Loading data → getting the data into your analysis tool
Transforming data → cleaning and reshaping the data so it’s useful
Think of it like cooking:
Let’s break everything down clearly.
1️⃣ What Does “Loading Data” Mean?
Loading data means importing data from somewhere into a program so you can work with it.
Data can come from:
In beginner data analysis, most people start with CSV files.
Example Dataset (students.csv)
name,age,grade,city Alice,23,88,New York Bob,25,92,Los Angeles Charlie,22,85,Chicago David,23,90,New York
2️⃣ Loading Data in Python (Using Pandas)
In Python, the most common library for data analysis is:
pandas
It allows you to work with data in table format called a DataFrame.
Step 1: Install pandas (if needed)
pip install pandas
Step 2: Load the CSV File
import pandas as pd
# Load CSV file
df = pd.read_csv("students.csv")
# Display first 5 rows
print(df.head())
What is a DataFrame?
A DataFrame is a table with rows and columns.
Here’s what the data looks like:
| name | age | grade | city | |---------|-----|-------|------------| | Alice | 23 | 88 | New York | | Bob | 25 | 92 | Los Angeles| | Charlie | 22 | 85 | Chicago | | David | 23 | 90 | New York |
3️⃣ What Does “Transforming Data” Mean?
Raw data is usually messy.
Transformation means:
4️⃣ Common Data Transformations (With Examples)
A) Selecting Columns
df["name"]
Selecting multiple columns:
df[["name", "grade"]]
B) Filtering Rows
Example: Students older than 23
df[df["age"] > 23]
C) Creating a New Column
Example: Add “Pass” column (grade ≥ 90)
df["pass"] = df["grade"] >= 90
Result:
| name | age | grade | city | pass | |---------|-----|-------|------------|-------| | Alice | 23 | 88 | New York | False | | Bob | 25 | 92 | Los Angeles| True | | Charlie | 22 | 85 | Chicago | False | | David | 23 | 90 | New York | True |
D) Handling Missing Data
If data has missing values:
df.isnull().sum()
Remove missing rows:
df = df.dropna()
Fill missing values:
df["grade"] = df["grade"].fillna(df["grade"].mean())
E) Grouping and Aggregating
Example: Average grade per city
df.groupby("city")["grade"].mean()
Result:
| city | average_grade | |-------------|--------------| | Chicago | 85 | | Los Angeles | 92 | | New York | 89 |
F) Sorting Data
Sort by grade descending:
df.sort_values(by="grade", ascending=False)
G) Renaming Columns
df.rename(columns={"grade": "final_grade"}, inplace=True)
5️⃣ Why Data Transformation is Important
Real-world data is:
If you skip transformation:
Good analysts spend 70–80% of their time cleaning data.
6️⃣ Real-World Workflow
Here’s how loading and transforming usually works:
1. Import libraries 2. Load data 3. Inspect data (head, info, shape) 4. Clean missing values 5. Fix data types 6. Remove duplicates 7. Create new useful columns 8. Filter relevant data 9. Aggregate/group if needed 10. Analyze or visualize
7️⃣ Full Example Workflow (Step-by-Step Code)
import pandas as pd
# 1. Load data
df = pd.read_csv("students.csv")
# 2. Inspect data
print(df.head())
print(df.info())
# 3. Remove duplicates
df = df.drop_duplicates()
# 4. Handle missing values
df["grade"] = df["grade"].fillna(df["grade"].mean())
# 5. Create pass column
df["pass"] = df["grade"] >= 90
# 6. Filter students older than 22
filtered_df = df[df["age"] > 22]
# 7. Group by city and calculate average grade
city_avg = df.groupby("city")["grade"].mean()
# 8. Sort by grade
sorted_df = df.sort_values(by="grade", ascending=False)
print(city_avg)
print(sorted_df)
8️⃣ Summary (Simple Explanation)
Load and Transforming Data means:
StepMeaningLoadingImporting data into your programCleaningFixing errors and missing valuesTransformingChanging data structure or formatFilteringSelecting needed dataAggregatingSummarizing dataSortingOrganizing data
Complete Compilation of All Code Used
Here is everything combined into one script:
import pandas as pd
# Load CSV file
df = pd.read_csv("students.csv")
# Display first rows
print(df.head())
# Check info
print(df.info())
# Check missing values
print(df.isnull().sum())
# Remove duplicates
df = df.drop_duplicates()
# Fill missing grades with average
df["grade"] = df["grade"].fillna(df["grade"].mean())
# Select columns
names = df["name"]
name_grade = df[["name", "grade"]]
# Filter rows
older_students = df[df["age"] > 23]
# Create new column
df["pass"] = df["grade"] >= 90
# Group and aggregate
city_avg = df.groupby("city")["grade"].mean()
# Sort values
sorted_df = df.sort_values(by="grade", ascending=False)
# Rename column
df.rename(columns={"grade": "final_grade"}, inplace=True)
print(city_avg)
print(sorted_df)