Handling Data Using SQL and Pandas - notd.io

Handling Data Using SQL and Pandas

tutorials technology ideas education

free note

Lets dive into the world of Load-Extract-Transform (ETL/ELT/LET) the data.

1. Loading & Accessing Data

SQL

Query data directly from relational tables
SELECT * FROM sales;
Join multiple tables
SELECT * FROM a JOIN b ON a.id = b.id;
Filter at source
WHERE date >= '2024-01-01'

Pandas

Load from CSV, Excel, JSON, SQL, Parquet
pd.read_csv("file.csv")
Connect to SQL via SQLAlchemy
pd.read_sql("SELECT * FROM sales", engine)
Load APIs or web data
pd.read_json(url)

2. Cleaning & Preparing Data

SQL

Handle missing values
COALESCE(col, 'Unknown')
Remove duplicates
SELECT DISTINCT
Standardize formats
UPPER(), TRIM(), CAST()

Pandas

Fill missing values
df.fillna(value)
Drop duplicates
df.drop_duplicates()
String cleanup
df['col'].str.strip().str.upper()

3. Filtering & Subsetting

SQL

Row filtering
WHERE amount > 1000
Conditional logic
CASE WHEN amount > 1000 THEN 'High' END

Pandas

Boolean filtering
df[df.amount > 1000]
Conditional columns
df['flag'] = np.where(df.amount > 1000, 'High', 'Low')

4. Grouping & Aggregation

SQL

Group by one or more columns
GROUP BY region
Aggregations
SUM(), AVG(), COUNT()

Pandas

Group and aggregate
df.groupby('region').agg({'sales':'sum'})
Multi‑level grouping
df.groupby(['region','year']).sum()

5. Joining & Merging Data

SQL

INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN

Pandas

pd.merge(df1, df2, on='id', how='left')
Supports complex merges with suffixes and indicators

6. Reshaping Data (Pivot, Unpivot, Melt)

SQL

Pivot tables (varies by engine)
PIVOT (SUM(sales) FOR month IN (...))
Unpivot using UNION ALL or UNPIVOT

Pandas

Pivot
df.pivot(index, columns, values)
Melt
pd.melt(df, id_vars=['id'])

7. Window Functions (Advanced Analytics)

SQL

Ranking
ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC)
Running totals
SUM(sales) OVER (ORDER BY date)

Pandas

Rolling windows
df['roll'] = df.sales.rolling(7).mean()
Expanding windows
df['cum'] = df.sales.expanding().sum()
Rank
df['rank'] = df.sales.rank()

8. Data Transformation

SQL

Mathematical transformations
sales * 1.1
Date extraction
EXTRACT(YEAR FROM date)

Pandas

Vectorized math
df['sales'] * 1.1
Date components
df['date'].dt.year

9. Data Validation

SQL

Range checks
WHERE age BETWEEN 18 AND 60
Regex validation
REGEXP_LIKE(email, '...')

Pandas

Boolean masks
df[df.age.between(18,60)]
Regex
df['email'].str.contains(pattern)

10. Exporting & Saving Data

SQL

Insert into tables
INSERT INTO table SELECT ...
Create new tables
CREATE TABLE new AS SELECT ...

Pandas

Save to CSV, Excel, Parquet
df.to_csv("out.csv")
Write back to SQL
df.to_sql("table", engine, if_exists='replace')

Let me know , you opinions on this brief data operations

Thanks,

Jyoti.

You can publish here, too - it's easy and free.