

Handling Data Using SQL and Pandas
Lets dive into the world of Load-Extract-Transform (ETL/ELT/LET) the data.
1. Loading & Accessing Data
SQL
- Query data directly from relational tables
- SELECT * FROM sales;
- Join multiple tables
- SELECT * FROM a JOIN b ON a.id = b.id;
- Filter at source
- WHERE date >= '2024-01-01'
Pandas
- Load from CSV, Excel, JSON, SQL, Parquet
- pd.read_csv("file.csv")
- Connect to SQL via SQLAlchemy
- pd.read_sql("SELECT * FROM sales", engine)
- Load APIs or web data
- pd.read_json(url)
2. Cleaning & Preparing Data
SQL
- Handle missing values
- COALESCE(col, 'Unknown')
- Remove duplicates
- SELECT DISTINCT
- Standardize formats
- UPPER(), TRIM(), CAST()
Pandas
- Fill missing values
- df.fillna(value)
- Drop duplicates
- df.drop_duplicates()
- String cleanup
- df['col'].str.strip().str.upper()
3. Filtering & Subsetting
SQL
- Row filtering
- WHERE amount > 1000
- Conditional logic
- CASE WHEN amount > 1000 THEN 'High' END
Pandas
- Boolean filtering
- df[df.amount > 1000]
- Conditional columns
- df['flag'] = np.where(df.amount > 1000, 'High', 'Low')
4. Grouping & Aggregation
SQL
- Group by one or more columns
- GROUP BY region
- Aggregations
- SUM(), AVG(), COUNT()
Pandas
- Group and aggregate
- df.groupby('region').agg({'sales':'sum'})
- Multi‑level grouping
- df.groupby(['region','year']).sum()
5. Joining & Merging Data
SQL
- INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN
Pandas
- pd.merge(df1, df2, on='id', how='left')
- Supports complex merges with suffixes and indicators
6. Reshaping Data (Pivot, Unpivot, Melt)
SQL
- Pivot tables (varies by engine)
- PIVOT (SUM(sales) FOR month IN (...))
- Unpivot using UNION ALL or UNPIVOT
Pandas
- Pivot
- df.pivot(index, columns, values)
- Melt
- pd.melt(df, id_vars=['id'])
7. Window Functions (Advanced Analytics)
SQL
- Ranking
- ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC)
- Running totals
- SUM(sales) OVER (ORDER BY date)
Pandas
- Rolling windows
- df['roll'] = df.sales.rolling(7).mean()
- Expanding windows
- df['cum'] = df.sales.expanding().sum()
- Rank
- df['rank'] = df.sales.rank()
8. Data Transformation
SQL
- Mathematical transformations
- sales * 1.1
- Date extraction
- EXTRACT(YEAR FROM date)
Pandas
- Vectorized math
- df['sales'] * 1.1
- Date components
- df['date'].dt.year
9. Data Validation
SQL
- Range checks
- WHERE age BETWEEN 18 AND 60
- Regex validation
- REGEXP_LIKE(email, '...')
Pandas
- Boolean masks
- df[df.age.between(18,60)]
- Regex
- df['email'].str.contains(pattern)
10. Exporting & Saving Data
SQL
- Insert into tables
- INSERT INTO table SELECT ...
- Create new tables
- CREATE TABLE new AS SELECT ...
Pandas
- Save to CSV, Excel, Parquet
- df.to_csv("out.csv")
- Write back to SQL
- df.to_sql("table", engine, if_exists='replace')
Let me know , you opinions on this brief data operations
Thanks,
Jyoti.
