Sorry, but Notd.io is not available without javascript Handling Data Using SQL and Pandas - notd.io

Read more about Handling Data Using SQL and Pandas
Read more about Handling Data Using SQL and Pandas
Handling Data Using SQL and Pandas

free note

Lets dive into the world of Load-Extract-Transform (ETL/ELT/LET) the data.

1. Loading & Accessing Data

SQL

  • Query data directly from relational tables
  • SELECT * FROM sales;
  • Join multiple tables
  • SELECT * FROM a JOIN b ON a.id = b.id;
  • Filter at source
  • WHERE date >= '2024-01-01'

Pandas

  • Load from CSV, Excel, JSON, SQL, Parquet
  • pd.read_csv("file.csv")
  • Connect to SQL via SQLAlchemy
  • pd.read_sql("SELECT * FROM sales", engine)
  • Load APIs or web data
  • pd.read_json(url)

2. Cleaning & Preparing Data

SQL

  • Handle missing values
  • COALESCE(col, 'Unknown')
  • Remove duplicates
  • SELECT DISTINCT
  • Standardize formats
  • UPPER(), TRIM(), CAST()

Pandas

  • Fill missing values
  • df.fillna(value)
  • Drop duplicates
  • df.drop_duplicates()
  • String cleanup
  • df['col'].str.strip().str.upper()

3. Filtering & Subsetting

SQL

  • Row filtering
  • WHERE amount > 1000
  • Conditional logic
  • CASE WHEN amount > 1000 THEN 'High' END

Pandas

  • Boolean filtering
  • df[df.amount > 1000]
  • Conditional columns
  • df['flag'] = np.where(df.amount > 1000, 'High', 'Low')

4. Grouping & Aggregation

SQL

  • Group by one or more columns
  • GROUP BY region
  • Aggregations
  • SUM(), AVG(), COUNT()

Pandas

  • Group and aggregate
  • df.groupby('region').agg({'sales':'sum'})
  • Multi‑level grouping
  • df.groupby(['region','year']).sum()

5. Joining & Merging Data

SQL

  • INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN

Pandas

  • pd.merge(df1, df2, on='id', how='left')
  • Supports complex merges with suffixes and indicators

6. Reshaping Data (Pivot, Unpivot, Melt)

SQL

  • Pivot tables (varies by engine)
  • PIVOT (SUM(sales) FOR month IN (...))
  • Unpivot using UNION ALL or UNPIVOT

Pandas

  • Pivot
  • df.pivot(index, columns, values)
  • Melt
  • pd.melt(df, id_vars=['id'])

7. Window Functions (Advanced Analytics)

SQL

  • Ranking
  • ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC)
  • Running totals
  • SUM(sales) OVER (ORDER BY date)

Pandas

  • Rolling windows
  • df['roll'] = df.sales.rolling(7).mean()
  • Expanding windows
  • df['cum'] = df.sales.expanding().sum()
  • Rank
  • df['rank'] = df.sales.rank()

8. Data Transformation

SQL

  • Mathematical transformations
  • sales * 1.1
  • Date extraction
  • EXTRACT(YEAR FROM date)

Pandas

  • Vectorized math
  • df['sales'] * 1.1
  • Date components
  • df['date'].dt.year

9. Data Validation

SQL

  • Range checks
  • WHERE age BETWEEN 18 AND 60
  • Regex validation
  • REGEXP_LIKE(email, '...')

Pandas

  • Boolean masks
  • df[df.age.between(18,60)]
  • Regex
  • df['email'].str.contains(pattern)

10. Exporting & Saving Data

SQL

  • Insert into tables
  • INSERT INTO table SELECT ...
  • Create new tables
  • CREATE TABLE new AS SELECT ...

Pandas

  • Save to CSV, Excel, Parquet
  • df.to_csv("out.csv")
  • Write back to SQL
  • df.to_sql("table", engine, if_exists='replace')

Let me know , you opinions on this brief data operations

Thanks,

Jyoti.

You can publish here, too - it's easy and free.