PySpark DataFrames vs. Spark SQL: Which One Should You Use?

When building data pipelines in Apache Spark, one of the most common questions is: “Should I write this in Spark SQL or use the DataFrame API?” The short answer regarding performance is neither. Both are powered by the Catalyst Optimizer, meaning Spark converts both into the same optimized physical plan under the hood. However, from a Data Engineering perspective, the choice significantly impacts how you build, test, scale, and maintain your code. ...

February 26, 2026 · Arjun Sajeevan

Real-Time Streaming Analytics: Kafka & PySpark

Level: Intermediate to Advanced Data Engineering Tech Stack: Python · Apache Kafka · Docker · PySpark Structured Streaming · JVM The Problem: Batch is Too Slow In modern e-commerce, waiting 24 hours to analyze sales data is no longer acceptable. Businesses need to know what is selling right now — to manage inventory, detect fraud, and trigger real-time marketing. To solve this, I designed and built a decoupled, event-driven streaming architecture locally. This project serves as a blueprint for how enterprise companies move from static batch processing to real-time data-in-motion. ...

February 22, 2026 · Arjun Sajeevan