Why pandas breaks at 57GB
The NYC TLC dataset is a classic "too big for a laptop" problem. A single year of trip data is ~10GB compressed. Across multiple years, you're looking at 57GB+ of Parquet files - far beyond what pandas can load into memory at once. Traditional approaches either crash, require expensive cloud clusters, or force you to downsample and lose signal.
The goal was to process the full dataset on a single machine, efficiently, without approximation - then layer on time-series analysis and PCA-based clustering to find patterns that matter.
Dataset: NYC TLC Yellow & Green Taxi Trips - multiple years, 57GB+ of Parquet files. Features include pickup/dropoff datetime, location IDs, passenger count, trip distance, fare amount, tip, payment type, and congestion surcharge.
How the pipeline works
The pipeline has three layers: ingest → transform → analyse. Dask handles the first two with lazy evaluation - nothing is computed until absolutely necessary. PyArrow provides the columnar Parquet I/O. The analysis layer materialises only the aggregated results, not the full dataset.
# Lazy load - nothing reads from disk yet ddf = dd.read_parquet("data/yellow_tripdata_*.parquet", engine="pyarrow", columns=["tpep_pickup_datetime", "trip_distance", "fare_amount", "tip_amount", "PULocationID"]) # Feature engineering (still lazy) ddf["hour"] = ddf["tpep_pickup_datetime"].dt.hour ddf["dayofweek"] = ddf["tpep_pickup_datetime"].dt.dayofweek ddf = ddf[(ddf.trip_distance > 0) & (ddf.fare_amount > 0)] # Aggregate - this is when Dask actually reads + processes hourly = ddf.groupby("hour")["trip_distance"].mean().compute()
Distributed Processing with Dask
Replaced pandas with Dask DataFrames for out-of-core, parallel computation. Dask partitions the dataset across files and processes each lazily - only materialising results when .compute() is called.
Time-Series Analysis
Extracted temporal features (hour, day of week, month, season) and computed rolling averages for ride volume, fare trends, and tip rates - revealing rush-hour peaks, weekend shifts, and seasonal demand drops.
PCA on Zone Aggregations
Aggregated trip statistics by TLC zone (pickup rate, avg distance, avg fare, tip ratio) then applied PCA to reduce to 2D - revealing clusters of tourist zones, commuter zones, and airport corridors.
Data Quality Pipeline
Filtered impossible trips (zero distance, negative fares, zero passengers), capped fare outliers at the 99th percentile, and handled timezone alignment across multi-year data with different schema versions.
Spatial Insights
Joined TLC zone lookups to map location IDs to borough and neighbourhood names - enabling borough-level demand analysis without expensive geospatial joins on the full dataset.
Visualisation Suite
Built a full set of Matplotlib/Seaborn charts: hourly demand curves, day-of-week heatmaps, fare distributions by borough, PCA scatterplots with zone labels, and tip-rate trends by payment type.
What the data showed
Dual peak demand
Ridership peaks at 8–9am and 5–7pm on weekdays, with a completely different profile on weekends - late-night entertainment demand replacing the commuter spike.
3 zone clusters via PCA
PCA revealed distinct airport zones (long trips, high fares), Midtown/tourist zones (short trips, high tip rates), and outer-borough zones (longer trips, low tips).
Card payments = higher tips
Credit card payments average 18.2% tip vs 12.1% for cash - a consistent pattern across all boroughs and time windows in the dataset.
Winter dip, summer plateau
January–February shows ~15% demand drop vs summer months. December spikes around holidays but crashes immediately after New Year.
Technologies Used
| Category | Tool & Role |
|---|---|
| Distributed Computing | Dask - out-of-core parallel DataFrame processing; lazy evaluation graph |
| File I/O | PyArrow - columnar Parquet reading with column pruning for efficiency |
| Analysis | Pandas & NumPy - post-aggregation analysis on materialised results |
| Dimensionality Reduction | scikit-learn PCA - zone-level clustering and pattern discovery |
| Visualisation | Matplotlib & Seaborn - full chart suite for temporal and spatial analysis |
| Dataset | NYC TLC Yellow & Green Taxi Trips - 57GB+ multi-year Parquet files |
Results
- Proved single-machine distributed processing is viable for 57GB+ datasets with Dask
- Dual-peak demand structure validated - actionable for surge pricing and driver allocation
- PCA zone clusters align closely with known NYC borough geography without using coordinates
- Pipeline is fully modular - new years of data are automatically picked up by the glob pattern
What I took away
- Lazy evaluation is the key insight of Dask - building a computation graph before executing means you only pay for what you actually need.
- Column pruning at read time (specifying
columns=[...]in read_parquet) gave a 3–4x speedup over reading all columns and dropping later. - PCA on aggregated zone-level statistics is surprisingly powerful for spatial segmentation without needing actual geospatial tooling.
- Schema drift between dataset years (column renames, new fields) requires careful validation before blindly unioning partitions.