Why pandas breaks at 57GB

The NYC TLC dataset is a classic "too big for a laptop" problem. A single year of trip data is ~10GB compressed. Across multiple years, you're looking at 57GB+ of Parquet files - far beyond what pandas can load into memory at once. Traditional approaches either crash, require expensive cloud clusters, or force you to downsample and lose signal.

The goal was to process the full dataset on a single machine, efficiently, without approximation - then layer on time-series analysis and PCA-based clustering to find patterns that matter.

Dataset: NYC TLC Yellow & Green Taxi Trips - multiple years, 57GB+ of Parquet files. Features include pickup/dropoff datetime, location IDs, passenger count, trip distance, fare amount, tip, payment type, and congestion surcharge.

How the pipeline works

The pipeline has three layers: ingest → transform → analyse. Dask handles the first two with lazy evaluation - nothing is computed until absolutely necessary. PyArrow provides the columnar Parquet I/O. The analysis layer materialises only the aggregated results, not the full dataset.

# Lazy load  -  nothing reads from disk yet
ddf = dd.read_parquet("data/yellow_tripdata_*.parquet",
                      engine="pyarrow",
                      columns=["tpep_pickup_datetime", "trip_distance",
                               "fare_amount", "tip_amount", "PULocationID"])

# Feature engineering (still lazy)
ddf["hour"] = ddf["tpep_pickup_datetime"].dt.hour
ddf["dayofweek"] = ddf["tpep_pickup_datetime"].dt.dayofweek
ddf = ddf[(ddf.trip_distance > 0) & (ddf.fare_amount > 0)]

# Aggregate  -  this is when Dask actually reads + processes
hourly = ddf.groupby("hour")["trip_distance"].mean().compute()

Distributed Processing with Dask

Replaced pandas with Dask DataFrames for out-of-core, parallel computation. Dask partitions the dataset across files and processes each lazily - only materialising results when .compute() is called.

Time-Series Analysis

Extracted temporal features (hour, day of week, month, season) and computed rolling averages for ride volume, fare trends, and tip rates - revealing rush-hour peaks, weekend shifts, and seasonal demand drops.

PCA on Zone Aggregations

Aggregated trip statistics by TLC zone (pickup rate, avg distance, avg fare, tip ratio) then applied PCA to reduce to 2D - revealing clusters of tourist zones, commuter zones, and airport corridors.

Data Quality Pipeline

Filtered impossible trips (zero distance, negative fares, zero passengers), capped fare outliers at the 99th percentile, and handled timezone alignment across multi-year data with different schema versions.

Spatial Insights

Joined TLC zone lookups to map location IDs to borough and neighbourhood names - enabling borough-level demand analysis without expensive geospatial joins on the full dataset.

Visualisation Suite

Built a full set of Matplotlib/Seaborn charts: hourly demand curves, day-of-week heatmaps, fare distributions by borough, PCA scatterplots with zone labels, and tip-rate trends by payment type.

What the data showed

Time Pattern
Dual peak demand

Ridership peaks at 8–9am and 5–7pm on weekdays, with a completely different profile on weekends - late-night entertainment demand replacing the commuter spike.

Spatial Pattern
3 zone clusters via PCA

PCA revealed distinct airport zones (long trips, high fares), Midtown/tourist zones (short trips, high tip rates), and outer-borough zones (longer trips, low tips).

Economic Pattern
Card payments = higher tips

Credit card payments average 18.2% tip vs 12.1% for cash - a consistent pattern across all boroughs and time windows in the dataset.

Seasonal Pattern
Winter dip, summer plateau

January–February shows ~15% demand drop vs summer months. December spikes around holidays but crashes immediately after New Year.

Technologies Used

CategoryTool & Role
Distributed ComputingDask - out-of-core parallel DataFrame processing; lazy evaluation graph
File I/OPyArrow - columnar Parquet reading with column pruning for efficiency
AnalysisPandas & NumPy - post-aggregation analysis on materialised results
Dimensionality Reductionscikit-learn PCA - zone-level clustering and pattern discovery
VisualisationMatplotlib & Seaborn - full chart suite for temporal and spatial analysis
DatasetNYC TLC Yellow & Green Taxi Trips - 57GB+ multi-year Parquet files

Results

57GB+
processed on a single machine without memory errors
3
distinct zone clusters discovered via PCA
10M+
trip records analysed across multiple years
  • Proved single-machine distributed processing is viable for 57GB+ datasets with Dask
  • Dual-peak demand structure validated - actionable for surge pricing and driver allocation
  • PCA zone clusters align closely with known NYC borough geography without using coordinates
  • Pipeline is fully modular - new years of data are automatically picked up by the glob pattern

What I took away

  • Lazy evaluation is the key insight of Dask - building a computation graph before executing means you only pay for what you actually need.
  • Column pruning at read time (specifying columns=[...] in read_parquet) gave a 3–4x speedup over reading all columns and dropping later.
  • PCA on aggregated zone-level statistics is surprisingly powerful for spatial segmentation without needing actual geospatial tooling.
  • Schema drift between dataset years (column renames, new fields) requires careful validation before blindly unioning partitions.