Project - Priyanka Nidadavolu

01 - The Problem

Why pandas breaks at 57GB

The NYC TLC dataset is a classic "too big for a laptop" problem. A single year of trip data is ~10GB compressed. Across multiple years, you're looking at 57GB+ of Parquet files - far beyond what pandas can load into memory at once. Traditional approaches either crash, require expensive cloud clusters, or force you to downsample and lose signal.

The goal was to process the full dataset on a single machine, efficiently, without approximation - then layer on time-series analysis and PCA-based clustering to find patterns that matter.

Dataset: NYC TLC Yellow & Green Taxi Trips - multiple years, 57GB+ of Parquet files. Features include pickup/dropoff datetime, location IDs, passenger count, trip distance, fare amount, tip, payment type, and congestion surcharge.

02 - Architecture

How the pipeline works

The pipeline has three layers: ingest → transform → analyse. Dask handles the first two with lazy evaluation - nothing is computed until absolutely necessary. PyArrow provides the columnar Parquet I/O. The analysis layer materialises only the aggregated results, not the full dataset.

# Lazy load  -  nothing reads from disk yet
ddf = dd.read_parquet("data/yellow_tripdata_*.parquet",
                      engine="pyarrow",
                      columns=["tpep_pickup_datetime", "trip_distance",
                               "fare_amount", "tip_amount", "PULocationID"])

# Feature engineering (still lazy)
ddf["hour"] = ddf["tpep_pickup_datetime"].dt.hour
ddf["dayofweek"] = ddf["tpep_pickup_datetime"].dt.dayofweek
ddf = ddf[(ddf.trip_distance > 0) & (ddf.fare_amount > 0)]

# Aggregate  -  this is when Dask actually reads + processes
hourly = ddf.groupby("hour")["trip_distance"].mean().compute()

Distributed Processing with Dask

Replaced pandas with Dask DataFrames for out-of-core, parallel computation. Dask partitions the dataset across files and processes each lazily - only materialising results when .compute() is called.

Time-Series Analysis

Extracted temporal features (hour, day of week, month, season) and computed rolling averages for ride volume, fare trends, and tip rates - revealing rush-hour peaks, weekend shifts, and seasonal demand drops.

PCA on Zone Aggregations

Aggregated trip statistics by TLC zone (pickup rate, avg distance, avg fare, tip ratio) then applied PCA to reduce to 2D - revealing clusters of tourist zones, commuter zones, and airport corridors.

Data Quality Pipeline

Filtered impossible trips (zero distance, negative fares, zero passengers), capped fare outliers at the 99th percentile, and handled timezone alignment across multi-year data with different schema versions.

Spatial Insights

Joined TLC zone lookups to map location IDs to borough and neighbourhood names - enabling borough-level demand analysis without expensive geospatial joins on the full dataset.

Visualisation Suite

Built a full set of Matplotlib/Seaborn charts: hourly demand curves, day-of-week heatmaps, fare distributions by borough, PCA scatterplots with zone labels, and tip-rate trends by payment type.

03 - Key Findings

What the data showed

Time Pattern

Dual peak demand

Ridership peaks at 8–9am and 5–7pm on weekdays, with a completely different profile on weekends - late-night entertainment demand replacing the commuter spike.

Spatial Pattern

3 zone clusters via PCA

PCA revealed distinct airport zones (long trips, high fares), Midtown/tourist zones (short trips, high tip rates), and outer-borough zones (longer trips, low tips).

Economic Pattern

Card payments = higher tips

Credit card payments average 18.2% tip vs 12.1% for cash - a consistent pattern across all boroughs and time windows in the dataset.

Seasonal Pattern

Winter dip, summer plateau

January–February shows ~15% demand drop vs summer months. December spikes around holidays but crashes immediately after New Year.

04 - Stack

Technologies Used

Category	Tool & Role
Distributed Computing	Dask - out-of-core parallel DataFrame processing; lazy evaluation graph
File I/O	PyArrow - columnar Parquet reading with column pruning for efficiency
Analysis	Pandas & NumPy - post-aggregation analysis on materialised results
Dimensionality Reduction	scikit-learn PCA - zone-level clustering and pattern discovery
Visualisation	Matplotlib & Seaborn - full chart suite for temporal and spatial analysis
Dataset	NYC TLC Yellow & Green Taxi Trips - 57GB+ multi-year Parquet files

05 - Outcomes

Results

57GB+

processed on a single machine without memory errors

distinct zone clusters discovered via PCA

10M+

trip records analysed across multiple years

Proved single-machine distributed processing is viable for 57GB+ datasets with Dask
Dual-peak demand structure validated - actionable for surge pricing and driver allocation
PCA zone clusters align closely with known NYC borough geography without using coordinates
Pipeline is fully modular - new years of data are automatically picked up by the glob pattern

06 - Learnings

What I took away

Lazy evaluation is the key insight of Dask - building a computation graph before executing means you only pay for what you actually need.
Column pruning at read time (specifying columns=[...] in read_parquet) gave a 3–4x speedup over reading all columns and dropping later.
PCA on aggregated zone-level statistics is surprisingly powerful for spatial segmentation without needing actual geospatial tooling.
Schema drift between dataset years (column renames, new fields) requires careful validation before blindly unioning partitions.

All Projects Jan – Mar 2026 · UC San Diego

Large-Scale NYC TaxiData Pipeline & Analysis

Why pandas breaks at 57GB

How the pipeline works

Distributed Processing with Dask

Time-Series Analysis

PCA on Zone Aggregations

Data Quality Pipeline

Spatial Insights

Visualisation Suite

What the data showed

Dual peak demand

3 zone clusters via PCA

Card payments = higher tips

Winter dip, summer plateau

Technologies Used

Results

What I took away

Large-Scale NYC Taxi
Data Pipeline & Analysis