GridGreen

ML training has hidden, variable climate costs

The same training job can produce 2× different CO₂ emissions depending on when and where it runs. Model choice alone can determine whether a job consumes 0.5 kWh or 15 kWh. Existing tools like CodeCarbon only measure emissions after execution - by then the decision is already made.

GridGreen flips this: it analyzes ML training code before execution and gives engineers the information they need to make carbon-aware decisions at development time.

Key insight: Carbon efficiency in ML is a decision problem, not just a measurement problem. GridGreen shifts awareness from after execution to before execution.

System Architecture

User pastes training code into a Monaco editor. FastAPI backend runs three analyses in parallel and returns results in under 20ms.

Carbon Estimation Engine

Python AST + regex parses training scripts to extract model type, batch size, training loops. Applies Kaplan 2020, Patterson 2022, Strubell 2019 scaling laws: Code → FLOPs → Energy → CO₂. Includes methodology citations and explicit limitations.

RAG Model Suggestions

Sentence Transformers (MiniLM) over a curated dataset of 58 model-swap pairs. TF-IDF fallback for improved recall on the small corpus. Returns alternatives with compute reduction %, benchmark retention (MMLU), and citations. Example: GPT-2 Large → DistilGPT-2 (-77.6% compute, 94.2% MMLU).

Grid-Aware Scheduling

Ingests US EIA hourly grid carbon intensity data into SQLite. Prophet model (+ Seasonal Naive fallback) forecasts the cleanest 48-hour execution window. Same job, 2× less CO₂ just by timing it right.

Session Scorecard

Tracks cumulative CO₂ saved and decisions made across a session - gamifying sustainable ML decisions and encouraging long-term behavior change.

MCP Agent Integration

Wraps all backend APIs as Model Context Protocol tools. GridGreen works inside Claude Desktop and Cursor with zero UI changes - call estimate_carbon() and suggest_alternatives() directly from any agent workflow.

Cloud & Infra Pipeline

Snowflake Cortex for vector storage, Databricks DLT for data ingestion, AWS SageMaker for batch processing, NVIDIA Brev for GPU workloads, Weights & Biases for experiment tracking, Google Gemini for NL reasoning.

Technologies Used

ComponentTool & Purpose
FrontendNext.js 15 + Monaco Editor + Tailwind CSS + Recharts + Framer Motion
BackendFastAPI - <20ms latency, all analysis endpoints
Code AnalysisPython AST + Regex - extract model type, training loops, batch size
Carbon EngineScaling laws (Kaplan 2020, Patterson 2022, Strubell 2019) - FLOPs → CO₂
RAG SystemSentence Transformers (MiniLM) + TF-IDF fallback + 58 model-swap pairs
Grid ForecastingUS EIA hourly data + SQLite + Prophet (+ Seasonal Naive fallback)
MCP IntegrationModel Context Protocol - Claude Desktop, Cursor, agent pipelines
Cloud / InfraSnowflake Cortex · Databricks DLT · AWS SageMaker · NVIDIA Brev · W&B · Gemini

Evaluation (12/12 workloads)

MetricValue
Success Rate100% (12/12 workloads)
Mean Analysis Latency<20ms
Suggestion Coverage66.7%
CO₂ Reduction (LLMs)54.9%
CO₂ Reduction (Vision/Audio)57.1%
Avg Compute Reduction77.6%

Impact

100%
success rate across 12 ML workload types
77.6%
average compute reduction via model swaps
<20ms
end-to-end analysis latency
  • Pre-run analysis shifts carbon decisions from measurement to prevention
  • RAG with TF-IDF fallback handles small dataset retrieval better than dense retrieval alone
  • MCP integration means GridGreen works in Claude Desktop and Cursor with no UI changes required
  • Grid scheduling finds windows where the same job produces 2× less CO₂

What made it hard

Challenge
Carbon estimate credibility

Early estimates lacked grounding. Added citations to published scaling laws and included explicit methodology limitations in every output.

Challenge
RAG on 58 pairs

Dense retrieval alone failed at this scale. TF-IDF fallback significantly improved recall for edge cases outside the embedding space.

Challenge
EIA API instability

Rate limits caused failures in grid data ingestion. Added mock mode and a diagnostics endpoint so the system degrades gracefully without breaking the core analysis flow.

Challenge
Multi-system integration

MCP + FastAPI + Next.js had conflicting dev environments. Required careful process coordination and explicit port management across three separate servers.

What I took away

  • Pre-run analysis is fundamentally more useful than post-hoc measurement - you cannot un-run a training job.
  • Hybrid retrieval (dense + sparse TF-IDF) outperforms pure embedding search on small, curated datasets.
  • MCP is a powerful abstraction - wrapping a FastAPI as MCP tools makes it instantly usable in any agent workflow without a single additional UI line.
  • Prophet handles seasonality cleanly for grid forecasting, but always build a Seasonal Naive fallback - EIA data gaps will break pure ML forecasters.