ML training has hidden, variable climate costs
The same training job can produce 2× different CO₂ emissions depending on when and where it runs. Model choice alone can determine whether a job consumes 0.5 kWh or 15 kWh. Existing tools like CodeCarbon only measure emissions after execution - by then the decision is already made.
GridGreen flips this: it analyzes ML training code before execution and gives engineers the information they need to make carbon-aware decisions at development time.
Key insight: Carbon efficiency in ML is a decision problem, not just a measurement problem. GridGreen shifts awareness from after execution to before execution.
System Architecture
User pastes training code into a Monaco editor. FastAPI backend runs three analyses in parallel and returns results in under 20ms.
Carbon Estimation Engine
Python AST + regex parses training scripts to extract model type, batch size, training loops. Applies Kaplan 2020, Patterson 2022, Strubell 2019 scaling laws: Code → FLOPs → Energy → CO₂. Includes methodology citations and explicit limitations.
RAG Model Suggestions
Sentence Transformers (MiniLM) over a curated dataset of 58 model-swap pairs. TF-IDF fallback for improved recall on the small corpus. Returns alternatives with compute reduction %, benchmark retention (MMLU), and citations. Example: GPT-2 Large → DistilGPT-2 (-77.6% compute, 94.2% MMLU).
Grid-Aware Scheduling
Ingests US EIA hourly grid carbon intensity data into SQLite. Prophet model (+ Seasonal Naive fallback) forecasts the cleanest 48-hour execution window. Same job, 2× less CO₂ just by timing it right.
Session Scorecard
Tracks cumulative CO₂ saved and decisions made across a session - gamifying sustainable ML decisions and encouraging long-term behavior change.
MCP Agent Integration
Wraps all backend APIs as Model Context Protocol tools. GridGreen works inside Claude Desktop and Cursor with zero UI changes - call estimate_carbon() and suggest_alternatives() directly from any agent workflow.
Cloud & Infra Pipeline
Snowflake Cortex for vector storage, Databricks DLT for data ingestion, AWS SageMaker for batch processing, NVIDIA Brev for GPU workloads, Weights & Biases for experiment tracking, Google Gemini for NL reasoning.
Technologies Used
| Component | Tool & Purpose |
|---|---|
| Frontend | Next.js 15 + Monaco Editor + Tailwind CSS + Recharts + Framer Motion |
| Backend | FastAPI - <20ms latency, all analysis endpoints |
| Code Analysis | Python AST + Regex - extract model type, training loops, batch size |
| Carbon Engine | Scaling laws (Kaplan 2020, Patterson 2022, Strubell 2019) - FLOPs → CO₂ |
| RAG System | Sentence Transformers (MiniLM) + TF-IDF fallback + 58 model-swap pairs |
| Grid Forecasting | US EIA hourly data + SQLite + Prophet (+ Seasonal Naive fallback) |
| MCP Integration | Model Context Protocol - Claude Desktop, Cursor, agent pipelines |
| Cloud / Infra | Snowflake Cortex · Databricks DLT · AWS SageMaker · NVIDIA Brev · W&B · Gemini |
Evaluation (12/12 workloads)
| Metric | Value |
|---|---|
| Success Rate | 100% (12/12 workloads) |
| Mean Analysis Latency | <20ms |
| Suggestion Coverage | 66.7% |
| CO₂ Reduction (LLMs) | 54.9% |
| CO₂ Reduction (Vision/Audio) | 57.1% |
| Avg Compute Reduction | 77.6% |
Impact
- Pre-run analysis shifts carbon decisions from measurement to prevention
- RAG with TF-IDF fallback handles small dataset retrieval better than dense retrieval alone
- MCP integration means GridGreen works in Claude Desktop and Cursor with no UI changes required
- Grid scheduling finds windows where the same job produces 2× less CO₂
What made it hard
Carbon estimate credibility
Early estimates lacked grounding. Added citations to published scaling laws and included explicit methodology limitations in every output.
RAG on 58 pairs
Dense retrieval alone failed at this scale. TF-IDF fallback significantly improved recall for edge cases outside the embedding space.
EIA API instability
Rate limits caused failures in grid data ingestion. Added mock mode and a diagnostics endpoint so the system degrades gracefully without breaking the core analysis flow.
Multi-system integration
MCP + FastAPI + Next.js had conflicting dev environments. Required careful process coordination and explicit port management across three separate servers.
What I took away
- Pre-run analysis is fundamentally more useful than post-hoc measurement - you cannot un-run a training job.
- Hybrid retrieval (dense + sparse TF-IDF) outperforms pure embedding search on small, curated datasets.
- MCP is a powerful abstraction - wrapping a FastAPI as MCP tools makes it instantly usable in any agent workflow without a single additional UI line.
- Prophet handles seasonality cleanly for grid forecasting, but always build a Seasonal Naive fallback - EIA data gaps will break pure ML forecasters.