01 Clarify the Problem & Scope5โ€“7 min
"We're designing a weather application like The Weather Channel, Dark Sky, or Apple Weather. There are three distinct subsystems: (1) a data ingestion pipeline that collects observations from stations, satellites, and radar, (2) a forecast computation layer that turns raw observations into predictions, and (3) a serving layer that delivers weather data to hundreds of millions of users with extreme read-heaviness and high cacheability."
Questions I'd Ask
  • What data are we serving? Current conditions, hourly forecast, 10-day forecast, radar maps, air quality? โ†’ All of these. Current conditions + hourly + daily forecasts are the core. Radar tiles and AQI are secondary.
  • Do we run our own forecast models (NWP)? Or consume forecasts from government agencies (NWS, ECMWF)? โ†’ We consume raw observations + model output from external sources (like NWS GFS/HRRR, ECMWF). We run a post-processing layer (bias correction, downscaling, blending) but not the physics simulation itself.
  • Granularity? City-level, zipcode-level, or point-level (lat/lng)? โ†’ Point-level at ~1km grid resolution. User sends lat/lng, we return the nearest grid point's data. This gives millions of unique "locations."
  • Severe weather alerts? โ†’ Yes โ€” push notifications for tornadoes, flash floods, hurricane warnings. These are life-safety critical and must be delivered within seconds.
  • Scale? โ†’ ~500M daily active users globally. Extreme read-heaviness. Weather data is highly cacheable โ€” it changes every 15-60 minutes, not every second.
  • Monetization? โ†’ Free tier with ads, premium tier with minute-level precipitation (hyper-local nowcasting), extended forecasts, no ads.
Agreed Scope
In ScopeOut of Scope
Current conditions (temp, humidity, wind, UV)Running NWP physics models (GFS, ECMWF)
Hourly forecast (48h) + daily forecast (14d)Historical weather data warehouse
Radar / precipitation map tilesAgricultural / aviation-specific products
Severe weather alerts (push notifications)IoT device ingestion (personal weather stations)
Location search (city, zipcode, lat/lng)Social features (user-reported conditions)
Air quality index (AQI)Climate modeling / long-range seasonal forecasts
Core Use Cases
  • UC1: "What's the weather right now?" โ€” User opens app โ†’ current conditions for their location in <200ms. Temp, feels-like, humidity, wind, UV index, condition icon.
  • UC2: "Will it rain this afternoon?" โ€” Hourly forecast for next 48 hours. Precipitation probability per hour. Minute-level rain forecast for next 2 hours (premium).
  • UC3: "10-day outlook" โ€” Daily high/low, condition summary, precipitation chance. Updated every 6 hours.
  • UC4: "Show me the radar" โ€” Animated radar map tiles. Past 2 hours + 1 hour forecast. Pan/zoom map interaction.
  • UC5: "Tornado warning!" โ€” Severe weather alert pushed to all affected users within 60 seconds of NWS issuance. Life-safety: zero message loss.
Non-Functional Requirements
  • Extreme read-heaviness (100,000:1 read:write): Weather data is produced by a handful of ingestion pipelines (writes) and consumed by 500M users (reads). This is the most read-heavy system you'll ever design. Caching is not an optimization โ€” it's the architecture.
  • High cacheability: Current conditions update every 15 minutes. Hourly forecasts update every 1-6 hours. 10-day forecasts update every 6 hours. A response cached for 10 minutes is still fresh for 99.9% of requests.
  • API response <200ms p95: Users expect instant weather. CDN cache hit should be <50ms. Cache miss (origin fetch) should be <200ms.
  • Alert delivery <60 seconds: From NWS issuing a tornado warning to user seeing the push notification. This is the only latency-critical write path.
  • Availability > consistency: Showing 15-minute-stale weather data is fine. Showing nothing (503 error) during a hurricane โ€” when people need weather most โ€” is catastrophic.
  • Global reach: Users everywhere. CDN-first architecture. Data sources are mostly US/EU-centric (NWS, ECMWF), but serving is global.
The key architectural insight: weather data is one of the most cacheable datasets on the internet. The same forecast serves every user in a ~1km grid cell. A city like NYC has ~800 grid cells, each updating every 15-60 minutes. That's a total of ~800 unique responses to cache โ€” and they serve millions of users. The entire architecture is a caching strategy with a data pipeline behind it.
02 Back-of-the-Envelope Estimation3โ€“5 min
"Let me separate the ingestion pipeline (low-volume, high-value data) from the serving layer (high-volume, highly cacheable)."
Daily Active Users
~500M
Global. ~3 app opens/day avg. Morning check, midday, evening.
API Requests / Day
~5B
500M ร— 3 opens ร— 3-4 API calls per open (current, hourly, daily, radar)
Peak QPS
~200K
5B / 86400 โ‰ˆ 58K avg. Peak = 3-4x avg. Morning rush ร— severe weather.
Unique Grid Points
~10M
Land surface at 1km resolution. ~150M total earth surface, but only populated land matters.
CDN Cache Hit Rate
>95%
10M grid points ร— 15min TTL. Popular cities: 99%+ hit rate. Rural: lower but still high.
Forecast Data Size
~5 KB / location
Current + 48h hourly + 14d daily โ‰ˆ 5KB JSON. 10M locations = 50GB total.
Key insight #1: With a 95%+ CDN cache hit rate, only ~5% of 200K peak QPS actually hits origin servers: ~10K QPS to origin. That's a modest backend. The architecture IS the CDN โ€” origin servers are just the cache-miss path.
Key insight #2: The entire forecast dataset (10M locations ร— 5KB) is ~50GB. This fits in memory. A single Redis cluster could hold every forecast for every location on earth. The "database" is essentially a pre-computed, in-memory grid that gets refreshed every 15-60 minutes by the ingestion pipeline.
Key insight #3: Severe weather spikes correlate with the highest user demand. A hurricane approaching makes millions of users check weather simultaneously โ€” exactly when availability matters most. CDN absorbs the read spike. But the alert push path must scale independently (not compete with read traffic for resources).
03 High-Level Design8โ€“12 min
"The architecture has three planes: an ingestion pipeline (collect + process weather data), a precomputation layer (turn raw data into API-ready responses), and a serving layer (CDN + origin that delivers to users). Plus a separate alert path for severe weather push notifications."
Key Architecture Decisions
"Here's WHY I chose each technology โ€” mapping requirements to tradeoffs."
RequirementDecisionWhy (and what was rejected)Consistency
200K QPS peak, 95%+ identical responsesCDN-first (CloudFront/Fastly)Weather responses are identical for every user in the same grid cell. CDN serves from edge, <50ms. Without CDN, origin would need 200K QPS capacity โ€” 20x overprovisioned for normal traffic.AP
Forecast data fits in memory (~50GB)Redis cluster for origin dataEntire forecast grid in-memory. O(1) lookup by grid cell ID. PostgreSQL would add 5-10ms disk I/O on cache miss โ€” unnecessary when data fits in RAM.AP
Ingest data from 10+ external sourcesKafka pipeline (not direct polling)Each source has different format, frequency, reliability. Kafka decouples ingestion from processing. If ECMWF is late, NWS data still flows. Direct polling creates tight coupling.โ€”
Pre-render 10M location responses every 15minBatch precomputation (Spark/Flink)Compute all forecasts in bulk, write to Redis. On-demand computation at 200K QPS would overload the forecast models. Precompute = O(1) serving latency regardless of query complexity.โ€”
Tornado alert to 10M users in <60sGeo-targeted push via FCM/APNsPush, not pull. Users can't poll for alerts โ€” they need immediate notification. Geo-fenced: only affected users. Fan-out via mobile push platforms (FCM handles delivery at scale).CP
Radar tiles: interactive map with pan/zoomPre-rendered tile pyramid (z/x/y)Standard slippy map tiles at zoom levels 2-12. Pre-rendered PNGs stored in S3, served via CDN. Dynamic rendering on each request would be 100x more expensive.AP
Major Components
High-Level Architecture EXTERNAL DATA SOURCES INGESTION & PROCESSING PRECOMPUTATION SERVING LAYER CLIENTS (500M DAU) ALERT PATH (LIFE-SAFETY) NWS / GFSUS forecasts (6h cycle) ECMWF / ERA5Global models (12h) NEXRAD Radar160 US radar sites METAR / SYNOPAirport stations (1h) SatellitesGOES-16/18, Himawari Data CollectorsPull / subscribe feeds Kafka PipelineDedup, normalize, route QC / ValidationReject outliers, fill gaps Observation StoreTimescaleDB Forecast BlenderMOS + downscale + blend Radar TilerMosaic โ†’ PNG z/x/y tiles Response BuilderJSON per grid cell โ†’ Redis Tile StoreS3 โ†’ CDN CDN Edge95%+ cache hit API GatewayRate limit, auth, route Weather APIGrid lookup + format Redis ClusterForecast grid (50GB) iOS / AndroidNative apps Web / WidgetBrowser, embeds Developer APIB2B partners Smart DisplaysEcho Show, Nest Hub NWS CAP FeedAlert XML stream Alert ProcessorParse + geo-fence Geo-IndexUsers in alert polygon FCM / APNsPush to 10M+ devices

Data Collectors INGEST

  • Poll NWS/ECMWF APIs on schedule (GRIB2, BUFR)
  • Subscribe to METAR/SYNOP feeds (decoded text)
  • Download NEXRAD Level-II radar data (binary)
  • Normalize all sources to internal schema

Forecast Blender COMPUTE

  • Model Output Statistics (MOS): bias-correct NWP output
  • Downscale: 13km GFS grid โ†’ 1km local grid
  • Blend: weighted average of GFS, ECMWF, HRRR, NAM
  • Output: best-estimate forecast per grid cell

Response Builder PRECOMPUTE

  • Renders API-ready JSON per grid cell (current + forecast)
  • Writes to Redis with grid_cell_id as key
  • Runs every 15 min for current, every 6h for extended
  • ~10M writes per cycle (parallelized across workers)

CDN + Weather API SERVE

  • CDN edge caches responses by grid cell ID (15min TTL)
  • Cache miss โ†’ API server โ†’ Redis lookup โ†’ respond
  • Lat/lng โ†’ grid cell mapping: floor(lat ร— 100), floor(lng ร— 100)
  • Cache key: /weather/{grid_id}?units=metric

Radar Tiler PRECOMPUTE

  • Mosaics 160 NEXRAD sites into single composite
  • Renders PNG tiles at zoom levels 2-12 (slippy map)
  • Stores to S3 โ†’ CDN. ~500K tiles per radar frame
  • New frame every 5 minutes, ~12 frames for animation

Alert Processor ALERTS

  • Subscribes to NWS CAP (Common Alerting Protocol) feed
  • Parses alert polygon (lat/lng vertices)
  • Queries geo-index: "which users are inside this polygon?"
  • Fans out push notifications via FCM/APNs
Flow 1: User Opens App โ€” Read Path
Weather API Request Flow Mobile App GET /weather?lat&lng CDN Edge 95% HIT 5% MISS Return cached JSON TTL 15 min ยท Latency: <50ms API Gateway Weather API lat/lng โ†’ grid cell floor(40.71ร—100) = 4071 Redis lookup GET grid:4071:-7401 Return pre-computed JSON โ†’ CDN caches (TTL: 15m) โ†’ Client: <200ms
Flow 2: Forecast Update Cycle โ€” Write Path
Forecast Data Pipeline (Every 6 Hours) Every 6 hours (GFS cycle: 00Z, 06Z, 12Z, 18Z) NWS GFS GRIB2, ~25GB per cycle Data Collector Download + decode GRIB2 Kafka raw.observations.gfs Forecast Blender (Spark Job) Bias-correct GFS output using recent observations Downscale: 13km grid โ†’ 1km grid (interpolation + terrain adjustment) Blend with ECMWF, HRRR, NAM (weighted by recent accuracy) Response Builder (parallel workers) For each of 10M grid cells: render JSON response โ†’ Redis: SET grid:{lat}:{lng} EX 3600 CDN Invalidation (optional) Most updates: let TTL expire (15 min). Critical changes: explicit purge for affected regions.
04 Deep Dives25โ€“30 min
Deep Dive 1: Data Ingestion & Forecast Pipeline (~10 min)
The weather data supply chain. Multiple external sources with different formats, frequencies, and reliability levels feed into a single unified pipeline that produces a "best estimate" forecast per grid cell.
SourceFormatUpdate FrequencyCoverageResolution
GFS (NWS)GRIB2Every 6 hoursGlobal~13km (0.25ยฐ)
HRRR (NWS)GRIB2Every hourCONUS only~3km
ECMWF IFSGRIB2Every 12 hoursGlobal~9km
METARText (decoded)Every hourAirports worldwidePoint observations
NEXRAD RadarBinary Level-IIEvery 5 minutes160 US sites~250m
GOES-16/18NetCDFEvery 5-15 minutesWestern Hemisphere~2km
Why blend multiple models instead of using just one? No single model is best everywhere. GFS has global coverage but only 13km resolution โ€” it misses local effects (sea breezes, mountain valleys). HRRR is 3km but only covers the US and only forecasts 18 hours ahead. ECMWF is often the most accurate globally but updates only every 12 hours. Blending: use HRRR for 0-18h over the US (highest resolution), GFS/ECMWF for 18h-14 days (longer range), and bias-correct all of them against actual station observations. The weights adapt: if ECMWF has been more accurate than GFS in a region recently, it gets higher weight. This is called Model Output Statistics (MOS) and is the core of the forecast quality differentiation.
Why precompute forecasts instead of computing on-demand? The forecast blending computation (interpolation, bias correction, downscaling) is CPU-intensive โ€” ~10ms per grid cell. At 200K QPS with 10% cache miss rate, that's 20K computations/sec. Precomputing all 10M grid cells every 15 minutes costs ~100K compute-seconds (parallelized to minutes). The result is O(1) serving: a single Redis GET per request. The tradeoff: 15-minute staleness. But weather changes slowly enough that this is imperceptible to users.
Deep Dive 2: API Serving & Caching Strategy (~8 min)
The serving layer is a three-tier cache. CDN edge (95%+ of requests), Redis cluster (cache miss + precomputed data), and the forecast pipeline (generates new data). The goal: most requests never reach your origin servers.
TierCache TTLHit RateLatencyWhat It Serves
CDN Edge15 min (current), 1h (forecast), 5min (radar)95%+<50msAll users in the same grid cell get the same cached response.
Redis ClusterUntil next precompute cycle99.9%<5msPre-computed JSON per grid cell. Full forecast grid in memory.
Forecast PipelineN/A (writes to Redis)N/AMinutesBatch computes all grid cells. Not on the serving path.
Cache key design is critical. The cache key must be deterministic from the request. A lat/lng like 40.7128, -74.0060 maps to grid cell grid:4071:-7401 (floor at 0.01ยฐ resolution โ‰ˆ 1km). Every user at Times Square hits the exact same cache key. This is why weather is so cacheable: the data is inherently spatial, and users cluster in the same cells. NYC has ~800 grid cells. Even if each is cached separately, that's 800 entries serving millions of users. Cache key: /v1/weather/{grid_id}?units={metric|imperial}&lang={en}. Units and language are part of the key because they change the response body.
Why not compute on demand and cache the result? This is the "cache-aside" pattern โ€” compute on first request, cache for subsequent ones. The problem: weather events cause correlated cache misses. When a hurricane approaches, millions of users check weather simultaneously for the same region. If the CDN TTL just expired, all those requests hit origin at once (thundering herd). Precomputation avoids this entirely: the data is already in Redis before anyone asks for it. The CDN TTL expiring just means the next request fetches from Redis (sub-5ms), not from a computation pipeline.
Deep Dive 3: Severe Weather Alerts (~7 min)
Alerts are the one latency-critical path. A tornado warning must reach affected users in under 60 seconds. This is a completely separate system from the read-heavy forecast serving path โ€” it's a geo-targeted push fan-out.
Severe Weather Alert Pipeline NWS issues Tornado Warning for Tarrant County, TX CAP XML with alert polygon: 5 lat/lng vertices ยท Severity: EXTREME ยท Urgency: IMMEDIATE Alert Processor (subscribes to NWS CAP feed) Parse polygon: [(32.75,-97.33), (32.75,-97.15), ...] ยท Severity: EXTREME Geo-Index Query "Which registered users have a saved location inside this polygon?" PostGIS: ST_Contains(polygon, user_loc) OR: pre-computed geohash prefix matching Result: 2.1M users in affected area Push Fan-out (parallel workers) iOS โ†’ APNs bulk (100K/batch) Android โ†’ FCM topics Web โ†’ Web Push API NWS issuance โ†’ user notification: <30 seconds (~5s parse + ~5s geo-query + ~20s push)
Why geo-index users ahead of time instead of computing at alert time? Computing "which of 500M users are inside this polygon?" at alert time is too slow. Instead, users register their locations at app install and when they add saved cities. These locations are indexed by geohash in PostgreSQL (PostGIS) or a dedicated geo-index (Redis Geospatial). At alert time, the query is: "which geohash prefixes overlap this polygon?" โ€” a fast index scan, not a full table scan. The tradeoff: users who travel (their current location differs from saved location) might miss a hyper-local alert. Mitigation: also use the user's last-known GPS location (updated when the app is open).
Why push instead of pull for alerts? Users don't constantly poll for alerts. The app might be backgrounded. A tornado gives 10-15 minutes of warning โ€” the user can't be expected to open the app at just the right time. Push notifications wake the device and display immediately. FCM/APNs handle the delivery complexity (device offline? queued. Battery saver? high priority bypasses Doze). The alert path is completely independent from the forecast serving path โ€” it has its own infrastructure, scaling, and SLOs.
Deep Dive 4: Data Model & Storage (~5 min)
grid_forecasts -- Redis: pre-computed JSON per grid cell key "grid:{lat100}:{lng100}" -- e.g., "grid:4071:-7401" value JSON { "location": {"lat": 40.71, "lng": -74.01, "name": "New York, NY"}, "current": { "temp_c": 22, "feels_like_c": 24, "humidity": 65, "wind_kph": 18, "wind_dir": "SW", "condition": "Partly Cloudy", "icon": "partly-cloudy-day", "uv_index": 6, "visibility_km": 16, "observed_at": "2026-02-15T14:00:00Z" }, "hourly": [ -- 48 entries {"time": "2026-02-15T15:00:00Z", "temp_c": 23, "precip_pct": 10, "condition": "Partly Cloudy"}, ... ], "daily": [ -- 14 entries {"date": "2026-02-15", "high_c": 24, "low_c": 14, "precip_pct": 20, "condition": "Scattered Showers"}, ... ], "alerts": [], -- active NWS alerts for this grid cell "aqi": {"value": 42, "category": "Good"}, "astronomy": {"sunrise": "06:48", "sunset": "17:32"}, "updated_at": "2026-02-15T14:05:00Z" } users -- PostgreSQL: registered users with saved locations id UUID device_tokens JSONB -- {apns: "...", fcm: "..."} saved_locations JSONB[] -- [{lat, lng, name, geohash}] last_known_loc GEOGRAPHY -- PostGIS point (last app open) preferences JSONB -- {units: "metric", alerts: true} tier TEXT -- "free" | "premium" created_at TIMESTAMPTZ observations -- TimescaleDB: raw station observations (time-series) time TIMESTAMPTZ station_id TEXT -- "KJFK", "KLAX" temp_c FLOAT humidity_pct FLOAT wind_kph FLOAT pressure_hpa FLOAT condition_code INT
API Design
GET/v1/weather?lat=40.71&lng=-74.01&units=metricFull weather for location
Returns: current + hourly (48h) + daily (14d) + alerts + AQI. Cache: CDN 15 min. Lat/lng snapped to nearest grid cell (1km).
GET/v1/weather/current?lat=...&lng=...Current conditions only
Lightweight: temp, humidity, wind, condition. ~500 bytes. Cache: CDN 10 min.
GET/v1/radar/{z}/{x}/{y}.png?timestamp=...Radar map tile
Standard slippy map tile. Pre-rendered PNG, 256ร—256px. S3 โ†’ CDN. Cache: 5 min (refreshes with new radar scan).
GET/v1/alerts?lat=...&lng=...Active alerts for location
Returns NWS alerts affecting this grid cell. Low TTL (2 min) for freshness. Alerts also embedded in /weather response.
GET/v1/locations/search?q=New+YorkLocation autocomplete
Fuzzy search over city names, zipcodes, airport codes. Returns lat/lng + display name. Powered by Elasticsearch with geo-biased ranking.
05 Cross-Cutting Concerns10โ€“12 min
Storage Architecture Summary
DataStoreWhy This Store
Pre-computed forecastsRedis Cluster (~50GB)Entire forecast grid in-memory. O(1) lookup by grid cell ID. Sub-5ms reads on cache miss from CDN.
Radar tilesS3 + CDN~500K PNG tiles per frame ร— 12 frames = 6M objects. Immutable, content-addressed. CDN-native.
Raw observationsTimescaleDBTime-series optimized. Auto-partitioned by time. Used for bias correction and quality control, not serving.
User profiles + locationsPostgreSQL + PostGISACID for user accounts. PostGIS for geo-spatial queries (alert polygon containment). ~500M rows.
Location searchElasticsearchFuzzy text search + geo-biased ranking. "New Y" โ†’ "New York, NY" with distance-based boost.
Alert queueKafka (separate cluster)Dedicated topic for alerts. Independent from forecast pipeline. At-least-once delivery guaranteed.
Failure Scenarios
GFS model run delayed (happens ~5% of runs)Stale forecast is still valid โ€” GFS runs every 6 hours, so a 6-hour-old forecast is still the "current" one. The precompute job simply doesn't run until new data arrives. CDN serves cached responses. Users see "Updated 6 hours ago" โ€” slightly stale but functional. ECMWF and HRRR are independent, so if GFS is late, other models still update.
Redis cluster node failureRedis Cluster with 3 replicas per shard. Automatic failover in <10 seconds. During failover, CDN serves cached responses (TTL not yet expired for 95%+ of keys). Worst case: 5% of requests see slightly stale data for 10 seconds. Acceptable for weather.
CDN edge outage (regional)Multi-CDN strategy: CloudFront primary, Fastly secondary. DNS-based failover. If one CDN's edge PoP is down, traffic routes to another. Origin servers can handle 10K QPS (the 5% cache miss volume) but cannot handle the full 200K QPS if CDN fails entirely โ€” this is a known single point of failure mitigated by multi-CDN.
Thundering herd (CDN TTL expires during peak)Stale-while-revalidate: CDN serves stale response while fetching fresh data from origin in the background. Only ONE request per cache key goes to origin โ€” all concurrent requests get the stale-but-valid response. This eliminates thundering herd entirely. Redis also serves pre-computed data, so origin "computation" is just a Redis GET.
NWS CAP feed down during severe weatherMulti-source alerts: subscribe to NWS CAP, IPAWS, and Weather Underground alerts. If primary feed is down, secondary feeds still deliver. Alert dedup by NWS alert ID. Alert processor monitors feed health and pages on-call if no alerts received for >15 minutes during active severe weather.
Push notification delivery failure (FCM/APNs down)Retry with exponential backoff. For life-safety alerts: also display in-app (polled on app open), update the /alerts API endpoint (low TTL), and send SMS for premium users. Multiple delivery channels ensure at least one reaches the user. FCM/APNs SLAs are 99.95%+ but not 100%.
Security & Access Control
Security. API keys for developer tier (rate-limited per key: 1000 req/min free, 60K/min paid). CDN-level DDoS protection (CloudFront Shield). No sensitive user data in weather responses โ€” they're public data. User profiles (push tokens, saved locations) are PII: encrypted at rest, accessed only by the alert system. Location data is sensitive: never logged in analytics. Radar tiles and forecasts are public data โ€” no auth required for read-only endpoints. Write endpoints (user registration, saved locations) require JWT authentication.
Scalability
Scalability. The serving path scales via CDN โ€” adding capacity means adding CDN PoPs, not origin servers. Origin scales horizontally behind a load balancer, but rarely needs more than a few dozen instances (only serves cache misses). The ingestion pipeline scales by adding Kafka partitions and Spark workers. The bottleneck is data sources, not our infrastructure โ€” NWS publishes at a fixed rate regardless of demand. Alert push scales via FCM/APNs fan-out: these platforms handle billions of pushes daily. Our alert processor needs to handle the geo-query and fan-out coordination, not the actual delivery. For international expansion: add more data sources (JMA for Japan, DWD for Germany) and more CDN PoPs. The architecture is identical โ€” just more grid cells and more cache keys.
Monitoring & SLOs
Monitoring & SLOs. Key SLOs: API p95 latency <200ms (CDN miss path), CDN cache hit rate >95%, alert delivery p95 <60 seconds, forecast freshness <30 minutes (no grid cell should have data older than 30 min). Metrics: CDN hit/miss ratio by region (detect stale TTLs), Redis memory usage (detect memory pressure), ingestion pipeline lag (detect delayed model runs), alert delivery latency histogram. Alerting: CDN hit rate drops below 90% (cache key misconfig), Redis memory >80% (needs scaling), no new forecast data in 2 hours (pipeline stuck), alert delivery p95 >60s (push infrastructure degraded). Dashboard: global map showing forecast freshness by region โ€” any region showing >30min staleness is immediately visible.
06 Wrap-Up & Evolution3โ€“5 min
What I'd Build Next
  • Hyper-local nowcasting (minute-level precipitation): ML model using radar + satellite to predict precipitation at 1-minute resolution for the next 2 hours. Dark Sky's killer feature. Requires real-time radar ingestion and GPU inference. Premium-only feature.
  • AI weather foundation model (GenCast, GraphCast): Replace or augment traditional NWP with ML-based forecast models. Google's GenCast beats ECMWF on 15-day forecasts at a fraction of the compute cost. Train on decades of reanalysis data (ERA5). GPU inference instead of supercomputer simulation.
  • Personalized weather intelligence: "It'll rain during your commute" โ€” combine user calendar, commute route, and minute-level precipitation. "Bad running weather this afternoon" โ€” learn user's outdoor activity patterns. Requires per-user context beyond just location.
  • Crowd-sourced observations: Users report current conditions ("It's raining here"). Aggregate thousands of reports to improve real-time accuracy between official station observations. Apple Weather does this with iPhone barometer data.
  • Climate adaptation dashboard: Historical trends: "This summer was 2.3ยฐC hotter than the 30-year average." Extreme weather frequency: "Your city had 15 more days above 35ยฐC than a decade ago." Enterprise version for agriculture, insurance, energy.
  • Offline-first mobile app: Cache the next 48h forecast on-device. App works without connectivity โ€” critical during severe weather when cell networks are overloaded. Delta sync: only download changed grid cells.
"The weather app architecture is defined by one property: extreme cacheability. The same 5KB JSON response serves every user in a 1km grid cell. That makes it a CDN-first design where origin servers handle <5% of traffic. The ingestion pipeline (ingest โ†’ blend โ†’ precompute โ†’ Redis โ†’ CDN) runs on a fixed schedule, completely decoupled from user demand. The alert path is the exception: life-safety push notifications must be delivered in seconds, using a separate geo-targeted push system. The key tradeoff: precomputation adds 15-minute staleness but eliminates the thundering herd problem and enables O(1) serving at any scale."
07 Interview Q&APractice
Q1

Why precompute all forecasts instead of computing on demand with caching?

A

Three reasons: (1) Thundering herd elimination: when a CDN TTL expires during a hurricane, thousands of users hit the same cache key simultaneously. With on-demand computation, all of those requests cascade to your compute tier. With precomputation, they all hit Redis โ€” a simple GET, not a computation. (2) Predictable load: the precompute job runs on a fixed schedule (every 15 min) regardless of user traffic. Origin capacity planning is trivial. (3) Cold start: an on-demand system has no cached data when it first deploys or after a cache flush. With precomputation, Redis always has the full forecast grid โ€” there's no "cold" state. The tradeoff is 15-minute staleness, but weather changes on hourly timescales, so 15-minute-old data is effectively fresh.

Q2

How do you handle a tornado warning reaching 10 million users in under 60 seconds?

A

The alert path is completely independent from forecast serving. Step 1 (~5s): Alert processor receives NWS CAP XML, parses the alert polygon. Step 2 (~5s): Geo-index query finds all users with saved locations or last-known positions inside the polygon. Users are pre-indexed by geohash โ€” the query is a prefix scan, not a full table scan. Step 3 (~20s): Fan out push notifications via FCM (Android) and APNs (iOS). These platforms are built for millions of concurrent pushes. We batch 100K tokens per API call. 10M users รท 100K per batch = 100 API calls, parallelized across workers. The key design choice: push, not pull. Users can't poll for alerts โ€” their app might be closed. Push notifications wake the device and display immediately. For redundancy, we also update the in-app alert banner and send SMS to premium users.

Q3

Your CDN goes down during a hurricane โ€” the exact moment when traffic spikes 10x. What happens?

A

This is the nightmare scenario โ€” and it's a known single point of failure. Mitigations: (1) Multi-CDN: CloudFront primary, Fastly secondary. DNS-based failover. If CloudFront degrades, Route 53 health checks detect it and shift traffic to Fastly within 60 seconds. (2) Origin can't absorb full CDN traffic (200K QPS vs 10K QPS origin capacity), so if BOTH CDNs fail, we degrade gracefully: serve a simplified response (current conditions only, no hourly), shed low-priority endpoints (radar tiles, location search), and activate an emergency static page with regional weather summaries. (3) The alert path is independent โ€” push notifications don't go through the CDN at all, so tornado warnings still reach users even if the forecast API is degraded. The honest answer: if both CDNs fail simultaneously during peak, some users will see errors. But the life-safety alert path survives.

Q4

How do you handle the lat/lng to grid cell mapping efficiently at scale?

A

The mapping is pure math, no database lookup required: grid_lat = floor(lat ร— 100), grid_lng = floor(lng ร— 100). This snaps any lat/lng to the nearest 0.01ยฐ grid cell (~1.1km at the equator). The cache key becomes deterministic: grid:{grid_lat}:{grid_lng}. This means the CDN can cache by URL path โ€” no query-string normalization needed. The 0.01ยฐ resolution gives ~10M unique land-surface grid cells. At the equator, 0.01ยฐ โ‰ˆ 1.1km. At 60ยฐ latitude, it's ~550m (longitude cells shrink with latitude). For higher fidelity near population centers, we could use a variable-resolution grid (finer in cities, coarser in oceans), but the uniform grid is simpler and 1km is sufficient for consumer weather. The key property: every user at the same 1km patch gets the identical response, making CDN caching trivially effective.

Q5

How do you make the forecast more accurate than just serving raw NWS data?

A

Three techniques compound accuracy: (1) Model blending: GFS, ECMWF, HRRR, and NAM each have different strengths. We maintain a rolling accuracy score per model per region (comparing past forecasts to actual observations). The blender assigns weights proportional to recent accuracy โ€” if ECMWF has been beating GFS in the Pacific Northwest this week, it gets higher weight there. (2) Statistical downscaling (MOS โ€” Model Output Statistics): NWP models output at 9-13km resolution and have systematic biases ("GFS is always 2ยฐC warm in Denver in January"). We train regression models that correct these biases using years of forecast-vs-observation pairs. (3) Observation assimilation: between NWP model runs (every 6-12 hours), we blend real-time station observations to keep "current conditions" fresh. If a METAR station reports 28ยฐC but the GFS forecast said 25ยฐC, we adjust the nearby grid cells toward the observation. This is the secret sauce that differentiates weather apps โ€” everyone gets the same raw NWP data, but the post-processing is proprietary.