- What outcome are we optimizing for? β Two linked outcomes: (1) Audience size β unique monthly visitors, driven by Zestimate accuracy (homeowners checking their value) and listing completeness (buyers searching). (2) Lead conversion β "Contact Agent" clicks that result in a home sale. These form the FLYWHEEL: more accurate data β more visitors β more leads β more agent revenue β fund more data. This tells us: Zestimate accuracy ISN'T a vanity metric β it directly drives the business model.
- What does Zillow actually know about? β Every residential property in the US (~110M). For each: physical attributes (beds, baths, sqft, lot size, year built), tax assessment history, sale transaction history, current listing status (for sale, for rent, off-market), photos, and the Zestimate.
- Data sources? β ~800+ MLS feeds (active listings), county assessor/recorder offices (tax records, deeds), user-submitted updates ("I remodeled my kitchen"), agent-submitted data, satellite/aerial imagery.
- Revenue model? β Primarily Premier Agent: agents pay for leads (buyer inquiries) in specific zip codes. Secondary: rental advertising, mortgage marketplace, display ads. NOT transaction commissions.
- Scale? β ~240M monthly visits, ~110M property records, ~2M active for-sale listings, ~1M rental listings, ~5M home sales/year in the US.
- What's the Zestimate? β ML-predicted market value for every home. Neural network model. Median error: ~6.9% for off-market homes, ~2% for on-market. Updated daily for all 110M homes.
| In Scope | Out of Scope |
|---|---|
| Property database (110M homes) | Zillow Offers (iBuying β now discontinued) |
| Multi-source data ingestion & dedup | Mortgage origination / Zillow Home Loans |
| Zestimate ML pipeline | Rental application processing |
| Geo-spatial search & map UI | 3D home tour hosting |
| Listing display & detail pages | Construction marketplace |
| Agent lead marketplace (Premier Agent) | International markets |
| Saved searches & alerts | ShowingTime scheduling internals |
- UC1 (Map Search): User drags map to a neighborhood β system queries all properties visible in the bounding box β shows pins for for-sale listings with prices, overlays Zestimates for off-market homes. Filters: price range, beds, baths, sqft, home type. Must update in <500ms as the map pans.
- UC2 (Property Detail): User clicks a property β full page: 30+ photos, Zestimate with history chart, tax history, sale history, school ratings, nearby comps, estimated mortgage payment, neighborhood stats, and "Contact Agent" button.
- UC3 (Zestimate Computation): Every night, the ML pipeline scores all 110M homes with updated data (new sales, new listings, tax assessments) β each home gets a fresh Zestimate and confidence interval.
- UC4 (Agent Lead): User clicks "Contact Agent" on a $500K listing in Austin β system routes the lead to the Premier Agent who bought that zip code β agent receives lead instantly (email + app notification) β lead tracked through CRM.
- UC5 (Saved Search Alert): User saves "3-bed homes in Brooklyn under $800K" β when a new listing matches, send push notification + email within minutes of listing appearing.
- Comprehensive coverage: Zillow's value proposition depends on having data for EVERY home, not just those for sale. A missing property = a homeowner who can't check their Zestimate = lost traffic.
- Data freshness: New MLS listings should appear on Zillow within 15 minutes of being posted. Zestimates updated daily. Tax/sale records: within days of recording.
- Geo-spatial performance: Map interactions must feel instantaneous. Bounding box query across 110M properties β return results in <200ms. This requires specialized geo-indexing.
- Zestimate accuracy: Median error of ~2% for on-market, ~7% for off-market. Model must handle geographic variation (NYC condo vs. rural Montana ranch) and temporal trends (hot/cold markets).
- SEO performance: Property detail pages are a MAJOR traffic source (Google searches for addresses). Pages must be server-rendered, fast-loading, and structured-data rich. ~110M unique URLs.
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Spatial queries: "homes in this map viewport" | PostgreSQL + PostGIS (GiST spatial index) | ST_MakeEnvelope + GiST index = O(log n) spatial lookup. Returns 500-5K properties in <50ms even with 150M total properties. | β |
| Complex faceted search: price + beds + baths + location | Elasticsearch for search (PostGIS for spatial-only) | ES handles combined geo + attribute filters natively. PostGIS struggles at >100 concurrent faceted queries/sec. | AP |
| Zestimate for 150M properties (weekly refresh) | Batch ML pipeline on Spark (not real-time) | Recomputing 150M valuations is a distributed computing problem. Real-time per-request inference would cost 1000x more and not improve accuracy. | β |
| MLS listings must appear near-real-time | Kafka/Kinesis β Elasticsearch index update | Listing changes stream through Kafka, consumed by ES indexer. New listing searchable within 5 minutes. Batch import would mean hours of staleness. | AP |
| Property images at millions of photos | S3 + CDN (not database BLOBs) | Average listing has 20 photos. 1.5M active listings Γ 20 = 30M images. S3 + CDN handles this with zero DB overhead. | β |
| Data from dozens of MLS boards with different schemas | Entity resolution pipeline (deduplicate + normalize) | Same property appears in multiple MLS feeds with different formats. Address normalization + fuzzy matching to create single canonical record. | β |
π₯ Data Ingestion Pipeline INGEST
- 800+ MLS feeds (RETS/RESO Web API)
- County public records (tax, deeds, permits)
- User/agent edits (home facts)
- Entity resolution & deduplication
π Property Graph CORE
- Canonical record for every US home (110M)
- Physical attributes, tax history, sale history
- Current listing status, photos, agent
- Pre-computed Zestimate + confidence interval
π§ Zestimate Pipeline ML BATCH
- Neural network trained on sale transactions
- Features: physical, geographic tiling, temporal
- Daily batch scoring: 110M homes
- Outputs: Zestimate, confidence range, rent Zestimate
πΊοΈ Geo Search Service SEARCH
- Bounding-box & polygon queries
- Geo-spatial index (R-tree / geohash)
- Filters: price, beds, baths, type, status
- Returns pin data for map rendering
π Property Detail Service DETAIL
- Full property page: all attributes + history
- Comparable sales (comps) computation
- School ratings, neighborhood stats
- SEO-optimized server-rendered pages
π€ Premier Agent Service REVENUE
- Lead routing: buyer inquiry β matched agent
- Agent inventory: which zips does each agent own?
- Lead delivery: email, SMS, CRM push
- Performance tracking: response time, conversion
π Alert Engine NOTIFY
- Saved searches β new listing alerts
- Price change notifications
- Zestimate change alerts for homeowners
- Must fire within 15 min of listing change
πΈ Media Service CDN
- ~2B+ property photos
- Multiple resolutions (thumbnail, detail, gallery)
- CDN-served, aggressively cached
- Satellite/aerial imagery layer
Zestimate Accuracy by Segment
| Segment | Median Error | Why |
|---|---|---|
| On-market (listing price available) | ~2% | Listing price is an extremely strong signal. Model mostly predicts "close to asking." |
| Off-market, recent sale (<2 yr) | ~5% | Recent sale price + market trend. Good data quality. |
| Off-market, no recent sale | ~7-10% | Must rely on comps, physical attributes, tax assessment. More uncertainty. |
| Rural / unique properties | ~15%+ | Few comps, unique features (100-acre ranch). Model struggles with sparse data. |
- Address geocoding: Every address β (latitude, longitude). Used for both entity resolution and geo-search. Geocoder must handle non-standard addresses, new construction (not yet in Google/USPS), and rural addresses ("County Road 42, 3rd mailbox on left").
- MLS data freshness: Zillow receives listing updates via RETS/RESO feeds. Most MLS feeds push updates every 15 minutes. Some smaller MLSs only update hourly. The ingestion pipeline must handle: new listings, price changes, status changes (active β pending β sold), photo additions, and listing withdrawals.
- Streaming architecture: Zillow moved from batch SQL Server processing to streaming (Kinesis Firehose β S3 Data Lake β Spark). Every property change is an event in a stream. Downstream consumers (search index, Zestimate, alerts) subscribe to relevant events.
| Data | Store | Why This Store |
|---|---|---|
| Property records | PostgreSQL + PostGIS | Address, lat/lng, attributes, tax records. PostGIS for spatial queries. ~150M properties in the US. |
| Listing data (active) | PostgreSQL + Elasticsearch | MLS feed data: price, photos, status, agent. ES for search/filter. Updated near-real-time from MLS. |
| Zestimate model inputs | S3 + feature store | ML features: tax assessments, comparable sales, neighborhood stats. Refreshed weekly per property. |
| Property images | S3 + CDN | MLS photos + satellite/street view. CDN-first serving. ~20 images per listing. |
| Search index | Elasticsearch | Geo-spatial + faceted search. Filters: price, beds, baths, sqft, year built. Updated via Kafka on listing change. |
| User saved searches | PostgreSQL | Saved search criteria + notification preferences. Matched against new listings for email alerts. |
| Event stream | Kafka / Kinesis | listing.created, listing.updated, price.changed. Consumed by search index, Zestimate pipeline, alerts. |
- 110M unique URLs: Every property has a permanent detail page URL. Google indexes these pages. When someone Googles "123 Main St Austin TX," Zillow's page ranks #1 β free traffic β lead opportunity.
- Server-side rendering (SSR): Property pages must be fully rendered HTML for Googlebot (not client-side React that Googlebot might not execute). SSR adds server cost but is critical for SEO.
- Structured data (Schema.org): Property pages emit JSON-LD structured data: address, price, beds, baths, photos. Google uses this for rich search results (price, photo in the SERP).
- Page speed: Core Web Vitals directly affect Google ranking. LCP (Largest Contentful Paint) target: <2.5 seconds. Photos (the heaviest element) served from CDN with lazy loading and responsive srcset.
- Content freshness signal: Active listings update frequently (price changes, status) β Google re-crawls often. Off-market pages update less β crawl budget managed carefully to prioritize active listings.
- Volume: Tens of millions of saved searches. Each defined by: geo boundary + filters (price, beds, status).
- Matching: When a new listing appears (or price changes), evaluate against all saved searches that overlap the listing's location and match its attributes. This is an INVERTED problem: instead of "find listings matching a search," it's "find searches matching a listing."
- Implementation: Publish listing_change events to Kafka β saved-search matcher service consumes events β evaluates each against a spatial index of saved searches β generates notifications. Target: alert within 15 minutes of listing change.
- Deduplication: Don't alert the same user for the same listing twice (if price changes after initial alert). Dedup key: (user_id, property_id, alert_type).
- MLS data rules: Each of the 800+ MLSs has its own rules about how data can be displayed. Some require Zillow to show the listing broker's name. Some restrict showing sold-price data. Some require data to be removed within 24 hours of expiration. Compliance is enforced per-MLS.
- Fair Housing Act: Search results cannot discriminate based on race, religion, national origin, family status, etc. Zillow must NOT allow targeting ads by protected characteristics. Personalization models must be audited for bias.
- CCPA/privacy: User search behavior and lead submissions contain personal data. Users can request deletion. Agent CRM integration must respect user opt-out preferences.
| Extension | Architecture Impact |
|---|---|
| AI Home Shopping Assistant | Conversational interface: "Find me a 3-bed near good schools in Austin under $500K with a pool." Requires: natural language β structured query translation, multi-turn refinement, integration with search service, and recommendation model that understands taste (not just filters). |
| Touring & Offers (full transaction) | Move beyond leads into facilitating the actual transaction: schedule tours (ShowingTime integration), submit offers digitally, manage escrow/closing. Creates a transaction platform, not just a media company. Requires: real-time availability for tours, offer management workflow, digital signature integration. |
| Renovation Value Estimator | "If I add a bathroom, how much would my home value increase?" Requires: training a model on before/after renovation data (permit records + sale prices), understanding renovation costs by market, and presenting ROI estimates per improvement type. |
| Climate Risk Layer | Show flood, fire, heat, and storm risk per property. Increasingly important for buyers and insurers. Requires: integrating FEMA flood maps, wildfire risk models, climate projection data. Serve as a layer on the property detail page and search filters ("exclude flood zone"). |
| Real-Time Zestimate (streaming) | Today the Zestimate is batch-computed daily. A streaming version would update within minutes of a nearby sale closing. Requires: streaming pipeline (Kafka + Flink) that detects new sale events, identifies affected properties (same neighborhood), and incrementally updates their Zestimates. Much harder than batch but enables "your home value just changed" push notifications. |
How does PostGIS handle "show me all homes for sale in this map viewport" efficiently?
When a user pans the map to a viewport (say, Austin downtown), the frontend sends the bounding box coordinates: (SW_lat, SW_lng, NE_lat, NE_lng). PostGIS uses a GiST (Generalized Search Tree) spatial index to find all properties within that rectangle in O(log n) time. The query is: `WHERE location && ST_MakeEnvelope(sw_lng, sw_lat, ne_lng, ne_lat, 4326)`. The && operator uses the spatial index β it doesn't scan all 150M properties. For a typical viewport showing a neighborhood, this returns 500-5000 properties in <50ms. The frontend then clusters nearby pins to avoid rendering thousands of markers. As the user zooms in, the cluster expands. The tricky part is combining spatial with attribute filters: "homes in this viewport under $500K with 3+ bedrooms." PostGIS handles the spatial part, but for complex faceted queries, we route to Elasticsearch, which maintains a geo-point index alongside the attribute indices. The choice between PostGIS and ES depends on query complexity.
How accurate is the Zestimate, and what's the ML pipeline?
The Zestimate has a median error rate of ~2-3% for on-market homes and ~7% for off-market homes. The ML pipeline runs weekly: (1) feature engineering: for each of 150M properties, compute features from public records (tax assessment, lot size, year built, recent renovations), comparable sales (similar homes sold within 1 mile in the last 6 months), neighborhood stats (school ratings, crime rates, walkability), and macro factors (interest rates, local employment). (2) Model training: gradient-boosted trees (XGBoost) trained on the last 12 months of actual sale prices vs. features. Separate models per metro area because real estate is hyperlocal. (3) Inference: apply the model to all 150M properties and write results to the property store. This runs on a Spark cluster β feature computation is the bottleneck (joining across 20+ data sources per property). The accuracy limitation is fundamentally about data: interior condition (renovated kitchen vs. original) isn't in public records. This is why on-market homes (with listing photos and descriptions that imply condition) have better estimates.