- What outcome matters most? β Review quality (bugs caught before merge / total bugs) balanced against false positive rate. A tool that flags real bugs is invaluable. A tool that floods PRs with noise gets disabled. Secondary: time-to-first-review (developers shouldn't wait for human reviewer when AI can start immediately). This shapes the model cascade: prefer precision over recall β it's better to miss a minor issue than to cry wolf.
- Trigger mechanism? PR creation/update only, or also IDE and CLI? β PR webhook is primary. IDE and CLI as secondary surfaces.
- Platforms? GitHub, GitLab, Bitbucket? β All three. Abstracted behind a platform adapter layer.
- What does "context-aware" mean specifically? β The review should understand: the diff, the full codebase (dependencies, call graph), linked issues (Jira/Linear), past PRs, team coding standards, and previous feedback from this team.
- Static analysis too, or just LLM? β Both. 40+ linters/SAST tools run alongside LLM review. Results synthesized into unified feedback.
- Interactive? Can devs reply to review comments? β Yes β conversational. Dev replies, AI responds, and can learn from feedback.
- Scale? β ~100K repositories, ~500K PRs/day, ~2M review comments posted/day.
| In Scope | Out of Scope |
|---|---|
| Webhook ingestion (GitHub/GitLab/Bitbucket) | IDE inline review (mention as extension) |
| Codebase cloning & context assembly | Code fix auto-application |
| LLM-powered code review pipeline | CI/CD pipeline integration |
| Static analysis tool orchestration (40+ linters) | LLM training / fine-tuning |
| PR comment posting (summary + inline) | Billing / subscription |
| Conversational follow-up in PR threads | Code generation / auto-fix PRs |
| Learning from team feedback | |
| Code graph / dependency analysis |
- UC1: Dev opens PR β within 90 seconds, CodeRabbit posts a review: summary of changes, walkthrough, architecture diagram, file-by-file comments with inline suggestions.
- UC2: Dev pushes new commits to PR β incremental review of only the new changes, referencing the prior review context.
- UC3: Dev replies to a CodeRabbit comment "we prefer 2-space indentation" β system learns this preference, applies to all future reviews for this repo.
- UC4: CodeRabbit detects a race condition by analyzing the dependency graph, not just the diff β something a diff-only reviewer would miss.
- Latency: Full review posted within 90 seconds of PR event. First comment (summary) within 30s.
- Accuracy > speed: False positives erode trust fast. Better to post fewer, high-quality comments than flood the PR with noise.
- Security: Customer code is cloned into an ephemeral sandbox, processed, then destroyed. Zero persistence after review. No code used for model training.
- Scalability: Handle PR spike during business hours (8amβ6pm per timezone = rolling peak).
- Reliability: Webhook must ACK within 50ms. Actual review processing is async. GitHub will retry if we timeout.
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| Review posted within 90 seconds of PR | Async pipeline: Kafka β clone β analyze β post | Webhook triggers async flow. Clone + static analysis + AI review parallelized where possible. Synchronous webhook processing would timeout. | β |
| Customer code security: never persisted | Ephemeral sandboxes, destroyed after review | Code cloned into ephemeral container, deleted after review completes. No persistent storage of customer code reduces breach blast radius. | β |
| False positive rate must decrease over time | Feedback loop: developer reactions β per-repo tuning | Thumbs up/down on comments feed into per-repo learned preferences. Comment patterns consistently dismissed get suppressed. | β |
| Multi-language: Python, JS, Go, Rust, etc. | LLM-based review (not language-specific rules) | LLM understands all languages. Rule-based approach requires per-language rule sets β doesn't scale to 20+ languages. | β |
| Context beyond the diff: understand the whole codebase | Vector embeddings of repository (LanceDB) | Index existing codebase for semantic search. Review comments reference patterns elsewhere in the repo, not just the changed lines. | β |
π Webhook Gateway INGESTION
- Receives GitHub/GitLab/Bitbucket webhooks
- HMAC signature validation
- ACK in <50ms, enqueue event
- Deliberately "dumb" β no business logic
π¬ Event Queue BUFFER
- Kafka / Redpanda
- Decouples ingestion from processing
- Absorbs spikes (shock absorber)
- Ordered per-repo for consistency
π§ Review Orchestrator CORE
- Consumes events from queue
- Spins up sandbox per review
- Coordinates: clone β context β lint β LLM β post
- Manages model cascade routing
π¦ Sandbox (Cloud Run) EPHEMERAL
- Isolated container per review
- Clones repo, runs linters, executes AI scripts
- Double-sandboxed: container + jailkit
- Destroyed after review completes
π Context Engine CORE
- Builds "case file" for the review
- Code graph (dependencies, call sites)
- Linked issues (Jira/Linear/GitHub)
- Past PRs, team learnings, coding guidelines
- Vector search via LanceDB
π§ Linter Orchestrator ANALYSIS
- 40+ static analysis tools (ESLint, Rubocop, etc.)
- Auto-detects language β runs relevant linters
- Results fed to LLM for synthesis
- Runs in sandbox alongside code
π€ LLM Review Engine AI
- Model cascade: fast model β complex model
- Summary generation, file-by-file review
- Verification pass (reduce hallucinations)
- Conversational replies in PR threads
π¬ Platform Adapter OUTPUT
- Posts review comments back to PR
- Abstracts GitHub/GitLab/Bitbucket APIs
- Summary comment + inline file comments
- Handles rate limits per platform
Context Layers (assembled in parallel)
| Layer | Source | What It Provides | Retrieval Method |
|---|---|---|---|
| Diff | Git | What changed: added/removed/modified lines per file | git diff (in sandbox) |
| Code Graph | AST Parser | Who calls this function? What depends on this module? Downstream impact. | Build lightweight dependency graph on clone. Tree-sitter for multi-language AST. |
| Co-change History | Git log | Files that historically change together with the modified files. | git log --follow analysis |
| Linked Issues | Jira/Linear/GitHub | The "why" β what was the developer trying to accomplish? | Parse PR description for issue refs, fetch via API |
| Past PRs | LanceDB | Similar changes, past review feedback on related code. | Semantic search over PR embeddings |
| Team Learnings | LanceDB | Preferences from past feedback ("we use 2-space indent", "always check null here"). | Filtered by repo_id, semantic match to current diff |
| Coding Guidelines | .coderabbit.yaml | Explicit rules: naming conventions, error handling, API patterns. | Direct config load |
| Linter Results | 40+ tools | Static analysis findings: type errors, security issues, style violations. | Run in sandbox, parse output |
Model Cascade Strategy
.coderabbit.yaml (path filters, ignore patterns).Sandbox Architecture (Cloud Run)
- Layer 1: Cloud Run container β each review gets its own container instance. Auto-scaled based on queue depth. Torn down after review completes. Minimal IAM permissions (no access to other customer data).
- Layer 2: gVisor microVM β Cloud Run's second-gen execution environment provides hardware-level isolation. Each container runs in its own microVM.
- Layer 3: Jailkit + cgroups β within the container, linter and script execution runs in a jailed process with restricted filesystem access (only the cloned repo) and CPU/memory limits via cgroups.
- Network: Outbound network is blocked by default. Only allowlisted domains (package registries for linter plugin install, issue tracker APIs) are permitted.
- Lifetime: Container exists for the duration of the review (60-90 seconds). All data destroyed on termination. No persistent storage.
Learning Sources
| Source | How It's Captured | How It's Applied |
|---|---|---|
| Chat Feedback | Dev replies to review comment with correction/preference. NLP classifies as "learning." | Stored in LanceDB as an embedding. Retrieved via semantic search when similar code is reviewed. |
| .coderabbit.yaml | Explicit rules: "enforce 2-space indent," "flag any use of eval()." | Injected directly into review prompt. Deterministic, always applied. |
| Coding Agent Rules | Import from Cursor/Copilot rules files (.cursorrules, etc.) | Parsed and included as coding guidelines. |
| Review Outcomes | Track which comments get resolved vs. dismissed by developers. | Down-weight comment patterns that are frequently dismissed. |
| Data | Store | Access Pattern | Retention |
|---|---|---|---|
| Webhook Events | Kafka | Ordered per-repo. Consumed by review workers. | 7 days (replay window) |
| Review Results | PostgreSQL | Per PR: summary, comments, status. Query by repo+PR. | Permanent (customer data) |
| Learnings | LanceDB | Semantic search by repo_id + embedding similarity. | Permanent (grows over time) |
| Code Graph Cache | Redis | Per-repo dependency graph. Invalidated on new PR. | TTL: 1 hour |
| Repo Metadata | PostgreSQL | Installation config, .coderabbit.yaml, connected integrations. | Permanent |
| Customer Code | EPHEMERAL (sandbox) | Cloned on demand, destroyed after review. | 0 β never persisted |
| Linter Results | In-memory (sandbox) | Generated and consumed within single review. | 0 β destroyed with sandbox |
| PR Embeddings | LanceDB | Semantic search for "similar past PRs." | 90 days rolling |
| At Scale | What Breaks | Mitigation |
|---|---|---|
| 10Γ (5M PRs/day) | Cloud Run concurrent container limit. LanceDB index size per customer. Kafka throughput. | Multi-region Cloud Run pools. Shard LanceDB by org_id. Kafka partition per top-100 active repos. |
| 100Γ (50M PRs/day) | LLM API rate limits become binding constraint. Code graph building at scale. Cost per review must decrease. | Self-hosted model inference for cost control. Incremental code graph updates (don't rebuild from scratch). Aggressive model cascade β push 60%+ to cheapest tier. |
- SOC 2 Type II & GDPR: Customer code never persisted. LLM queries zero-retention. Audit log of all data access.
- Webhook HMAC: Every webhook validated against shared secret. Invalid signatures rejected at gateway (403).
- Tenant isolation: Each review runs in its own container. No shared filesystem. No cross-customer data leakage possible β even in memory.
- No training on customer code: Contractual and technical guarantee. LLM providers configured with zero data retention agreements.
- Review pipeline: End-to-end latency (webhook β comments posted). Stage breakdown (clone, lint, LLM, post). Error rate by stage.
- Quality metrics: Comment dismissal rate, learning adoption rate, false positive rate (tracked via dev reactions).
- Cost: LLM tokens per review, model cascade distribution (% trivial/moderate/complex), cost per review trending.
- Alerting: Review latency p99 > 120s, webhook queue depth > 10K, sandbox OOM rate > 1%, comment posting error rate > 5%.
| Extension | Why It Matters | Architecture Impact |
|---|---|---|
| Auto-Fix PRs | Don't just comment β create a fix PR with one click | Agent generates code changes in sandbox. Creates branch + commits via GitHub API. Requires higher trust bar β verified fixes only. |
| IDE Integration (deep) | Review before PR is even opened | Local LSP-like service running lightweight version of the review pipeline. No sandbox needed β code is already local. |
| Cross-PR Impact Analysis | Detect architectural drift across many PRs over weeks | Longitudinal analysis over PR embeddings. Weekly "codebase health" reports. Requires persistent code graph (beyond ephemeral). |
| Custom Model Fine-Tuning | Enterprise customers with unique codebases | Fine-tune smaller model on customer's past reviews. Serve from dedicated inference endpoint. Higher accuracy, lower latency. |
| Security-Focused Tier | Deep security analysis (SAST/DAST level) | Longer review budget (5 min instead of 90s). Run dynamic analysis (actually execute tests). Requires beefier sandbox with network simulation. |
How do you handle the cold start problem for the first PR in a new repository?
Without repository context, the first review would be generic β flagging style issues and obvious bugs but missing domain-specific patterns. We mitigate this with: (1) language and framework detection during clone β if it's a Rails app, we apply Ruby/Rails-specific review rules, (2) analysis of existing code patterns β even on the first PR, we index the existing codebase to understand naming conventions, test patterns, and architecture, (3) README and config file analysis β understanding the project's stated conventions, (4) diff-only focus β we review what changed, not the entire codebase, so the review is scoped to the PR's intent. By the 10th PR, the system has learned from developer responses (thumbs up/down on comments, which suggestions were accepted) and the reviews become significantly more relevant. The key metric: false positive rate on the first PR is ~30% (acceptable), dropping to ~10% by the 20th PR.
What happens if the AI review hallucinates a bug that doesn't exist?
False positives are the biggest threat to adoption β if developers dismiss reviews as noise, the tool is worthless. Our mitigation is multi-layered: (1) Static analysis FIRST β linting and type checking run before the AI review. If the linter passes, the AI is less likely to flag syntax issues. (2) Confidence scoring β each comment includes an internal confidence score. Below a threshold, we either suppress the comment or soften the language ("Consider whether..." vs. "Bug: this will crash"). (3) Verification step β after the AI generates review comments, a separate validation pass checks whether the flagged code actually exists in the diff and whether the concern is logically consistent. (4) Feedback loop β developer reactions (resolve/dismiss) feed back into per-repository fine-tuning. A comment pattern that's consistently dismissed gets suppressed. The honest answer: hallucinations still happen, which is why the review is always advisory (no blocking PRs on AI review) and every comment links to the specific code line for easy verification.