AI Code Review
Benchmark.

We generate realistic bugs based on real-world patterns and see which LLMs can actually find them. Not linting issues. Regressions that break logic and cross-file contracts.

216eval traces
05languages
08models
75test cases
Methodology

How It Works

Every model gets the same test suite, the same bugs, the same grading criteria. No cherry-picking, no human scoring, no prompt tricks. Just deterministic evals you can audit yourself.

01

Generate bugs

We create synthetic but realistic regressions based on real-world bug patterns across 5 languages. Each test case has a PR patch, the full codebase, and exact bug locations as ground truth.

02

Run AI code review

Every model gets the same PR diff and has to find the bugs, point to the exact lines, and suggest a fix. We parse everything into structured suggestions.

03

Two judges score it

Claude Sonnet and GPT independently score every response. Did the model find the bugs? Were the suggestions actually correct? Final score is the average of both.

04

Publish everything

All scores, all traces, all judge reasoning. Broken down by language, category, and model. Nothing is hidden.

Test Categories

Local Logic

Bugs that live inside a single file. Wrong conditions, broken state, off-by-one errors. Most models do okay here. It's table stakes for AI code review.

Cross-File Context

You change an interface in one file and it breaks three consumers in other files. Can the model catch that? This is where it gets hard, and where we see the biggest gap between models.

Metrics Explained

Score
Primary

Overall quality, averaged across both judges. This is what we rank by.

Coverage
Recall

How many of the known bugs did the model actually find?

Validity
Precision

Out of everything the model flagged, how much was real? Lower means more noise.

Pass Rate
Binary

Percentage of tests where the model hit the minimum bar to pass.

Dual Judge System

Claude Sonnet 4.5Anthropic

Checks each response against the ground-truth bugs. Did the model find them? Are the suggestions actually valid? Writes out its full reasoning for every score.

GPTOpenAI

Second opinion, same criteria, different provider. Two judges from two companies means no single-provider bias. The final score is the average of both.

final_score = (sonnet_score + gpt_score) / 2. Same for coverage and validity. You can see both judges' full reasoning for every trace in theexplorer.

Languages

TypeScript / NodePythonReact / TSXRubyJava

Every language has both local and cross-file tests. Same test suite for every model, so the comparison is fair.

Powered By

Kodus

Every evaluation is run byKodus, the AI code review engine behind this benchmark. The same pipeline that reviews production code runs these evals.

Learn about Kodus
Rankings

Global Leaderboard

Full ranking
#ModelScoreCoverageValidityCross-FileLatency
01Claude Sonnet 4.5Anthropic87.1%85.0%89.2%90.7%10ms
02Gemini 2.5 ProGoogle86.8%78.0%95.7%87.5%8ms
03Kimi K2.5Moonshot AI85.6%78.6%92.7%85.0%8ms
04Claude Haiku 4.5Anthropic85.0%88.8%81.2%89.1%8ms
05Gemini 3.1 ProGoogle84.2%77.1%91.3%80.5%21ms
Explorer

Regression Traces

All traces

Refactor event names to follow domain-driven naming convention (e.g. 'orderCompleted' -> 'order.finalized')

src/events/event-registry.ts
@@ -27,7 +27,7 @@
new hunk
27 private registerDefaults(): void {
28+ this.register('order.finalized', this.handleOrderComplete.bind(this));bug
29+ this.register('payment.captured', this.handlePayment.bind(this));bug
30+ this.register('user.onboarded', this.handleUserRegistration.bind(this));bug
31+ this.register('inventory.adjusted', this.handleInventory.bind(this));bug
32 }
old hunk
27 private registerDefaults(): void {
28- this.register('orderCompleted', this.handleOrderComplete.bind(this));bug
29- this.register('paymentProcessed', this.handlePayment.bind(this));bug
30- this.register('userRegistered', this.handleUserRegistration.bind(this));bug
31- this.register('inventoryUpdated', this.handleInventory.bind(this));bug
32 }
Claude Haiku 4.5suggested 4 changes

Event name mismatch breaks CheckoutService.completeOrder. The service emits 'orderCompleted' but the registry now registers 'order.finalized', causing the event to be silently dropped with no handlers executing. The warehouse notification and analytics update will never occur.

src/events/event-registry.ts
Suggested change
-this.register('order.finalized', this.handleOrderComplete.bind(this));
+// Ensure CheckoutService.completeOrder emits 'order.finalized' instead of 'orderCompleted'
+// OR register both event names:
+this.register('order.finalized', this.handleOrderComplete.bind(this));
+this.register('orderCompleted', this.handleOrderComplete.bind(this)); // backward compat

Event name mismatch breaks PaymentService.capturePayment. The service emits 'paymentProcessed' but the registry now registers 'payment.captured', causing the event to be silently dropped with no handlers executing. Revenue recording will never occur.

src/events/event-registry.ts
Suggested change
-this.register('payment.captured', this.handlePayment.bind(this));
+// Ensure PaymentService.capturePayment emits 'payment.captured' instead of 'paymentProcessed'
+// OR register both event names:
+this.register('payment.captured', this.handlePayment.bind(this));
+this.register('paymentProcessed', this.handlePayment.bind(this)); // backward compat

Event name mismatch breaks OnboardingService.registerUser. The service emits 'userRegistered' but the registry now registers 'user.onboarded', causing the event to be silently dropped with no handlers executing. Welcome emails will never be sent.

src/events/event-registry.ts
Suggested change
-this.register('user.onboarded', this.handleUserRegistration.bind(this));
+// Ensure OnboardingService.registerUser emits 'user.onboarded' instead of 'userRegistered'
+// OR register both event names:
+this.register('user.onboarded', this.handleUserRegistration.bind(this));
+this.register('userRegistered', this.handleUserRegistration.bind(this)); // backward compat

Event name mismatch breaks WarehouseService.updateStock. The service emits 'inventoryUpdated' but the registry now registers 'inventory.adjusted', causing the event to be silently dropped with no handlers executing. Reorder threshold checks will never occur.

src/events/event-registry.ts
Suggested change
-this.register('inventory.adjusted', this.handleInventory.bind(this));
+// Ensure WarehouseService.updateStock emits 'inventory.adjusted' instead of 'inventoryUpdated'
+// OR register both event names:
+this.register('inventory.adjusted', this.handleInventory.bind(this));
+this.register('inventoryUpdated', this.handleInventory.bind(this)); // backward compat
Reference Bugs (4)
src/events/event-registry.tsL2828src/events/event-registry.tsL2929src/events/event-registry.tsL3030src/events/event-registry.tsL3131
Judge Evaluation
100.0%

Score

100.0%

Coverage

100.0%

Validity

silently skipped. - Verdict: VALID - Reason: This is a concrete bug where a refactored event name in the registry doesn't match the event name being emitted by the service, causing handlers to never execute. This will result in missing reorder threshold checks, risking stock-outs. ## Step 2: Coverage Comparing against reference issues: 1. **Reference Issue 1** (orderCompleted → order.finalized): **FOUND** by Suggestion 1 2. **Reference Issue 2** (paymentProcessed → payment.captured): **FOUND** by Suggestion 2 3. **Reference Issue 3** (userRegistered → user.onboarded): **FOUND** by Suggestion 3 4. **Reference Issue 4** (inventoryUpdated → inventory.adjusted): **FOUND** by Suggestion 4 All 4 reference issues were found. coverage_score = 4/4 = 1.0 ## Step 3: Validity Total suggestions: 4 Valid suggestions: 4 (all suggestions are valid) validity_score = 4/4 = 1.0 ## Step 4: Final Score coverage_score = 4/4 = 1.0 validity_score = 4/4 = 1.0 final_score = 1.0 * 0.5 + 1.0 * 0.5 = 1.0
12.2sParse OKLineAcc 100%IoU 100%Matched 4/4