AI Code Review
Benchmark.

We generate realistic bugs based on real-world patterns and see which LLMs can actually find them. Not linting issues. Regressions that break logic and cross-file contracts.

216eval traces

05languages

08models

75test cases

View Rankings Add test cases

Methodology

How It Works

Every model gets the same test suite, the same bugs, the same grading criteria. No cherry-picking, no human scoring, no prompt tricks. Just deterministic evals you can audit yourself.

Generate bugs

We create synthetic but realistic regressions based on real-world bug patterns across 5 languages. Each test case has a PR patch, the full codebase, and exact bug locations as ground truth.

Run AI code review

Every model gets the same PR diff and has to find the bugs, point to the exact lines, and suggest a fix. We parse everything into structured suggestions.

Two judges score it

Claude Sonnet and GPT independently score every response. Did the model find the bugs? Were the suggestions actually correct? Final score is the average of both.

Publish everything

All scores, all traces, all judge reasoning. Broken down by language, category, and model. Nothing is hidden.

Test Categories

Local Logic

Bugs that live inside a single file. Wrong conditions, broken state, off-by-one errors. Most models do okay here. It's table stakes for AI code review.

Cross-File Context

You change an interface in one file and it breaks three consumers in other files. Can the model catch that? This is where it gets hard, and where we see the biggest gap between models.

Metrics Explained

Score

Primary

Overall quality, averaged across both judges. This is what we rank by.

Coverage

Recall

How many of the known bugs did the model actually find?

Validity

Precision

Out of everything the model flagged, how much was real? Lower means more noise.

Pass Rate

Binary

Percentage of tests where the model hit the minimum bar to pass.

Dual Judge System

Claude Sonnet 4.5Anthropic

Checks each response against the ground-truth bugs. Did the model find them? Are the suggestions actually valid? Writes out its full reasoning for every score.

GPTOpenAI

Second opinion, same criteria, different provider. Two judges from two companies means no single-provider bias. The final score is the average of both.

final_score = (sonnet_score + gpt_score) / 2. Same for coverage and validity. You can see both judges' full reasoning for every trace in theexplorer.

Languages

TypeScript / NodePythonReact / TSXRubyJava

Every language has both local and cross-file tests. Same test suite for every model, so the comparison is fair.

Powered By

Every evaluation is run byKodus, the AI code review engine behind this benchmark. The same pipeline that reviews production code runs these evals.

Learn about Kodus

Rankings

Global Leaderboard

Full ranking

#	Model	Score	Coverage	Validity	Cross-File	Latency
01	Claude Sonnet 4.5Anthropic	87.1%	85.0%	89.2%	90.7%	10ms
02	Gemini 2.5 ProGoogle	86.8%	78.0%	95.7%	87.5%	8ms
03	Kimi K2.5Moonshot AI	85.6%	78.6%	92.7%	85.0%	8ms
04	Claude Haiku 4.5Anthropic	85.0%	88.8%	81.2%	89.1%	8ms
05	Gemini 3.1 ProGoogle	84.2%	77.1%	91.3%	80.5%	21ms

Explorer

Regression Traces

All traces

Refactor event names to follow domain-driven naming convention (e.g. 'orderCompleted' -> 'order.finalized')

src/events/event-registry.ts

@@ -27,7 +27,7 @@

new hunk

27 private registerDefaults(): void {

28+ this.register('order.finalized', this.handleOrderComplete.bind(this));bug

29+ this.register('payment.captured', this.handlePayment.bind(this));bug

30+ this.register('user.onboarded', this.handleUserRegistration.bind(this));bug

31+ this.register('inventory.adjusted', this.handleInventory.bind(this));bug

32 }

old hunk

27 private registerDefaults(): void {

28- this.register('orderCompleted', this.handleOrderComplete.bind(this));bug

29- this.register('paymentProcessed', this.handlePayment.bind(this));bug

30- this.register('userRegistered', this.handleUserRegistration.bind(this));bug

31- this.register('inventoryUpdated', this.handleInventory.bind(this));bug

32 }

Claude Haiku 4.5suggested 4 changes

Event name mismatch breaks CheckoutService.completeOrder. The service emits 'orderCompleted' but the registry now registers 'order.finalized', causing the event to be silently dropped with no handlers executing. The warehouse notification and analytics update will never occur.

src/events/event-registry.ts

Suggested change

-this.register('order.finalized', this.handleOrderComplete.bind(this));

+// Ensure CheckoutService.completeOrder emits 'order.finalized' instead of 'orderCompleted'

+// OR register both event names:

+this.register('order.finalized', this.handleOrderComplete.bind(this));

+this.register('orderCompleted', this.handleOrderComplete.bind(this)); // backward compat

Event name mismatch breaks PaymentService.capturePayment. The service emits 'paymentProcessed' but the registry now registers 'payment.captured', causing the event to be silently dropped with no handlers executing. Revenue recording will never occur.

src/events/event-registry.ts

Suggested change

-this.register('payment.captured', this.handlePayment.bind(this));

+// Ensure PaymentService.capturePayment emits 'payment.captured' instead of 'paymentProcessed'

+// OR register both event names:

+this.register('payment.captured', this.handlePayment.bind(this));

+this.register('paymentProcessed', this.handlePayment.bind(this)); // backward compat

Event name mismatch breaks OnboardingService.registerUser. The service emits 'userRegistered' but the registry now registers 'user.onboarded', causing the event to be silently dropped with no handlers executing. Welcome emails will never be sent.

src/events/event-registry.ts

Suggested change

-this.register('user.onboarded', this.handleUserRegistration.bind(this));

+// Ensure OnboardingService.registerUser emits 'user.onboarded' instead of 'userRegistered'

+// OR register both event names:

+this.register('user.onboarded', this.handleUserRegistration.bind(this));

+this.register('userRegistered', this.handleUserRegistration.bind(this)); // backward compat

Event name mismatch breaks WarehouseService.updateStock. The service emits 'inventoryUpdated' but the registry now registers 'inventory.adjusted', causing the event to be silently dropped with no handlers executing. Reorder threshold checks will never occur.

src/events/event-registry.ts

Suggested change

-this.register('inventory.adjusted', this.handleInventory.bind(this));

+// Ensure WarehouseService.updateStock emits 'inventory.adjusted' instead of 'inventoryUpdated'

+// OR register both event names:

+this.register('inventory.adjusted', this.handleInventory.bind(this));

+this.register('inventoryUpdated', this.handleInventory.bind(this)); // backward compat

Reference Bugs (4)

src/events/event-registry.tsL28–28src/events/event-registry.tsL29–29src/events/event-registry.tsL30–30src/events/event-registry.tsL31–31

Judge Evaluation

100.0%

Score

100.0%

Coverage

100.0%

Validity

silently skipped. - Verdict: VALID - Reason: This is a concrete bug where a refactored event name in the registry doesn't match the event name being emitted by the service, causing handlers to never execute. This will result in missing reorder threshold checks, risking stock-outs. ## Step 2: Coverage Comparing against reference issues: 1. **Reference Issue 1** (orderCompleted → order.finalized): **FOUND** by Suggestion 1 2. **Reference Issue 2** (paymentProcessed → payment.captured): **FOUND** by Suggestion 2 3. **Reference Issue 3** (userRegistered → user.onboarded): **FOUND** by Suggestion 3 4. **Reference Issue 4** (inventoryUpdated → inventory.adjusted): **FOUND** by Suggestion 4 All 4 reference issues were found. coverage_score = 4/4 = 1.0 ## Step 3: Validity Total suggestions: 4 Valid suggestions: 4 (all suggestions are valid) validity_score = 4/4 = 1.0 ## Step 4: Final Score coverage_score = 4/4 = 1.0 validity_score = 4/4 = 1.0 final_score = 1.0 * 0.5 + 1.0 * 0.5 = 1.0

12.2sParse OKLineAcc 100%IoU 100%Matched 4/4