AI Code Review
Benchmark.
We generate realistic bugs based on real-world patterns and see which LLMs can actually find them. Not linting issues. Regressions that break logic and cross-file contracts.
How It Works
Every model gets the same test suite, the same bugs, the same grading criteria. No cherry-picking, no human scoring, no prompt tricks. Just deterministic evals you can audit yourself.
Generate bugs
We create synthetic but realistic regressions based on real-world bug patterns across 5 languages. Each test case has a PR patch, the full codebase, and exact bug locations as ground truth.
Run AI code review
Every model gets the same PR diff and has to find the bugs, point to the exact lines, and suggest a fix. We parse everything into structured suggestions.
Two judges score it
Claude Sonnet and GPT independently score every response. Did the model find the bugs? Were the suggestions actually correct? Final score is the average of both.
Publish everything
All scores, all traces, all judge reasoning. Broken down by language, category, and model. Nothing is hidden.
Test Categories
Local Logic
Bugs that live inside a single file. Wrong conditions, broken state, off-by-one errors. Most models do okay here. It's table stakes for AI code review.
Cross-File Context
You change an interface in one file and it breaks three consumers in other files. Can the model catch that? This is where it gets hard, and where we see the biggest gap between models.
Metrics Explained
Overall quality, averaged across both judges. This is what we rank by.
How many of the known bugs did the model actually find?
Out of everything the model flagged, how much was real? Lower means more noise.
Percentage of tests where the model hit the minimum bar to pass.
Dual Judge System
Checks each response against the ground-truth bugs. Did the model find them? Are the suggestions actually valid? Writes out its full reasoning for every score.
Second opinion, same criteria, different provider. Two judges from two companies means no single-provider bias. The final score is the average of both.
Languages
Every language has both local and cross-file tests. Same test suite for every model, so the comparison is fair.
Powered By

Every evaluation is run byKodus, the AI code review engine behind this benchmark. The same pipeline that reviews production code runs these evals.
Learn about KodusGlobal Leaderboard
Regression Traces
Refactor event names to follow domain-driven naming convention (e.g. 'orderCompleted' -> 'order.finalized')
Event name mismatch breaks CheckoutService.completeOrder. The service emits 'orderCompleted' but the registry now registers 'order.finalized', causing the event to be silently dropped with no handlers executing. The warehouse notification and analytics update will never occur.
src/events/event-registry.tsEvent name mismatch breaks PaymentService.capturePayment. The service emits 'paymentProcessed' but the registry now registers 'payment.captured', causing the event to be silently dropped with no handlers executing. Revenue recording will never occur.
src/events/event-registry.tsEvent name mismatch breaks OnboardingService.registerUser. The service emits 'userRegistered' but the registry now registers 'user.onboarded', causing the event to be silently dropped with no handlers executing. Welcome emails will never be sent.
src/events/event-registry.tsEvent name mismatch breaks WarehouseService.updateStock. The service emits 'inventoryUpdated' but the registry now registers 'inventory.adjusted', causing the event to be silently dropped with no handlers executing. Reorder threshold checks will never occur.
src/events/event-registry.tsScore
Coverage
Validity