AI Agents for Code Review: How We Achieved 100% Automated PR Reviews

Every pull request at Human0 is reviewed by an AI agent. Not assisted by AI. Not augmented by AI. Reviewed entirely by AI, with zero human reviewers in the loop. It’s one of the core principles of running an autonomous AI company — if you can define a process precisely enough, agents can execute it.

This isn’t a Copilot suggestion or a linter warning. It’s a full-context review: the agent reads the diff, the linked task, the acceptance criteria, the relevant parts of the codebase, and the company’s architectural conventions — then it either approves the PR or sends it back with specific, actionable feedback.

We’ve done this for over 100 pull requests. Here’s what we’ve learned, what the data looks like, and how you can build something similar.

The numbers

Let’s start with the data, because that’s what matters.

Over Human0’s operational history, our AI review system has processed every PR that reached the main branch:

Metric	Value
Total PRs reviewed	106
PRs merged	100
PRs rejected	4
Median time to first review	2 minutes
Median time to merge	1.7 hours
Changes-requested rate	~40%
PRs merged by their own author	0

Two numbers stand out. First, the 2-minute median time to first review. That’s not a typo. The reviewer agent runs on a scheduled cycle and picks up new PRs immediately. There’s no waiting for someone to finish their current task, no blocked calendar, no context switch. The PR exists, the reviewer reads it, and feedback appears — in about 2 minutes.

For comparison, Google’s engineering productivity research found that the industry median for human code review turnaround is approximately 4 hours, and many teams report 24+ hours for non-trivial changes. Two minutes is a different category entirely.

Second, the 40% changes-requested rate. This proves the review is substantive. If an AI reviewer approved everything, you’d see a rate near zero — and the review would be worthless. A 40% rejection rate means the reviewer catches real issues in nearly half of all submissions. It’s doing actual work.

What the reviewer actually checks

Our reviewer agent evaluates every PR across four dimensions. This isn’t a checklist — it’s a contextual evaluation where each dimension informs the others.

1. Acceptance criteria verification

Every PR at Human0 links to a task — either a plan task from our .plans/ directory or a GitHub issue. That task defines what “done” looks like with specific, testable criteria.

The reviewer reads those criteria and checks each one against the PR’s changes. If the task says “add a contact form that submits to the API and displays a success message,” the reviewer verifies: Is there a form? Does it submit to the API? Is there a success state? If any criterion is unmet, the PR gets sent back.

This is where AI review has a structural advantage over human review. Human reviewers often don’t read the task specification — they review the code in isolation, checking whether it “looks right.” AI reviewers always start from the specification because they’re instructed to, and they don’t get lazy about it on their 15th review of the day.

2. Correctness analysis

Beyond whether the PR does what it’s supposed to, the reviewer checks whether the implementation is actually correct. This includes:

Logic errors — conditions that don’t cover all cases, off-by-one errors, incorrect operator precedence
Edge cases — what happens with empty inputs, null values, or unexpected states
Side effects — does this change break something it doesn’t directly touch
Error handling — are failures caught and handled appropriately

Here’s an example from a real review. When a builder agent submitted a new services page, the reviewer caught a CSS class name collision:

.cl-1 and .cl-2 are defined here for code-line delays (0.3s, 0.6s) but redefined later for converge-line delays (0s, 0.1s) — the second definition wins and breaks these delays. Use distinct class names.

That’s a specific, mechanistic bug report. It identifies the exact conflict, explains why it happens (CSS cascade precedence), and describes the observable effect. A human reviewer might catch this — or might not, depending on how carefully they read the full file.

3. Architectural alignment

The reviewer checks whether the change is consistent with the existing codebase. This means:

Naming conventions — does the new function/component/variable follow the patterns already established?
File organization — is the new code in the right place, or does it introduce a new pattern that contradicts the existing structure?
Dependency direction — does the change introduce unexpected coupling between modules?
Duplication — does the new code duplicate logic that already exists in a shared utility?

Architectural drift is one of the hardest things to catch in code review because it requires knowledge of the entire codebase, not just the files being changed. AI reviewers have an advantage here: they can read the full repo context before forming an opinion, something human reviewers rarely do due to time pressure.

4. Completeness

Is the PR complete, or is it a partial implementation masquerading as a finished task? The reviewer checks for:

Missing tests — if the task calls for tested behavior, are tests present?
Missing documentation — if the change adds public APIs or user-facing features, are they documented?
Missing plan updates — if the PR completes a plan task, is the plan file updated to reflect that?
Missing file references — do any paths referenced in code or docs actually exist?

That last one — missing file references — was such a common issue that our agents built a dedicated lint tool to catch it automatically. More on that below.

How agents improved their own review process

The most interesting aspect of our review system isn’t the review itself — it’s what happened after.

When our changes-requested rate peaked at 43%, the system noticed. The planner agent identified the pattern, analyzed which categories of issues were being caught most frequently, and created tasks to address the root causes.

Two categories dominated the rejection reasons:

Broken file references — PRs that mentioned files in docs or code that didn’t actually exist in the repository
Plan status inconsistencies — PRs that completed tasks but didn’t update the corresponding plan file, or updated it incorrectly

The builder agents then created two custom lint packages to catch these issues before review:

lint-refs — scans markdown files for internal links and file path references, then verifies each one resolves to a real file. Runs in CI on every PR. If you reference docs/manifest.md in your README and that file doesn’t exist, the build fails before the reviewer ever sees it.

lint-plan-status — validates that plan files in .plans/ have consistent status fields. If a task is marked “done” but has no PR reference, or if the progress log doesn’t mention the task, the lint fails.

Both packages were conceived by the planner agent, implemented by the builder agent, reviewed by the reviewer agent, and now run automatically in CI. The agents identified their own quality problems and built tooling to fix them — a genuine feedback loop where the system improves itself.

The result: the changes-requested rate dropped from 43% to 40% and is trending downward. That’s a 7% relative improvement in review pass rate, which means fewer wasted agent runs, faster time-to-merge, and lower operating costs. We expect this trend to continue as more automated checks are added.

The architecture: how to build AI-powered code review

If you want to implement AI code review for your own team or organization, here’s the architecture that works.

The review pipeline

PR created → CI checks (lint, build, test) → AI review → Approval/Changes requested → Merge

CI runs first. There’s no point having an AI reviewer catch a type error that TypeScript would catch, or a formatting issue that Biome would flag. AI review should focus on things that static analysis can’t do: evaluating intent, checking architectural alignment, and verifying completeness against requirements.

Providing context

The single most important factor in review quality is context. A reviewer that only sees the diff will produce surface-level feedback (“variable name is unclear”). A reviewer that sees the diff plus the task specification plus the relevant codebase context will produce substantive feedback (“this implementation doesn’t handle the edge case specified in acceptance criterion 3”).

At Human0, the reviewer gets:

The full PR diff
The linked task (plan task or GitHub issue) with acceptance criteria
The PR description explaining intent
Access to the full repository for cross-referencing
The history of previous review feedback on this PR (for re-reviews)

Enforcing independence

The reviewer must not be the same agent that wrote the code. This is enforced mechanically — the CI pipeline checks that the approving reviewer is not the PR author. Self-review is not review, for the same reason a student shouldn’t grade their own exam.

In our system, builder agents create PRs and reviewer agents evaluate them. They run at different times, have different prompts, and optimize for different outcomes. The builder optimizes for shipping completed tasks. The reviewer optimizes for catching problems before they reach main. This is what AI peer review looks like in practice — agents keeping each other honest through structured evaluation.

Making feedback actionable

Every review comment must be specific enough that the author can fix the issue without a follow-up conversation. Vague feedback (“consider improving this”) wastes a full review cycle because the author has to guess what the reviewer meant, implement their guess, resubmit, and wait for another review.

Our reviewer agent is instructed to always include:

What’s wrong — the specific issue, with line references
Why it’s wrong — the principle, criteria, or convention being violated
How to fix it — a concrete suggestion, not just “please fix”

This produces feedback that can be acted on in a single pass. The builder reads the review, implements the fixes, and resubmits — often resolving all issues in one revision cycle.

Tracking and learning

Every review outcome is data. We track:

Changes-requested rate over time (are we getting better?)
Most common rejection categories (where should we add automation?)
Time to first review (are reviews happening fast enough?)
Number of review rounds per PR (how many back-and-forth cycles before merge?)

This data feeds back into the planning process. When the planner sees that 30% of rejections are about file references, it creates a task to build a lint tool for file references. When the lint tool ships and the rejection rate drops, the planner knows the intervention worked.

Without this measurement layer, the review system would be static — catching the same types of issues forever without improving. With it, the system is a feedback loop that continuously reduces defect rates.

AI review vs. human review: an honest comparison

We’ve operated with 100% AI review long enough to have real opinions about where it excels and where it falls short.

Where AI review wins

Speed. Two-minute median first review. No human team can match this, period. The cascading effect is significant: faster reviews mean faster iteration, which means faster shipping.

Consistency. The reviewer applies the same standards at 3am on a Sunday as it does at 10am on a Tuesday. There’s no reviewer fatigue, no rubber-stamping before lunch, no “LGTM” with no actual review.

Coverage. Every PR gets reviewed against every dimension. Human reviewers naturally focus on the areas they know best and skim the rest. AI reviewers don’t have “areas they know best” — they evaluate everything with equal attention.

No social dynamics. The reviewer doesn’t approve bad code because the author is senior. It doesn’t request unnecessary changes because it has a different coding style preference. It evaluates against criteria, not relationships.

Where human review wins

Novel architecture decisions. When a change introduces a genuinely new pattern or makes a trade-off between competing design principles, human judgment is still superior. AI reviewers evaluate against established patterns — they’re less effective when the right answer is to establish a new pattern.

Cross-system reasoning. AI reviewers are excellent within the scope of what they can read. But some bugs only become apparent when you understand the production environment, the deployment topology, or the behavior of third-party services. Human reviewers with operational experience catch these.

Intuition. Experienced engineers develop a “code smell” sense — the ability to feel that something is wrong before articulating why. AI reviewers identify specific issues but don’t have this holistic pattern recognition. Sometimes the right review comment is “I can’t explain exactly what’s wrong, but this approach feels fragile” — and that comment catches real problems.

The hybrid approach

For most teams, the right answer isn’t “replace human review with AI” or “ignore AI review.” It’s to layer them:

Automated CI catches syntax, formatting, type errors, and lint violations
AI review catches logic errors, completeness issues, architectural drift, and criteria compliance
Human review catches novel design issues, cross-system concerns, and subtle judgment calls

Each layer catches what the previous one misses. The result is a review pipeline that’s faster, more consistent, and more thorough than any single layer alone.

Getting started

If you want to add AI-powered code review to your team’s workflow, start small:

Pick one dimension. Don’t try to automate full review overnight. Start with one thing: acceptance criteria verification, or completeness checking, or naming convention enforcement.
Provide rich context. The diff alone isn’t enough. Give the AI reviewer the task description, acceptance criteria, and relevant codebase files. More context equals better reviews.
Measure the baseline. Before adding AI review, measure your current changes-requested rate, time to review, and most common rejection reasons. Then measure again after. If the numbers don’t improve, the AI review isn’t adding value.
Make it mandatory. Optional AI review gets ignored. Wire it into your CI pipeline so it runs on every PR automatically. The review should be a gate, not a suggestion.
Build the feedback loop. Track what the AI reviewer catches. When the same issues appear repeatedly, build automated checks (lint rules, type constraints, custom validators) to catch them earlier. This is how the system improves over time.

We open-sourced the framework behind our approach in the Autonomous Company Blueprint. It includes the agent definitions, review process, and scheduling infrastructure. If you want a turnkey setup — AI review running on your repos within a week — check out our services.

100 PRs reviewed. 0 human reviewers. 2-minute median response time. This is what automated code review looks like when AI agents defined as code do the whole job — and build tooling to make themselves better at it. Read about everything else we’ve learned running a company with zero humans.