Blog ·AI Engineering·

Your AI Agent Says It's Done. Make It Prove It.

Why every significant AI-built feature should ship with a verification artifact: an HTML page of screenshots, sample outputs, and SQL checks that lets a human verify the work in five minutes instead of fifty

The Two Most Dangerous Words in AI-Augmented Engineering

"All set."

Your coding agent ran the tests. The diff looks reasonable. The PR description is well-organized. The TypeScript compiles. It says it's done.

How do you actually know?

The honest answer for most teams: you don't. You read the diff and trust your gut. You run the dev server, click through the happy path, and call it shipped. Maybe you spot-check one or two edge cases. Then a week later a user hits a flow you never tried, and the code itself is fine. The bug lives in the behavior of the running system under a real-world condition no one tested. The agent built exactly what you described, and exactly what you described turned out to behave differently in a real browser than it did in any diff review.

The verification gap: Coding agents reliably produce code that looks correct, compiles, and passes tests. They are far less reliable at confirming that the resulting system behavior is what you actually wanted. The space between "code is right" and "feature works" is where AI-augmented teams lose their weeks back.

The pattern I want to describe takes about fifteen extra minutes per significant feature. It doesn't require new tooling. It catches the kind of bugs that would otherwise show up in production, and as a bonus, it produces the single best sprint artifact I've ever shipped.

I call it the verification artifact. It's an HTML page. That's all.


What Your Existing QA Doesn't Tell You

Here's why the standard verification stack underperforms when an AI built the feature.

Reading the diff. Diffs are good for catching style issues, obvious logic errors, and things you've seen before. They are nearly useless for catching "this whole approach has a subtle behavioral problem you won't see until you run it." LLMs write plausible-looking code by definition. A clean diff is the expected output of any modern coding agent, so it carries almost no information about whether the feature works.

Running tests. Tests verify code correctness, not feature correctness. The test passes because the agent wrote the test against the same implementation. That's a check on internal consistency, dressed up to look like independent verification.

Type-checking. Same problem. The compiler verifies the code is internally consistent. It doesn't tell you the right thing is being built.

Eyeballing the UI. This actually does work. The problem is nobody does it thoroughly. You click through the happy path, see things rendering, and move on. You don't re-clear your cookies and click the magic link from a clean browser context, and that's the path that breaks.

The thing that catches behavioral bugs in AI-built features is exactly the thing that's most expensive to do well: a human, looking at the actual running system, with fresh eyes, exercising the real flows. The verification artifact is a way to make that cheap.


What's in the Artifact

A verification page is a single self-contained HTML file that lives next to the implementation plan in the project. It contains, for the feature you just built:

  • Screenshots. Of every meaningful UI state. Empty state, post-action state, error state, modal state, the bit you almost forgot.
  • Sample API outputs. Real request/response pairs from a running dev server, not types or schemas.
  • SQL or data verification. The actual query you ran to confirm the database ended up in the state you expected, with the result printed below it.
  • The checklist. Every assertion the feature is supposed to satisfy, ticked off, with the evidence pointing at the screenshot or sample that proves it.
  • Known follow-ups. What you saw, decided not to fix in this PR, and want to remember.

Here's the structure from a recent feature I shipped, a tool for spinning up throwaway test users to QA the onboarding flow:

Verification page sections:

  • What landed: bulleted list of files, migrations, endpoints
  • Sprint 1, Schema: the migration SQL, the verification query, the result inline
  • Sprint 2, API: five endpoints, with live POST/GET responses captured from curl
  • Sprint 3, UI: eight numbered screenshots showing each state in the flow
  • Middleware gate: screenshot proving the 404 fires when the employee flag is false
  • End-to-end checklist: ten ticked checkboxes, each pointing at the evidence
  • Magic link end-to-end (post-fix): bug 1, bug 2, the server logs proving the bug, the fix snippets
  • Known follow-ups: two items intentionally not fixed in this PR

Total time to produce: about twenty minutes after the work was already done. Total time to audit as a reviewer: roughly five minutes of scrolling. Compare that to the cost of pulling the branch, restoring my dev environment, walking through the feature live, and trying to remember which states to test.


The Bug Story That Sold Me on This

The test-users feature looked done. The agent had built it, the API responses were correct, the page rendered, single-delete and bulk-delete both worked. I had the verification page open with seven of the eight planned sections filled in. I hadn't yet done the click the magic link in a real browser step because, well, the API was clearly returning a magic link.

I made the agent do it. Open the link in a clean browser context. Watch what happens.

It got stuck on a "Confirming authentication..." spinner forever.

Then I made it dig in. Two bugs were hiding behind the green checkmarks:

Bug 1. Our /auth/confirm page only handled the PKCE flow (?code= query parameter). Magic links from auth.admin.generateLink({ type: 'magiclink' }) return their tokens in the URL hash, not a query string. The page silently ignored them and waited forever for a session that would never arrive.

Bug 2. Even after I fixed that, the link sent users to the marketing landing page with auth tokens orphaned in the URL hash. Why? Supabase's redirect URL allowlist matches by strict equality unless an entry contains a *. We were passing …/auth/confirm?next=/home. The allowlist had …/auth/confirm. No match, so Supabase silently fell back to the bare Site URL.

Neither bug would have been caught by reading the diff. Neither bug would have been caught by running the tests. Both bugs lived in the interaction between our code and an external system (Supabase Auth) under conditions (a real browser, a clean cookie state) that the agent never exercised on its own.

The bugs were caught because the artifact required an end-to-end screenshot. Without that requirement, the test-users feature would have shipped with two broken interactions, and nobody would have noticed until production. Once the bugs were fixed, the artifact gained a new section called Magic link end-to-end (post-fix), with the server log proving the rewrite was happening, the two fix snippets inline, and a screenshot of a test user successfully landed on the onboarding wizard.

That last section is now the most valuable part of the document. Future me, or future Claude reading the project history, knows exactly what broke, exactly what fixed it, and exactly what evidence proved the fix.


Why This Pattern Is Even More Valuable for "Mushy" AI Features

Everything I've said so far applies to building any software with an AI agent. The verification artifact gets even more valuable when the thing you're building is itself AI-powered.

Consider a RAG feature: "Given a question, retrieve relevant documents and synthesize an answer." How do you verify it works? You can write tests asserting that some documents come back. You can write tests asserting the response is non-empty. You cannot, in any meaningful way, write a test asserting "the answer is good."

The mushy-output problem: For features whose outputs are themselves natural language (summaries, RAG answers, classifications, agent reasoning traces), there is no green-checkmark verification. There is only "does this output look right to a human." The verification artifact is how you make that judgment quickly and reviewably.

The structure that works for mushy outputs:

  • A small but representative input set. Five to ten cases that span the input distribution. The happy path, two edge cases, two adversarial cases.
  • The actual outputs, captured. Not paraphrased. The literal string the system produced.
  • A "what to look for" prompt. What's the human reviewer supposed to be checking? "Does the answer cite the document it claims to be citing?" "Does the summary preserve the action items?" "Does the classification handle the ambiguous case the way we'd expect?"

LLM-as-judge has its place. It scales, it gives a number, it's fine for regression detection. It cannot replace a human looking at outputs for a feature you're shipping for the first time. The judge model has the same blind spots the implementation model does. You need a human, briefly, on real outputs.

The verification artifact is how you give them the briefly.


Making It Automatic

This pattern only works if it's automatic. The whole point is that the agent produces the artifact while it's already in the context of the implementation, and not as a separate "now write up what you did" task.

The way I've operationalized it: my implementation-plan skill, the prompt the agent reads when starting any non-trivial feature, has a phase called "verification artifact" that requires an index.html file in verification/ containing screenshots and samples before the work can be called done. The screenshots get moved into the verification folder rather than thrown away. The HTML gets opened in my browser automatically when the agent thinks it's done.

That last detail matters more than it sounds. If the artifact only opens when I remember to open it, I won't remember. If it opens automatically, I always look. Once I'm looking, I notice the things the agent skipped.

The verification artifact rules I've landed on:

  • One file, no dependencies. Plain HTML, inline CSS, screenshots as relative paths in a sibling folder. Open it from the filesystem, no build step.
  • Real outputs, not types. Sample API responses copied verbatim from curl, SQL results pasted under the queries that produced them. Schemas tell you what could happen; samples tell you what did.
  • Screenshots over descriptions. "The button changes color on hover" is not verification. A screenshot of the hover state is.
  • Document the bugs you found. When verification catches something, the fix gets a section in the artifact, with the server logs or sample inputs that proved the original was broken. This is the part nobody does and it's the most valuable part.
  • Known follow-ups, written down. The thing you decided not to fix in this PR, and why, lives at the bottom. Future-you will thank you.

The Sprint Artifact Bonus

A few sprints in, the folder of verification pages quietly becomes the most valuable engineering documentation your team has.

The verification pages are the only artifact that captures, for each shipped feature, what we built, what it actually does in the running system, how we proved it, and what we knew we hadn't gotten to yet. READMEs drift away from the code. Architecture diagrams describe what we meant to build. Verification pages describe what the running system actually did when we shipped it.

When a new engineer joins the project, the verification folder is what I point them at. When a bug shows up months later in a feature I haven't touched, the verification page tells me what the original behavior was supposed to be. When a partner asks "are we sure that's working correctly?", I send them a link.

Engineering teams produce a lot of artifacts. PRs, commits, issues, design docs, retros, RFCs. Almost none of them age well. The verification page is the rare exception, because it's a snapshot of a working system at a moment in time, with screenshots that don't lie about what was actually shipping.


What This Costs You

Twenty minutes per feature. The agent produces it; you don't write it by hand. It uses the screenshots it was already taking to verify its own work, the SQL queries it was already running, the API responses it was already inspecting. The work was already happening. The artifact just makes that work legible to a human reviewer.

What you get in return:

  • A forcing function that makes "click the actual flow in a real browser" a non-skippable step
  • A five-minute review path instead of a fifty-minute one
  • A behavioral spec that survives the codebase
  • Documentation of the bugs you found during verification, which is otherwise the highest-signal information in the project and the most likely to be lost
  • A growing catalog of how each feature in your product is supposed to work, captured at the moment it was shipping

For AI-augmented teams, this isn't optional. The whole reason you're moving fast with AI is that you've collapsed the implementation cost of features. Verification cost did not collapse along with it. If you're not deliberately rebuilding the verification side of the loop, you're shipping features faster than you can confirm they work, and the gap accumulates as debt.


The One-Sentence Version

Trust the AI agent that hands you an HTML page proving the feature works.

Make the artifact mandatory, have it open automatically, and require the agent to fill in a section called bugs we found during verification. That last section is the one that turns "AI builds the feature" into "AI builds the feature and confirms it works." That's the only version of the workflow that scales.