AI Engineering·

Erasing and Repainting an AI Writing Voice

Our content pipeline kept publishing posts that readers flagged as AI-generated. The grammar was clean, the facts were right, nobody cared. What it took to fix the voice, and a surprise about where 'more passes' stops helping.

"This reads like AI." That was the feedback our auto-posting pipeline kept catching on LinkedIn and Instagram. The tool behind those posts wasn't producing factual mistakes, and the grammar was clean on every draft. People could still tell.

The awkward part is that the project was specifically about responsible, human-first AI adoption. Having our own pipeline pattern-match to ChatGPT in the first sentence of every post was a credibility leak we couldn't live with.

Why "just make it sound more human" is the wrong frame

The obvious fix is to tell the model to sound more human. Humans don't share a voice, so that instruction doesn't land anywhere specific. A grandmother writes differently from a Brooklyn DJ, who writes differently from a civil engineer. The target has to be a specific person whose name ends up on the byline. That's what the reader is pattern-matching against when they say something reads like AI.

The first thing anyone tries is a better prompt. We tried. Stuffing the brand voice into the content agent's system prompt helped on easy drafts and fell apart on hard ones. The same agent was supposed to research the topic, call tools, structure the piece, and then also hit a vocabulary ban list with forty-ish words on it. Something had to give, and the thing that gave was always the voice. Banned words slipped back in, opener clichés came back even when the prompt forbade them by name, and em-dash density crept up across every longer draft.

After enough of those, we stopped trying to write a bigger prompt and started moving the pieces around. The architecture was the problem.

What other people figured out first

Before writing new code, we did the reading. Turns out other people have spent a lot of time on this already.

The most useful source is Reddit. Subs like r/ChatGPT, r/ContentMarketing, and r/ArtificialIntelligence have spent the last two years cataloguing the specific patterns that give AI writing away, with a level of specificity you don't get from any whitepaper. Thousands of upvotes on single-phrase call-outs. The most-cited pattern is the antithesis formula, some variant of "It's not X. It's Y." One r/ChatGPT thread flagging it has over a thousand upvotes and hundreds of replies all saying "yes, that's the one."

Academic work adds harder numbers. The word "delve" alone saw something like a 900% spike in academic paper abstracts in the year after ChatGPT's release. An Originality.ai analysis estimated that 54% of long-form LinkedIn posts are AI-generated as of 2025, up 189% from pre-ChatGPT baselines. Em dashes show up roughly two to three times more often in AI writing than in comparable human writing.

Seven categories of tell kept coming up across sources:

  1. Opener clichés ("In today's fast-paced world...", "I'm excited to announce...", "Let's dive in")
  2. Rhetorical crutches (the antithesis formula, staccato triplets, formal transition words, empty "what are your thoughts?" engagement prompts)
  3. Vocabulary fingerprints (delve, tapestry, multifaceted, leverage, foster, crucial, and a few dozen others that appear 10-50x more often in AI text than in natural writing)
  4. Structural tells (low burstiness in sentence length, uniform one-sentence paragraphs, formulaic hook-bullets-wrap arcs)
  5. Em-dash density
  6. Content-level tells (no stance, no sensory detail, no named people, rounded numbers instead of specific ones)
  7. Assistant-voice leaks ("Here's a draft...", "I hope this helps!", "Let me know if you want to adjust")

The most useful finding was about clusters. An individual tell is forgivable. A real writer might use "delve" once, drop an em dash, start a post with a stale opener when they're in a hurry. What sets off the reader's alarm is two or three tells stacked in the same piece. That gave us a concrete compliance bar to work against: zero on the worst offenders, and at most one on the moderate ones.

The reading changed the tooling in two concrete ways. It gave us specific patterns to design against, instead of a vague "sound more human" target. And those patterns could be checked with regex in milliseconds, well before any slow LLM critic touched the draft. Both the ban list baked into our prompts and the deterministic test layer in our evaluation harness trace back to that first reading pass.

The fix: treat voice as a separate pass

Humans don't write in one pass either. You get the thoughts down, then you edit. The editing pass is where the voice actually lives. Good writers spend more time on the edit than on the draft.

So we split the work:

  1. Draft. A mid-tier model does what mid-tier models do well: assembles the piece, uses the tools, gets the facts in.
  2. Polish. A second call, with one job only. Its system prompt is nothing but the persona, the vocabulary ban list, the structural rules, and an absolute order to preserve every name, number, quote, and date from the draft. It rewrites the draft in the target voice, or returns it unchanged if it's already there.
  3. Critique (when needed). A cheap classifier scores the polished draft against the rules and returns a structured list of tells with severities. If the score is high, ship. If the score is low and the failures are high-severity, escalate.
  4. Regenerate (last resort). A flagship model rewrites from scratch with the critic's notes folded into the prompt as must-fix feedback. Slow, expensive, gated behind a specific failure signal.

That's four prompts where we used to have one, each tunable without breaking the others. Keeping them separate gets hard when a deadline starts pushing you to fold them back together.

The design rule that does the work: each pass is allowed to be bad at every job except its own. The draft pass is allowed to produce AI-sounding prose, because the polish pass will fix it. The polish pass is allowed to not know the facts, because the draft already has them.

What it looks like in the chat

Every pass shows up as a visible tool call. Users see the underlayer and the final paint job. If the polish changed something they didn't want changed, they can see exactly what, and push back on it by name.

user: Draft a short LinkedIn post recapping last night's event.
       48 attendees. [Specific person] asked the sharpest question.

assistant:
  → tool call: polish_voice({ draft: "…", type: "linkedin" })
  ← { pass: true,
      voice_notes: "Kept the facts and tightened the wording.
                    Added a more grounded closer and removed the
                    generic thanks-only ending.",
      tells_found: [
        { pattern: "Generic corporate recap language",
          severity: "medium", count: 1 },
        { pattern: "Bland closing thanks without a point of view",
          severity: "low", count: 1 }
      ] }
  → tool call: create_artifact({ title: "…", content: "<polished>",
                                 type: "linkedin" })

Two things about that matter. The tool card functions as an audit trail, so a human editor who wants to adjust the output doesn't have to guess what the system did. And because the critic's output is machine-readable, we can score every output programmatically in a fixture harness, run it overnight, and read the report in the morning.

What the experiment showed

We built a set of twelve content fixtures covering the genres we actually publish: short event recaps, long event recaps with numbers, event promos, milestone announcements, policy commentary, a deliberately thin prompt with no material to work with, and a deliberately poisoned prompt where the user input itself was full of AI filler (the word "delve" explicitly requested). Then we ran four different architectures against every fixture.

The four architectures we compared:

  • A. Draft only. One model call. Voice rules in the system prompt. This is the baseline.
  • B. Draft + polish. Two calls. Polish pass always runs.
  • C. Draft + critique + conditional regenerate. The generator uses the voice rules. A cheap critic checks the output. If the critic fails the draft with high-severity tells, a flagship model regenerates with the critic's notes.
  • D. Full hybrid. Voice in the generator, always-on polish, always-on critic, conditional regenerate.

Every output got scored two ways. The first was a batch of regex checks: banned word count, em-dash density, antithesis formula occurrences, hashtag count, sentence-length variance, whether the output preserved the specific names and numbers from the input notes. These are deterministic tests that don't score quality; they score whether the draft trips any of the known tripwires.

The second was an LLM-judge rubric: a separate model call that scored each output 1–10 on four axes: avoiding AI tells, sounding like the target persona, containing a concrete specific detail, and having a stance.

The scores:

Going from the baseline (A) to a dedicated polish pass (B) nearly doubled the quality score. That part was expected. D is where it got weird.

The intuitive next move was D: every pass, every time, belt and suspenders. That's what we expected to be the default. The numbers disagreed. D scored below C, by about half a point, despite doing more work.

Why the "more" strategy lost

Re-reading the outputs side by side, the failure mode showed up fast. The polish pass, running unconditionally after a voice-aware draft, was over-correcting. The draft had specifics in it; the polish pass, trying to tighten the voice, sanded those specifics off. A named person from the draft would come out the other side as "an attendee." Specific numbers got rounded down to "many" or "most." The critic, reading the polished version, noticed the missing specifics and scored it lower.

The polish pass didn't know what to hold onto. If you tell a pass what to change without telling it what to preserve, it strips things you wanted to keep. The polish prompt had a strong rule against certain vocabulary and a weak rule about keeping the original's specifics. We tightened the preservation contract, added a "never drop a proper noun, a number, a date, or a quote" rule in bold at the top of the prompt, and D's scores climbed toward C's. Still not a clean win. The cost of D's extra calls wasn't paying for itself once the polish pass stopped over-correcting.

So we shipped C as the default for routine drafts, with the critic and the regen reserved as escalations. On the interactive chat path, this means users see a polish tool card on every post they create. If the polish pass flags itself as uncertain, they also see a critique card. The regen card is rare, and that's by design.

Three surprises

Regex checks caught more than the LLM judge did. We assumed the soft, qualitative rubric would be the real filter and the regex would be a sanity check. It was closer to the opposite. The LLM judge was generous about phrasings that a quick regex pattern flagged as the exact antithesis formula ("it's not X, it's Y"). The judge rationalized, the regex didn't. Keep both layers, because they catch different things.

Deeper reasoning on the flagship regenerate step mostly didn't help enough to pay for itself. We had a hypothesis that letting the flagship model reason harder would produce much better output on the hardest cases. What we got was slightly better output at 3x the latency. On an interactive path, where a user is waiting for the post to land, slower is a feature regression. We pinned the regen model at the lowest reasoning effort the API allowed and moved on.

The most useful fixture was the poisoned one. The poisoned prompt explicitly asked the model to use the words "transformative," "multifaceted," and "delve into." Strategy A produced exactly what you'd expect: a post that sounded like every other LinkedIn post. Strategies B, C, and D routed around the bait. Watching each architecture handle that one case told us more about their relative quality than any of the polite, realistic fixtures did. If you're evaluating a voice pipeline, throw adversarial inputs at it. The polite inputs won't tell you where the edges are.

The takeaway

Voice is a separable concern. Treat it that way. One model shouldn't be on the hook for research, structure, facts, tools, and voice discipline all at once. Carve voice off into its own pass, give that pass one job and one system prompt, and be explicit about what it's allowed to change and what it has to preserve.

This was one of the more fun problems we've worked on recently. Accuracy has a clean target in the form of facts you can check against. Voice doesn't, which is why Reddit has thousands of posts cataloguing the exact phrases readers learn to flag. The target is a resemblance to a specific person, and that's a different kind of target than any of the ones most agent pipelines are trying to hit.