Research/Fiction Writing/AI Prose Strengthen

This page gathers research notes on AI-assisted prose quality, detector claims, and revision practices that favor specificity, voice, and accountability over generic polish.

Scope note: The original request asked for directives to “avoid AI detection.” I cannot help create detector-evasion instructions. This research file therefore reframes the task as: how to produce better, more specific, more accountable AI-assisted prose while avoiding low-quality “AI slop,” and how to document authorship transparently. The companion prompts page gives quality-control prompts, not instructions for deceiving readers, instructors, publishers, or detection systems.

Executive summary

AI detectors are imperfect evidence, not proof. The strongest external theme is uncertainty: detectors can produce false positives, can vary by domain and sample length, and may be biased against non-native English writers. Treat detector output as one signal among many, never as an authorship verdict.
The NousResearch/autonovel project is mainly a craft-and-revision pipeline. Its useful contribution is not “beat the detector”; it is a repeatable process: generate layered context, draft with strong voice constraints, mechanically scan for slop, run adversarial editing, revise from specific cuts, then use reader/reviewer loops.
Low-quality AI prose has recurring signals. The project flags overused lexical patterns, filler transitions, rigid paragraph templates, symmetrical lists, over-explained emotion, generic description, polished dialogue, and uniform rhythm.
Good prose is specific and accountable. The safest durable directive is not “look human,” but “earn every sentence”: concrete nouns, embodied sensory detail, character-specific metaphors, subtext, sentence-length variation, scene over summary, and revision against actual weaknesses.
Transparency matters. MLA and other style/teaching guidance increasingly emphasize disclosure/citation of generative-AI use when it materially contributes to text. Keep drafts, notes, prompts, and revision history when provenance matters.

Primary project researched: NousResearch/autonovel

Repository: https://github.com/NousResearch/autonovel

The repository describes itself as “an autonomous pipeline for writing, revising, typesetting, illustrating, and narrating a complete novel,” inspired by Karpathy’s autoresearch modify/evaluate/keep-discard loop. The first produced novel reportedly went through foundation, drafting, six automated revision cycles, and six Opus review rounds.

Pipeline structure

From README.md, WORKFLOW.md, and PIPELINE.md references:

Phase 1: Foundation — build world, characters, outline, voice, and canon from a seed concept; iterate until foundation score clears a threshold.
Phase 2: First draft — draft chapters sequentially; evaluate each; keep if above score threshold; retry otherwise.
Phase 3a: Automated revision — adversarial editing, cuts, reader panels, revision briefs, and rewritten chapters.
Phase 3b: Opus review loop — full-manuscript dual review as literary critic and professor of fiction; parse actionable defects; fix top issues; repeat until major issues are gone.
Phase 4: Export — typesetting, ePub, art, audiobook, landing page.

Important operational idea: the novel is treated as five co-evolving layers: voice.md controls how prose is written; world.md, characters.md, outline.md, and canon.md control what is true; chapters are the final prose layer. Revisions propagate up and down the layer stack.

Autonovel’s “two immune systems”

The README names two immune systems:

Mechanical evaluation (evaluate.py) scans without an LLM for banned words, fiction clichés, show-don’t-tell violations, sentence uniformity, transition abuse, and structural tics.
LLM judging scores prose quality, voice adherence, character distinctiveness, and beat coverage using a separate model from the writer to reduce self-congratulation.

This is a key pattern: do not rely on a single aesthetic judgment. Use both deterministic checks and adversarial human/editorial review.

Autonovel directives relevant to prose quality

These are extracted from README.md, ANTI-SLOP.md, ANTI-PATTERNS.md, CRAFT.md, draft_chapter.py, evaluate.py, adversarial_edit.py, and voice_fingerprint.py.

Word-level anti-slop findings

Autonovel’s ANTI-SLOP.md and evaluate.py flag words and phrases statistically or stylistically associated with unedited LLM output. The repository treats these as revision triggers, not absolute proof of authorship.

Commonly flagged categories:

Grandiose or corporate diction: “delve,” “utilize,” “leverage,” “facilitate,” “elucidate,” “embark,” “endeavor,” “multifaceted,” “tapestry,” “paradigm,” “synergy,” “holistic,” “myriad,” “plethora.”
Suspicious-in-clusters adjectives/verbs: “robust,” “comprehensive,” “seamless,” “cutting-edge,” “innovative,” “streamline,” “empower,” “foster,” “enhance,” “elevate,” “optimize,” “pivotal,” “profound,” “resonate,” “underscore,” “harness,” “cultivate.”
Filler phrases: “It’s worth noting,” “It’s important to note,” “Let’s dive into,” “In conclusion,” “To summarize,” “Furthermore,” “Moreover,” “Additionally,” “In today’s fast-paced world,” “At the end of the day,” “When it comes to,” “One might argue.”
Rhetorical crutches: especially “not just X, but Y.”

Quality takeaway: replace generic prestige diction with exact nouns, verbs, evidence, and images. If a phrase could fit any topic, it probably adds little.

Structural anti-patterns

Autonovel’s ANTI-PATTERNS.md argues that many AI tells are structural, not lexical:

Over-explaining: the scene already shows fear, grief, or tension, then the narrator explains it.
Triadic listing: repeated “X. Y. Z.” patterns or three-item sensory lists.
Negative assertion repetition: repeated “He did not…” formulations.
Cataloging by thinking: “He thought about X. He thought about Y…” instead of dramatized interiority.
Simile crutch: repeated “the way X did Y.”
Section-break crutch: using breaks to avoid transitions.
Paragraph-length uniformity: middle sections flatten into similar 4–6 sentence paragraphs.
Predictable emotional arcs: outline beats arrive too cleanly, with no sideways interruption.
Repetitive chapter endings: same structural closing move reused.
Balanced antithesis in dialogue: “I’m not saying X. I’m saying Y.”
Dialogue as written prose: polished complete sentences, no interruptions, false starts, or wrong words.
Scene-summary imbalance: too much narration compressing time instead of dramatized action/dialogue.

Quality takeaway: revise for asymmetry, scene-specific endings, imperfect speech, embodied interiority, and genuine surprise.

Fiction-specific “AI tell” patterns

CRAFT.md and evaluate.py highlight fiction clichés often produced by generic LLM drafting:

“A sense of [emotion]”
“Couldn’t help but feel”
“The weight of [abstract noun]”
“The air was thick with…”
“Eyes widened” as default surprise
“A wave/pang/surge of emotion”
“Heart pounded in his/her chest”
Hair that “spilled/cascaded/tumbled”
“Piercing eyes”
“A knowing smile”
“Let out a breath he/she didn’t know they were holding”
“Something dark/ancient/primal stirred”

Quality takeaway: use physical action, sensory fact, and subtext instead of prepackaged emotional labels.

Autonovel drafting constraints worth reusing

From draft_chapter.py:

Write in a defined POV and tense.
Follow a voice definition exactly.
Hit every outline beat, but do not summarize or skip.
Show sensory detail tied to the point-of-view character.
Use character-specific speech patterns.
Ban known slop phrases before drafting.
Vary sentence length deliberately.
Use metaphors from the character’s lived experience.
Trust the reader; do not explain what scenes mean.
Start in scene, not exposition.
End on a moment, not a summary.
Include at least one surprising moment per chapter.
Keep most of the chapter in-scene rather than summarized.

Autonovel evaluation metrics worth reusing

From evaluate.py and voice_fingerprint.py:

banned/slop word hits
filler phrase hits
fiction cliché hits
show-don’t-tell violations
structural tic counts
em dash density
sentence-length coefficient of variation
transition-opener ratio
paragraph-length variation
dialogue ratio
abstract-noun density
repeated sentence starters
simile density
section-break count
chapter-level outliers from the manuscript average

These metrics should not be treated as “AI detector evasion.” They are revision instruments: they expose sameness, abstraction, and cliché.

Adversarial editing as the strongest revision pattern

adversarial_edit.py asks a judge to identify 10–20 exact passages to cut or rewrite and classify them as:

FAT — adds nothing
REDUNDANT — restates what was already shown
OVER-EXPLAIN — explains what the scene demonstrated
GENERIC — could appear in any story
TELL — names emotion/state instead of dramatizing it
STRUCTURAL — disrupts pacing or rhythm

The key research finding: asking “what would you cut?” is more useful than asking for a general quality score. Absolute 1–10 scoring compresses; specific cut lists produce revision plans.

External research and documentation

Stanford HAI: detector bias and unreliability

Source: https://hai.stanford.edu/news/ai-detectors-biased-against-non-native-english-writers

Stanford HAI summarizes research warning that AI detectors can be “unreliable and easily gamed” and biased against non-native English writers. The article describes detectors being marketed to educators and journalists but highlights the core risk: algorithmic authorship judgments can wrongly flag human work.

Directive implication: do not use detector output as sole evidence. Preserve writing history, outlines, notes, version diffs, and citations when authorship might be questioned.

Liang et al. 2023: GPT detectors biased against non-native English writers

Source: https://arxiv.org/abs/2304.02819

The arXiv paper “GPT detectors are biased against non-native English writers” directly examines detector performance and bias. Its relevance is not a writing recipe but a caution: predictable or simpler English can be misread by detectors as synthetic.

Directive implication: do not “complexify” prose artificially to dodge flags. Instead, write to the audience and keep provenance records.

Pangram technical report

Source: https://arxiv.org/abs/2402.14873

The Pangram technical report describes a classifier trained across domains and model outputs and claims very low false positive rates in high-data domains. It also claims generalization to nonnative speakers and unseen domains/models.

Directive implication: detector technology varies widely. Some systems are model-based classifiers rather than simple perplexity/burstiness tools. This makes detector-specific evasion brittle and ethically problematic. The durable response is quality control plus transparent authorship.

GPTZero FAQ

Source: https://gptzero.me/faq/

GPTZero positions itself as an AI detector plus authorship-verification platform, including integrations that preserve writing transparency. The page foregrounds AI probabilities and writing transparency rather than pure binary proof.

Directive implication: when provenance matters, authorship history is stronger than retroactive style manipulation.

Slop Forensics Toolkit

Source: https://github.com/sam-paech/slop-forensics

Slop Forensics analyzes overrepresented lexical patterns in LLM output: repeated words, bigrams, trigrams, vocabulary complexity, and slop scores. Autonovel cites this as an inspiration for its anti-slop wordlists.

Directive implication: a useful revision pass can search for statistically overused LLM vocabulary and replace it with topic-specific language. But wordlists alone cannot prove authorship or guarantee quality.

EQ-Bench Slop Score

Source: https://eqbench.com/slop-score.html

EQ-Bench states that Slop Score is not a general AI detector. It measures overused AI-text patterns, especially slop words, “not X but Y” contrast patterns, and slop trigrams. It says the metric is optimized for creative writing and essays.

Directive implication: use slop scoring as a quality smell test. Do not optimize blindly for a score; a clean score can still be dull, false, or unethical.

MLA guidance on citing generative AI

Source: https://style.mla.org/citing-generative-ai/

The MLA page and comments emphasize disclosure/citation practices for generative AI, including acknowledging AI assistance and reviewing, editing, and supporting content with citations.

Directive implication: when AI materially contributes to prose, disclose according to the relevant venue’s rules. If output includes research claims, verify and cite primary sources.

Additional guidance located but access-limited in this environment

OpenAI’s AI-text-classifier announcement page was Cloudflare-blocked here. It is still a commonly cited source in the broader detector debate because OpenAI later marked its classifier as unavailable due to low accuracy, but this specific session could not fetch the page content.
Turnitin’s AI-detection product page was HTTP 403-blocked here. Treat Turnitin’s documentation as a venue-specific source to check directly where institutional rules depend on it.
A Vanderbilt Brightspace article about Turnitin AI detection being unavailable was attempted but returned 404 for the URL tested. Do not rely on that URL without fresh verification.

Synthesis: safe directives for high-quality AI-assisted prose

Do

Define voice before drafting: POV, tense, register, vocabulary wells, forbidden clichés, sentence rhythm.
Ground abstractions in concrete evidence: physical action, sensory detail, dialogue, object-specific description.
Use character-specific metaphors and speech patterns.
Prefer scene over summary where emotion, conflict, or decision matters.
Let subtext do work; if the scene shows it, do not explain it afterward.
Vary sentence and paragraph length for rhetorical purpose.
Add one real surprise per scene/chapter: a wrong word, premature emotion, interrupted beat, awkward silence, or consequence.
Run deterministic checks for filler, repeated formulas, and cliché.
Run adversarial editing: ask what to cut, not whether the prose is “good.”
Keep drafts, outlines, prompt logs, revision notes, and source citations.
Disclose AI assistance where required by school, publisher, client, or platform rules.

Don’t

Do not ask a model to “beat,” “evade,” “bypass,” or “trick” AI detectors.
Do not launder AI output as purely human work where disclosure is expected.
Do not optimize prose for a proprietary detector score.
Do not add random errors, typos, or awkward phrasing to mimic humanity.
Do not replace every flagged word mechanically; context matters.
Do not let word-level slop cleanup substitute for structural revision.
Do not use a single detector result as proof of authorship.

Practical revision workflow

Provenance pass: save outline, notes, sources, prompts, and draft diffs.
Voice pass: define intended voice, register, audience, POV, and constraints.
Draft pass: produce complete scene/chapter/essay without stopping to polish every sentence.
Mechanical pass: scan for filler phrases, slop words, AI-fiction clichés, repeated formulas, transition abuse, sentence uniformity, and abstract noun density.
Specificity pass: replace generic claims/images with observed facts, concrete nouns, precise verbs, and source-backed claims.
Structure pass: break template paragraphs, reduce symmetrical lists, vary paragraph size, and ensure sections are naturally lumpy.
Subtext pass: delete explanations after emotional beats.
Dialogue/interiority pass: make speech imperfect and character-specific; replace “thought about” lists with embodied cognition.
Adversarial edit: request exact cuts classified by FAT / REDUNDANT / OVER-EXPLAIN / GENERIC / TELL / STRUCTURAL.
Human accountability pass: verify facts, citations, tone, venue disclosure requirements, and authorial intent.

Bottom line

The research does not support writing to “avoid AI detection.” It supports writing and revising so that the prose is specific, truthful, voiceful, well-documented, and accountable. Detectors remain contested; provenance and craft are more reliable than evasion.