Achieving Determinism with LLM Agents: An Architecture Guide

I run a fleet of LLM agents that audit 42 repositories every day. Same code, same prompts, same model. And for weeks, every single run produced different results. Not because the codebase was changing between runs, but because the agents were sampling instead of enumerating, summarizing instead of counting, and making subjective judgments instead of executing deterministic checks. The audit results were non-reproducible, which meant they were useless as an audit.

This article is not about making LLMs themselves deterministic. That is a physics problem with no complete solution. This is about building agent architectures that produce deterministic outcomes despite the inherent non-determinism of the models powering them. The distinction matters. You do not need a deterministic model to build a deterministic system. You need deterministic scaffolding around a non-deterministic core.

Why LLM Outputs Are Non-Deterministic

Before building solutions, you need to understand why the problem exists at all. The common belief that setting temperature=0 guarantees deterministic output is wrong. Testing Qwen 3 235B with 1,000 identical temperature-zero prompts produced 80 unique completions, with the most common response appearing only 78 times. No consensus majority. GPT-4o showed accuracy swings of up to 72 percentage points across identical runs on the same task.

Three root causes drive this behavior, and temperature is not the primary one.

Batch Invariance Failure

This is the real villain. Modern inference systems dynamically adjust batch sizes based on server load. When your request lands in a batch of 4 versus a batch of 128, the matrix multiplications execute in different orders. Running torch.mm(a[:1], b) separately produces a maximum absolute difference of 1,669.25 compared to extracting the same row from torch.mm(a, b)[:1]. During attention decoding with a 1,000-token KV cache, processing 128 tokens requires five computational blocks versus four without the cache. Different reduction order, different floating-point accumulation, different results.

Floating-Point Non-Associativity

Floating-point addition is not associative. Summing the array [1e-10, 1e-5, 1e-2, 1] with its negatives produces 102 unique possible results depending on summation order. The classic example: (0.1 + 1e20) - 1e20 = 0, but 0.1 + (1e20 - 1e20) = 0.1. Lower precision formats like fp16 and bfloat16 amplify the problem. Every attention layer in a transformer performs thousands of these operations, and the accumulated drift is enough to flip token selections.

Hardware and Infrastructure Variance

Different GPU generations produce different floating-point behavior. vLLM explicitly states that reproducibility is only guaranteed on "the same hardware and same vLLM version." API providers like OpenAI and Anthropic update their infrastructure multiple times per year. Each update can change the system fingerprint and break any reproducibility guarantees the seed parameter was providing.

Non-Determinism Source	Impact	Mitigation Possible?
Temperature/sampling	High	Yes, set to 0 (reduces but does not eliminate)
Batch size variation	High	Only with custom inference (20% performance cost)
Floating-point accumulation	Medium	Requires deterministic CUDA kernels
Hardware differences	Medium	Pin to specific hardware (expensive)
Infrastructure updates	Low-Medium	Monitor system fingerprint changes
KV cache state	Low	Requires stateless inference

The Temperature Equals Zero Myth

OpenAI introduced a seed parameter in late 2023 as a "best effort" determinism feature. You set a seed integer alongside temperature=0 and top_p=1, and the API attempts to return identical outputs for identical inputs. The operative words are "best effort." OpenAI's own documentation warns of a "small chance that responses differ...due to inherent non-determinism." You can monitor the system_fingerprint field in the response to detect when the backend has changed, but there is nothing you can do about it when it does.

I tested this extensively. Even with seed pinning and fingerprint matching, I observed drift across runs separated by hours. The deterministic inference performance trade-offs tell the story:

Configuration	Time (1,000 sequences)	Slowdown
Default (non-deterministic)	26 seconds	Baseline
Deterministic, unoptimized	55 seconds	2.1x
Deterministic, optimized	42 seconds	1.6x
Batch-invariant matmul	~20% loss vs. cuBLAS	Significant

True deterministic inference costs 1.6x to 2.1x in throughput. API providers are not absorbing that cost for you. When they say "best effort," they mean it.

The Real Problem: Agents, Not Models

Here is the insight that changed everything for me. The non-determinism that was destroying my audit results was not coming from the model. It was coming from the agent architecture. The model might generate slightly different phrasing across runs, but the catastrophic variance, where one run finds 20 issues and the next finds 8, came from three agent-level failures.

Sampling Instead of Enumerating

When I told an agent to "check all packages for required files," it would check a representative sample. Not because the model is lazy, but because the prompt left room for interpretation. The agent decided that checking 30 of 144 packages was sufficient to establish a pattern. Next run, it sampled a different 30. Different sample, different findings.

Subjective Judgment Instead of Objective Measurement

"README looks comprehensive" is not a check result. It is a subjective judgment that varies with the model's interpretation of "comprehensive." One run might decide a README needs an environment variables section. The next run might not. Both are defensible interpretations. Neither is deterministic.

Context Pressure Under Load

When an agent covers too many repositories in a single session, it hits context window pressure and starts summarizing, skipping, and abbreviating. An agent covering 15 repos produces different results than three agents covering 5 repos each, even when checking the same things. The three-agent approach consistently finds more issues because each agent has room to be thorough.

Building Deterministic Agent Architectures

The solution is to stop trying to make the model deterministic and instead build an architecture where the model's non-determinism cannot affect the outcome. I call this approach "deterministic scaffolding around a non-deterministic core."

Strategy 1: Enumerate, Never Sample

Every check that touches a countable set must enumerate the full set and report exact counts. Not "checked several packages" but "checked 144/144 packages." The prompt must be explicit:

Check ALL 144 packages. For each package, verify these 5 files exist:
__init__.py, README.md, schema.json, metadata.yaml, tests/

Report format: "X/144 packages have all 5 required files."
List every package that is missing any file.

The key is that the expected count (144) is in the prompt. The agent cannot decide to sample because the prompt demands a specific denominator. If the agent reports "142/144," I know exactly which two failed. If it reports "all packages look good," I know the agent skipped the enumeration and I reject the result.

Strategy 2: Deterministic Commands Over Natural Language Inspection

Anywhere a check can be expressed as a bash command that produces a pass/fail result, use the command instead of asking the agent to inspect and judge.

Check	Non-Deterministic Approach	Deterministic Approach
Package file inventory	"Check if packages have required files"	`find packages/ -maxdepth 2 -name "schema.json"`
Secret detection	"Look for hardcoded secrets"	`grep -rn 'sk-ant-' src/`
Port consistency	"Verify ports match the reference"	`grep 'port' docker-compose.yml`
README completeness	"Check if README is comprehensive"	Count specific required section headers
Trademark consistency	"Verify trademark usage"	`grep -rn 'ProductName®' *.html` (must be 0)

The bash command produces identical output every time. The agent's role shifts from inspector to executor: run the command, parse the output, report the result. The model's non-determinism is irrelevant because the measurement is deterministic.

Strategy 3: Canonical Sources of Truth

Every number that gets validated needs a single canonical source. I maintain a canonical.json file that contains every count that matters: patent application count, per-application claim counts, domain specification count, package count, lab count. Every check validates against this file, not against a number the agent remembers or derives.

{
  "patent_applications": 25,
  "total_claims": 845,
  "domain_specs": 144,
  "synthesized_packages": 144,
  "lab_definitions": 89,
  "per_application_claims": {
    "Application A": 42,
    "Application B": 38
  }
}

When canonical numbers change (new domains synthesized, patents added), the canonical file gets updated first, and all subsequent audits validate against the updated truth. This eliminates a class of non-determinism where agents derive counts differently depending on how they traverse directory structures or parse documents.

Strategy 4: Phase Zero Automation

Before any agent-driven audit begins, a fully deterministic validation script runs first. This script is pure bash with no LLM involvement. It validates all mechanical checks: file counts, schema validation, git status across all repos, trademark consistency, directory structure compliance.

#!/bin/bash
# Phase 0: Deterministic validation - no LLM involved
CANONICAL=$(cat .audits/canonical.json)
EXPECTED_APPS=$(echo "$CANONICAL" | jq '.patent_applications')
ACTUAL_APPS=$(find CIP/ -maxdepth 1 -name "Application *" -type d | wc -l)

if [ "$ACTUAL_APPS" -ne "$EXPECTED_APPS" ]; then
    echo "FAIL: Expected $EXPECTED_APPS applications, found $ACTUAL_APPS"
    exit 1
fi

Phase Zero catches the mechanical failures before the agents even start. This means the agent-driven phases only handle checks that genuinely require semantic understanding: documentation quality, cross-reference consistency, architecture alignment. The surface area for non-determinism shrinks to only the checks where non-determinism is unavoidable.

Strategy 5: Agent Scope Limits

Context pressure is a determinism killer. An agent covering 15 repositories will hit context window limits and start degrading. The fix is simple: limit each agent to 4-5 repositories maximum. Split the work across more agents rather than overloading fewer.

Agent Configuration	Repos per Agent	Findings Consistency
1 agent, 42 repos	42	Poor (high variance between runs)
3 agents, 14 repos each	14	Moderate (some summarization)
7 agents, 6 repos each	6	Good (consistent findings)
10 agents, 4-5 repos each	4-5	Excellent (near-deterministic)

The trade-off is token cost. Ten agents consume more total tokens than three agents. But the ten-agent configuration produces reproducible results, and the three-agent configuration does not. Reproducibility is worth the extra tokens.

Strategy 6: Structured Output Schemas

When agents must produce findings, force the output into a structured schema. Do not ask for free-text summaries.

Research from the Berkeley Function Calling Leaderboard shows the impact of structured output methods:

Method	GPT-4o Accuracy	Claude 3.5 Sonnet Accuracy
Function Calling	87.4%	78.1%
Schema-Aligned Parsing	93.0%	94.4%
JSON Mode	~90%	~90%
Free-text Prompting	<50%	<50%

Schema-Aligned Parsing (SAP), where you post-process the output with schema knowledge rather than constraining generation, achieves the highest consistency. The agent generates its finding, and a deterministic parser validates and normalizes the output against the expected schema. Malformed findings get rejected and re-requested.

Strategy 7: Separation of Detection from Remediation

Never let the same agent detect and fix issues in a single pass. Detection and remediation are different cognitive modes with different determinism requirements.

The detection phase produces a structured findings report: every issue found, grouped by repository, with severity classification and exact counts. This report gets dumped to the terminal in full before any fixes begin. The human (or a separate remediation agent) reviews the findings and then executes fixes.

This separation matters because:

Detection agents can be re-run to verify reproducibility without the confounding effect of fixes from previous runs.
The findings report serves as an audit trail. If detection and remediation are interleaved, you cannot reconstruct what was found versus what was changed.
Remediation agents can be given the specific findings list rather than re-discovering issues, eliminating another source of variance.

Strategy 8: Completeness Attestation

Every audit report must include a completeness attestation: "X total checks executed across Y repos. Z items examined exhaustively (N packages, M specs, P applications)." This line makes it immediately obvious if an agent skipped or sampled anything.

If the attestation says "321 checks across 42 repos, 144 packages verified, 25 applications verified, 845 claims counted," I can compare that against the canonical numbers and confirm exhaustiveness. If it says "checked several hundred items across the repos," I reject the audit and re-run it.

The Idempotency Test

The ultimate validation for a deterministic agent architecture is the idempotency test: run the audit twice in a row with no code changes between runs. If the second run produces identical findings to the first, the architecture is working. If it finds different issues, the methodology is broken.

I failed this test for three weeks before arriving at the architecture described above. The failure modes were instructive:

Failure Mode	Root Cause	Fix
Different issue counts between runs	Agent sampling different subsets	Enumerate, never sample
Issues "appearing" on second run	First run's agent hit context limits	Reduce repos per agent to 4-5
Subjective findings varying	"Looks good" vs. "needs improvement"	Replace with deterministic commands
Count mismatches	Agents deriving counts differently	Canonical JSON source of truth
Phantom fixes	Remediation interleaved with detection	Separate detection from remediation

After implementing all eight strategies, the audit achieved idempotency. Two consecutive runs against identical code produce identical findings. The false impression that the codebase was "perpetually broken" disappeared. It was never the codebase. It was the audit.

The Hybrid Approach: Deterministic Guardrails with Semantic Reasoning

Not every check can be reduced to a bash one-liner. Documentation quality assessment, cross-reference consistency between prose documents, and architecture alignment require semantic understanding that only an LLM can provide. The solution is a hybrid architecture:

Phase Zero (fully deterministic): Bash scripts validate all mechanical checks against canonical data. No LLM involvement. Produces identical results every time.
Phase One (constrained LLM): Agents execute structured checks with enumeration requirements, scope limits, and output schemas. The LLM handles checks that require reading comprehension, but the scaffolding constrains its behavior.
Phase Two (semantic LLM): Agents perform genuinely subjective assessments (documentation quality, naming consistency, architecture coherence). These findings are flagged as "semantic" in the report and are expected to have some variance between runs.
Validation gate: Post-fix, Phase Zero re-runs to confirm mechanical checks still pass. If Phase Zero regresses, fixes are applied before finalizing.

The key insight is that you do not need 100% determinism. You need determinism where it matters (counts, file existence, security checks, compliance validation) and can tolerate variance where it does not (documentation style suggestions, naming recommendations). Labeling which checks are deterministic and which are semantic lets you set appropriate expectations for reproducibility.

Cost and Performance Trade-Offs

Deterministic architectures cost more tokens. There is no way around this. Here are the real numbers from my production system:

Approach	Agents	Tokens per Run	Run Time	Findings Consistency
Single agent, open prompt	1	~50K	15 min	Poor (30-50% variance)
Three agents, structured	3	~150K	25 min	Moderate (10-20% variance)
Ten agents, full deterministic	10	~400K	45 min	Excellent (<2% variance)

The ten-agent deterministic approach costs 8x more tokens than the single-agent approach. It takes 3x longer. And it is the only approach that produces results I can trust. The cheaper approaches generate findings reports that look plausible but are not reproducible, which means every "finding" requires manual verification, which means the agent is not actually saving time.

The 8x token cost is the price of trustworthy automation. It is worth paying.

Lessons from Production

After running this architecture daily for over a month across 42 repositories, several patterns have emerged.

Agent scope is the biggest lever. Reducing repos-per-agent from 14 to 5 improved findings consistency more than any other single change. Context pressure is the silent killer of determinism.

Canonical data files eliminate an entire class of bugs. Before introducing canonical.json, different agents would derive different counts for the same metric depending on how they traversed the filesystem. One agent counted 143 packages because it excluded a hidden directory. Another counted 145 because it included a template directory. The canonical file ended the argument.

Phase Zero catches 60-70% of issues. Most audit failures are mechanical: missing files, wrong counts, uncommitted changes, stale documentation. A bash script finds these faster and more reliably than any LLM agent. The agents should focus on the 30-40% of checks that require semantic understanding.

The idempotency test is your quality gate. Run it weekly. When it fails, something in the architecture has drifted. Fix the architecture, not the symptoms.

Non-determinism is a feature for some tasks. Code review suggestions, documentation improvement ideas, and architecture recommendations benefit from variance. You want different perspectives across runs. Just do not mix these semantic tasks with deterministic validation in the same report.

Key Patterns and Takeaways

Temperature equals zero is necessary but not sufficient. It reduces variance but does not eliminate it. Batch invariance and floating-point effects dominate.
Determinism lives in the scaffolding, not the model. Build deterministic architectures around non-deterministic models. Constrain the agent's behavior so its non-determinism cannot affect outcomes.
Enumerate, never sample. Every check against a countable set must enumerate the full set and report exact counts. Put the expected count in the prompt.
Prefer deterministic commands over natural language inspection. If a check can be expressed as a bash one-liner, use the command.
Maintain canonical sources of truth. Every validated number should come from a single authoritative file, not from agent-derived counts.
Limit agent scope to 4-5 repos. Context pressure causes non-deterministic degradation. More agents with less scope beats fewer agents with more scope.
Separate detection from remediation. Never interleave finding issues with fixing them. The findings report is your audit trail.
Test for idempotency. If two consecutive runs against identical code produce different findings, the architecture is broken.
Accept that some checks are inherently non-deterministic. Label them as semantic and set appropriate expectations. Do not try to force determinism where it is not achievable.
Pay the token cost. Deterministic architectures cost more. The alternative is untrustworthy automation, which costs more in human verification time than the tokens ever will.

Appendix A: Sanitized Deployment Readiness Audit Specification

The following is a sanitized version of the actual audit specification I use to validate deployment readiness across 42 repositories daily. Project names, domain-specific details, and proprietary references have been replaced with generic equivalents. The structure and methodology are unchanged.

This document evolved through three weeks of iteration. The first version was a paragraph of prose that said "check all repos for common issues." The current version is 300+ lines of explicit checks, severity classifications, and agent scope rules. Every line exists because a previous audit produced non-deterministic results without it.

Preamble and Philosophy

# Full Deployment Readiness Audit

This audit must be run with a clean context — start a fresh session
with no prior conversation history.

## Exhaustive Verification — Mandatory

This is an audit, not a spot-check. Every check described in this
document must be performed exhaustively against every applicable item.
No sampling. No "spot-checking a representative subset." No "verified
a few and they all passed." If a check says "for every package" it
means every single one of the 144+ packages. If it says "for each
service" it means all services.

### Rules for audit agents

1. Enumerate, don't sample. When checking N items, check all N.
   Report the count checked and the count that passed/failed.
   Example: "144/144 packages have all 5 required files"
   not "checked several packages and they all look good."

2. Use deterministic commands. Prefer find | wc -l, grep -c, diff,
   and scripted loops over manual inspection.

3. Report exact counts. Every check result must include the number
   of items examined and the number that passed.
   "PASS (845/845 specs valid)" not just "PASS."

4. No subjective judgments. "README looks comprehensive" is not a
   check result. "README contains: overview (line 1-15), tech stack
   (line 16-22), setup (line 23-40), env vars (line 41-55) — 4/4
   required sections present" is.

5. Itemize failures completely. If 3 out of 144 packages are
   missing a file, list all 3 by name.

6. Idempotency. If the audit is run twice in a row with no code
   changes in between, it must produce identical findings.

7. Agent scope limits. Each audit agent should cover at most 4-5
   repos. If an agent covers more, it will hit context pressure
   and start skipping or summarizing.

Phase Zero: Deterministic Pre-Flight

## Phase 0: Canonical Validation (automated, run first)

Before any agent-driven audit begins, run the deterministic
validation script:

  bash .audits/scripts/validate-canonical.sh

This script validates all mechanical checks against canonical.json:
- Component count and per-component sub-counts
- Schema validation (all specs against schema.json)
- Git status across all repos (no uncommitted changes)
- Naming convention consistency
- Cleanup directory detection and removal
- .gitignore completeness across all repos

If any check fails, fix the issue before proceeding to agent phases.

When canonical numbers change, update canonical.json to match
reality. The canonical file is the single source of truth —
all repos should reference these numbers rather than maintaining
their own copies.

Per-Category Check Templates

## Backend Services (run for EACH service)

1. Git status: Clean working tree, up to date with origin.
2. Fast tests: Run pytest -m "not slow" --tb=short -q.
   Record: passed, failed, skipped. Zero failures.
3. Coverage: Run with --cov. Must be >= 75%.
4. Security — hardcoded secrets: Grep .py files for API key
   patterns, password literals, secret literals.
   Exclude test files and .env.example.
5. Security — .env not tracked: git ls-files .env must
   return empty. .env.example SHOULD be tracked.
6. Dependency declaration: Check whether deps are in
   pyproject.toml or requirements.txt. Flag if test deps
   are mixed with runtime deps.
7. Required docs: Verify docs/ contains requirements.md,
   design.md, and testing-strategy.md.
8. README completeness: Verify README contains: project
   overview, tech stack, local setup, Docker instructions,
   environment variables list, API endpoint overview.
   Flag any missing section.
9. Port consistency: Check that port numbers in
   Dockerfile/docker-compose/uvicorn config match the
   canonical port reference document.
10. Commit and push all changes.

## Frontend Clients (run for EACH client)

1. Git status: Clean working tree, up to date with origin.
2. ESLint: Run npx eslint src/ --max-warnings=0.
   Zero errors required. Warnings are MEDIUM.
3. TypeScript: Run npx tsc --noEmit. Zero errors required.
4. Security — no tracked secrets: Grep source files for
   API key patterns. Exclude test mocks gated behind
   environment variable checks.
5. Security — .env not tracked and .env.example present.
6. API endpoint URLs: Grep for localhost: and http://.
   Every URL must come from an env var or be a fallback
   default matching the canonical port reference.
7. Test script exists: Check package.json for a test script.
8. Feature parity (desktop vs web): Compare directory
   listings between web and desktop clients. List any
   screen present in one but not the other.
9. README completeness: overview, tech stack, setup,
   build instructions, env vars.
10. Commit and push all changes.

Findings Report and Fix Protocol

## Findings Report

After all checks complete but BEFORE starting any fixes,
dump the complete findings as a structured report:

- Summary table of all repos with pass/fail per category
- Every issue grouped by repo with severity
  (CRITICAL / HIGH / MEDIUM / LOW)
- Test results (pass/fail counts, coverage percentages)
- Security findings
- Stale documentation and incorrect cross-references

Completeness attestation: "X total checks executed across
Y repos. Z items examined exhaustively (N packages, M specs,
P components)."

Only after the findings report is fully displayed should you
proceed to fixes.

## Fixes

- For every defect found, write a regression test covering
  the errant code path.
- For stale documentation, update files to be current.
- After all fixes, re-run Phase 0 to confirm mechanical
  checks still pass. The goal is that a hypothetical
  immediate second audit run would produce zero findings.

Appendix B: Sanitized Diagram Audit Specification

This is a sanitized version of the specification I use for exhaustive verification of technical diagrams against their corresponding specifications. The original audits diagrams for patent filings, but the architecture applies to any system where diagrams must accurately represent specifications: architecture diagrams vs. design docs, data flow diagrams vs. API specs, or infrastructure diagrams vs. deployment runbooks.

This specification was created after discovering that diagram audits were the most non-deterministic part of the entire system. Agents would flag different issues each run, miss obvious mismatches, and invent severity categories not in the specification. The solution was extreme explicitness: every possible finding has a pre-defined ID, severity classification is mandatory (agents cannot invent severities), and pre-flight checks handle everything that can be checked deterministically.

Severity Classification

## Severity Classification (mandatory — agents MUST NOT invent severities)

Agents do not decide severity. Every finding must match exactly
one of these categories. If a finding does not match any category,
it is ACCEPT (do not report it).

### Automatic FAIL

| ID                  | Rule                                                   |
|---------------------|--------------------------------------------------------|
| F-MISSING-NODE      | Spec describes a component and diagram has no node     |
| F-MISSING-NUMERAL   | Reference numeral in spec section not in diagram       |
| F-DUPLICATE-NUMERAL | Same numeral on two distinct nodes in one figure       |
| F-PHANTOM-NUMERAL   | Numeral in diagram does not exist in spec              |
| F-WRONG-CONCEPT     | Numeral labels the wrong concept vs. spec              |
| F-EDGE-LABEL        | Prohibited edge label syntax used                      |
| F-ID-COLLISION      | Identifier used as both subgraph ID and node ID        |
| F-LOGIC-ERROR       | Diagram flow contradicts spec logic                    |
| F-DOTTED-FORWARD    | Forward data flow uses dotted arrow (reserved for      |
|                     | back-edges only)                                       |
| F-SOLID-BACKEDGE    | Back-edge uses solid arrow instead of dotted           |
| F-CROSS-CONFLICT    | Same numeral refers to different concepts across figs  |

### Automatic WARN

| ID                     | Rule                                                |
|------------------------|-----------------------------------------------------|
| W-DECISION-INCOMPLETE  | Decision node has fewer outcomes than spec describes |
| W-MISSING-PARENT       | Sub-components shown without parent grouping         |
| W-ORPHAN-NODE          | Non-entry node has no incoming edge                  |
| W-COVERAGE-GAP         | A spec element has no visual representation          |
| W-MISSING-HEADER       | Diagram file lacks required header comment block     |

### Automatic ACCEPT (do NOT report)

| ID               | Rule                                                  |
|------------------|-------------------------------------------------------|
| A-COLLAPSED-SUB  | Sub-component numerals collapsed into parent range     |
| A-UNLABELED-IO   | Input/output nodes without spec-assigned numerals      |
| A-LONG-LABEL     | Labels exceeding convention when improving clarity     |
| A-ABBREVIATION   | Minor label abbreviations vs. spec wording             |
| A-SYNONYM-HEADER | Header using synonymous terms                          |

Pre-Flight Deterministic Checks

## Pre-Flight Checks (deterministic — run BEFORE reading spec)

For each component, run these bash checks FIRST and include
results in the findings.

### a. Prohibited Syntax Patterns
  grep -nP 'PATTERN' Diagrams/*.mmd || echo "CLEAN"
  Any match → F-EDGE-LABEL per file:line.

### b. Prohibited Node Types
  grep -n 'PATTERN' Diagrams/*.mmd || echo "CLEAN"
  Any match → F-PSEUDOSTATE per file:line.

### c. Style Inventory
  grep -n 'PATTERN' Diagrams/*.mmd || echo "NONE"
  List every match. Agent classifies each using
  the cycle test (would removing the edge break a cycle?).

### d. Header Validation
  for f in Diagrams/*.mmd; do head -3 "$f"; done
  Each file must have required header comments.

### e. Duplicate Reference Detection
  for f in Diagrams/*.mmd; do
    grep -oP '\(\d+\)' "$f" | sort | uniq -d
  done
  Any output → potential F-DUPLICATE-NUMERAL.

### f. Identifier Collision Detection
  Compare subgraph IDs against node IDs.
  Any overlap → F-ID-COLLISION.

Agent Isolation and Output Schema

## Agent Architecture

One agent per component. Each agent gets a clean context with
only its component's spec and diagrams. This prevents context
pollution across components.

## Agent Output Schema

Return a findings table with ONLY FAIL and WARN rows:

| Fig | ID              | Severity | Detail                          |
|-----|-----------------|----------|---------------------------------|
| 3   | F-DOTTED-FORWARD| FAIL     | Line 24: A -.-> B is forward    |

Then return structured JSON for deterministic aggregation:

{
  "component": "X",
  "figures_checked": 8,
  "fail_count": 2,
  "warn_count": 3,
  "clean_figures": [1, 5, 6],
  "findings": [
    {
      "fig": 3,
      "id": "F-DOTTED-FORWARD",
      "severity": "FAIL",
      "line": 24,
      "detail": "A -.-> B is forward input, not back-edge"
    }
  ]
}

## Baseline Comparison

After all agents complete, write findings to baseline.json.
On subsequent audits, classify each finding as:
- NEW: not in previous baseline (regression)
- KNOWN: matches a previous finding (existing debt)
- FIXED: in previous baseline but not found now (resolved)

Exception Suppression

## Exception Suppression

Before reporting findings, read exceptions.json.

- Global exceptions: suppress matching findings across all
  figures in all components.
- Per-figure exceptions: suppress for specific figures only.
- Per-component exceptions: suppress across all figures in
  a component.

If a finding is suppressed by an exception, omit it entirely.
Do NOT report it as "suppressed."

The exception mechanism is critical for determinism. Without it, agents would re-report known false positives on every run, creating noise that obscures genuine regressions. The exception file is the canonical list of "we know about this and it is not a problem." It turns what would be subjective agent judgment ("should I flag this again?") into a deterministic lookup.

Additional Resources

Defeating Nondeterminism in LLM Inference by Thinking Machines Lab. Deep technical analysis of batch invariance and floating-point sources of non-determinism.
Why Temperature=0 Does Not Guarantee Deterministic LLM Outputs by Michael Brenndoerfer. Empirical testing of the temperature-zero myth across multiple models.
Non-Determinism of "Deterministic" LLM Settings (2024, ACM Transactions on Software Engineering and Methodology). Academic study documenting 70 percentage point accuracy swings across identical conditions.
R-LAM: Reproducibility-Constrained Large Action Models (2025). Framework achieving perfect reproducibility without degrading task performance.
IMMACULATE: Practical LLM Auditing Framework (2025). Sampling-based audit approach requiring only 3,000 queries to detect 10% fraud with 95% confidence.
Deterministic AI Architecture by Kubiya. The "Thin Agent / Fat Platform" model for production agent systems.
Schema-Aligned Parsing vs. Function Calling by BAML. Benchmark data showing SAP outperforms constrained generation for structured outputs.
OpenAI Reproducible Outputs Cookbook. Official documentation on the seed parameter and its limitations.