exhibit A / live

LLMs assert.
Proof Engine proves.

An open-source skill that makes every factual claim carry its receipts. Numbers get computed in Python. Quotes get fetched from live URLs and matched against the page. A fabricated citation fails the match — and the verdict downgrades so the gap is visible instead of hidden.

130 proofs · 19 domains · 368 sources checked · MIT · auditable · re-runnable
130
Proofs
19
Domains
368
Sources checked
100%
Re-runnable
§01 · premise

The witness cannot corroborate itself.

LLMs hallucinate facts and hallucinate the checks on those facts. Asking one to verify its own answer runs the same error-prone process twice.

Search-grounded AI finds pages, but still lets the model summarize them. A link says "the evidence is over there." A proof says: here is the evidence, here is the code that checked it, here is what happened when it ran.

Proof Engine routes every claim through a gate the LLM can't fake. Python doesn't hallucinate. A fabricated citation fails to match live page text. A partial match downgrades the verdict so the gap is visible — instead of smoothed away by a confident summary.

The LLM is an untrusted author. Every claim it makes passes through a deterministic checkpoint. When it hallucinates, the pipeline breaks visibly instead of hiding the error.

The model still does useful work: it drafts code, finds sources, formalizes claims. It just doesn't get to be the verification.

The circular check
LLM(LLM(claim)) ≠ proof

The proof pipeline
fetch(url) ∘ match(quote) ∘ compute(python) → verdict
— deterministic, replayable, breakable
§02 · how it works

Five steps. No LLM in the verification path.

Each step produces an artifact that survives independently of the model that drafted it. You can re-run the whole pipeline from source, offline, in Python.

01
?
claim input
Any factual assertion — a viral myth, a stat, a mathematical identity, a VC pitch-deck number. The LLM decomposes it into sub-claims (SC1, SC2…) and extractable facts (B1, B2, A1…).
# from a session
claim = "0.999… < 1"
decompose(claim) → [SC1: "0.999 repeating equals 1"]
02
fetch sources network
Academic papers, government data, reference encyclopedias — never the model's memory. Every URL goes in the audit trail with HTTP status, fetch mode (live, wayback, snapshot), and credibility tier.
fetch("planck.esa.int/2018-legacy") → 200 // tier-1 · academic
03
verify quotes text match
Each quoted sentence must appear on the live page, modulo Unicode/HTML normalization. Partial matches downgrade the verdict. Fabricated citations break the pipeline visibly, with coverage percentages in the audit log.
match(quote="Ω_Λ = 0.6853 ± 0.0074", page) → verified // 100% word coverage
04
run proof.py python
Deterministic computation. sympy for exact math, numpy for quantitative, every constant version-controlled. Anyone can python proof.py and see the same result.
compare(omega_lambda, threshold=0.68, op=">") → True
assert "0.6853 > 0.68" # holds
05
verdict output
Structured outcome: PROVED, DISPROVED, PARTIAL, SUPPORTED, or UNDETERMINED. Every verdict ships with a Jupyter notebook, PROV-JSON provenance chain, and an RO-Crate 1.1 archive bundle.
verdict: PROVED // Ω_Λ = 0.6853 > 0.68 — confirmed
§03 · the difference

"Why not just ask the model?"

Same claim, same sources, two radically different artifacts. The one on the left is a confident summary. The one on the right is a re-runnable script with a trace.

typical LLM
prompt: "does using AI tools make humans worse at critical thinking?"
"The claim is too absolute to confirm or deny cleanly. The real picture appears to be that passive, over-reliant use degrades critical thinking, while active, interrogative use can augment it — making the unqualified 'makes humans worse' framing false as a universal statement."
  • hedges into unfalsifiability — "too absolute," "context-dependent"
  • names one 2025 study without a URL, quote, or coverage check
  • invokes "cognitive-offloading theory" without citing a source
  • concludes "false as a universal statement" — but no universal was claimed
  • a second ask re-runs the same mechanism; the hedge persists
✗ plausible-sounding · zero provenance
proof engine
same claim → verdict + audit trail
PROVED — four independent research groups, different institutions and methods, reach the same association: AI tool use correlates with measurable drops in critical-thinking scores.
  • B1 Gerlich 2025, n=666 — negative correlation (r=-0.68), cognitive offloading (quote-verified)
  • B2 Lee et al. 2025 (Microsoft Research / CHI), n=319 knowledge workers — confidence ↑ → critical effort ↓ (quote-verified)
  • B3 Harvard Gazette 2025 — faculty cross-discipline panel, same concern (quote-verified)
  • B4 Jose et al. 2025 (Frontiers / PMC) — ChatGPT users solved 48% more problems but scored 17% lower on concept tests (quote-verified)
  • verdict qualifier in the record: correlation, not proven causation; routine use > high-stakes use
✓ 4 sources verified · consensus threshold met · re-runnable read the full proof →
§04 · exhibit room

Think you know the answer?

11 claims, verdicts redacted. Tap a card to reveal what the pipeline actually found. Every one of them is a re-runnable Python script in the catalog.

tap to reveal · R to reveal all
01 · mathematics / myths
0.999... (with 9s repeating forever) is strictly less than 1.
DISPROVED
The claim that 0.999...
read proof →
02 · cosmology
Dark energy constitutes more than 68% of the universe's total energy density according to the Planck 2018 legacy release.
PROVED (with unverified citations)
The Planck 2018 legacy release reports ΩΛ = 0.6853 ± 0.0074, which is strictly greater than 0.68.
read proof →
03 · mathematics
The binary operator eml is defined by the expression eml(a, b) = (a) - (b). There exists a finite binary tree consisting solely of eml operations, whose 9 leaves are drawn from \1, x, y\, such that the tree evaluates exactly to x × y. The tree has K = 17 tokens (8 eml operations and 9 leaves), and the identity holds for all complex x and y (in the algebraic setting where is the identity).
PROVED
**Verdict: PROVED.** The 17-token expression eml(eml(1, eml(eml(eml(1, eml(eml(1, eml(1, x)), 1)), y), 1)), 1) evaluates to x * y.
read proof →
04 · myths / biology
Hair and fingernails continue to grow for days after a person dies.
DISPROVED (with unverified citations)
**DISPROVED (with unverified citations)**: The claim that hair and fingernails continue to grow for days after death is false.
read proof →
05 · neuroscience / myths
Humans use only 10% of their brain at any one time.
DISPROVED
**Verdict: DISPROVED.** Three independent, authoritative neuroscience sources — Scientific American (quoting a Johns Hopkins neurologist), MIT's McGovern Institute for Brain Research, and the University of Washington — each explicitly reject the claim that humans use only 10% of their brain.
read proof →
06 · history / myths
Napoleon Bonaparte stood shorter than the average Frenchman of his era.
DISPROVED
Napoleon Bonaparte was not shorter than the average Frenchman of his era.
read proof →
07 · physics
Quantum entanglement enables the transmission of usable information faster than the speed of light when the distant parties pre-agree on a measurement basis.
DISPROVED
**Verdict: DISPROVED.** The claim that quantum entanglement enables faster-than-light information transmission when parties pre-agree on a measurement basis is false.
read proof →
08 · mathematics
The 100000th prime number is exactly 1299709.
PROVED
The 100,000th prime number is exactly 1,299,709.
read proof →
09 · myths / biology
The average person swallows eight spiders per year while sleeping.
DISPROVED
The claim that the average person swallows eight spiders per year while sleeping is conclusively false.
read proof →
10 · myths / history
The Great Wall of China is the only man-made object visible from space with the naked eye.
DISPROVED
The claim that the Great Wall of China is the only man-made object visible from space with the naked eye is false on both counts.
read proof →
11 · economics / history
The purchasing power of the US dollar has declined by more than 90% since the Federal Reserve was established in 1913.
PROVED
**Verdict: PROVED.** The purchasing power of the US dollar has declined by **96.85%** since 1913, well exceeding the claimed "more than 90%" threshold by 6.85 percentage points.
read proof →
explore all 130 proofs in the catalog →
§05 · scope

What it can and can't do.

Calibrated honesty beats confident vagueness. The engine refuses to gesture at things it can't mechanically check.

works well for
  • Factual claims with citable evidence dates, numbers, quotes, statistics with verifiable source pages
  • Mathematical assertions anything Python + sympy can compute deterministically, including symbolic identities
  • Debunking specific claims "did X really say Y?" · "is statistic Z accurate?" · "does this compound claim decompose?"
  • Compound claims decomposes "X and Y" into independently verified sub-claims (SC1 ∧ SC2)
doesn't work for
  • Causal claims "X caused Y" tops out at PARTIAL — facts yes, causal theory weighting no
  • Broad literature synthesis "coffee reduces diabetes risk" needs a systematic review, not a proof
  • JS-rendered pages citation match degrades when the source needs a browser to render
  • Absence-of-evidence search-based facts reach SUPPORTED at best; the engine can't prove non-existence
  • Contested definitions "is a hot dog a sandwich?" — depends on definition, not evidence
  • Original theorem proving computations yes, novel conjectures no
§06 · every proof ships

Eight files. Nothing to take on faith.

Every verdict in the catalog includes these artifacts, versioned, DOI-minted, and downloadable. Run the Python. Open the notebook. Cite the JSON.

proof.py
re-runnable verification script — python proof.py
.py
§
proof.md
structured report with verdict + sub-claim breakdown
.md
proof_audit.md
citation-by-citation evidence trail, coverage % per quote
.md
proof_narrative.md
plain-language summary for non-technical readers
.md
Jupyter Notebook
interactive re-verification in a browser cell-by-cell
.ipynb
W3C PROV-JSON
provenance chain — feed it to downstream fact pipelines
.json
RO-Crate 1.1
archival research-object bundle, DOI-minted
.crate
Citation files
BibTeX, RIS, CFF, Chicago, APA — ready to cite
.bib
§07 · install
Build agents that prove instead of assert.
Drop the skill into Claude or any other agent that supports Skills. Then just say: "use the proof-engine skill to verify …" — it auto-activates when a claim needs checking, and refuses to guess when it can't.