exhibit A / live

LLMs assert.
Proof Engine proves.

An open-source skill that makes every factual claim carry its receipts. Numbers get computed in Python. Quotes get fetched from live URLs and matched against the page. A fabricated citation fails the match — and the verdict downgrades so the gap is visible instead of hidden.

install the skill → browse 130 proofs

130 proofs · 19 domains · 368 sources checked · MIT · auditable · re-runnable

130

Proofs

Domains

368

Sources checked

100%

Re-runnable

§01 · premise

The witness cannot corroborate itself.

LLMs hallucinate facts and hallucinate the checks on those facts. Asking one to verify its own answer runs the same error-prone process twice.

Search-grounded AI finds pages, but still lets the model summarize them. A link says "the evidence is over there." A proof says: here is the evidence, here is the code that checked it, here is what happened when it ran.

Proof Engine routes every claim through a gate the LLM can't fake. Python doesn't hallucinate. A fabricated citation fails to match live page text. A partial match downgrades the verdict so the gap is visible — instead of smoothed away by a confident summary.

The LLM is an untrusted author. Every claim it makes passes through a deterministic checkpoint. When it hallucinates, the pipeline breaks visibly instead of hiding the error.

The model still does useful work: it drafts code, finds sources, formalizes claims. It just doesn't get to be the verification.

The circular check

LLM(~~LLM(claim)~~) ≠ proof

The proof pipeline

fetch(url) ∘ match(quote) ∘ compute(python) → verdict

— deterministic, replayable, breakable

§02 · how it works

Five steps. No LLM in the verification path.

Each step produces an artifact that survives independently of the model that drafted it. You can re-run the whole pipeline from source, offline, in Python.

claim input

Any factual assertion — a viral myth, a stat, a mathematical identity, a VC pitch-deck number. The LLM decomposes it into sub-claims (SC1, SC2…) and extractable facts (B1, B2, A1…).

# from a session
› claim = "0.999… < 1"
› decompose(claim) → [SC1: "0.999 repeating equals 1"]

⌕

fetch sources network

Academic papers, government data, reference encyclopedias — never the model's memory. Every URL goes in the audit trail with HTTP status, fetch mode (live, wayback, snapshot), and credibility tier.

› fetch("planck.esa.int/2018-legacy") → 200 // tier-1 · academic

✓

verify quotes text match

Each quoted sentence must appear on the live page, modulo Unicode/HTML normalization. Partial matches downgrade the verdict. Fabricated citations break the pipeline visibly, with coverage percentages in the audit log.

› match(quote="Ω_Λ = 0.6853 ± 0.0074", page) → verified // 100% word coverage

▶

run proof.py python

Deterministic computation. sympy for exact math, numpy for quantitative, every constant version-controlled. Anyone can python proof.py and see the same result.

› compare(omega_lambda, threshold=0.68, op=">") → True
› assert "0.6853 > 0.68" # holds

∎

verdict output

Structured outcome: PROVED, DISPROVED, PARTIAL, SUPPORTED, or UNDETERMINED. Every verdict ships with a Jupyter notebook, PROV-JSON provenance chain, and an RO-Crate 1.1 archive bundle.

› verdict: PROVED // Ω_Λ = 0.6853 > 0.68 — confirmed

§03 · the difference

"Why not just ask the model?"

Same claim, same sources, two radically different artifacts. The one on the left is a confident summary. The one on the right is a re-runnable script with a trace.

typical LLM

prompt: "does using AI tools make humans worse at critical thinking?"

"The claim is too absolute to confirm or deny cleanly. The real picture appears to be that passive, over-reliant use degrades critical thinking, while active, interrogative use can augment it — making the unqualified 'makes humans worse' framing false as a universal statement."

hedges into unfalsifiability — "too absolute," "context-dependent"
names one 2025 study without a URL, quote, or coverage check
invokes "cognitive-offloading theory" without citing a source
concludes "false as a universal statement" — but no universal was claimed
a second ask re-runs the same mechanism; the hedge persists

✗ plausible-sounding · zero provenance

proof engine

same claim → verdict + audit trail

PROVED — four independent research groups, different institutions and methods, reach the same association: AI tool use correlates with measurable drops in critical-thinking scores.

B1 Gerlich 2025, n=666 — negative correlation (r=-0.68), cognitive offloading (quote-verified)
B2 Lee et al. 2025 (Microsoft Research / CHI), n=319 knowledge workers — confidence ↑ → critical effort ↓ (quote-verified)
B3 Harvard Gazette 2025 — faculty cross-discipline panel, same concern (quote-verified)
B4 Jose et al. 2025 (Frontiers / PMC) — ChatGPT users solved 48% more problems but scored 17% lower on concept tests (quote-verified)
verdict qualifier in the record: correlation, not proven causation; routine use > high-stakes use

✓ 4 sources verified · consensus threshold met · re-runnable read the full proof →

§04 · exhibit room

Think you know the answer?

11 claims, verdicts redacted. Tap a card to reveal what the pipeline actually found. Every one of them is a re-runnable Python script in the catalog.

tap to reveal · R to reveal all

01 · mathematics / myths

0.999... (with 9s repeating forever) is strictly less than 1.

DISPROVED

The claim that 0.999...

read proof →

02 · cosmology

Dark energy constitutes more than 68% of the universe's total energy density according to the Planck 2018 legacy release.

PROVED (with unverified citations)

The Planck 2018 legacy release reports ΩΛ = 0.6853 ± 0.0074, which is strictly greater than 0.68.

read proof →

03 · mathematics

The binary operator eml is defined by the expression eml(a, b) = (a) - (b). There exists a finite binary tree consisting solely of eml operations, whose 9 leaves are drawn from \1, x, y\, such that the tree evaluates exactly to x × y. The tree has K = 17 tokens (8 eml operations and 9 leaves), and the identity holds for all complex x and y (in the algebraic setting where is the identity).

PROVED

**Verdict: PROVED.** The 17-token expression eml(eml(1, eml(eml(eml(1, eml(eml(1, eml(1, x)), 1)), y), 1)), 1) evaluates to x * y.

read proof →

04 · myths / biology

Hair and fingernails continue to grow for days after a person dies.

DISPROVED (with unverified citations)

**DISPROVED (with unverified citations)**: The claim that hair and fingernails continue to grow for days after death is false.

read proof →

05 · neuroscience / myths

Humans use only 10% of their brain at any one time.

DISPROVED

**Verdict: DISPROVED.** Three independent, authoritative neuroscience sources — Scientific American (quoting a Johns Hopkins neurologist), MIT's McGovern Institute for Brain Research, and the University of Washington — each explicitly reject the claim that humans use only 10% of their brain.

read proof →

06 · history / myths

Napoleon Bonaparte stood shorter than the average Frenchman of his era.

DISPROVED

Napoleon Bonaparte was not shorter than the average Frenchman of his era.

read proof →

07 · physics

Quantum entanglement enables the transmission of usable information faster than the speed of light when the distant parties pre-agree on a measurement basis.

DISPROVED

**Verdict: DISPROVED.** The claim that quantum entanglement enables faster-than-light information transmission when parties pre-agree on a measurement basis is false.

read proof →

08 · mathematics

The 100000th prime number is exactly 1299709.

PROVED

The 100,000th prime number is exactly 1,299,709.

read proof →

09 · myths / biology

The average person swallows eight spiders per year while sleeping.

DISPROVED

The claim that the average person swallows eight spiders per year while sleeping is conclusively false.

read proof →

10 · myths / history

The Great Wall of China is the only man-made object visible from space with the naked eye.

DISPROVED

The claim that the Great Wall of China is the only man-made object visible from space with the naked eye is false on both counts.

read proof →

11 · economics / history

The purchasing power of the US dollar has declined by more than 90% since the Federal Reserve was established in 1913.

PROVED

**Verdict: PROVED.** The purchasing power of the US dollar has declined by **96.85%** since 1913, well exceeding the claimed "more than 90%" threshold by 6.85 percentage points.

read proof →

explore all 130 proofs in the catalog →

§05 · scope

What it can and can't do.

Calibrated honesty beats confident vagueness. The engine refuses to gesture at things it can't mechanically check.

works well for

Factual claims with citable evidence dates, numbers, quotes, statistics with verifiable source pages
Mathematical assertions anything Python + sympy can compute deterministically, including symbolic identities
Debunking specific claims "did X really say Y?" · "is statistic Z accurate?" · "does this compound claim decompose?"
Compound claims decomposes "X and Y" into independently verified sub-claims (SC1 ∧ SC2)

doesn't work for

Causal claims "X caused Y" tops out at PARTIAL — facts yes, causal theory weighting no
Broad literature synthesis "coffee reduces diabetes risk" needs a systematic review, not a proof
JS-rendered pages citation match degrades when the source needs a browser to render
Absence-of-evidence search-based facts reach SUPPORTED at best; the engine can't prove non-existence
Contested definitions "is a hot dog a sandwich?" — depends on definition, not evidence
Original theorem proving computations yes, novel conjectures no

§06 · every proof ships

Eight files. Nothing to take on faith.

Every verdict in the catalog includes these artifacts, versioned, DOI-minted, and downloadable. Run the Python. Open the notebook. Cite the JSON.

▶

proof.py

re-runnable verification script — python proof.py

.py

proof.md

structured report with verdict + sub-claim breakdown

.md

✓

proof_audit.md

citation-by-citation evidence trail, coverage % per quote

.md

proof_narrative.md

plain-language summary for non-technical readers

.md

⎔

Jupyter Notebook

interactive re-verification in a browser cell-by-cell

.ipynb

⟳

W3C PROV-JSON

provenance chain — feed it to downstream fact pipelines

.json

⧉

RO-Crate 1.1

archival research-object bundle, DOI-minted

.crate

⌖

Citation files

BibTeX, RIS, CFF, Chicago, APA — ready to cite

.bib

§07 · install

Build agents that prove instead of assert.

Drop the skill into Claude or any other agent that supports Skills. Then just say: "use the proof-engine skill to verify …" — it auto-activates when a claim needs checking, and refuses to guess when it can't.

install in Claude Desktop → install on other platforms → build AI agents that prove → docs github

LLMs assert. Proof Engine proves.

The witness cannot corroborate itself.

Five steps. No LLM in the verification path.

"Why not just ask the model?"

Think you know the answer?

What it can and can't do.

Eight files. Nothing to take on faith.

LLMs assert.
Proof Engine proves.