LLMs assert.
Proof Engine proves.
An open-source skill that makes every factual claim carry its receipts. Numbers get computed in Python. Quotes get fetched from live URLs and matched against the page. A fabricated citation fails the match — and the verdict downgrades so the gap is visible instead of hidden.
The witness cannot corroborate itself.
LLMs hallucinate facts and hallucinate the checks on those facts. Asking one to verify its own answer runs the same error-prone process twice.
Search-grounded AI finds pages, but still lets the model summarize them. A link says "the evidence is over there." A proof says: here is the evidence, here is the code that checked it, here is what happened when it ran.
Proof Engine routes every claim through a gate the LLM can't fake. Python doesn't hallucinate. A fabricated citation fails to match live page text. A partial match downgrades the verdict so the gap is visible — instead of smoothed away by a confident summary.
The model still does useful work: it drafts code, finds sources, formalizes claims. It just doesn't get to be the verification.
Five steps. No LLM in the verification path.
Each step produces an artifact that survives independently of the model that drafted it. You can re-run the whole pipeline from source, offline, in Python.
› claim = "0.999… < 1"
› decompose(claim) → [SC1: "0.999 repeating equals 1"]
live, wayback, snapshot), and credibility tier.sympy for exact math, numpy for quantitative, every constant version-controlled. Anyone can python proof.py and see the same result.› assert "0.6853 > 0.68" # holds
PROVED, DISPROVED, PARTIAL, SUPPORTED, or UNDETERMINED. Every verdict ships with a Jupyter notebook, PROV-JSON provenance chain, and an RO-Crate 1.1 archive bundle."Why not just ask the model?"
Same claim, same sources, two radically different artifacts. The one on the left is a confident summary. The one on the right is a re-runnable script with a trace.
- hedges into unfalsifiability — "too absolute," "context-dependent"
- names one 2025 study without a URL, quote, or coverage check
- invokes "cognitive-offloading theory" without citing a source
- concludes "false as a universal statement" — but no universal was claimed
- a second ask re-runs the same mechanism; the hedge persists
B1Gerlich 2025, n=666 — negative correlation (r=-0.68), cognitive offloading (quote-verified)B2Lee et al. 2025 (Microsoft Research / CHI), n=319 knowledge workers — confidence ↑ → critical effort ↓ (quote-verified)B3Harvard Gazette 2025 — faculty cross-discipline panel, same concern (quote-verified)B4Jose et al. 2025 (Frontiers / PMC) — ChatGPT users solved 48% more problems but scored 17% lower on concept tests (quote-verified)- verdict qualifier in the record: correlation, not proven causation; routine use > high-stakes use
Think you know the answer?
11 claims, verdicts redacted. Tap a card to reveal what the pipeline actually found. Every one of them is a re-runnable Python script in the catalog.
What it can and can't do.
Calibrated honesty beats confident vagueness. The engine refuses to gesture at things it can't mechanically check.
- Factual claims with citable evidence dates, numbers, quotes, statistics with verifiable source pages
- Mathematical assertions anything Python + sympy can compute deterministically, including symbolic identities
- Debunking specific claims "did X really say Y?" · "is statistic Z accurate?" · "does this compound claim decompose?"
- Compound claims decomposes "X and Y" into independently verified sub-claims (SC1 ∧ SC2)
- Causal claims "X caused Y" tops out at PARTIAL — facts yes, causal theory weighting no
- Broad literature synthesis "coffee reduces diabetes risk" needs a systematic review, not a proof
- JS-rendered pages citation match degrades when the source needs a browser to render
- Absence-of-evidence search-based facts reach SUPPORTED at best; the engine can't prove non-existence
- Contested definitions "is a hot dog a sandwich?" — depends on definition, not evidence
- Original theorem proving computations yes, novel conjectures no
Eight files. Nothing to take on faith.
Every verdict in the catalog includes these artifacts, versioned, DOI-minted, and downloadable. Run the Python. Open the notebook. Cite the JSON.
python proof.py