AI Hallucination Agent — open-source verification demo

AI Hallucination Agent

Open source · strategies · modules · P0–P2 severity

MIT Multi-provider LLM Risk triage

Live demo

Architecture

Modules & risk

Coverage

Open source

LLM backend

Bring your own API key (optional)

Sent over HTTPS to this app’s /api/verify only — not stored on the server. Keys are not saved unless you click Save for future (stored in this browser’s localStorage). For CI/workflows use headers X-LLM-API-Key + X-LLM-Provider.

AI-generated text to verify Demo mode

Verification strategy

Multi-source RAG

Cross-check against multiple knowledge bases

Chain-of-thought verify

Step-by-step logical consistency

Adversarial probe

Counter-evidence stress test

Confidence calibration

Uncertainty per sub-claim

Active verification modules

Multi-source RAG

Wikipedia, web, arXiv

Citation verifierv2

CrossRef, courts, PubMed

Code safetyv2

AST, URLs, packages

Medical guardv2

Evidence & disclaimers

Policy groundingv2

Vs. source documents

Calibration

Per-claim uncertainty

Domain context (multi-select)

Medical

Diagnoses, dosing

Legal

Cases, citations

Technical

Code, packages

Policy

T&Cs, internal docs

Historical

Events, facts

Predictive

Forecasts

Verification result

Choose a mode and run verification

How this demo runs: /api/verify calls your chosen LLM using server-side keys only (never exposed to the browser). Priority for Auto: Anthropic → Gemini → OpenAI → Groq. Configure at least one key (Cloudflare Secrets or .dev.vars). file:// cannot load this page’s API — use npm run dev or your deployed URL.

How the agent prevents hallucinations

A multi-layer pipeline intercepts AI outputs, decomposes them into atomic claims, verifies each claim via retrieval and optional domain modules (v2), applies a verifier LLM with calibration and severity labels (v3), and returns a calibrated confidence score with citations or corrections. Beyond verification: constrained decoding, uncertainty elicitation, and self-consistency sampling reduce bad outputs upstream.

This deployed demo: the Live demo tab runs the verifier LLM through /api/verify (Cloudflare Function) using whichever providers you configured — citation/CodeScanner-style modules here are described as product concepts until wired to real APIs in code.

Claim decomposer

Layer 1 — parse & split

Splits complex text into atomic factual claims

Detects verifiable vs. opinion statements

Extracts named entities, dates, quantities

Assigns verification priority per claim

RAG retriever

Layer 2 — evidence fetch

Queries Wikipedia, arXiv, web search in parallel

Embeds claims → semantic nearest-neighbor search

Supports custom private knowledge bases

Source reliability scoring (e.g. PageRank-style)

Verifier LLM

Layer 3 — judge

Structured chain-of-thought per claim

Entailment: evidence supports or contradicts

Confidence score with calibration

Backends: GPT-4, Claude, Gemini, Ollama, …

Report generator

Layer 4 — output

Verdict with grounded corrections

Inline citation links for claims

JSON, HTML, Markdown, webhooks

Audit trail + feedback for retraining

Prevention strategies beyond verification

Constrained decodingForce citations via logit bias or tool calls.

Uncertainty elicitationAsk the model to surface uncertainty before asserting facts.

Self-consistencySample multiple answers; vote on the majority.

v2 — four domain-specific modules on the base pipeline

Each module is independently loadable. After claim decomposition they can run in parallel and merge into one verdict with per-module evidence trails. In the Live demo, turning modules on shapes the LLM prompt (simulated cross-checks); production-grade hooks to CourtListener, CrossRef, etc. belong in application code—same architecture as below.

Citation verifier v2

Legal, academic, scientific

Resolves case names via CourtListener & Google Scholar APIs

Verifies DOIs and ISBNs via CrossRef and OpenAlex

Checks PubMed IDs and arXiv IDs for existence

Flags hallucinated docket numbers (e.g. fake WL citations)

Returns real case summary if found, correction if not

Code safety scanner v2

Static analysis + URL validation

AST parse: injection, XSS, unsafe eval patterns

HTTP HEAD validates URLs (404 / redirect)

Bandit & semgrep (Python); ESLint security (JS)

Dependency checks: package exists on PyPI/npm

CVE lookup for referenced library versions

Medical guard v2

Clinical evidence grounding

Maps diagnoses and drugs to ICD-11 and RxNorm

PubMed for supporting or contradicting RCTs

WHO & FDA cross-check for interactions

Flags high-confidence claims with thin evidence

Mandatory “consult a clinician” disclaimer on output

Policy grounding v2

Document & policy adherence

Ingests company policies as a grounding KB

Semantic similarity: claim vs. source paragraph

Detects promises not in policy (Air Canada scenario)

Sources: PDF, DOCX, Notion, Confluence, …

Returns conflicting policy section as evidence

Severity tiers (v3)

Extreme risk — blocker

Life, liberty, legal integrity — block output, pager / human review, full audit trail.

Medical misdiagnosis Lethal dosage errors Fabricated precedents in filings Emergency misinformation

High risk — escalate

Financial loss, liability, reputational damage — flag, correction, operator notify, audit.

Binding policy promises High-stakes demo errors Security backdoors in code Fake citations in reports

Moderate risk — correct

Accuracy and credibility — inline correction, uncertainty badge, “verify this” prompt.

Non-existent citations False biographies Broken links / 404s Arithmetic errors

Hallucination types (fabricated sources, fictional history, policy invention, etc.) appear in the Coverage tab with example mappings.

Incident categories & module mapping below extend to the regulated sectors in the next card. Coverage = intent to detect before users rely on output. Percentages are engineering targets for a full retrieval-backed stack, not guarantees or certifications. Severity defaults (P0–P2) reflect typical harm if a false claim slips through.

Critical sectors requiring hallucination testing

These sectors share near‑zero tolerance for silent fabrication: liability, licensing, fines, or physical harm dominate outcomes. The Live demo on this site runs verifier JSON through /api/verify only — it does not by itself satisfy EU AI Act, FDA, banking, or court evidentiary rules; production systems still need grounded retrieval, logging, human‑in‑the‑loop sign‑off, and your compliance review.

Legal & judiciary

Fake cases, dockets, or reasoning in filings → sanctions / disbarment risk. Architecture maps to: Citation verifier + audit trail in report generator · typical tier P0

Healthcare & life sciences

Misdiagnosis, dosing, contraindications in AI outputs → patient harm; regulated-as-device paths expect reliability & human review. Maps to: Medical guard (text) + severity P0 · multimodal clinical imaging remains a model-layer gap (see partial row below).

Financial services

Bad numbers, “insights,” transfers, or filings grounded in fiction → liability & regulatory breach. Maps to: Policy grounding on approved disclosures + citation/numeric cross-checks + calibration · often P1

Manufacturing & critical infrastructure

Subtly wrong repair, safety, or control guidance → downtime or injury. Maps to: Code / link scanner + procedural RAG against SOPs & OEM docs · treat as high‑stakes technical · P0–P1 depending on consequence.

Government & law enforcement

Essential services, immigration, policing — often classified as high‑risk under frameworks like the EU AI Act: documentation, logging, oversight. Maps to: audit-oriented report output + policy/legal grounding; deployment must meet jurisdictional rules beyond this repo.

Customer service (regulated)

Binding promises (refunds, coverage) invented by bots → direct liability. Maps to: Policy grounding (Air Canada–style checks) · P1

Why these sectors are especially “in need”

Liability: Unlike casual chat, a single wrong factual or policy assertion can mean fines in the millions, loss of license, or irreversible harm — there is little room for “mostly right.”
Audit trails: Regulators and plaintiffs increasingly ask how a conclusion was reached; hallucination testing & structured verdict JSON support documentation for compliance and litigation defence (when wired into your logging stack).
Trust gap: After operators see confident lies from AI, adoption of automation stalls; repeatable verification is needed to recover confidence and productivity.

Headline incidents → module mapping

Legal fabrications (Mata v. Avianca)

Citation verifier → CourtListener + docket / Westlaw-style checks

P0 Covered

95%

Customer service misinformation (Air Canada)

Policy grounding → claims not in source documents

P1 Covered

92%

Academic citation fabrication (fake DOIs)

Citation verifier → CrossRef DOI + PubMed + OpenAlex

P1 Covered

97%

Historical / factual inaccuracies (Google Bard / Webb)

Multi-source RAG → Wikipedia, news archives, NASA

P1 Covered

88%

Healthcare misidentification (skin lesion)

Medical guard on text; vision errors occur inside the model — not fixable by post-hoc text alone

P0 Partial

55%

Code errors & broken links (fraud system bugs)

Code scanner → AST vulns, 404 URLs, fake packages

P2 Covered

82%

Healthcare image inference (the multimodal gap) needs model-level calibration — ensemble uncertainty, Platt scaling, and mandatory human-in-the-loop review. Treat that as a roadmap item for multimodal plugins, not something a text-only guard fully solves.

Common hallucination types → module

Source fabrication

Fake news, quotes, papers, legal cites

Citation verifier

Logical / arithmetic

Wrong math or contradictions

Chain-of-thought

Fictional history

“Moonlight Treaty of 1854,” invented events

Multi-source RAG

Policy invention

Refunds or terms not in your docs

Policy grounding

Medical misinformation

Unsupported dosing or diagnosis in text

Medical guard

Code & link errors

404s, fake npm/PyPI names, risky code

Code scanner

“Know-it-all” tone

Authoritative phrasing, weak evidence

Calibration

Incorrect predictions

Markets, weather, outcomes stated as certain

Calibration

Image misclassification

CV sees objects that are not there

Model layer · roadmap

Honest ceiling. This agent is a post-hoc verification layer: it runs after the LLM produces text and before the user sees it. Unknown-unknown fabrications in domains with no reachable ground truth may still land as “uncertain.” Training-time gaps and multimodal failures need prevention inside the model (RLHF, constitutional AI, RAG in the loop, calibrated vision heads).

Deployment options

This repo ships production UI on Cloudflare Pages: public/index.html + functions/api/*.js + wrangler.toml — see README (npm run dev / npm run deploy). The cards below are additional ways you might package the same ideas (PyPI, Docker, etc.).

Click a card to copy a starter prompt for an AI assistant or internal docs — not a guarantee that a package name exists on a registry. Reserve names when you publish.

PyPI package roadmap

Future pip install … — name TBD when published

Docker + REST API

FastAPI server, OpenAI-compatible verify route

Hugging Face Spaces

Free hosted Gradio demo for discovery

Vercel / Railway

Serverless-style deploy + API keys in env

npm package

npm install hallucination-guard — TS-first SDK

GitHub Action

CI/CD gate on LLM outputs in PRs

Python / YAML below — honest labeling: These blocks describe a target Python API and CI shape for contributors — they are not importable from this repository today (only functions/*.js + public/ ship). Publishing PyPI/npm/GitHub Actions under those names requires your release process.

Target Python API — full v2 modules (not generated in repo yet)

from hallucination_guard import GuardAgent
from hallucination_guard.modules import (
    CitationVerifier, CodeScanner, MedicalGuard, PolicyGrounding
)

agent = GuardAgent(
    verifier="claude-sonnet-4",
    modules=[
        CitationVerifier(sources=["courtlistener", "crossref", "pubmed"]),
        CodeScanner(run_url_check=True, check_packages=True),
        MedicalGuard(require_rct_evidence=True, add_disclaimer=True),
        PolicyGrounding(docs=["./policies/refund_policy.pdf"]),
    ],
    strategy="parallel",   # all modules run concurrently
)

result = agent.verify(text=llm_output)

print(result.verdict)             # trusted | warning | hallucination
print(result.module_reports)       # per-module evidence breakdown
print(result.corrections)          # grounded corrections with citations
print(result.safe_rewrite)         # corrected version of the text

FastAPI middleware (target — package not in repo yet)

from hallucination_guard.middleware import HallucinationGuardMiddleware

app.add_middleware(
    HallucinationGuardMiddleware,
    agent=agent,
    block_on=["hallucination"],       # auto-block hallucinations
    warn_on=["warning"],               # pass warnings with header
    safe_rewrite=True,                  # replace with corrected text
)

GitHub Action — CI sketch (no published action in this repo yet)

# Replace `your-org/.../action` when you publish — placeholder name only:
- uses: your-org/hallucination-guard-action@v1
  with:
    text-file: outputs/llm_response.txt
    modules: citation,code,policy
    policy-docs: docs/policies/
    fail-on: hallucination
    # One or more, matching your guard backend:
    anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
    # gemini-api-key / openai-api-key / groq-api-key if applicable

Quick start — base agent (target Python API)

from hallucination_guard import GuardAgent

agent = GuardAgent(
    verifier="claude-sonnet-4",  # or "gpt-4o", "ollama/mistral"
    strategy="multi_source",
    sources=["wikipedia", "web_search", "arxiv"],
)

result = agent.verify(
    text="The Great Wall is visible from the moon."
)

print(result.verdict)         # → "hallucination"
print(result.confidence)      # → 0.97
print(result.corrections[0])  # → "NASA confirms it is not visible..."

Repo layout — today vs roadmap

ai-hallucination-guard/ # this repository ├── functions/ # Cloudflare Pages Functions │ ├── api/ │ │ ├── config.js # GET which LLM keys are set │ │ └── verify.js # POST proxy to LLMs │ └── lib/ │ └── providers.js # Anthropic / Gemini / OpenAI / Groq ├── public/ │ ├── index.html # Live demo + all tabs │ └── _headers ├── wrangler.toml ├── package.json ├── README.md ├── LICENSE │ ├── # Future optional Python package (not generated yet): ├── hallucination_guard/ # planned core package ├── api/ # FastAPI server └── pyproject.toml

Verification strategies

MIT

License — commercial OK

LLM backends / modules

∞

Custom knowledge bases

Ways to contribute

New strategy plugins

Domain verifiers: medical, legal, code

LLM backend adapters

Ollama, Gemini, Mistral, Cohere, local

Benchmark datasets

Labeled hallucination examples

Framework integrations

LangChain, LlamaIndex, Haystack, CrewAI

Open-source growth checklist

README with badges, demo GIF, quick start

CONTRIBUTING.md — style, PR process, issue templates

GitHub Discussions for Q&A and roadmap

PyPI + npm releases, semver, changelogs

Hugging Face Spaces live demo

Launch posts (HN, Reddit r/LocalLLaMA, etc.)

Docker image on GHCR for self-hosting

MIT License — see LICENSE. Served from public/index.html; verification API via Cloudflare Pages Function /api/verify.