Case 01 — Archer Evolv vs. Raw LLM

01 — The stakes

A date is not metadata. It's a deadline.

In risk and compliance, the date attached to a regulatory document is the trigger for everything downstream. The effective date sets when an obligation becomes binding. The comment-close date is a hard window that, once missed, cannot be reopened. The publication date anchors version control, supersession, and audit trails. A wrong date doesn't produce a slightly-off answer — it produces a missed filing, an out-of-date control, or an obligation tracked against the wrong calendar.

That is why a "usually right" model is the dangerous case. An answer that is wrong but plausible and confident flows silently into a compliance calendar and is only discovered when the deadline has already passed.

Publication date

Version & supersession

Anchors which version of a rule governs, what it replaced, and the audit trail regulators expect. Wrong here corrupts the system of record.

Effective date

When the clock starts

Determines when an obligation becomes binding and when controls must be live. An error means operating out of compliance without knowing it.

Comment-close date

A window that won't reopen

The fixed deadline to influence a proposed rule. Miss it and the only remedy is litigation or living with the outcome.

02 — Accuracy

Same 55 documents. Different truth.

Every document was run through the raw-LLM determination mechanism and independently adjudicated by an Expert-in-the-Loop. The raw model was correct on fewer than half. Evolv, applying source-specific extraction configs and a tuned knowledge base, holds error below 5%.

Raw LLM56.4% error

Correct · 24 Incorrect · 14 Failed · 17

Correct24 · 43.6%

Wrong but returned an answer14 · 25.5%

Failed / timed out (no answer)17 · 30.9%

Archer Evolv< 5% error

VERIFIED

Verified correct · >95% Routed to EITL review · <5%

High-confidence → used directlyauto

Low-confidence → EITL reviewcaught, not shipped

Result persisted & reusableonce

7/20

Confidence is not a safety net.

Of the 20 answers the raw LLM rated high confidence, 7 were flatly wrong — a 35% false-assurance rate. You cannot filter risk by trusting the model's own confidence score; the failures it hides are precisely the ones a reviewer would have waved through.

03 — The dangerous errors

What "wrong" actually looked like.

These aren't near-misses. The raw model defaulted to tidy, plausible dates — the first of a month, the first of a year — while the true date sat in a statutory citation it never retrieved. Several confident answers were off by years or decades. The correct date in every case is traceable to a specific legal reference.

Source	LLM said	Conf.	Actual	Off by	Authority for the correct date
DE · Gen. Assembly	1996-02-02	high	2024-06-25	~28 yrs	84 Del. Laws, c. 277, §§ 2,3 — latest amendment approved
MT · Sec. of State	2024-10-01	medium	2007-03-22	~17 yrs	Sec. 1, Ch. 38, L. 2007 — amendment approved
UT · Sec. of State	2025-12-06	high	2025-10-14	~2 mo	Ch. 17, 2025 Special Session 1 — signed by governor
DE · Gen. Assembly	2023-08-03	high	2025-06-30	~2 yrs	85 Del. Laws, c. 44, § 1 — latest amendment approved
CA · regulatory	2007-08-01	medium	2014-08-13	~7 yrs	CA Regulatory Notice Register — register filing
DE · Gen. Assembly	2024-07-01	medium	2026-01-30	~1.5 yrs	85 Del. Laws, c. 233, § 10 — latest amendment approved

Each correction was supplied by EITL adjudication and traces to a source-specific authority — exactly the signal Evolv's extraction configs are tuned to find and that retrieval confirms.

04 — Speed & Cost

The raw model pays every time. Evolv pays once.

Per request, the raw LLM averaged ~4 seconds against a 5-second timeout; Evolv serves a verified, persisted date in ~0.05 seconds — about 80× faster. But the real divergence is repetition: when an agent or analyst asks for the same document's date again, the raw model re-computes from scratch — re-incurring latency, inference cost, and a fresh, non-deterministic chance of being wrong. Evolv answers from cache: compute once, verify once, serve forever.

Documents in scope · 500

Repeat lookups per document / month · 12

Raw LLM — recompute each time

Inference calls / month

—

Compute time / month

—

Expected wrong answers served / mo *

—

Archer Evolv — compute once, cache

Inference calls / month

—

Compute time / month

—

Expected wrong answers served / mo

—

With Evolv, per month —

* Wrong-answer estimate applies the measured rates to answers actually served: raw LLM 25.5% (incorrect, returned as if valid) on every recompute; Evolv <5%, caught at ingestion and routed to review rather than shipped. Latency assumptions: raw 4.0s/call, Evolv 0.05s/cache read. Inference is incurred once per document at ingestion for Evolv. Figures are illustrative and scale with the sliders.

05 — How Evolv does it

A tuned pipeline, governed by experts.

Evolv manages reusable, content-source-specific collections of extraction-configs and KB models, each linked to one or many sources. When a new document enters the infer_document_key_date pipeline, the specialized AI operator applies the matching config to drive its determination strategy.

Step 1 · Extract

Pull search signals from the document

Evolv extraction + tuned LLM identify the best in-document clues: title, citation, agency, chapter, register reference, effective date, source-specific metadata.

Step 2 · Retrieve

Find supporting evidence via those signals

The MCP server routes to the CAI App API / Advanced Search — or the pipeline calls Archer Advanced Search directly — across millions of pre-curated documents.

Step 3 · Infer

Determine the date from retrieved evidence

The LLM combines the retrieved authority with the original document to determine the date and return structured evidence + a confidence score.

Step 4 · Persist

Write the verified result onto the document

The answer, evidence, and confidence are stored — reusable on every future lookup at cache speed.

High confidence?

Yes

Use the inferred date

Route to EITL review

The EITL governance loop

Tuned, tested, validated — before production.

EITL creates & manages source-specific extraction-configs and KB models in the Data Admin Tool.
Tests each config against sample / training data sets before release.
Validates accuracy, then activates the config for a source or spider.
Low-confidence results return to EITL — feeding corrections back into the config.

06 — The bottom line

One approach guesses. The other is accountable.

Accuracy

Raw LLM: 44% correct, 56% wrong or failed, and confidently wrong 35% of the time. Evolv: >95% verified, with the residual caught and routed — never silently shipped.

Speed

~4s per raw request against a 5s timeout, recomputed on every ask. Evolv serves verified dates from persistence in ~0.05s — roughly 80× faster on repeat.

Cost

Raw pays inference, latency, and fresh error risk on every lookup. Evolv computes once at ingestion; every subsequent answer is near-free and already verified.

For a function where a wrong date is a missed deadline, the raw LLM's defining failure isn't that it's wrong — it's that it's confidently wrong, repeatedly, and at full cost each time. Evolv replaces that with a verified, cached, expert-governed answer.