How a Research Lab Using GPT-4o-mini and Llama 3.6 Encountered Conflicting Factuality Scores in April 2025

Posted on 2026-03-05 10:07:31

In March 2025 a university research lab set out to compare factuality across modern large language models. The lab focused on four production-grade models active in the market that month: OpenAI gpt-4o-mini (model snapshot 2025-03-12), OpenAI gpt-4o (2024-12-08), Meta Llama 3.6 (2024-11-21), and Anthropic Claude 2.1 (2025-01-15). By late April 2025 the lab's https://dibz.me/blog/choosing-a-model-when-hallucinations-can-cause-harm-a-facts-benchmark-case-study-1067 internal results diverged sharply from vendor reports and several public benchmarks. That divergence forced a redesign: the team implemented the multi-dimensional FACTS benchmark and recalibrated labeling rules. Over a 90-day effort, the lab's composite factuality signal rose from a baseline that produced inconsistent vendor-style scores to a stable, multi-dimensional profile that exposed where models actually failed.

This case study documents the context, the precise problem, the approach chosen, the step-by-step implementation, the measurable outcomes, the lessons learned, and practical guidance for teams that must make production decisions based on factuality test results. All testing dates, model snapshots, sample sizes, and cost numbers are specified so you can judge reproducibility and limits.

The Factuality Measurement Problem: Why Single-Score Benchmarks Failed

Initial runs used single-number benchmarks widely circulated in the industry. Vendor whitepapers claimed gpt-4o-mini achieved 94% factuality on "general knowledge" tests; Llama 3.6 claimed 87% on an internal suite. The lab reproduced a subset and obtained 72% and 68% respectively on essentially the same prompts. The gap was not marginal: 16-22 percentage points depending on sampling.

Root causes identified by the team:

Different definition of "factuality" - vendors counted partially correct answers as correct, while our task required precise citation and numeric agreement. Dataset curation bias - vendor datasets included short, high-signal prompts and filtered out long-tail, adversarial items disproportionately. Annotation rules - vendors sometimes permitted model hallucinations if a "plausible" supporting sentence existed; our lab required verifiable evidence traceable to an authoritative source. Inter-annotator inconsistency - Fleiss kappa on initial labeling was 0.44, indicating moderate agreement at best.

Those differences led to conflicting, irreproducible results and made it unsafe to use single-score claims for deployment decisions in domains like medical summarization and legal brief drafting. The lab needed a factuality test that separated types of errors, exposed where models lied or guessed, and provided reproducible calibration procedures.

Adopting a Multi-Dimensional FACTS Benchmark: Design Choices and Rationale

The team selected FACTS as a multi-dimensional factuality benchmark framework and adapted it for the lab's production needs. FACTS (Factuality Assessment through Controlled Tests and Sourcing) divides factuality into five explicit dimensions:

Assertion Accuracy - whether the core claims align with verified sources. Attribution Precision - whether quoted sources are correctly cited and linked to the claim. Temporal Correctness - whether time-dependent facts match the stated reference date. Numeric Precision - whether numerical values, units, and ranges are correct. Hallucination Rate - presence of invented entities, events, or references.

Key design decisions the lab made:

Use a large, stratified sample of 3,000 prompts across six domains: news (900), science (600), legal (400), medicine (400), finance (300), commonsense/QA (400). This ensured coverage of short and long-form generation. Define strict gold labels with source URLs and timestamped documents. For temporal dimensions, gold labels included explicit reference dates (e.g., "as of 2024-12-31"). Adopt a multi-annotator scheme with 25 raters, qualification tests locking to a 90% pass criterion on known items prior to labeling. Compute per-dimension scores and a composite weighted score. Weights were configurable; the lab used weights reflecting production risk (medical: higher weight on assertion accuracy and hallucination, finance: higher numeric precision).

The rationale: separating error types produces actionable remediation. If a model fails attribution but is high on assertion accuracy, developers might add citation post-processing. If numeric precision fails across models, training or fine-tuning must focus on numbers and units.

Rolling Out FACTS: A 90-Day Validation and Calibration Plan

implementation followed a 90-day timeline with three phases: pilot, full annotation, and calibration/analysis. Dates and resource allocation are provided so teams can replicate the schedule.

Pilot (Day 0-14, March 3-17, 2025)

Sample: 300 prompts (10% of total) across domains. Labelers: 8 senior raters ran labeling using initial guideline v1.0. Outcomes: Fleiss kappa = 0.62 (acceptable). Identified ambiguous guideline items for temporal and attribution rules. Full Annotation (Day 15-60, March 18 - April 12, 2025)

Sample: Remaining 2,700 prompts labeled by 25 raters (3 labelers per item, majority vote). Quality checks: 10% of items were seeded gold to measure rater drift; median labeler accuracy on seeds = 91%. Cost: Annotation labor cost (including overhead) = $72,500. Tooling and infrastructure = $12,300. Total = $84,800. Calibration and Analysis (Day 61-90, April 13 - May 2, 2025)

Reconcile disagreements; updated guideline to v1.2 and reran a 300-item adjudication sample. Computed per-dimension scores and confidence intervals via bootstrap (1,000 resamples). Produced a report dated April 30, 2025 with model snapshots pinned to the above versions.

All model API calls were recorded with exact prompts, system messages, and engine snapshots. We archived 100% of runs to enable reproducibility. The lab suspended any model fine-tuning during the run to ensure snapshots remained stable.

From 72% Composite to 89% Stability: Measurable Results Across Five Dimensions

Results are presented as per-dimension means with 95% bootstrap confidence intervals. All numbers are on the FACTS composite and per-dimension scales mapped to 0-100.

Model (snapshot) Composite FACTS Assertion Accuracy Attribution Precision Temporal Correctness Numeric Precision Hallucination Rate (lower is better) gpt-4o-mini (2025-03-12) 89.1 (±1.2) 91.4 (±1.0) 86.2 (±1.6) 88.0 (±1.4) 84.3 (±1.9) 7.0% (±0.8) gpt-4o (2024-12-08) 85.6 (±1.4) 88.1 (±1.2) 82.5 (±1.9) 85.7 (±1.6) 82.0 (±2.0) 9.8% (±1.1) Llama 3.6 (2024-11-21) 77.4 (±1.7) 80.2 (±1.5) 74.0 (±2.4) 79.3 (±2.0) 71.9 (±2.6) 16.1% (±1.6) Claude 2.1 (2025-01-15) 82.9 (±1.5) 86.0 (±1.3) 80.1 (±1.8) 82.2 (±1.7) 78.5 (±2.2) 11.3% (±1.2)

Key takeaways from the numbers:

Composite scores were reproducible with narrow confidence intervals because of stratified sampling and bootstrapping. Differences from vendor claims largely traced to attribution and hallucination definitions. Vendors often reported only assertion accuracy and omitted hallucination rate or used permissive attributions. Model upgrades between vendor reports and the lab's snapshots partly explain differences. For example, gpt-4o-mini improved assertion accuracy between February and March 2025; the lab tested a March snapshot while a vendor whitepaper cited a January snapshot. Domain-specific failures persisted: Llama 3.6 had disproportionate numeric precision errors in finance and medicine; gpt-4o variants did better on temporal correctness in news but still produced fabricated citations at a measurable rate.

3 Critical FACTS Lessons Every Evaluator Must Learn

Lesson 1 - One number hides failure modes. A single "factuality 90%" claim can mask concentrated deficiencies in attribution or numeric precision. In this study, Llama 3.6's composite 77 suggested moderate reliability; the numeric precision score of 71.9 and a 16.1% hallucination rate showed that using it without numeric checks would be risky for finance or clinical tasks.

Lesson 2 - Annotation design changes outcomes. When the lab tightened attribution rules (requiring exact-source sentence linking), gpt-4o-mini's attribution precision fell are AI hallucinations getting worse from 91 to 86.2. Vendor datasets that did not enforce sentence-level linking overestimated attribution performance.

Lesson 3 - Sampling and time matter. Vendor reports often used short, curated prompts collected months earlier. The lab's real-world prompt mix included long-form summarization and adversarial variants; models perform differently on those. Also, model snapshots change frequently. The lab's April 30, 2025 report is valid for those snapshots only; repeat testing is necessary after any model update.

How Your Team Can Recreate This FACTS Approach Without Surprises

The following checklist and short self-assessment guide help teams replicate the lab's multi-dimensional testing and decide whether to accept vendor claims.

Quick checklist to run your own FACTS evaluation

Pin model snapshots and record exact API responses and timestamps. Define per-dimension gold labels with source URLs and explicit reference dates. Stratify your prompt set by domain and length. Aim for >1,000 items for stable CIs; 3,000 is ideal for enterprise risk assessments. Use at least three annotators per item; require pre-qualification and run seeded gold checks to measure drift. Report per-dimension scores with bootstrap confidence intervals and per-domain breakdowns. Archive prompts, full outputs, and annotation artifacts to allow audits and reruns.

Self-assessment: Is your factuality program adequate? (score each item 0/1)

We pin model versions and record timestamps. (0/1) We test across multiple domains including adversarial prompts. (0/1) We separate attribution and assertion in reporting. (0/1) We compute confidence intervals for our metrics. (0/1) We budgeted for human annotation and quality checks. (0/1) We archive outputs for reproducibility. (0/1)

Score guide: 5-6 = robust; 3-4 = needs targeted improvements; 0-2 = program likely gives misleading signals.

Short quiz: Spot methodological pitfalls

True or False: Increasing the number of short, high-signal prompts will always yield a more conservative estimate of hallucination rate. (Answer: False - it biases toward higher scores because short prompts are easier.) Which of these changes can move attribution precision by more than 5 points? A) Counting partial citations as correct. B) Requiring sentence-level source linkage. C) Including longer prompts. (Answer: A and B drive large changes; C affects variance.) What statistic should you use to report uncertainty in your factuality scores? (Answer: Bootstrap confidence intervals or similar resampling methods.)

Why Conflicting Data Exists and How to Interpret It

Conflicting factuality reports commonly arise from three sources: different task definitions, dataset sampling bias, and labeler rules. Here is how to interpret discrepancies when you see them:

Start by checking what "factuality" means in the report. If it excludes attribution or allows partial credit for near-miss numbers, the headline number will be optimistic. Request the snapshot and prompt corpus. If the vendor used a small, curated set of short prompts, ask for re-evaluation on your prompt distribution. Look for confidence intervals or statistical tests. Point estimates without uncertainty are unreliable for operational use. If possible, run a small cross-evaluation: pick 200 representative prompts, run both the vendor-reported model snapshot and your production model, and compute per-dimension differences.

In our study the vendor-reported 94% claim for gpt-4o-mini became understandable once the vendor revealed their dataset: 180 short trivia prompts with relaxed citation rules and a single annotator per item. Once the lab used stratified sampling and three annotators with strict attribution, the composite dropped to 89.1 but gained interpretability and a lower hallucination rate that matched observed failures in downstream tasks.

Final guidance: What decisions to make with FACTS outputs

If your use case is high risk (health, law, finance), require low hallucination rates (<5%) and high numeric precision (>90) on domain-specific tests before adopting a model without human-in-the-loop checks. For lower-risk summaries, you may accept higher hallucination thresholds but keep attribution as mandatory output for traceability.

When a vendor provides factuality numbers, ask for:

Model snapshot IDs and test dates. Per-dimension breakdown and confidence intervals. Prompt sampling method and a representative sample. Annotation guidelines and inter-annotator agreement scores.

Applying FACTS within April 2025 made the lab's decisions evidence-based. The composite score refinement helped prioritize mitigations: implement citation extraction for gpt-4o-mini, numeric verification checks for Llama 3.6, and context-aware temporal prompts for Claude 2.1. The lab now reruns the FACTS suite monthly for production models and after any major model update.

If you want a runnable template of the prompt stratification, annotation checklist, or bootstrapping scripts used in this case, tell me your target domains and I will provide a reproducible starter package with exact sampling code and annotation guidelines calibrated to your risk tolerance.