Healthcare PDF-Grounding Benchmark Viewer

What this benchmark is

A persona-based, document-grounded question-answering benchmark for evaluating frontier LLMs on real healthcare document work. It doesn't test whether a model knows clinical facts — it tests whether a model can do the document work of a specific healthcare professional: reading a supplied clinical document and answering strictly from it. It extends the GDP.pdf paradigm — faithful reasoning over authentic professional PDFs — into healthcare, with its own construct: personas mapped to real occupations, span-grounded sub-questions, and rubric-based grading.

Domain
Healthcare document work, grounded in authentic clinical and public-health documents from sources such as CDC, TCGA-Reports, MTSamples, and MIMIC-IV, with provenance, and access level tracked per document. The task samples currently shown here are grounded in CDC documents.

Taxonomies
The benchmark is organized around two taxonomies: a persona taxonomy that defines who does the work, and a task-type taxonomy that defines what cognitive operation each item tests.

Persona taxonomy
Thirteen healthcare personas cover the field's document-handling roles — from frontline clinicians and diagnostic interpreters to coders, payer reviewers, public health professionals, and patients. Each persona is mapped to real O*NET occupations, so tasks reflect what practitioners actually consult on the job rather than editorial guesswork. The persona is the primary axis of variation: the same source document can yield a protocol-extraction item for a nurse and a compliance-threshold item for an administrator.

Task-type taxonomy
Nineteen capability codes classify the cognitive operation each item tests, spanning four tiers:

  • Foundational — value extraction, faithful grounding, layout navigation
  • Structural — table parsing, conditional flow, chart reading
  • Advanced — cross-document reconciliation, temporal reasoning, absence-aware abstention, synthesis
  • Healthcare-specific — clinical coding, payer-criteria matching, medication reconciliation, care transitions, safety guardrails

How our approach differs from GDP.PDF

GDP.PDF limitation Why our approach improves on it
No occupational grounding
Items are grouped only by domain, and each one comes from whichever expert happened to have done that task. Nothing ties the tasks to the people who actually do the work, and there's no finer level of coverage than the domain.
Explicit persona–occupation alignment
Each persona is mapped to real occupations and their work activities, so every item carries a persona, occupation, and work aspect. Coverage is driven by workforce data rather than contributor convenience, and is auditable and reproducible.
Domain-agnostic taxonomy
The capability taxonomy is deliberately built to generalize across all domains, so no category tests an operation that matters only inside one profession.
Added healthcare-specific tier
We keep the generic tiers and add a healthcare-specific one, capturing operations like clinical coding and medication reconciliation that distinguish skilled healthcare document work and that a cross-domain taxonomy cannot see.
Per-item rubrics
Every rubric is written from scratch for a single document–question pair, so recurring criteria are re-derived each time, with no reuse and no consistency across items.
Tiered rubric criteria
Grading criteria are organized into tiers — domain, persona, occupation, and task — and every item draws from each tier. This keeps criteria consistent and reusable across items, and ensures each item is graded from broad grounding down to task-specific requirements.