Beyond Generation: Engineering a Verifiable Clinical AI Pipeline

The most difficult part of applied artificial intelligence is not model selection, prompt design, or fine-tuning. It is engineering a system that understands its own uncertainty.

Clinical AI exposes this challenge more than almost any other domain. In healthcare, a system is not judged by how fluent or impressive its output looks, but by whether its suggestions can be trusted, inspected, and justified. A single hallucinated recommendation can have real-world consequences.

This project was built with a very specific goal:

To design and implement the complete working pipeline of a clinical AI system, where every step is explainable, verifiable, and auditable, not just the final output.

Rather than focusing on a single model or a single prompt, the project focuses on how multiple AI components interact, how information flows between them, and how confidence is built (or withheld) at each stage.

1. Why Clinical AI Is a Pipeline Engineering Problem

Most AI demos reduce the problem to a single transformation:

“Convert audio into a clinical note.”

In practice, clinical documentation is not a single task. It is a chain of tightly coupled reasoning steps, each of which has different failure modes.

A clinician listening to a patient performs multiple cognitive operations:

They distinguish symptoms from background conversation.
They track durations and progression.
They mentally map symptoms to known clinical guidelines.
They decide what is certain versus what needs follow-up.
They document cautiously, often hedging language when evidence is weak.

Attempting to collapse all of this into a single LLM call creates a system that appears confident but lacks accountability.

The central design decision of this project was therefore to treat clinical documentation as a multi-stage pipeline, where:

Each stage performs one responsibility.
Each stage produces structured outputs.
Downstream stages are not allowed to assume correctness from upstream stages.

This framing shifts the problem from “How good is the model?” to “How robust is the system?”

2. Designing the Data Architecture Before the Models

Before building any AI components, the project required a data architecture capable of representing process, not just results.

Most AI systems store:

an input,
a generated output.

This project stores:

the raw transcript,
the extracted entities,
the generated SOAP note,
the individual claims inside that note,
the evidence used to support each claim,
confidence scores,
and the clinician’s final edits.

This distinction is critical because clinical reasoning is iterative. A transcript should not change, but interpretations should be revisable.

Key Architectural Decisions

Transcript and clinical note are intentionally decoupled. The transcript represents objective input data, while the SOAP note represents interpretation. By separating them, the system allows the same transcript to be reprocessed with different logic, models, or evidence sources without losing the original input.

Claims are first-class entities. Instead of treating the note as a block of text, the system extracts and stores individual claims (especially from Assessment and Plan sections). Each claim can then be independently verified, scored, and reviewed.

Verification data is persisted. Evidence snippets, guideline references, and confidence scores are stored alongside claims. This creates an auditable trail that can be inspected later, exported, or used for training improvements.

The database schema reflects the belief that trust is something you store explicitly, not something you infer later.

3. High-Level System Architecture

At a system level, the pipeline follows a strict flow:

The system progresses through the following stages:

Audio ingestion and normalization
Offline transcription
Medical entity extraction
Knowledge retrieval
Structured SOAP generation
Claim verification (CRAG)
Human review and finalization

Each stage produces outputs that are both human-readable and machine-consumable, allowing the pipeline to be debugged, inspected, and improved incrementally.

Importantly, no single stage is allowed to silently fail. Each stage reports status, duration, and confidence indicators that are surfaced in the UI.

4. Offline Audio Transcription: Privacy and Determinism

The pipeline begins with raw clinician–patient audio.

Instead of relying on cloud APIs, transcription is performed using Whisper.cpp, running entirely offline.

This choice was deliberate:

It avoids transmitting protected health information over the network.
It ensures predictable latency and uptime.
It allows the system to run in constrained or offline environments.

Before transcription, audio is normalized into a consistent format (16kHz, mono PCM). This ensures stable ASR performance and simplifies downstream processing.

The output of this stage is a verbatim transcript. At this point, the system has not made any medical judgments. It has only converted sound into text.

5. Medical Entity Extraction: Structuring the Noise

Raw transcripts are noisy. Patients speak conversationally, jump between topics, and use imprecise language.

To prevent the LLM from reasoning over unstructured text, the system performs a medical Named Entity Recognition (NER) pass.

This stage extracts:

Symptoms mentioned by the patient
Relevant body parts
Durations and temporal modifiers
Other clinically relevant descriptors

The extracted entities serve as constraints, not enhancements.

If a symptom is not detected, the system does not confidently reason about it later. This design choice intentionally sacrifices recall in favor of safety. It is better for the system to say “insufficient evidence” than to hallucinate a diagnosis.

6. Retrieval: Grounding the System in Evidence

Once entities are extracted, the system conditionally performs retrieval against a curated knowledge base.

The retrieval logic is intentionally conservative:

Retrieval only happens if relevant entities exist.
The system retrieves short, focused guideline snippets.
If no relevant evidence is found, retrieval is skipped entirely.

The knowledge base consists of:

Trusted clinical guidelines
ICD-10 code descriptions

These documents are embedded and stored in ChromaDB, allowing semantic search rather than keyword matching.

This stage ensures that generation is anchored in real clinical references, not model priors.

7. Structured SOAP Note Generation

Only after transcription, entity extraction, and retrieval does the system invoke the LLM to generate a SOAP note.

The LLM is not asked to “think freely.” Instead, it is constrained by:

The transcript
Extracted entities
Retrieved evidence
Explicit rules about what it is allowed to state

The goal is not creativity. The goal is structured synthesis.

The output at this stage is an AI-generated SOAP draft that is explicitly marked as such and prepared for verification.

8. CRAG: Verifying Before Trusting

This is the most important component of the system.

Instead of assuming the generated SOAP note is correct, the system performs Corrective Retrieval-Augmented Generation (CRAG).

The process works as follows:

The system extracts individual claims from the Assessment and Plan sections.
Each claim is independently compared against retrieved guideline evidence.
A semantic similarity score is computed.
The claim is labeled as:
- Supported
- Partially Supported
- Unsupported

This verification step runs independently of generation. Even if the LLM produces fluent text, CRAG can still flag unsupported claims.

This transforms the system from “AI that answers” into “AI that justifies.”

9. Agentic Orchestration and Graceful Degradation

Rather than a rigid, linear pipeline, the backend is designed as an agentic orchestrator.

Each stage declares:

its inputs,
its outputs,
and its failure conditions.

If a stage fails:

The failure is recorded.
Downstream stages adapt.
The UI reflects reduced confidence.

For example:

If retrieval fails, generation proceeds with a warning.
If verification fails, claims are marked as unverified.
If the LLM is unavailable, transcription and extraction still work.

This design ensures the system degrades gracefully rather than catastrophically, which is essential in clinical workflows.

10. Human-in-the-Loop Review as a Safety Mechanism

The frontend is not just a display layer. It is an active safety mechanism.

The UI:

Clearly separates AI-generated drafts from clinician edits.
Displays verification labels with color-coded confidence.
Allows clinicians to expand and inspect evidence snippets.
Requires explicit sign-off before finalization.

The system never assumes trust. It invites scrutiny.

Final Reflection

This project demonstrates that clinical AI is not about smarter models, but about better systems.

Trustworthy AI emerges when:

reasoning is decomposed,
evidence is explicit,
uncertainty is surfaced,
and humans remain in control.

The most important outcome of this project is not the SOAP note it generates, it is the pipeline that explains how that note came to be.

From Audio to Auditable Clinical Notes

Beyond Generation: Engineering a Verifiable Clinical AI Pipeline

1. Why Clinical AI Is a Pipeline Engineering Problem

2. Designing the Data Architecture Before the Models

Key Architectural Decisions

3. High-Level System Architecture

4. Offline Audio Transcription: Privacy and Determinism

5. Medical Entity Extraction: Structuring the Noise

6. Retrieval: Grounding the System in Evidence

7. Structured SOAP Note Generation

8. CRAG: Verifying Before Trusting

9. Agentic Orchestration and Graceful Degradation

10. Human-in-the-Loop Review as a Safety Mechanism

Final Reflection

Comments

Command Palette

Beyond Generation: Engineering a Verifiable Clinical AI Pipeline

1. Why Clinical AI Is a Pipeline Engineering Problem

2. Designing the Data Architecture Before the Models

Key Architectural Decisions

3. High-Level System Architecture

4. Offline Audio Transcription: Privacy and Determinism

5. Medical Entity Extraction: Structuring the Noise

6. Retrieval: Grounding the System in Evidence

7. Structured SOAP Note Generation

8. CRAG: Verifying Before Trusting

9. Agentic Orchestration and Graceful Degradation

10. Human-in-the-Loop Review as a Safety Mechanism

Final Reflection

Comments