Living Curriculum

Research & Paper Spotlights

Every week features curated papers from three categories. Our moat against static MOOCs: a curriculum that evolves with the field.

Paper Selection Framework

We balance three categories intentionally — preventing the curriculum from being hijacked by LLM hype while ignoring implementation science and regulation.

Method Papers

New models, benchmarks, evaluation frameworks

Clinical Validation

Real-world deployment, prospective studies, workflow impact

Critical Analysis

Bias, data shift, hallucination, failure modes, governance

Five Questions for Every Paper

What is the research question?

Where does the data come from? Is there selection bias?

Are the model and comparator appropriate?

Do the metrics matter for clinical decision-making?

Can this enter teaching materials? Can this enter clinical practice?

“The most dangerous papers aren’t the obviously flawed ones — they’re the beautifully packaged half-finished products.”

Featured Papers — March 2026

Evaluation & Methodology

A clinical environment simulator for dynamic AI evaluation

Luo L, et al. · Nature Medicine · 2026 Mar 12

DOI: 10.1038/s41591-026-04252-6 · PMID: 41820673

Why This Paper Matters

Medical AI cannot be judged by static benchmark scores alone. This paper proposes a Clinical Environment Simulator (CES) that evaluates LLMs within a digital hospital where each decision changes subsequent patient states — mimicking real clinical path-dependency.

Teaching Points

Why USMLE-style benchmarks are insufficient for clinical AI
Dynamic vs static evaluation: sequential decisions accumulate errors
Foundation for FDA/deployment science and post-deployment monitoring

Track A: Build

Evaluation design, offline benchmark vs dynamic evaluation, task formulation

Track B: Judge

Clinical decision support, human-AI collaboration, safety evaluation

Track C: Deploy

Post-deployment monitoring frameworks, regulatory evaluation standards

Architecture & Applications

The role of agentic artificial intelligence in healthcare: a scoping review

Collaco BG, et al. · npj Digital Medicine · 2026 Mar 14

DOI: 10.1038/s41746-026-02517-5 · PMID: 41832341

Why This Paper Matters

As AI moves from chatbots to autonomous agents, healthcare needs a clear taxonomy. This scoping review maps the landscape of agentic AI — distinguishing copilots, tool-using agents, and multi-agent systems, while noting the field remains early and immature.

Teaching Points

Taxonomy: chatbot vs copilot vs tool-using agent vs multi-agent system
Mapping exercise: which clinical tasks merit which automation level?
Risk framing: accountability, tool misuse, hallucination amplification

Track A: Build

From generative AI to agentic AI: planning, tool use, autonomy levels

Track B: Judge

Clinical orchestration, documentation, triage, workflow automation

Track C: Deploy

Agent governance, deployment boundary-setting, approval frameworks

Governance & Privacy

Cautious optimism on foundation models in medical imaging: balancing privacy and innovation

Santos R, et al. · npj Digital Medicine · 2026

DOI: 10.1038/s41746-026-02533-5 · PMID: 41833961

Why This Paper Matters

Foundation models in medical imaging may retain patient-identifiable signals. Retinal imaging re-identification rates reach 94%. This perspective argues for dual-track defense: technical safeguards (DP-SGD, feature disentanglement) plus policy frameworks.

Teaching Points

“Removing names” does not equal anonymization in imaging data
Privacy leakage mechanisms: demographic/identity signals in embeddings
Dual defense: PII scrubbing, DP-SGD, homomorphic encryption + policy

Track A: Build

Representation learning, privacy leakage, de-identification limits

Track B: Judge

Imaging AI governance, data stewardship, responsible deployment

Track C: Deploy

Institutional privacy policy, vendor due diligence, data agreements

Extended Reading

Benchmark & Evaluation

Holistic evaluation of large language models for medical tasks with MedHELM

Bedi S, et al. · Nature Medicine · 2026

Introduces MedHELM: 5 task categories, 22 subcategories, 121 tasks, 37 evaluations across 9 frontier LLMs. Key finding: no single score represents medical ability — task decomposition matters more than leaderboard rankings.

Join the Discussion

Our weekly paper spotlight sessions are open to enrolled cohort members. Learn to read, critique, and apply the latest research.

Apply Now