Living Curriculum
Research & Paper Spotlights
Every week features curated papers from three categories. Our moat against static MOOCs: a curriculum that evolves with the field.
Paper Selection Framework
We balance three categories intentionally — preventing the curriculum from being hijacked by LLM hype while ignoring implementation science and regulation.
Method Papers
New models, benchmarks, evaluation frameworks
Clinical Validation
Real-world deployment, prospective studies, workflow impact
Critical Analysis
Bias, data shift, hallucination, failure modes, governance
Five Questions for Every Paper
What is the research question?
Where does the data come from? Is there selection bias?
Are the model and comparator appropriate?
Do the metrics matter for clinical decision-making?
Can this enter teaching materials? Can this enter clinical practice?
“The most dangerous papers aren’t the obviously flawed ones — they’re the beautifully packaged half-finished products.”
Featured Papers — March 2026
A clinical environment simulator for dynamic AI evaluation
Luo L, et al. · Nature Medicine · 2026 Mar 12
DOI: 10.1038/s41591-026-04252-6 · PMID: 41820673
Why This Paper Matters
Medical AI cannot be judged by static benchmark scores alone. This paper proposes a Clinical Environment Simulator (CES) that evaluates LLMs within a digital hospital where each decision changes subsequent patient states — mimicking real clinical path-dependency.
Teaching Points
- Why USMLE-style benchmarks are insufficient for clinical AI
- Dynamic vs static evaluation: sequential decisions accumulate errors
- Foundation for FDA/deployment science and post-deployment monitoring
Track A: Build
Evaluation design, offline benchmark vs dynamic evaluation, task formulation
Track B: Judge
Clinical decision support, human-AI collaboration, safety evaluation
Track C: Deploy
Post-deployment monitoring frameworks, regulatory evaluation standards
The role of agentic artificial intelligence in healthcare: a scoping review
Collaco BG, et al. · npj Digital Medicine · 2026 Mar 14
DOI: 10.1038/s41746-026-02517-5 · PMID: 41832341
Why This Paper Matters
As AI moves from chatbots to autonomous agents, healthcare needs a clear taxonomy. This scoping review maps the landscape of agentic AI — distinguishing copilots, tool-using agents, and multi-agent systems, while noting the field remains early and immature.
Teaching Points
- Taxonomy: chatbot vs copilot vs tool-using agent vs multi-agent system
- Mapping exercise: which clinical tasks merit which automation level?
- Risk framing: accountability, tool misuse, hallucination amplification
Track A: Build
From generative AI to agentic AI: planning, tool use, autonomy levels
Track B: Judge
Clinical orchestration, documentation, triage, workflow automation
Track C: Deploy
Agent governance, deployment boundary-setting, approval frameworks
Cautious optimism on foundation models in medical imaging: balancing privacy and innovation
Santos R, et al. · npj Digital Medicine · 2026
DOI: 10.1038/s41746-026-02533-5 · PMID: 41833961
Why This Paper Matters
Foundation models in medical imaging may retain patient-identifiable signals. Retinal imaging re-identification rates reach 94%. This perspective argues for dual-track defense: technical safeguards (DP-SGD, feature disentanglement) plus policy frameworks.
Teaching Points
- “Removing names” does not equal anonymization in imaging data
- Privacy leakage mechanisms: demographic/identity signals in embeddings
- Dual defense: PII scrubbing, DP-SGD, homomorphic encryption + policy
Track A: Build
Representation learning, privacy leakage, de-identification limits
Track B: Judge
Imaging AI governance, data stewardship, responsible deployment
Track C: Deploy
Institutional privacy policy, vendor due diligence, data agreements
Extended Reading
Benchmark & Evaluation
Holistic evaluation of large language models for medical tasks with MedHELM
Bedi S, et al. · Nature Medicine · 2026
Introduces MedHELM: 5 task categories, 22 subcategories, 121 tasks, 37 evaluations across 9 frontier LLMs. Key finding: no single score represents medical ability — task decomposition matters more than leaderboard rankings.
Join the Discussion
Our weekly paper spotlight sessions are open to enrolled cohort members. Learn to read, critique, and apply the latest research.
Apply Now