Evidence-Based Medicine

Asking clinical questions, finding evidence, critically appraising studies, applying evidence to patients, hierarchy of evidence, GRADE methodology, clinical practice guidelines, shared decision-making, and every concept and tool needed to practice evidence-based medicine.

01 Overview & History of EBM

Evidence-based medicine (EBM) is the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients. Rather than relying solely on pathophysiologic reasoning, clinical anecdote, or authority, EBM demands that treatment decisions be informed by the highest quality research evidence available, integrated with clinical expertise and patient values. EBM is not a rigid cookbook; it is a disciplined approach to lifelong learning and decision-making that has reshaped every field of medicine over the past three decades.

Why This Matters

Every clinical decision — which test to order, which drug to start, how to counsel a patient — reflects an implicit estimate of benefit, harm, and probability. EBM forces these estimates to be explicit, transparent, and grounded in the best available evidence. Clinicians who practice EBM make fewer errors, avoid outdated interventions, and more effectively translate trial results into personalized patient care.

Historical Development

The conceptual roots of EBM stretch back centuries — Pierre Louis used the numerical method in 1830s Paris to show that bloodletting did not improve pneumonia survival, one of the first statistical critiques of accepted therapy. Archie Cochrane, a British epidemiologist, published Effectiveness and Efficiency in 1972, arguing that medical practice should be based on rigorous randomized trials and systematic reviews; his name now marks the Cochrane Collaboration, founded in 1993 to produce systematic reviews of healthcare interventions.

The term “evidence-based medicine” was coined by Gordon Guyatt at McMaster University in 1991, building on David Sackett’s clinical epidemiology tradition. The landmark 1992 JAMA paper by the Evidence-Based Medicine Working Group formally introduced EBM as a “new paradigm for medical practice.” Since then EBM has expanded into evidence-based nursing, evidence-based public health, and evidence-based healthcare more broadly.

Key Figures in EBM

FigureContribution
Pierre Louis (1830s)Numerical method; disproved bloodletting for pneumonia
Archie Cochrane (1972)Advocated RCTs and systematic reviews as basis for clinical practice
David SackettDefined EBM; founded the first department of clinical epidemiology at McMaster
Gordon Guyatt (1991)Coined the term “evidence-based medicine”; lead GRADE developer
Iain ChalmersCo-founder of the Cochrane Collaboration
Doug AltmanAdvanced reporting standards (CONSORT, STROBE, PRISMA)
Archie Cochrane famously awarded a “wooden spoon” to obstetrics in 1979 as the specialty that had done least to adopt randomized trials. This stimulated the Oxford Database of Perinatal Trials, which became a founding project of the Cochrane Collaboration.

Why EBM Matters in Modern Practice

The medical literature doubles roughly every 10–15 years; more than 1 million biomedical articles are indexed annually in PubMed. No clinician can read, let alone synthesize, this volume unaided. Without structured methods for finding, appraising, and applying evidence, practice drifts toward habit, authority, and marketing influence. Studies repeatedly show that routine care lags best evidence by 10–20 years, with many widely used interventions later found ineffective or harmful (e.g., hormone replacement for cardiovascular prevention, routine arthroscopy for knee osteoarthritis, tight glycemic control in ICU patients).

The half-life of medical knowledge has been estimated at about 5–7 years — meaning half of what you learn in medical school will be outdated, revised, or overturned within a decade. EBM skills, not facts, are the durable investment of a medical career.

02 Defining EBM & the Three Pillars

The Sackett Definition

David Sackett’s 1996 BMJ editorial provides the canonical definition: “Evidence-based medicine is the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients. The practice of evidence-based medicine means integrating individual clinical expertise with the best available external clinical evidence from systematic research.” Sackett emphasized that EBM is neither “cookbook medicine” nor a tool for cost-cutters — it is a method for integrating evidence with clinical judgment.

The Three Pillars of EBM

EBM rests on the integration of three components, sometimes visualized as overlapping circles: (1) Best available research evidence, (2) Clinical expertise (the physician’s accumulated skills, pattern recognition, and ability to diagnose and manage individual patients), and (3) Patient values and preferences (the unique concerns, expectations, and circumstances each patient brings to the encounter). Evidence alone is not enough; a decision must fit the patient in front of you.

The Three Pillars in Detail

PillarWhat It IsExample
Best evidenceResults of valid, relevant clinical research — ideally systematic reviews of RCTsA meta-analysis showing that statins reduce MI by 25% in high-risk primary prevention
Clinical expertiseIndividual clinician’s skills and judgment gained through training and experienceRecognizing a patient with atypical chest pain who truly has ACS despite a normal initial ECG
Patient valuesThe patient’s unique preferences, concerns, expectations, and circumstancesA frail 85-year-old who declines a colonoscopy, valuing quality of life over screening benefit

What EBM Is — and Is Not

EBM IsEBM Is Not
An explicit method for using evidence in decisionsCookbook medicine or rigid protocols
Integration of evidence, expertise, and valuesEvidence replacing clinical judgment
Acknowledgment of uncertaintyA guarantee of the right answer
A lifelong learning frameworkA one-time memorized fact set
Useful at the bedside in real timeOnly for researchers or academics
A tool for shared decision-makingA cost-containment ploy
The most common misunderstanding of EBM is that it reduces medicine to applying the latest RCT. Sackett himself warned: “Without clinical expertise, practice risks becoming tyrannized by evidence. Without current best evidence, practice risks becoming rapidly out of date.” Both pillars are essential.

03 EBM vs Traditional Medicine

Traditional (pre-EBM) medical practice was dominated by eminence-based decisions — senior authorities in a field determined standards of care, often based on pathophysiologic reasoning or uncontrolled clinical experience. While valuable, this approach produced and perpetuated many practices later shown to be ineffective or harmful.

Historical Examples of Evidence Overturning Authority

InterventionInitial RationaleWhat Evidence Showed
Lidocaine prophylaxis post-MISuppress ventricular ectopy, prevent sudden deathRCTs showed increased mortality; abandoned
Class I antiarrhythmics (CAST trial, 1989)Suppress PVCs after MI>2-fold increase in mortality; established surrogate-endpoint fallacy
Hormone replacement therapy (WHI, 2002)Prevent CAD in postmenopausal womenIncreased CV events, stroke, breast cancer
Routine arthroscopy for knee OAMechanical debridement helpsSham-controlled RCTs showed no benefit
Bed rest for low back painRest promotes healingWorsens outcomes; early mobilization is superior
Tight glycemic control in ICUHyperglycemia is harmfulNICE-SUGAR: tight control increased mortality
High-dose chemo + BMT for breast cancerDose-response rationaleNo survival benefit; substantial harm

Pre-EBM vs EBM Decision-Making

FeatureTraditional / Eminence-BasedEvidence-Based
Source of knowledgeAuthority, pathophysiology, anecdoteSystematically appraised research
Handling of uncertaintyHidden; authority tends to sound certainExplicit; magnitudes and confidence intervals reported
Role of patient valuesOften implicit or ignoredExplicit part of decision-making
Response to new evidenceSlow; driven by opinion leadersSystematic updating via living reviews
Metric of qualityExperience and reputationOutcomes and measurable fidelity to best practice
The CAST trial (1989) is the defining moment in the history of EBM. Suppressing PVCs after MI seemed obviously beneficial from a pathophysiologic standpoint — yet encainide, flecainide, and moricizine increased mortality. The lesson: plausible mechanisms and surrogate outcomes are not enough. Hard outcomes and randomized evidence are required.

Why Pathophysiologic Reasoning Alone Fails

Pathophysiologic reasoning is seductive because it is mechanistic and satisfying — if a drug lowers LDL, and high LDL causes atherosclerosis, then lowering LDL should prevent heart attacks. This chain of reasoning is correct for statins but wrong for torcetrapib, ezetimibe (initially), and many niacin trials, where LDL or HDL changes did not translate to outcome benefits. The body is a complex system with countless feedback loops, off-target effects, and unanticipated interactions. Only empirical testing in humans with patient-important outcomes can confirm whether a biologically plausible mechanism produces a real clinical benefit.

The Three Questions EBM Forces You to Ask

(1) What is the evidence that this intervention helps — and how certain is it? (2) How large is the benefit relative to the harm, cost, and burden? (3) Does this evidence apply to the patient in front of me? If you cannot answer all three, you are not yet practicing EBM.

04 Step 1 — Ask (PICO/PICOT)

The 5-step EBM cycle begins with transforming a clinical information need into an answerable question. A well-constructed question is the foundation of an efficient literature search and a relevant appraisal. The most widely used framework is PICO (sometimes extended to PICOT or PICOS).

The PICO(T) Framework

LetterComponentExample
PPatient / Population / ProblemAdults with type 2 diabetes and CKD stage 3
IIntervention / Exposure / TestSGLT2 inhibitor (empagliflozin)
CComparisonPlacebo or standard care
OOutcomeProgression to ESRD, all-cause mortality
TTime horizon (optional)Over 3 years of follow-up
SStudy type (optional)Randomized controlled trial or meta-analysis
Worked Example

Unstructured: “Does aspirin help older patients?” → unanswerable.
PICOT: In adults ≥70 without prior CVD (P), does daily low-dose aspirin (I) compared with placebo (C) reduce cardiovascular death or major bleeding (O) over 5 years (T)? → answerable by the ASPREE trial.

Foreground vs Background Questions

Clinicians ask two fundamentally different types of questions:

  • Background questions seek general knowledge about a condition or intervention (“What causes atrial fibrillation?”). These are best answered by textbooks, review articles, or point-of-care resources (UpToDate, DynaMed).
  • Foreground questions seek specific evidence to guide a clinical decision (“In patients with new-onset AF, does rhythm control with catheter ablation improve mortality compared to rate control?”). These demand systematic literature searches and critical appraisal.
Trainees ask mostly background questions; experienced clinicians ask mostly foreground questions. As knowledge grows, the proportion of foreground questions rises — and so does the need for efficient EBM skills to answer them at the bedside.

05 Types of Clinical Questions

Different types of clinical questions demand different study designs for their answer. Recognizing the question type determines where to look and how to appraise what you find.

The Five Major Question Types

Question TypeWhat It AsksBest Study DesignExample
TherapyDoes treatment X help?RCT / SR of RCTsDoes finerenone reduce CV events in diabetic kidney disease?
DiagnosisHow accurate is this test?Cross-sectional study comparing test to gold standardWhat is the sensitivity of high-sensitivity troponin for MI at 1 hour?
PrognosisWhat is the likely course?Prospective cohortWhat is the 10-year mortality after first stroke?
Harm / EtiologyDoes this exposure cause harm?Cohort or case-controlDo fluoroquinolones cause aortic dissection?
PreventionCan we prevent the outcome?RCT or cohortDoes HPV vaccination prevent cervical cancer?

Less Common but Important Question Types

TypeBest Design
ScreeningRCT with mortality as outcome (ideal)
Cost-effectivenessEconomic analysis / decision model
Quality improvementInterrupted time series, stepped-wedge trial
Qualitative (patient experience)Qualitative research (interviews, ethnography)
Clinical prediction ruleDerivation and validation cohort studies
Matching the question to the correct design is essential. You cannot answer a therapy question from a case series, and an RCT is rarely the right tool for a rare-adverse-event harm question (case-control is better for rare outcomes, cohorts for rare exposures).

06 Steps 2–5 — Acquire, Appraise, Apply, Assess

Once the question is framed, the remaining steps of the EBM cycle turn evidence into action and feed back into improved practice.

The Full 5-Step EBM Cycle

StepActivityKey Tools
1. AskConvert information need into an answerable questionPICO/PICOT framework
2. AcquireEfficiently track down the best evidencePubMed, Cochrane, UpToDate, DynaMed
3. AppraiseCritically evaluate the evidence for validity, impact, and applicabilityJAMA Users’ Guides, CASP checklists, GRADE
4. ApplyIntegrate evidence with clinical expertise and patient valuesDecision aids, shared decision-making
5. AssessEvaluate performance and seek ways to improveAudit, reflection, QI metrics
The 5 A’s

A common mnemonic: Ask → Acquire → Appraise → Apply → Assess. Each step can fail independently: a poorly framed question, a bad search, a superficial appraisal, failure to apply evidence to a specific patient, or absence of feedback all undermine EBM practice.

Time Constraints at the Point of Care

A complete 5-step cycle — formulating a question, searching, appraising, and applying — takes hours and is impractical during a busy clinic. For real-time decisions, clinicians rely on pre-appraised resources (Cochrane reviews, UpToDate, DynaMed, BMJ Best Practice), which summarize evidence in structured, regularly updated formats. The full 5-step process is reserved for challenging questions, teaching, or when pre-appraised summaries are unavailable.

Step 5 (Assess) is the most neglected. Without feedback — audit, morbidity and mortality review, personal reflection — clinicians cannot know whether their EBM practice is actually improving patient outcomes.

07 Databases & Point-of-Care Resources

Knowing which database to search is as important as knowing how to search it. Resources differ in scope, filtering, and the degree to which evidence has already been appraised.

Major Biomedical Databases

DatabaseScopeNotes
PubMedFree interface to MEDLINE + PubMed Central + NCBI books~35 million citations; default starting point
MEDLINENLM’s core biomedical database (5,200+ journals)Indexed with MeSH headings; subset of PubMed
EmbaseElsevier; heavier European and pharmacology coverageBetter for adverse drug reactions, device studies; indexed with EMTREE
Cochrane LibrarySystematic reviews (CDSR) + trials registry (CENTRAL)Gold standard for SRs; critical for therapy questions
CINAHLNursing and allied healthUseful for patient experience, nursing interventions
PsycINFOPsychology / behavioral healthEssential for mental health topics
Web of Science / ScopusMultidisciplinary citation indicesCitation tracking (“cited by”)

Pre-Appraised & Point-of-Care Resources

ResourceWhat It ProvidesStrengths / Weaknesses
UpToDateExpert-authored topic reviews, graded recommendationsComprehensive, frequently updated; expert summaries may lag or reflect opinion
DynaMedEvidence-graded topic summariesMore explicit evidence grades; some users find less readable
BMJ Best PracticeStructured diagnosis-to-treatment summariesStrong integration of guidelines
Cochrane Clinical AnswersShort summaries of Cochrane SRsQuick point-of-care use
NICE / USPSTF / guidelines.gov archivesNational clinical guidelinesVarying methodologic rigor
ACP Journal Club / EvidenceAlertsPre-appraised alerts of new high-quality studiesKeeps clinicians current

Gray Literature

“Gray literature” includes conference abstracts, theses, regulatory documents (FDA, EMA), unpublished trials, preprints (medRxiv, bioRxiv), and clinical trial registries (ClinicalTrials.gov, WHO ICTRP, EU Clinical Trials Register). It is essential for comprehensive systematic reviews because positive results are preferentially published — ignoring gray literature can bias conclusions.

ClinicalTrials.gov is now the canonical source for trial protocols and registered outcomes. Comparing a published paper’s primary outcome with the pre-registered outcome is one of the most efficient ways to detect outcome switching and selective reporting.

08 Search Strategy, Boolean Logic & MeSH

Efficient searching combines controlled vocabulary (MeSH), free-text terms, Boolean operators, and filters. A good strategy balances sensitivity (capturing all relevant studies) with specificity (excluding irrelevant ones).

Boolean Operators

OperatorEffectExample
ANDBoth terms required → narrows resultsdiabetes AND metformin
OREither term acceptable → broadens results“myocardial infarction” OR “heart attack”
NOTExcludes a term → narrows but may lose relevant studieshypertension NOT pregnancy

MeSH (Medical Subject Headings)

MeSH is the National Library of Medicine’s controlled vocabulary. Each MEDLINE-indexed article is tagged with MeSH terms by human indexers, allowing retrieval of articles on a concept regardless of exact wording. Searching “myocardial infarction”[MeSH] captures papers that say “heart attack,” “MI,” or “STEMI” without having to list every synonym.

Best Practice Search

Combine MeSH and free-text: (“Myocardial Infarction”[MeSH] OR “heart attack”[tiab]) AND (“Aspirin”[MeSH] OR aspirin[tiab]). MeSH catches indexed articles; free-text catches recent articles not yet indexed and those missed by indexers.

Useful PubMed Filters

FilterUse
Article type: Randomized Controlled TrialTherapy questions
Article type: Systematic Review / Meta-AnalysisSynthesized evidence
Article type: Practice GuidelineCurrent recommendations
Species: HumansExclude animal studies
Age group, sexPopulation-specific questions
Publication dateRecent evidence
Clinical Queries filterPre-built methodologic filters by question type

Four-Step Search Strategy

  1. Identify concepts using PICO.
  2. Find MeSH terms and synonyms for each concept.
  3. Combine synonyms with OR, then concepts with AND.
  4. Apply limits (study type, date, language) appropriate to the question.
If you get 10,000 hits, add more AND terms or filters. If you get zero hits, drop a concept or add synonyms with OR. A useful rule: the number of results should be small enough to scan titles in a few minutes but large enough that you probably haven’t missed important studies.

09 The Evidence Pyramid

Not all evidence is created equal. The evidence pyramid visualizes a rough hierarchy of study designs by their ability to reduce bias in estimating treatment effects. Higher-tier designs are more likely (on average) to provide valid estimates, but a well-done cohort can outperform a flawed RCT.

Classic Evidence Pyramid (Unfiltered Information)

LevelStudy TypeKey Feature
TopSystematic reviews / meta-analyses of RCTsSynthesize all available trials
2Randomized controlled trials (RCTs)Randomization balances confounders
3Cohort studies (prospective > retrospective)Follow exposed vs unexposed forward in time
4Case-control studiesCompare cases with outcome to controls without
5Cross-sectional studiesSnapshot of exposure and outcome
6Case series / case reportsDescriptive; generate hypotheses
BottomExpert opinion / bench researchMechanism, pathophysiology, speculation

Haynes “5S” / “6S” Pyramid (Pre-Appraised Information)

Brian Haynes proposed an alternative hierarchy focused on how the evidence reaches the clinician, prioritizing pre-appraised sources that save time at the bedside:

LevelTypeExample
SystemsComputerized decision support linked to the EHRAlerts, order sets with embedded evidence
SummariesEvidence-based topic reviewsUpToDate, DynaMed, BMJ Best Practice
Synopses of SynthesesStructured abstracts of SRsCochrane Clinical Answers, DARE
SynthesesSystematic reviews and meta-analysesCochrane Reviews
Synopses of StudiesStructured abstracts of individual studiesACP Journal Club
StudiesIndividual primary studiesOriginal RCTs and cohorts
For a busy clinician at the point of care, the 6S pyramid is more useful than the classic pyramid: start at the top (Systems/Summaries), and only descend if higher levels do not address your question. This reverses the instinct to always start with PubMed.

Systematic Review vs Meta-Analysis — The Critical Distinction

These terms are often confused:

  • A systematic review (SR) is a structured, reproducible summary of all the evidence relevant to a focused question, using explicit methods to search, select, and appraise studies.
  • A meta-analysis (MA) is a statistical technique that quantitatively pools results from multiple studies into a single summary estimate.
  • All meta-analyses should be embedded in a systematic review. Not all systematic reviews include a meta-analysis — when studies are too heterogeneous to pool, a qualitative (narrative) synthesis is appropriate.

10 Study Designs Overview

The choice of study design is driven by the question and by ethical and practical constraints. This section summarizes the major designs and when to use each.

Experimental Designs

DesignKey FeatureStrengthWeakness
Parallel-group RCTRandom assignment to armsMinimizes confounding; gold standard for therapyCost; limited generalizability
Crossover RCTEach participant receives both interventionsSmaller sample, within-subject comparisonCarryover effects; only for stable chronic conditions
Cluster RCTRandomize groups (e.g., clinics)Used for systems/population interventionsReduced power; design effect from clustering
Factorial RCTTests 2+ interventions simultaneouslyEfficient for independent interventionsInteraction effects complicate analysis
Non-inferiority RCTNew treatment is not worse than standard by a pre-set marginUseful when equipoise existsMargin choice is critical and often controversial
Adaptive trialPre-specified modifications based on interim resultsEfficient; platform trials (RECOVERY)Complex statistical adjustments
N-of-1 trialSingle patient, multiple crossover periodsIndividualized evidenceNot generalizable

Observational Designs

DesignDirectionBest ForCaveats
Prospective cohortExposure → outcome (forward)Prognosis; rare exposure; multiple outcomesLong follow-up, loss to follow-up
Retrospective cohortExposure → outcome (historical)Faster; occupational harmsRecords quality; missing data
Case-controlOutcome → exposure (backward)Rare outcomes; etiologyRecall bias; cannot calculate incidence
Cross-sectionalBoth at one time pointPrevalence; diagnostic accuracyCannot establish temporality
EcologicalPopulation-levelHypothesis generationEcological fallacy
Case series / reportDescriptiveRare diseases, novel presentationsNo comparison; no inference
Case-control is efficient for rare outcomes (e.g., agranulocytosis from clozapine); cohort is efficient for rare exposures (e.g., an unusual occupational toxin). Choose the design by thinking: which is rarer — the exposure or the outcome?

Quasi-Experimental Designs

When randomization is impossible (policy changes, system interventions), quasi-experimental designs provide stronger inference than simple before-after comparisons:

DesignDescriptionUse
Interrupted time seriesMultiple measurements before and after an interventionPolicy changes, hand-hygiene campaigns
Difference-in-differencesCompares change in treated vs untreated groups over timeHealth policy; Medicaid expansion studies
Regression discontinuityExploits a threshold-based interventionEligibility thresholds (e.g., age 65 for Medicare)
Instrumental variable analysisUses a variable correlated with exposure but not outcome except through exposureMendelian randomization; geographic variation
Stepped-wedge cluster trialAll clusters eventually get the intervention, at staggered start timesImplementation trials

Key Concepts in Observational Analysis

ConceptMeaning
ConfoundingA third variable associated with both exposure and outcome distorts the apparent relationship
Confounding by indicationThe reason a drug is prescribed is itself associated with the outcome
Propensity score matchingBalances observational groups on measured covariates to mimic randomization
Effect modificationThe effect differs across subgroups — a real phenomenon, not bias
Healthy-user biasUsers of an intervention tend to be healthier overall, inflating apparent benefit
Immortal time biasPeriod during which outcome cannot occur is misclassified

11 Validity of RCTs

Critical appraisal of a therapy study asks three core questions: (1) Are the results valid? (2) What are the results? (3) Will they help me care for my patient? This section focuses on validity — whether the study design protects against bias.

Key Validity Criteria for RCTs

CriterionWhy It Matters
RandomizationBalances known and unknown confounders between arms
Allocation concealmentPrevents the enroller from manipulating who gets which arm (e.g., sealed opaque envelopes, central randomization)
Blinding (masking)Minimizes performance bias (caregivers) and ascertainment bias (outcome assessors)
Groups similar at baselineConfirms that randomization succeeded (Table 1 of the paper)
Co-interventions equalPrevents confounding by differential treatment outside the protocol
Complete follow-up<5% loss is reassuring; >20% threatens validity (“5 and 20 rule”)
Intention-to-treat (ITT) analysisAnalyzes patients in their originally assigned groups regardless of adherence; preserves randomization

Randomization vs Allocation Concealment

These are distinct concepts: randomization is the process of generating a random sequence; allocation concealment is hiding that sequence from the person enrolling participants. Studies with inadequate allocation concealment overestimate treatment effects by ~30–40%.

Levels of Blinding

TypeWho Is BlindedProtects Against
Single-blindUsually participantPlacebo/nocebo effects
Double-blindParticipant + caregiverPerformance bias
Triple-blindParticipant + caregiver + outcome assessorAscertainment bias
Quadruple-blind+ data analystAnalytic bias

Intention-to-Treat vs Per-Protocol

AnalysisDefinitionUse
ITTAnalyze everyone in the group to which they were randomizedPreferred for superiority trials; conservative estimate of effect
Per-protocolAnalyze only those who completed the protocol as assignedSensitivity analysis; preferred for non-inferiority (more conservative in that context)
As-treatedAnalyze by the treatment actually receivedRarely preferred; breaks randomization
Key Biases in Therapy Studies

Selection bias (non-random assignment) → prevented by randomization + concealment.
Performance bias (differential care) → prevented by blinding.
Attrition bias (differential dropouts) → minimized by complete follow-up and ITT.
Detection bias (differential outcome ascertainment) → prevented by blinded assessors.
Reporting bias (selective outcome reporting) → detected via pre-registered protocols.

The most common serious flaw in modern trials is not randomization itself but inadequate allocation concealment and incomplete blinding of outcome assessors. Subjective outcomes (pain, quality of life) are especially vulnerable to unblinded assessment.

Methods of Randomization

MethodDescriptionStrengths / Weaknesses
Simple randomizationEach participant randomized independentlyEasy; can give unequal group sizes in small trials
Block randomizationRandomization within fixed-size blocksGuarantees balance; block size must be hidden
Stratified randomizationSeparate sequences for key prognostic strataBalances important confounders
MinimizationDynamic allocation to balance covariatesGood for small trials; semi-random
Cluster randomizationGroups (clinics, wards) randomized rather than individualsAvoids contamination; needs design-effect adjustment

When an RCT Is Not Possible

Randomized trials are sometimes impossible or unethical (rare diseases, emergency conditions, long-term exposures, harms). In such cases, observational evidence must carry the inference. Key strategies to strengthen observational inference include large sample sizes, propensity score methods, instrumental variable analysis, negative controls, target trial emulation, and triangulation across independent designs.

12 Results — Effect Size, RR, ARR, NNT, CI

Once validity is established, the next appraisal question is: how large is the effect, and how precise is the estimate? Effect size measures quantify the impact; confidence intervals quantify precision.

Core Effect Measures (Dichotomous Outcomes)

MeasureFormulaInterpretation
Control event rate (CER)Events in control / control nBaseline risk
Experimental event rate (EER)Events in treatment / treatment nRisk on treatment
Absolute risk reduction (ARR)CER − EERReal-world benefit; drives NNT
Relative risk (RR)EER / CERRatio of risks; <1 = protective
Relative risk reduction (RRR)(CER − EER) / CER = 1 − RRProportion of baseline risk removed
Odds ratio (OR)Odds in treatment / odds in controlApproximates RR when outcome is rare
Number needed to treat (NNT)1 / ARRPatients to treat to prevent one event
Number needed to harm (NNH)1 / ARIPatients treated to cause one harm
Hazard ratio (HR)From Cox regressionRelative hazard over time
Worked NNT Example

A statin reduces 5-year MI rate from 10% (CER) to 7% (EER). ARR = 0.03; RRR = 0.30 (30%); NNT = 1/0.03 ≈ 33. You must treat about 33 patients for 5 years to prevent one MI. The same RRR of 30% applied to a baseline risk of 1% gives an NNT of 333 — identical relative effect, very different clinical meaning.

Why RRR Alone Is Misleading

Press releases love relative risk reductions because they sound larger (“30% reduction”) than absolute numbers (“3 fewer events per 100 patients”). Always translate relative figures into absolute ones. A large RRR applied to a very low baseline risk yields a small ARR and a large NNT, which may not justify the cost or side effects.

Confidence Intervals & Statistical Significance

A 95% confidence interval (CI) is the range within which the true effect likely lies, repeated across hypothetical studies. If the CI for an RR or OR crosses 1.0 (or the CI for a mean difference crosses 0), the result is not statistically significant at α = 0.05. The width of the CI reflects precision — wider = less precise, narrower = more precise.

ConceptRule
CI for RR/OR includes 1.0Not statistically significant
CI for mean difference includes 0Not statistically significant
Narrow CIPrecise estimate (often large sample)
Wide CIImprecise (small sample or rare outcome)
p-value < 0.05Conventionally “significant” but must be interpreted in context

Statistical vs Clinical Significance

A statistically significant result may be clinically trivial (e.g., a 2 mmHg blood pressure difference from a huge trial), and a non-significant result may hide a clinically important effect obscured by low power. Always ask: is this difference big enough to matter to a patient?

The minimum clinically important difference (MCID) is the smallest change a patient perceives as meaningful. Many trials report “statistically significant” effects well below the MCID — these are statistically real but clinically irrelevant.

Types of Outcomes in Trials

Outcome TypeExampleConsideration
Patient-important (hard)Death, MI, stroke, hospitalizationPrimary evidence of benefit
Patient-reportedPain, quality of life, functionRequires validated instruments; blinding critical
SurrogateLDL, HbA1c, CD4 count, BPMust be validated against hard outcomes
CompositeMACE (death + MI + stroke)Valid if components are comparably important and similarly affected
Time-to-eventTime to relapse, progression-free survivalAnalyzed with Kaplan-Meier and Cox regression

Power & Sample Size

Statistical power (1 − β) is the probability of detecting a true effect of a specified size. Convention requires ≥80% power. Sample size depends on expected effect size, baseline event rate, variability, α (usually 0.05), and desired power. Underpowered trials produce wide confidence intervals and risk missing real effects (type II error); they also produce inflated effect estimates when positive (“winner’s curse”).

A “negative” trial that is underpowered does not show the absence of an effect — it shows that the study could not rule one out. Always interpret a non-significant result by looking at the confidence interval: does it exclude a clinically important effect, or is it wide enough to include one?

13 Applicability & CONSORT

A valid, precise result is useless if it does not apply to your patient. Applicability (external validity, generalizability) asks whether you can extrapolate the study findings to the person in front of you.

Applicability Questions

  • Is my patient similar enough to the trial population (age, sex, comorbidities, severity, ethnicity)?
  • Was the treatment protocol feasible in my setting?
  • Were all clinically important outcomes considered, including harms and quality of life?
  • Do the benefits outweigh the harms, costs, and burdens for this patient?
  • Are the patient’s values and preferences consistent with the intervention?
The Efficacy-Effectiveness Gap

Efficacy is the effect of an intervention under ideal, controlled conditions (explanatory trials). Effectiveness is the effect in routine practice (pragmatic trials). Most published RCTs measure efficacy, yet patients live in the effectiveness world — with adherence problems, comorbidities, and resource constraints. Pragmatic trials narrow this gap.

CONSORT — Reporting RCTs

The CONSORT (Consolidated Standards of Reporting Trials) statement is a 25-item checklist and flow diagram for transparent reporting of parallel-group RCTs. Journals increasingly require CONSORT compliance. Key elements include:

CONSORT ItemWhat to Report
Trial designType (parallel, crossover), allocation ratio
RandomizationSequence generation, concealment mechanism, implementation
BlindingWho was blinded and how
Participant flowEnrollment, allocation, follow-up, analysis (CONSORT flow diagram)
Baseline dataDemographics and clinical characteristics by group
Outcomes and estimationPrimary and secondary outcomes with effect size and precision
HarmsAll-cause and specific adverse events by group
Registration & protocolTrial registration number and protocol access
When appraising an RCT, the CONSORT flow diagram is the most efficient single view of trial integrity: it shows how many were screened, randomized, followed up, and analyzed. Large losses between randomization and analysis are a red flag.

14 Diagnostic Study Validity & Biases

Diagnostic studies evaluate how well a test distinguishes patients with the disease from those without it. Critical appraisal here has its own vocabulary and pitfalls.

Validity Criteria for Diagnostic Studies

CriterionWhy It Matters
Independent, blinded comparisonTest and reference standard interpreted without knowledge of each other
Appropriate reference standardThe “gold standard” must truly define disease status
Reference standard applied to allAll patients — positive and negative on the index test — get the gold standard
Appropriate spectrumPatients resemble those in whom the test will be used in practice
ReproducibilityClear, reproducible test procedure and thresholds

Key Biases in Diagnostic Studies

BiasDescriptionEffect
Spectrum biasStudy enrolls mostly severe cases and healthy controlsInflates sensitivity and specificity; performance drops in real-world mixed populations
Verification bias (work-up bias)Only patients with a positive index test get the gold standardInflates sensitivity, underestimates specificity
Review biasInterpreter of one test knows the result of the otherInflates apparent accuracy
Incorporation biasThe index test is part of the reference standardArtificially inflates accuracy
Disease-progression biasDelay between index test and reference standard allows disease to changeMisclassification
If a study reports 99% sensitivity and 99% specificity, suspect spectrum bias — such numbers usually come from a “sick vs healthy” design that does not reflect the diagnostic challenge (intermediate cases) a clinician actually faces.

15 Sensitivity, Specificity, LRs, ROC, STARD & QUADAS-2

The 2×2 Diagnostic Table

Disease +Disease −
Test +True Positive (TP)False Positive (FP)
Test −False Negative (FN)True Negative (TN)

Core Diagnostic Metrics

MetricFormulaInterpretation
Sensitivity (Sn)TP / (TP + FN)Probability of a positive test in diseased patients; high Sn rules out (SnNOUT)
Specificity (Sp)TN / (TN + FP)Probability of a negative test in healthy patients; high Sp rules in (SpPIN)
Positive predictive value (PPV)TP / (TP + FP)Probability of disease given a positive test; depends on prevalence
Negative predictive value (NPV)TN / (TN + FN)Probability of no disease given a negative test; depends on prevalence
Positive likelihood ratio (LR+)Sn / (1 − Sp)>10 strong rule-in; 5–10 moderate
Negative likelihood ratio (LR−)(1 − Sn) / Sp<0.1 strong rule-out; 0.1–0.2 moderate
Accuracy(TP + TN) / totalOverall correct classification; misleading when prevalence is skewed
SnNOUT & SpPIN

A very Sensitive test, when Negative, rules OUT disease (SnNOUT).
A very Specific test, when Positive, rules IN disease (SpPIN). These mnemonics summarize the test-choice logic for ruling in vs ruling out.

Likelihood Ratios & Bayesian Reasoning

Likelihood ratios convert pre-test probability into post-test probability without requiring the population prevalence. They are the most clinically useful diagnostic metric because they work at the individual patient level.

LR+Approximate Change in Probability
>10Large, often conclusive increase
5–10Moderate increase
2–5Small increase
1–2Minimal; usually unhelpful
1No change
LR−Approximate Change in Probability
<0.1Large, often conclusive decrease
0.1–0.2Moderate decrease
0.2–0.5Small decrease
0.5–1Minimal; usually unhelpful

ROC Curves

A receiver operating characteristic (ROC) curve plots sensitivity (y-axis) against 1 − specificity (x-axis) across all possible thresholds for a continuous test. The area under the curve (AUC) summarizes overall discrimination: AUC 0.5 = no better than chance, 0.7–0.8 acceptable, 0.8–0.9 excellent, >0.9 outstanding.

Reporting & Appraisal Tools

ToolPurpose
STARDReporting checklist for diagnostic accuracy studies (analogous to CONSORT for RCTs)
QUADAS-2Tool to assess risk of bias and applicability in diagnostic studies included in systematic reviews
TRIPODReporting of multivariable prediction models (diagnosis and prognosis)
PPV and NPV depend on prevalence; sensitivity and specificity (and thus LRs) do not. Board questions often exploit this: if the same test is used in a low-prevalence setting, sensitivity is unchanged but PPV plummets.

Worked 2×2 Example

Suppose a troponin assay is tested in 1,000 patients with chest pain. 100 truly have MI (10% prevalence). The test is positive in 95 of the 100 with MI and in 90 of the 900 without MI.

MI +MI −Total
Test +TP = 95FP = 90185
Test −FN = 5TN = 810815
Total1009001000
  • Sensitivity = 95/100 = 95%
  • Specificity = 810/900 = 90%
  • PPV = 95/185 = 51%
  • NPV = 810/815 = 99.4%
  • LR+ = 0.95 / 0.10 = 9.5
  • LR− = 0.05 / 0.90 = 0.056

Notice that even with excellent sensitivity and specificity, PPV is only 51% because of the 10% prevalence. If prevalence were 1% (e.g., screening asymptomatic young adults), PPV would fall below 10%.

Diagnostic Thresholds & Tradeoffs

Most continuous tests force a tradeoff between sensitivity and specificity via the cutoff. Lower cutoffs raise sensitivity (and NPV) but reduce specificity (and PPV). The optimal threshold depends on the clinical consequence of false negatives vs false positives — missing a pulmonary embolism vs a false-positive CT angiogram involve very different harms. The ROC curve displays every possible tradeoff.

16 Prognosis Studies & Cohort Design

Prognosis studies estimate the likely course of a disease — rate of recovery, progression, complications, death, or functional outcome. They answer questions patients ask most often: “What will happen to me?” The ideal design is a prospective inception cohort with complete follow-up.

Validity Criteria for Prognosis Studies

CriterionWhy
Well-defined inception cohortAll patients assembled at a common early point in disease (e.g., new MI diagnosis)
Follow-up sufficiently long and completeLong enough to capture outcomes; <20% lost to follow-up
Objective outcome criteriaBlinded outcome assessment when possible
Adjustment for prognostic factorsConfounders controlled through multivariable modeling
Validation cohortPrognostic models should be validated in a separate population

Sources of Bias in Prognosis Studies

BiasEffect
Lead-time biasEarlier diagnosis appears to prolong survival without changing disease course
Length biasScreening preferentially detects slow-growing disease, inflating apparent survival
Survivor cohort biasCohorts assembled late miss patients who died early
Loss to follow-upLost patients may have different outcomes
Will Rogers phenomenonImproved staging moves patients between categories, improving apparent survival in both

Reporting Prognostic Results

MeasureUse
Event rate at fixed time“30-day mortality = 8%”
Median survivalRobust summary of time-to-event
Kaplan-Meier curvesSurvival over time; good for censored data
Hazard ratio (Cox model)Relative hazard adjusted for covariates
5-year / 10-year survivalStandard cancer reporting

Reporting Guidelines

The STROBE statement guides reporting of observational studies (cohort, case-control, cross-sectional). TRIPOD guides multivariable prediction model reporting, and REMARK guides prognostic tumor marker studies.

Lead-time bias and length bias are the classic cancer-screening traps: earlier detection can appear to prolong survival without actually delaying death, and screening preferentially catches indolent disease. Mortality — not survival — is the only unbiased outcome for screening evaluation.
An “inception cohort” is key. A cohort assembled at various points in disease course (e.g., mixing new and chronic patients) will distort the natural history because survivors accumulate — a form of length bias called “survivor cohort bias.”

17 Harm Studies & Hill Criteria

Harm studies ask whether an exposure (drug, environmental factor, device) causes an adverse outcome. Because harms often cannot be ethically randomized, most harm evidence comes from cohorts, case-control studies, and pharmacovigilance.

Cohort vs Case-Control for Harm

DesignBest WhenLimitations
CohortExposure is rare or multiple outcomes of interestInefficient for rare outcomes; long follow-up
Case-controlOutcome is rareRecall bias; selection of controls
RCTCommon harms already being studiedUnderpowered for rare adverse events
Pharmacovigilance (FAERS, VAERS)Signal detection for rare eventsNo denominator; voluntary reporting

Validity Criteria for Harm Studies

  • Clearly identified comparison group with similar baseline risk
  • Exposure and outcome measured the same way in both groups
  • Sufficient follow-up
  • Adjustment for confounders (including indication bias)
  • Dose-response relationship where possible

Bradford Hill Criteria for Causation

Austin Bradford Hill proposed nine viewpoints in 1965 to help judge whether an observed association is causal — not a rigid checklist but a structured way to reason:

CriterionMeaning
StrengthLarger effect sizes are less likely to be due to confounding
ConsistencyRepeated observation in different settings and populations
SpecificityOne exposure → one outcome (weakest of the criteria)
TemporalityCause must precede effect (essential, not optional)
Biological gradientDose-response relationship
PlausibilityBiologically coherent mechanism
CoherenceConsistent with known facts about disease
ExperimentRemoving exposure reduces risk (when possible)
AnalogySimilar cause-effect relationships exist
Of Hill’s criteria, temporality is the only one that is strictly necessary — a cause must precede its effect. The others strengthen the inference but can be absent in real causal relationships and present in non-causal ones.

18 SR Methodology & PRISMA

A systematic review is the highest form of literature synthesis when done well. Its value depends entirely on the rigor of its methods — a poorly conducted SR is worse than a single well-done trial because its conclusions carry false authority.

Steps of a Systematic Review

  1. Frame a focused question (PICO)
  2. Pre-register the protocol (e.g., PROSPERO)
  3. Develop an explicit search strategy (multiple databases + gray literature)
  4. Screen titles, abstracts, and full texts in duplicate
  5. Extract data in duplicate with standardized forms
  6. Assess risk of bias in each included study (e.g., Cochrane RoB 2, ROBINS-I)
  7. Synthesize results (narrative and/or meta-analysis)
  8. Assess certainty of evidence (GRADE)
  9. Report transparently (PRISMA)

PRISMA — Reporting Standard

The PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement is a 27-item checklist and flow diagram for transparent SR reporting. The 2020 update emphasizes protocol registration, data sharing, and explicit handling of bias.

Risk-of-Bias Tools

ToolStudy Type
Cochrane RoB 2RCTs (randomization, deviations from intended intervention, missing data, outcome measurement, selective reporting)
ROBINS-INon-randomized studies of interventions
QUADAS-2Diagnostic accuracy studies
Newcastle-Ottawa ScaleObservational studies (older but widely used)
AMSTAR-2Appraising the quality of a systematic review itself
PROSPERO registration of SR protocols has become standard. Comparing the protocol with the final review is a simple way to detect methods switching, just as ClinicalTrials.gov does for outcome switching in RCTs.

Cochrane Systematic Reviews

The Cochrane Collaboration produces the most methodologically rigorous systematic reviews in healthcare. Cochrane reviews follow standardized procedures documented in the Cochrane Handbook, undergo peer review, and are updated as new evidence accumulates. They are published in the Cochrane Database of Systematic Reviews (CDSR). Key features include:

  • Pre-specified protocols with peer review
  • Comprehensive search strategies including gray literature
  • Standard risk-of-bias assessment (RoB 2)
  • GRADE Summary of Findings tables
  • Plain language summaries for patients
  • Living review format for rapidly evolving topics

Living Systematic Reviews

A living systematic review is continuously updated as new evidence becomes available, rather than representing a snapshot in time. COVID-19 accelerated adoption of this approach — treatment guidelines needed to incorporate new RCTs within weeks rather than years. Living reviews rely on automated search alerts, streamlined screening, and modular updating.

19 Meta-Analysis & Forest Plots

Meta-analysis statistically combines results from multiple studies into a pooled effect estimate. Done well, it increases precision and power; done poorly, it produces a misleadingly confident summary of heterogeneous or biased studies.

Reading a Forest Plot

ElementMeaning
Each horizontal lineOne study’s point estimate and 95% CI
Box sizeWeight of the study (larger box = more weight)
Vertical line at 1 (or 0)Line of no effect
Diamond at the bottomPooled estimate and 95% CI; its tips are the limits
Studies crossing line of no effectIndividually non-significant
Pooled diamond not crossing lineOverall significant effect

Fixed vs Random Effects

ModelAssumptionWhen Appropriate
Fixed-effect (common-effect)One true underlying effect shared by all studiesHomogeneous studies, identical populations
Random-effectsTrue effect varies between studies; pooled estimate is mean of a distributionHeterogeneous populations or interventions; more conservative (wider CI)

When heterogeneity is present, random-effects models are generally more appropriate because they acknowledge variability in the underlying effect. Random-effects models give more weight to small studies than fixed-effect models, which can be problematic if small studies are lower quality or biased.

Pooling Methods

MethodUse
Mantel-HaenszelDichotomous outcomes, fixed-effect
DerSimonian-LairdRandom-effects (classical)
Inverse varianceContinuous outcomes
PetoRare events, odds ratios
On a forest plot, look first at the diamond for the pooled estimate, then at the spread of individual study estimates to gauge heterogeneity, and finally at box sizes to see which studies dominate the pooled result. A single huge trial can drive the entire meta-analysis.

Sample Forest Plot Interpretation Walk-Through

Imagine a forest plot of eight trials of a new oral anticoagulant vs warfarin for stroke prevention in atrial fibrillation. Six trials show point estimates to the left of the vertical line (favoring the new drug), one is right on the line, and one is slightly to the right. The pooled diamond sits at HR 0.82 with 95% CI 0.75–0.90, I2 = 22%. What do you conclude?

  • The pooled estimate shows an 18% relative reduction in stroke with the new drug.
  • The 95% CI (0.75–0.90) excludes 1.0, so the result is statistically significant.
  • I2 of 22% indicates low heterogeneity — studies are reasonably consistent.
  • The consistency across studies (most favoring the new drug) supports the pooled finding.

But before accepting the conclusion, you would check risk of bias in each trial, whether the heterogeneity has a clinical explanation (dose, population), and whether the pooled NNT (derived from baseline stroke risk in the control arm) is clinically meaningful.

20 Heterogeneity, Publication Bias & GRADE

Heterogeneity

Heterogeneity is variation between studies beyond what would be expected by chance. Sources include clinical (different populations, interventions), methodological (different designs, risk of bias), and statistical heterogeneity.

StatisticInterpretation
Cochran’s Q (χ2)Tests the null of no heterogeneity; low power with few studies
I2% of total variation due to heterogeneity rather than chance. 0–40% might not be important; 30–60% moderate; 50–90% substantial; 75–100% considerable
τ2Estimate of between-study variance in random-effects models

Publication Bias & Funnel Plots

Publication bias is the preferential publication of studies with positive, statistically significant, or novel results. It inflates meta-analytic effect estimates. Detection tools include:

  • Funnel plot: scatter of effect size vs precision (typically 1/SE). In the absence of bias, it is symmetric about the pooled estimate. Asymmetry suggests missing small “negative” studies.
  • Egger’s test: statistical test for funnel plot asymmetry.
  • Trim-and-fill: imputes missing studies to estimate an adjusted pooled effect.
GRADE — Grading of Recommendations, Assessment, Development and Evaluation

GRADE is the dominant modern framework for rating the certainty of evidence and the strength of recommendations. It is used by WHO, Cochrane, NICE, ACP, and most international guideline bodies. GRADE explicitly separates certainty in the evidence from the strength of a recommendation, which also considers values, preferences, resources, and feasibility.

GRADE Certainty of Evidence

Evidence starts at High (RCTs) or Low (observational studies) and is rated up or down:

Downgrade ReasonUpgrade Reason
Risk of biasLarge magnitude of effect
Inconsistency (heterogeneity)Dose-response gradient
Indirectness (applicability)All plausible confounders would reduce the effect
Imprecision (wide CI)
Publication bias
Certainty LevelMeaning
HighVery confident the true effect lies close to the estimate
ModerateModerately confident; true effect likely close but could differ
LowLimited confidence; true effect may be substantially different
Very lowVery little confidence in the estimate

GRADE Recommendation Strength

Recommendations are dichotomized as strong (“we recommend”) or weak / conditional (“we suggest”). Strong recommendations require high-certainty evidence and benefits clearly outweighing harms for most patients. Weak recommendations reflect uncertainty or tradeoffs — shared decision-making is particularly important here.

Summary of Findings Tables

GRADE-compliant SRs present a Summary of Findings (SoF) table: for each critical outcome, it shows the risk in the comparator group, the risk in the intervention group, the relative effect, the number of participants and studies, and the GRADE certainty with reasons for downgrading. This single table is often the most valuable artifact of a systematic review.

Under GRADE, observational studies can be upgraded (e.g., the dramatic effect of parachutes or insulin for type 1 diabetes), and RCTs can be downgraded for serious flaws. The starting level is not the ending level.

21 Clinical Practice Guidelines & AGREE II

Clinical practice guidelines (CPGs) are “statements that include recommendations intended to optimize patient care, informed by a systematic review of evidence and an assessment of the benefits and harms of alternative care options” (IOM, 2011). Well-made guidelines translate vast evidence into actionable recommendations; poorly made guidelines perpetuate opinion dressed as evidence.

Types of Guidelines

TypeExampleCharacteristic
Evidence-based (GRADE)WHO, NICE, ACPSystematic review + GRADE certainty ratings
Consensus-basedSpecialty society statementsExpert agreement when evidence is sparse
Screening recommendationsUSPSTFGraded A–D / I based on benefit vs harm
Quality indicatorsCMS, NQF measuresDerived from guidelines; used for payment

Guideline Development Process

  1. Scope and key questions (PICO)
  2. Multidisciplinary panel with declared conflicts of interest
  3. Systematic reviews of the evidence
  4. Rating of evidence certainty (GRADE)
  5. Drafting recommendations with explicit rationale
  6. External review and public comment
  7. Publication with a planned update cycle

AGREE II — Guideline Quality Appraisal

The AGREE II (Appraisal of Guidelines for Research and Evaluation) tool assesses guideline methodology across six domains:

DomainFocus
Scope and purposeObjectives, questions, target population
Stakeholder involvementPanel composition, target users, patient input
Rigor of developmentSystematic search, evidence selection, recommendation formulation
Clarity of presentationRecommendations clearly stated and identifiable
ApplicabilityBarriers, facilitators, resource implications
Editorial independenceFunding and conflicts of interest

Major Evidence / Recommendation Grading Systems

SystemUserNotes
GRADECochrane, WHO, NICE, ACPDominant modern framework
USPSTFUS Preventive Services Task ForceA, B, C, D, I for preventive services
ACC/AHACardiologyClass I/IIa/IIb/III with levels A/B/C
Oxford CEBMOlder but historically influentialLevels 1–5
Conflicts of interest are a major source of guideline distortion. When appraising a guideline, always check the COI disclosures of the panel chair and members — studies have shown systematic differences between guidelines written by conflicted vs non-conflicted panels.

22 Shared Decision-Making

Shared decision-making (SDM) is the process in which clinicians and patients make decisions together using the best available evidence, accounting for the patient’s values, preferences, and circumstances. It is the clinical operationalization of the third pillar of EBM.

When SDM Is Essential

  • Preference-sensitive decisions (e.g., PSA screening, prostatectomy vs active surveillance)
  • Close tradeoffs between benefits and harms
  • Weak / conditional recommendations under GRADE
  • Decisions with major quality-of-life implications
  • Screening tests (over-diagnosis tradeoffs)

Three Talk Model (Elwyn et al.)

PhaseContent
Team talk“There is a choice to be made; let’s make it together.”
Option talkDescribe reasonable options, benefits, and harms in understandable terms
Decision talkElicit preferences and reach a decision the patient can support

Tools for SDM

ToolPurpose
Decision aids (Mayo Clinic, Option Grid)Present options, outcomes, and probabilities in patient-friendly formats
Pictographs / icon arraysVisualize risks (e.g., 100-person figures) to reduce framing bias
Natural frequencies“3 in 100” rather than percentages
Teach-backConfirm understanding by asking patient to explain in their own words
Risk communication matters as much as the evidence itself. Patients understand “3 out of 100 people will have a stroke” far better than “the absolute risk is 3%.” Avoid relative risks (“50% higher”) in patient counseling — they exaggerate perceived danger.

Framing Effects in Risk Communication

Identical data can lead to different decisions depending on how they are presented — a cognitive bias called framing. A 90% survival rate sounds better than a 10% mortality rate, though they are arithmetically identical. Best practices include:

  • Present both positive and negative framings when possible
  • Use consistent denominators (e.g., “3 in 100” and “97 in 100”)
  • Provide absolute numbers, not only relative ones
  • Use visual aids (icon arrays, bar charts)
  • Avoid loaded terminology (“aggressive cancer” vs “slow-growing”)
  • Describe a reasonable time horizon (“over the next 10 years”)

Barriers to SDM Implementation

BarrierMitigation
Time pressure in clinicPre-visit decision aids, team-based SDM
Clinician discomfort with uncertaintyTraining in risk communication; normalize honest “I don’t know”
Health literacy limitationsPlain language, teach-back, visual tools
Cultural expectations of paternalismCulturally adapted communication
Reimbursement structureBill SDM visits (CMS codes for some decisions)

23 Special Topics — RWE, Big Data, ML, NMA

Real-World Evidence (RWE)

Real-world evidence is clinical evidence derived from routine care data — EHRs, claims databases, registries, wearables. The FDA formally incorporates RWE in post-marketing drug and device decisions. RWE can answer questions RCTs cannot: long-term safety, use in populations excluded from trials, comparative effectiveness in everyday practice.

Big Data & Machine Learning in Medicine

ConceptRole
Big dataVery large, often heterogeneous datasets (claims, genomics, imaging)
Machine learningAlgorithms that learn patterns from data (e.g., random forests, gradient boosting, deep learning)
Supervised learningLabeled training data (e.g., predicting sepsis from vital signs)
Unsupervised learningDiscovers structure in unlabeled data (clustering, phenotyping)
Model validationInternal (cross-validation), external (independent cohorts), prospective
Calibration vs discriminationCalibration: predicted probabilities match observed; Discrimination: AUC

Appraising ML prediction models uses TRIPOD-AI / TRIPOD+AI, extending TRIPOD for AI-based models. Key concerns include dataset shift, algorithmic bias, lack of transparency, and the need for prospective validation before deployment.

Network Meta-Analysis (NMA)

Network meta-analysis pools direct and indirect comparisons to compare multiple interventions simultaneously, even when head-to-head trials are unavailable. For example, NMA can rank several antidepressants using a network of trials in which some drugs have never been directly compared. Key assumptions include transitivity (the indirectly compared populations are similar enough to allow inference) and consistency (direct and indirect evidence agree).

Individual Patient Data Meta-Analysis (IPD-MA)

IPD-MA obtains individual participant data from each included trial rather than aggregate summary data. It is the gold standard for pooled analysis because it allows standardized re-analysis, time-to-event modeling, and subgroup analyses not possible with published summaries. The tradeoff is significant time and resource cost.

Machine learning models frequently achieve impressive discrimination in training data but fail catastrophically on external validation — the “reproducibility crisis” of clinical AI. Always ask whether a model has been externally validated and prospectively deployed before trusting it.

Target Trial Emulation

Target trial emulation is a modern framework in which observational data are analyzed as if they were an RCT, with the hypothetical (target) trial specified in advance. This discipline forces researchers to define eligibility, treatment strategies, assignment procedures, follow-up start, outcomes, and analysis plans up front — reducing many common observational biases such as immortal time bias and prevalent-user bias. It has become a standard approach in comparative effectiveness research using EHR and claims data.

Mendelian Randomization

Mendelian randomization uses genetic variants as instrumental variables to infer causal effects of an exposure on an outcome. Because alleles are randomly assigned at conception (Mendel’s laws), they are largely independent of confounders that plague standard observational analyses. This approach has been used to clarify the causal role of LDL cholesterol, C-reactive protein, and body mass index in cardiovascular disease. Limitations include pleiotropy (variants affecting multiple pathways) and the need for large genetic datasets.

24 Common Pitfalls in EBM

Even carefully conducted studies can mislead. The following pitfalls are among the most common reasons that apparently strong evidence fails to replicate or translate.

Statistical vs Clinical Significance (Revisited)

A p-value tells you whether an effect is likely different from zero, not whether it matters. Very large trials can produce statistically significant differences of trivial magnitude; small trials can miss clinically important effects. Always look at effect size, confidence interval, and minimum clinically important difference.

Surrogate Outcomes

Surrogate outcomes substitute a lab value or imaging finding for a patient-important outcome (e.g., LDL for cardiovascular death, HbA1c for microvascular complications, bone mineral density for fractures). They are convenient but dangerous: the CAST trial (suppressing PVCs) and rosiglitazone (lowering glucose while increasing MI) are cautionary examples. Surrogates must be validated against hard outcomes before being trusted for decision-making.

Composite Endpoints

Composite endpoints combine multiple outcomes (e.g., “MACE” = death + MI + stroke + revascularization) into a single variable to boost power. They are valid only when component outcomes are of similar importance and affected in the same direction. Beware when a composite is driven by the least important component (e.g., hospitalization) while mortality is unchanged.

Subgroup & Post-Hoc Analyses

Subgroup analyses test whether an effect varies across patient subsets. They are hypothesis-generating, not confirmatory. Pre-specified subgroups with a plausible interaction test are more trustworthy than post-hoc analyses (e.g., the infamous ISIS-2 subgroup showing no benefit of aspirin for patients born under certain astrological signs — used as a parody of subgroup analysis).

p-Hacking, HARKing & the Garden of Forking Paths

PracticeDefinitionConsequence
p-hackingTrying multiple analyses until something “works”Inflated false positive rate
HARKingHypothesizing After Results are KnownPresents exploratory findings as confirmatory
Outcome switchingChanging primary outcome post hocSelective reporting bias
Publication biasSelective publication of positive resultsDistorts meta-analyses
SpinFraming non-significant results as positiveMisleads readers and media

White-Coat Science & Reverse Causation

White-coat science refers to overconfident interpretation of weak evidence clothed in the authority of expertise. Reverse causation occurs when the outcome causes the exposure rather than vice versa (e.g., low cholesterol in sick patients is often the result of cachexia rather than a cause of death).

When reading any observational study showing a surprising benefit of a lifestyle factor or supplement, first ask: is this reverse causation (sick people stop taking it) or confounding by healthy-user bias (the kind of person who takes it does everything else healthy too)? These are the most common explanations for results that fail in randomized trials.

Number Needed to Treat Traps

  • NNT depends on baseline risk — quote NNT only for a specific population
  • NNT depends on time horizon — always specify (e.g., NNT 50 over 5 years)
  • NNT must be accompanied by NNH to give a fair picture of tradeoffs
  • NNT from meta-analyses may be misleading when heterogeneity is high

The Reproducibility Crisis

Large replication projects (Open Science Collaboration in psychology, the Reproducibility Project: Cancer Biology) have shown that a substantial fraction of published findings do not replicate when the same studies are repeated. Causes include low statistical power, flexible analysis, publication bias, p-hacking, and inadequate description of methods. EBM’s response has included pre-registration, data sharing, methodologic transparency (CONSORT, STROBE, PRISMA), and emphasis on systematic reviews over single studies.

Ethical Considerations in EBM

IssueEBM Implication
EquipoiseEthical basis of RCTs — genuine uncertainty about which arm is better
Informed consentPatients must understand that research differs from care
Vulnerable populationsExtra protections for children, pregnant women, prisoners, cognitively impaired
Data sharingPatients whose data are used should benefit from replication and transparency
Placebo controlsAcceptable only when no effective standard exists or add-on design is used
Conflicts of interestIndustry funding systematically biases results toward sponsor interests

25 Reference — Checklists & Tables

This section consolidates high-yield reference material — critical appraisal checklists, study design comparisons, GRADE summaries, and common database features.

Critical Appraisal Quick Checklist — Therapy

QuestionLook For
Randomized?Method of sequence generation
Allocation concealed?Central randomization or sealed opaque envelopes
Groups similar at baseline?Table 1
Blinded?Participants, caregivers, outcome assessors
Follow-up complete?>80% or sensitivity analysis
ITT analysis?Analyzed as randomized
Effect size meaningful?ARR, NNT, CI, clinical significance
Applicable to my patient?Population, setting, values

Critical Appraisal Quick Checklist — Diagnosis

QuestionLook For
Independent, blinded comparison with reference standard?Methods section
Appropriate spectrum of patients?Consecutive series, not extremes
Reference standard applied regardless of index test?No verification bias
Methods described in enough detail to replicate?STARD checklist
Sens/Spec and likelihood ratios reported?With CIs

Critical Appraisal Quick Checklist — Systematic Review

QuestionLook For
Focused question (PICO)?Stated clearly
Protocol registered?PROSPERO
Comprehensive search?Multiple databases + gray literature
Study selection in duplicate?Two reviewers with kappa
Risk of bias assessed?RoB 2 or ROBINS-I
Heterogeneity evaluated?I2, subgroup analysis
Publication bias assessed?Funnel plot, Egger’s
Certainty of evidence rated?GRADE

Study Design Comparison

DesignQuestionCan Show Causation?Time Direction
RCTTherapy, preventionYes (strongest)Forward
CohortPrognosis, harm, rare exposuresSupports causal inferenceForward
Case-controlRare outcomes, etiologySupports causal inferenceBackward
Cross-sectionalPrevalence, diagnosisNo (no temporality)Snapshot
Case seriesDescriptionNoVariable
EcologicalPopulation hypothesisNo (ecological fallacy)Snapshot

GRADE Summary Table

Starting PointDowngrade ForUpgrade For
RCT → HighRisk of biasLarge effect
Observational → LowInconsistencyDose-response
IndirectnessConfounding in opposite direction
Imprecision
Publication bias

Reporting Guidelines by Study Type

Study TypeGuideline
RCTCONSORT
ObservationalSTROBE
Systematic reviewPRISMA
Diagnostic accuracySTARD
Prediction modelTRIPOD / TRIPOD-AI
QualitativeSRQR / COREQ
Case reportCARE
QI interventionSQUIRE
ProtocolSPIRIT

Common Databases — At a Glance

DatabaseStrength
PubMed / MEDLINEBroadest biomedical free resource
Cochrane LibrarySystematic reviews & trials registry
EmbaseEuropean journals; pharmacology
CINAHLNursing / allied health
PsycINFOMental health
ClinicalTrials.govTrial registry & results
UpToDate / DynaMedPoint-of-care summaries
PROSPEROSR protocol registry

26 High-Yield Review

This final section distills the most commonly tested and clinically essential EBM concepts.

The EBM Mindset in One Paragraph

Practicing EBM means approaching every clinical decision with humility about uncertainty, discipline about evidence, and respect for patient values. It means framing answerable questions, searching efficiently, appraising critically, quantifying benefit and harm, and integrating evidence with clinical judgment at the bedside. It is not about memorizing trials — it is a durable set of skills that outlasts any individual fact and keeps practice aligned with the best available knowledge as medicine evolves.

Core Definitions to Master

TermOne-Line Definition
EBMIntegration of best evidence, clinical expertise, and patient values
PICOPopulation, Intervention, Comparison, Outcome
SensitivityP(test + | disease +); SnNOUT
SpecificityP(test − | disease −); SpPIN
PPV / NPVDepend on prevalence
LR+Sn / (1 − Sp); >10 rules in
LR−(1 − Sn) / Sp; <0.1 rules out
ARRCER − EER; absolute benefit
RRR(CER − EER) / CER
NNT1 / ARR; patients treated to prevent one event
NNH1 / ARI; patients treated to cause one harm
OROdds ratio; approximates RR when outcome is rare
HRHazard ratio; from Cox regression
ITTAnalyzed as randomized; preserves randomization
I2% variation due to heterogeneity
GRADEFramework for certainty of evidence and recommendation strength

Rapid-Fire Clinical Pearls

Always translate relative risk reductions into absolute risk reductions and NNTs. A 50% RRR on a 2% baseline is an ARR of only 1% and an NNT of 100 — impressive-sounding relative numbers often hide modest absolute benefits.
The CAST trial is the canonical refutation of surrogate-outcome reasoning. Suppressing PVCs after MI with flecainide and encainide seemed logical but increased mortality. Never accept a surrogate outcome without hard-outcome validation.
SnNOUT & SpPIN: a sensitive test, when negative, rules out; a specific test, when positive, rules in. Use sensitive tests for screening (you cannot afford to miss disease) and specific tests for confirmation (you cannot afford false positives).
Likelihood ratios are the most useful diagnostic metric at the bedside because they work directly on a single patient’s pre-test probability, without requiring population prevalence. LR+ >10 and LR− <0.1 move probability substantially.
PPV and NPV depend on prevalence; sensitivity, specificity, and LRs do not. A test with great sensitivity and specificity can still have terrible PPV in a low-prevalence screening setting — the basis of the “false positive paradox” for rare diseases.
Intention-to-treat preserves the benefits of randomization by analyzing patients in their originally assigned group regardless of adherence. Per-protocol analysis should be a sensitivity check, not the primary analysis — except in non-inferiority trials, where it is more conservative.
Allocation concealment and blinding are distinct. Concealment protects randomization from being subverted at enrollment; blinding protects against differential care and biased outcome assessment afterward. Both matter, and they are often confused on exams.
I2 quantifies heterogeneity in a meta-analysis. Values >50% suggest substantial heterogeneity and favor random-effects models. When heterogeneity is high, investigate it (subgroups, meta-regression) rather than just pooling blindly.
A funnel plot with asymmetric missing small-study regions suggests publication bias. Trim-and-fill gives a bias-corrected estimate but is not a substitute for comprehensive searching including gray literature and trial registries.
GRADE ratings start at High for RCTs and Low for observational studies, then move up or down based on risk of bias, inconsistency, indirectness, imprecision, publication bias, magnitude of effect, and dose-response. Strong recommendations require high-certainty evidence and a clearly favorable benefit-harm balance.
Shared decision-making is not optional under EBM — it is the operationalization of the third pillar (patient values). It is especially critical for preference-sensitive decisions, weak/conditional recommendations, and screening tests with over-diagnosis tradeoffs.
Case-control studies are efficient for rare outcomes (e.g., aplastic anemia from chloramphenicol); cohorts are efficient for rare exposures (e.g., occupational solvents). Choose by asking which is rarer — exposure or outcome.
Hill’s criteria are a structured way to judge causality in observational data; temporality is the only absolutely necessary criterion. The others strengthen the case but are neither sufficient nor universally necessary.
Pre-appraised resources (UpToDate, DynaMed, Cochrane) are the practical top of the evidence pyramid at the point of care. Start there, and descend to primary studies only when pre-appraised sources are insufficient or out of date.
Always check whether an RCT’s primary outcome was pre-registered on ClinicalTrials.gov or in the protocol. Outcome switching (changing the primary outcome after seeing results) is one of the most common forms of selective reporting bias in published literature.
Exam & Practice Strategy

For EBM questions: (1) Identify the question type (therapy, diagnosis, prognosis, harm) — this determines the best design and appraisal framework. (2) Know validity criteria for each design (randomization, blinding, ITT for therapy; independent comparison and spectrum for diagnosis). (3) Be comfortable converting between RR, RRR, ARR, NNT, sensitivity, specificity, and likelihood ratios. (4) Memorize the 2×2 table and recompute metrics quickly. (5) Recognize common biases (selection, performance, attrition, verification, spectrum, publication). (6) Apply GRADE thinking: start with design, adjust for quality, interpret in light of effect size and precision. These six skills resolve the vast majority of EBM and biostatistics questions on any exam and, more importantly, at the bedside.