Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Surgery
Sophocles H. Voineskos
Charles H. Goldsmith
Editors
Evidence-Based Surgery
A Guide to Understanding
and Interpreting the Surgical
Literature
123
Editors
Achilles Thoma Charles H. Goldsmith
Department of Surgery, Division of Department of Health Research Methods,
Plastic Surgery, Department of Health Evidence and Impact (HEI)
Research Methods, Evidence McMaster University
and Impact (HEI) Hamilton, ON, Canada
McMaster University
Faculty of Health Sciences
Hamilton, ON, Canada
Simon Fraser University
Burnaby, BC, Canada
Sheila Sprague
Department of Surgery, Division of Department of Occupational Science
Orthopaedic Surgery, Department of and Occupational Therapy,
Health Research Methods, Evidence Faculty of Medicine
and Impact (HEI) The University of British Columbia
McMaster University Vancouver, BC, Canada
Hamilton, ON, Canada
Sophocles H. Voineskos
Department of Surgery, Division of
Plastic Surgery
McMaster University
Hamilton, ON, Canada
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
Therefore, while the primary goal of this book is to teach surgeons how to
appraise the surgical literature, a collateral benefit is that the concepts
explained here may help research-minded surgeons produce better research.
References
6. Corlan AD. Medline trend: automated yearly statistics of PubMed results for any query
[Internet]; 2004. Available from http://www.webcitation.org/65RkD48SV. Accessed
by 16 July 2018.
7. Rohrich R, Weinstein A. Predator-in-Chief: Wolves in Editors’ Clothing. PRS Go.
2018; 6: 2-pe1652.
8. Farrokhyar F, Amin N, Dath D, Bhandari M, Kelly S. Impact of the Surgical Research
Methodology Program on surgical residents’ research profiles. J Surg Educ. 2014;71:
513–20.
Contents
ix
x Contents
xiii
xiv Contributors
Evidence-based surgery (EBS) as a paradigm rience. The art and knowledge of surgery was
shift in surgical practice evolved from its pre- passed from teacher to student.
cursor, evidence-based medicine (EBM), a term The EBM movement was introduced to address
first coined by Gordon Guyatt in the early 1990s this gap, with a goal to ensure that patients were
[1]. Before the development of “science-based treated based on evidence, not the word of author-
medicine”, physicians from the time of Hip- ities [2]. While EBM, as a movement, is relatively
pocrates learned the craft of medicine, on expe- new to the medical community, clinical trials have
been taking place in some form since the 1500s. It
has been reported that the first clinical trial was
conducted, accidentally, by Ambroise Pare while
treating military soldiers [3]. When the supply of
oil, the standard treatment for wounds was sparse,
A. Thoma (&) J. Murphy he mixed a digestive of egg yolks, rose oil, and
Department of Surgery, Division of Plastic Surgery, turpentine. Pare reported the differences between
McMaster University, Hamilton, ON, Canada
e-mail: athoma@mcmaster.ca
the two treatments, with those receiving the new
treatment experiencing little pain and decreased
J. Murphy
e-mail: murphj11@mcmaster.ca
swelling [3]. The first physician-conducted con-
trolled clinical trial within the modern era was
S. Sprague
Department of Surgery, Division of Orthopaedic
recorded 200 years later by James Lind. Lind
Surgery, McMaster University, Hamilton, ON, planned a trial comparing cures for scurvy using 12
Canada patients at sea [3]. From this study, it was discov-
e-mail: sprags@mcmaster.ca ered that citrus fruits could treat scurvy; however, it
A. Thoma S. Sprague C. H. Goldsmith wasn’t until 50 years later that the British Navy
Department of Health Research Methods, Evidence made lemon juice a part of sailors’ diets [3].
and Impact, McMaster University, Hamilton, ON,
Canada
Notable events that occurred since then lead-
e-mail: cgoldsmi@sfu.ca ing the development of science-based surgery
C. H. Goldsmith
and medicine overall included: (1) The careful
Faculty of Health Sciences, Simon Fraser University, documentation of surgical interventions and
Burnaby, BC, Canada results by Billroth and Halstead [4]; (2) the
C. H. Goldsmith publication of the Flexner report in the early
Department of Occupational Science and twentieth century, cementing “scientific inquiry
Occupational Therapy, Faculty of Medicine, The as the bedrock of American medicine” [5]; and
University of British Columbia, Vancouver, BC,
Canada
(3) the US FDA Kefauver-Harris Act in the
1960s, demanding rigorous empirical testing of primarily internists. Their publications on the
clinical trials in humans leading to the estab- concepts of EBM focused on clinical scenarios in
lished claims regarding drug efficacy and safety internal medicine that were not considered rele-
of new pharmaceutical innovations [5]. vant to surgeons.
The “experiential” nature of the development To introduce the principles of EBM into sur-
of surgery in modern times was highlighted by gery, the definition of EMB was slightly altered to
Meakins [6]. He believed that surgical training EBS by a group of surgeons and methodologists
was performed “in a hierarchical environment” at McMaster University [15]. Evidence-Based
where the professor or chief likely defined the Surgery (EBS) is thus defined as the integration
way clinical situations were to be managed, and of best research evidence, patients’ preferences/
how the operation was to be done [6]. values, health resource availability, clinical set-
Sometime within the 1970s and 80s a paradigm ting, and ultimately our surgical expertise. This
shift occurred in which evidentiary rules changed group of surgeons and methodologists, including
the hierarchical system mentioned above by mentees and colleagues of David Sackett, pub-
Meakins; this new paradigm placed a higher value lished a series of articles in the Canadian Journal
on evidence than authority. The protagonists in this of Surgery with the goal to teach surgeons of all
paradigm shift were a group of clinical epidemi- subspecialties the skills of appraising a surgical
ologists, including David Sackett (considered by article before applying the conclusions of the
most, the father of EBM), David Eddy, Archie article to their practice [15–35]. Gradually some
Cochrane, and others. They highlighted the need surgical subspecialties encouraged their members
for strengthening the empirical evidence of medi- to adopt the principles of EBS [36, 37]. There is
cine [5–11]. They proposed the initial evidentiary an impression that the increase and quality of
rules for guiding clinical decisions, and shortly surgical research seen since 2000 in these spe-
after, published the first of a series of articles in the cialties is due to the adoption of the EBM and
Canadian Medical Association Journal (CMAJ) EBS philosophies.
advising physicians how to appraise the medical In practicing evidence-based surgery, we base
literature [12]. This series of articles was followed our decisions on the five pillars shown in
by the well-known article series, Users’ guides to Fig. 1.3. First, we consider our patient’s prefer-
the medical literature, first published in JAMA in ences and actions. Second, we consider the
1993 [13]. healthcare resources available to us. Third, we
Some may argue that the seeds of the EBM consider the best available evidence. Fourth, we
movement started with the publication of a rudi- consider the clinical setting we are practicing
mentary form of levels of evidence by the Cana- (e.g., academic, community, developing nation,
dian Task Force on Periodic Health Examination war zone, etc.), finally we integrate these with
in 1976 as a result of a joint effort of the Deputy our own surgical skills.
Health Ministers across the ten Canadian pro- Since the introduction of EBM and EBS,
vinces [7]. This Task Force proposed an evidence numerous surgical training programs across
rating system with four levels of evidence (LOE) North American have started to integrate
and the corresponding types of evidence (see evidence-based training and research into their
Fig. 1.1). This LOE rating system which is central core curriculum. Duke University’s surgical
to the philosophy of EBM and EBS was improved departments are associated with the Duke Clini-
by Sackett to include five levels (see Fig. 1.2) [8]. cal Research Institute (DCRI). Here, surgical
These early LOE rating scales have been since residents have the opportunity to receive training,
modified and made more stringent to take into participate, and lead clinical research projects
account a study’s methodological quality [14]. [39]. The DCRI strives to create meaningful and
The advent of the EBM movement in the early realistic clinical trials that can shape patient care,
1980s was slowly adopted by surgery, likely and then integrate the results into practice [39].
because the fathers of this paradigm shift were Similarly, Stanford University has a Center for
1 History of Evidence-Based Surgery (EBS) 3
Clinical Research (SCCR) that focuses on develop and validate clinical and radiographic
research collaboration, innovation, and evidence- measures in rheumatic diseases [42]. Every
based operations [40]. The SCCR integrates 2 years OMERACT organizes meetings to con-
educational sessions into their program with past tinuously develop outcome measure consensus
sessions on clinical research education, good among panel members within the context of
clinical practice, and the informed consent pro- musculoskeletal and autoimmune diseases [42].
cess [40]. McMaster University has included Information on the OMERACT group can be
very similar programming into their core surgical found at their website, www.omeract.org.
curriculum by incorporating a Surgical Research The introduction of OMERACT was closely
Methodology (SRM) course [41]. Through this followed by GRADE (Grading of Recommen-
course, residents learn critical thinking and the dations Assessment, Development and Evalua-
appreciation for statistical interpretation [41]. tion). GRADE was developed by an international
The success of these programs can be shown panel of experts in the area of evidence-based
through the increases in research productivity practice [43]. The aim of GRADE was to create
and performance in surgical residents among all an easy to follow and transparent guide to rating
disciplines [40]. It is believed that the imple- the quality/certainty of evidence and strength of
mentation of such programs not only benefits the recommendations in systematic reviews and
future careers of surgical residents but also guideline development [43]. From this initiative
encourages evidence-based patient care [41]. came the development of the GRADEpro and its
Included in many research-training courses is corresponding phone application that can be used
the introduction of research guidelines that as an all-on-one tool for summarizing and pre-
standardize the reporting and measurement of senting information for decision-making in
outcomes. One of the earliest initiatives to health care [43]. More information can be found
improve outcome measurement was OMERACT at their website www.gradeworkinggroup.org.
(Outcome Measurements in Rheumatology) [42]. GRADE was adopted by the Cochrane Col-
OMERACT is an independent initiative of laboration to evaluate the quality of evidence from
healthcare professionals that was introduced in systematic reviews and further, to summarize the
the early 1990s [42]. OMERACT continues to findings and present the evidence to decision
1 History of Evidence-Based Surgery (EBS) 5
makers [44]. The Cochrane Collaboration (named in the health research literature, the COMET
in honor of Archie Cochrane) is an independent (Core Outcome Measures in Effectiveness Trials)
network of researchers, professionals, healthcare initiative focuses on the development and appli-
workers, and patients, which was formed in the cation of agreed-upon standardized outcome sets
mid-2000s [45]. Its goal is to improve healthcare or “core outcome sets” [48]. This initiative out-
decisions and transform the way decisions are lines the minimum that should be both measured
made in health care. To accomplish this goal, and reported within a clinical trial in regard to a
Cochrane gathers and summarizes the best evi- specific condition [48]. This outlines the expec-
dence to help the professional make informed tation that the identified core outcomes will be
treatment decisions [45]. As of 2018, there are collected and reported for each respective con-
37,000 contributors from over 130 countries dition, allowing results from many trials to be
working together to create a source of credible and compared and combined [48]. The COMET
accessible health information, free of commercial handbook is available at their website: www.
sponsorship or conflict of interest [45]. The con- comet-initiative.org.
tributors include leaders in the fields of medicine, Regulating the quality of both health mea-
health policy, research methodology, and con- surement and health reporting is incredibly
sumer advocacy. The work of Cochrane is neces- important for the standardization of the clinical
sary as access to health evidence, and the literature. However, maintaining public trans-
subsequent risk of misinterpreting data increases parency of methods, anonymized results, and
[45]. More information on Cochrane can be found adverse events is potentially just as important for
at www.cochrane.org. the overall quality of clinical research. In 1997,
To regulate the methodological quality of health the Food and Drug Administration (FDA) estab-
measurement, the COSMIN (COnsensus-based lished an up-to-date and monitored database of
Standards for the selection of health Measure- clinical trials [49]. This government-led website,
ment INstruments) initiative was developed [46]. made public in 2000, provides resources on
In 2006, this group published an appraisal tool for publicly and privately supported clinical trials for
evaluating the quality of studies based on mea- patients, family members, and healthcare pro-
surement properties of health measurement instru- fessionals [49]. Maintained by the National
ments [46]. The checklist, which was developed in Library of Medicine (NLM) from the National
an international Delphi study, focuses on Institutes of Health (NIH), this database is
health-related patient-reported outcomes and is updated and strictly monitored to ensure up to
also useful in evaluating studies on performance- date information [49]. This registry is a govern-
based tests or clinical rating scales [46]. More ment website and therefore does not host, receive
information on the COSMIN initiative, as well as funding or advertise any commercial entities or
their checklist, can be found at www.cosmin.nl. content [49]. As of March 2018, there are over
The EQUATOR (Enhancing the QUAlity and 268,000 study records on the ClinicalTrials.gov
Transparency Of health Research) Network is a database that are being conducted in all 50 States
well-known international initiative that aims to and 203 countries [49]. The registry is a great
improve reliability and value of health research way to stay current on developments in the var-
[47]. To accomplish this goal, the EQUATOR ious fields of medicine, find collaborating part-
Network has created stringent guidelines to be ners and identify gaps within specific areas of
used for reporting, based on study type [47]. interest. Anyone can search this database, with-
The EQUATOR Network provides information out registering, by visiting www.clinicaltrials.
pertaining to all of their guidelines at their website gov/ct2/home.
www.equator-network.org. Table 1.1 lists the Researchers can also register systematic
guidelines and their respective study types. reviews with PROSPERO, the International
While the abovementioned guidelines from prospective registrar of systematic reviews [50].
the EQUATOR network outline proper reporting PROSPERO was produced by the Centre for
6 A. Thoma et al.
Reviews and Dissemination at York University need to be addressed are Knowledge Translation
and funded by the NIH [50]. This international of new advances in surgery into practice. With
database contains detailed information about so many journals and information being pub-
systematic review protocols in numerous areas lished in both legitimate and predatory journals,
including health and social care, welfare, public the “true signal” of surgical advances may be
health, and education [50]. Key features from lost in the “noise”. The new generation of sur-
each review protocol are recorded and main- geons should be trained in appraisal skills and
tained to provide a comprehensive list of regis- informed of important pre-appraised informa-
tered systematic reviews to help avoid tion such as systematic reviews or clinical
duplication and reduce the opportunity for practice guidelines. Utilizing the EQUATOR
reporting bias [50]. Cochrane protocols are also Network, COMET, COSMIN, and GRADE
included in PROSPERO with links to the full guidelines can help surgeons appraise and be
protocol found on the Cochrane Library. Readers aware of clinically important literature within
can learn more about PROSPERO by visiting: their subspecialties.
www.crd.york.ac.uk/prospero.
The application of the abovementioned
guidelines helps to address the many areas of References
failure that can occur during the research pro-
cess [51]. The EQUATOR Network guidelines, 1. Guyatt GH. Evidence-based medicine. ACP J Club.
COMET handbook, and COSMIN tool are 1991; A-16:114.
2. Smith R, Rennie D. Evidence based medicine—an
critical in the design, implementation, and oral history. BMJ. 2014;348:371–3.
reporting of clinical and observational trials; 3. Bhatt A. Evolution of clinical research: a history
GRADE can be applied in systematic reviews, before and beyond James Lind. Perspect Clin Res.
meta-analyses, and guideline development [51]. 2010;1(10):6–10.
4. Imber G. Genius on the edge: the bizarre double life
Through the integration of the recommendations of Dr. William Steweard Halsted. New York: Kaplan
of the EQUATOR guidelines, surgeons will Publishing; 2011.
advance reporting and transparency of surgical 5. Djulbegovic B, Guyatt GH. Progress in evidence-
research; similarly, by applying COSMIN, sur- based medicine: a quarter century on. Lancet.
2017;290(10092):415–23.
gical research will be more efficient. All of these 6. Meakins JL. Innovation in surgery. Am J Surg.
initiatives will thus advance the overall philos- 2002;183:399–405.
ophy of EBS. Future challenges to EBS that
1 History of Evidence-Based Surgery (EBS) 7
7. Canadian Task Force on the Periodic Health Exam- to use a decision analysis. Can J Surg. 2007;50
ination (CTFPHE). The periodic health examination. (5):403–9.
Can Med Assoc J. 1979;121(9):1193–254. 24. Thoma A, Cornacchi S, Lovrics P, Goldsmith CH.
8. Sackett DL. Rules of evidence and clinical recom- Users’ guide to the surgical literature: how to assess
mendations on the use of antithrombotic agents. an article on health-related quality of life. Can J Surg.
Chest. 1989;95(Supp 2):2S–4S. 2008;51(3):215–24.
9. Eddy DM, Billings J. The quality of medical 25. Cadeddu M, Farrokhyar F, Thoma A, Haines T,
evidence: implications for quality of care. Health Garnett A, Goldsmith CH. Users’ guide to the
Aff (Millwood).1988;7(1):19–32. surgical literature: how to assess power and sample
10. Cochrane AL. Effectiveness and efficiency: random size. Can J Surg. 2008;51(6):476–82.
reflections on health services. London: Nuffield 26. Hansebout R, Cornacchi SD, Haines T, Gold-
Provincial Hospitals Trust; 1972. smith CH. Users’ guide to the surgical literature:
11. Eddy DM. Guidelines for the cancer related checkup: how to use an article on prognosis. Can J Surg.
recommendations and rationale. CA: Cancer J Clin. 2009;52(4):328–36.
1980;30:3–50. 27. Dijkman B, Kooistra B, Bhandari M. Users’ guide to
12. Sackett DL. How to read clinical journals: I. Why to the surgical literature: how to work with a subgroup
read them and how to start reading them critically. analysis. Can J Surg. 2009;52(6):515–22.
Can Med Assoc J. 1981;124(5):555–8. 28. Thoma A, Cornacchi SD, Farrokhyar F, Bhandari M,
13. Oxman AD, Sackett DL, Guyatt GH. Users’ guides Goldsmith CH. Users’ guides to the surgical litera-
to the medical literature. I. How to get started. The ture: how to assess a survey in surgery. Can J Surg.
evidence-based medicine working group. JAMA. 2011;54(6):394–402.
1993;270(17):2093–95. 29. Cadeddu M, Farrokhyar F, Levis C, Cornacchi,
14. Oxford Centre for Evidence-based Medicine. Levels Haines T, Thoma A. For the evidence-based surgery
of evidence (March) 2009. http://www.cebm.net/ working group. Users’ guide to the surgical literature:
oxfordcentre-evidence-based-medicine-levels- understanding confidence intervals. Can J Surg.
evidence-march-2009. Accessed 8 Aug 2014. 2012;55(3):207–11.
15. Archibald S, Bhandari M, Thoma A. Evidence based 30. Coroneos CJ, Voineskos SH, Cornacchi SD, Gold-
surgery series. Users’ guide to the surgical literature: smith CH, Ignacy TA, Thoma A. Users’ guide to the
how to use an article about a diagnostic test. Can J surgical literature: how to evaluate clinical practice
Surg. 2001;44(1):17–23. guidelines. Can J Surg. 2014;57(4):280–6.
16. Urschel J, Tandan V, Miller J, Goldsmith CH. Users’ 31. Waltho D, Kaur MN, Haynes RB, Farrokhyar F,
guide to evidence-based surgery: how to use an Thoma A. Users’ guide to the surgical literature: how
article evaluating surgical interventions. Can J Surg. to perform a high-quality literature search. Can J
2001;44(2):95–100. Surg. 2015;58(5):349–58.
17. Thoma A, Sprague S, Tandan V. Users’ guide to the 32. Thoma A, Kaur MN, Farrokhyar F, Waltho D,
surgical literature: how to use an article on economic Levis C, Lovrics P, Goldsmith CH. Users’ guide to
analysis. Can J Surg. 2001;44:347–55. the surgical literature: how to assess an article about
18. Hong D, Tandan V, Goldsmith CH, Simunovic M. harm in surgery. Can J Surg. 2016;59(5):351–7.
Users’ guide to the surgical literature: how to use an 33. Gallo L, Eskicioglu C, Braga L, Farrokhyar F,
article reporting population-based volume-outcome Thoma A. Users’ guide to the surgical literature: how
relationships in surgery. Can J Surg. 2002;45(2): to assess an article using surrogate outcomes. Can J
109–15. Surg. 2017;60(4):280–7.
19. Birch D, Eady A, Robertson D, De Pauw S, 34. Thoma A, Farrokhyar F, Waltho D, Braga LH,
Tandan V. Users’ guide to the surgical literature: Sprague S, Goldsmith CH. Users’ guide to the
how to perform a literature search. Can J Surg. surgical literature: how to assess a noninferiority
2003;46(2):136–41. trial. Can J Surg. 2017;60(6):426–32.
20. Bhandari M, Devereaux PJ, Montori V, Cinà C, 35. Gallo L, Murphy J, Braga L, Farrokhyar F, Thoma A.
Tandan V, Guyatt GH. Users’ guide to the surgical Users’ guide to the surgical literature: how to assess a
literature: how to use a systematic literature review qualitative study. Can J Surg. Accepted for publica-
and meta-analysis. Can J Surg. 2004;47(1):60–7. tion October 2017.
21. Thoma A, Farrokhyar F, Bhandari M, Tandan V. The 36. Thoma A. Evidence-based plastic surgery: design,
evidence-based surgery working group. Users’ guide measurement and evaluation. Toronto Ontario: Clin
to the surgical literature: how to assess a randomized Plastic Surg. 2008;35(2):ix.
controlled trial in surgery. Can J Surg. 2004;47 37. Bhandari M, editor. Evidence-based orthopedics.
(3):200–8. Oxford: Wiley-Blackwell; 2012.
22. Birch D, Goldsmith CH, Tandan V. Users’ guide to 38. Haynes RB, Devereaux PJ, Guyatt GH. Clinical
the surgical literature: self audit and practice appraisal expertise in the era of evidence-based medicine and
for surgeons. Can J Surg. 2005;48(1):57–62. patient choice. BMJ Evid-Based Med. 2002;7:36–8.
23. Mastracci TM, Thoma A, Farrokhyar F, Tandan VR, 39. Duke University. Duke Clinical Research Institute:
Cina CS. Users’ guide to the surgical literature: how from thought leadership to clinical practice—about
8 A. Thoma et al.
us [Internet]. Duke University; 2017 [cited 13 Mar 45. Cochrane. About Us [Internet]. Cochrane group
2018]. Available from https://www.dcri.org/about/ [cited 29 Mar 2018]. Available from http://www.
who-we-are/. cochrane.org/about-us.
40. Stanford University. Stanford Center for Clinical 46. VU University Medical Center. COSMIN [Internet].
Research (SCCR) [Internet]. Stanford University; VU University [cited 13 Mar 2018]. Available from
2018 [cited 13 Mar 2018]. Available from http://med. http://www.cosmin.nl/.
stanford.edu/sccr.html. 47. The EQUATOR Network. Equator network—about
41. Farrokhyar F, Amin N, Dath D, Bhandari M, Kelly S, us [Internet]. The EQUATOR network [cited 13 Mar
Kolkin AM, et al. Impact of the surgical research 2018]. Available from http://www.equator-network.
methodology program on surgical residents’ research org/about-us/x.
profiles. J Surg Educ. 2014;71(4):513–20. 48. COMET Initiative. The COMET initiative [Internet].
42. OMERACT Outcome Measures in Rheumatology. COMET [cited 13 Mar 2018]. Available from http://
Welcome to OMERACT [Internet]. OMERACT www.comet-initiative.org/.
[cited 19 Mar 2018]. Available from http://www. 49. National Institute of Health (NIH). Clinicaltrials.gov
omeract.org. [Internet]. NIH [cited 20 Mar 2018]. Available from
43. GRADE Working Group. GRADE: welcome to the http://www.clinicaltrials.gov.
grade working group—from evidence to recommen- 50. University of York Centre for Reviews and Dissem-
dations—transparent and sensible [Internet]. GRADE ination—PROSPERO. About PROSPERO [Internet].
working group [cited 13 Mar 2018]. Available from York [cited 29 Mar 2018]. Available from https://
http://www.gradeworkinggroup.org/. www.crd.york.ac.uk/prospero/#aboutpage.
44. McMaster University Mac GRADE Centre. About 51. Innes N, Schwendicke F, Lamont T. How do we
GRADE [Internet]. McMaster University [cited 29 create, and improve the evidence? Br Dent J.
Mar 2018]. Available from https://cebgrade.mcmaster. 2016;220(12):651–5.
ca/aboutgrade.html.
The Steps of Practicing
Evidence-Based Surgery (EBS) 2
Achilles Thoma, Sheila Sprague, Luis H. Braga
and Sophocles H. Voineskos
In Chap. 1, we explained the historical devel- evidence-based approach [2]. Many surgeons,
opment of the Evidence-Based Medicine (EBM) however, are not familiar with the concepts of
and Evidence-Based Surgery (EBS) and para- EBS and how they can be applied. We hope that
digm shifts of practicing medicine and surgery to this book will help close that gap.
the present era. EBS is defined as an approach of This chapter will show you step by step how
practicing surgery in which the surgeon is aware to adopt an evidence-based approach to your
of the best evidence in support of practice and the surgical practice. There are five distinct steps in
strength of that evidence. the process of systematically finding, appraising,
In 2008, the British Medical Journal (BMJ) and using the contemporaneous research findings
hailed this paradigm shift as one of the 15 most as the basis for clinical decisions [3].
important milestones in medicine in the last These steps need to be followed precisely in
150 years [1]. The experiential and authoritarian their correct order (see Table 2.1). They will
approach to learning and practicing medicine and become clearer as you read the following chap-
surgery in the past has been supplanted by the ters of the book, which go into more detail and
also explain the methodological issues of the
various study designs.
Table 2.1 Steps to evidence-based surgery formulation, which as you remember is generated
1 Construct a relevant, answerable question from a from our clinical scenario/problem. Such key
clinical scenario (see Chap. 3) words will include “colon cancer”, “laparoscopic
2 Plan and carry out a literature search for best surgery”, “open Surgery”, and “ureter injury”.
external evidence (see Chap. 4) These key words are then first entered into a
3 Critically appraise the literature for validity and filtered electronic database to see if any articles
applicability (see Chaps. 6, 10 and 11) address the research question. Filtered databases,
4 Apply the evidence to your clinical practice (see such as the Cochrane Database, include articles
resolution of scenarios in Chaps. 6, 10 and 11) pre-appraised by methodological experts. If there
5 Evaluate your performance are no hits obtained on this database, an unfil-
tered database could be used. Unfiltered data-
bases are ones in which articles go through the
usual review process; examples of unfiltered
Scenario
database are Medline or Embase.
At the weekly general surgery rounds, a senior sur-
The literature search is an important step and
geon challenged a newly appointed colorectal sur-
all surgeons should be familiar with it; for more
geon when he claimed that the incidence of ureteral
detailed information on how to perform and
injury is lower with a laparoscopic approach rather
evaluate a literature search please see Chap. 4.
than with open surgery because of better visibility.
STEP 3. Critically appraise the literature for
This is a plausible scenario that can generate validity and applicability.
enough controversy to create a research question.
The answer to this question may not be readily The appraisal of a surgical article involves
available. To find it, we proceed as suggested in Step answering three important questions:
1, developing a clinically relevant and answerable
question from this scenario. The question is struc- Question #1: Is the study presented in the article
tured in such a way as to identify the Population or valid?
Patients, Intervention, Comparative Intervention, Question #2: If valid, what were the results?
Outcome, and Time Horizon. These terms can be Question #3: Can the results be applied to my
remembered by the acronym PICOT [4, 5]. patients or practice?
For the abovementioned scenario, the proper
research question using the structure of PICOT Depending on the study design we are
would be, “In colorectal cancer patients, does examining, each of these three main questions
laparoscopic, as compared to open surgery, result may have additional subsidiary questions. These
in less ureteral injury 30 days following surgery?” questions will be a recurrent theme in the chap-
As a rule of thumb, we ask such questions in ters to follow. If on reading the article, in par-
one breath. If you cannot ask such PICOT for- ticular the methods section, you do not believe it
matted question in one breath, it means that your is valid, do not waste your time reading the
question is convoluted. Try practicing “one breath article in its’ entirety. If on the other hand, you
PICOT questions until you become an expert”. believe it is valid, then you should proceed and
The structure and formulation of the research examine the results and what they may mean for
question are discussed in more detail in Chap. 3. the study group in the article and potentially,
your own patients. Chapter 6 explains how to
STEP 2. Plan and carry out a search of the lit- interpret the results of surgical interventions.
erature for best external evidence. This book contains chapters dedicated to
explaining how to appraise properly, the evi-
For the literature search conducted in this dence, based on the specific content or method-
step, we select key words using our PICOT ology used in a study.
2 The Steps of Practicing Evidence-Based Surgery (EBS) 11
STEP 4. Apply the evidence to your clinical the young surgeon claimed that the incidence of
practice. ureteral injury is lower with the laparoscopic
technique than the open technique during col-
orectal surgery. He claimed that the laparoscopic
If you believe that the investigators have used method because of better visualization is less likely
correct methods in their study and on assessing to lead to damage of the ureter. The head of the
their results, you find that the novel intervention Division asks the Minimal Access Surgery Fellow
in the Service to review the surgical literature and
shows better results than the standard approach report whether this is true or not at next week’s
you are using, then you need to decide if you can rounds.
apply them to your practice.
This decision will be made if you believe that
the patients examined in the article you read are STEP 1. Construct a relevant, answerable
similar to your patients. If, for example, the study question from a clinical scenario.
was done in pediatric patients, but your patients
are adults, the study’s results may not be appli- Using the PICOT as a guide, your clinical
cable to your practice. You may also consider the question would include the following:
clinical resources available to you as you may be
working in a rural hospital, where you may not Population: Patients undergoing colorectal
have access to the latest technological advances surgery.
(e.g., robotic surgery, single-port laparoscopic Intervention: Laparoscopic surgery.
instrumentation). Comparison Intervention: Open-technique
surgery.
STEP 5. Evaluate your effectiveness and effi- Outcome: Ureteral injury.
ciency in carrying out Steps 1–4. Time Horizon: Up to 30 days following surgery.
Fig. 2.1 a Search terms from your research questions are entered. b A list of appropriate articles
2 The Steps of Practicing Evidence-Based Surgery (EBS) 13
Fig. 2.2 a Search terms from your research question are entered. b A list of appropriate articles for your research
question is listed
14 A. Thoma et al.
Using PubMed, using the same search terms, focus of the article. For example, with this article,
five articles, which were better suited to answer we may look at if patients in the two groups were
the research question, were found. After review- similar in regard to prognostic factors known to
ing the available articles, you choose the article be associated with the outcome. In this particular
that best addresses your research question, in this example, Andersen et al. [6] explained that the
example, the selected article is iatrogenic ureteral two groups were comparable in gender; however,
injury in colorectal cancer surgery: a nationwide there were significant differences, including but
study comparing laparoscopic and open approa- not limited to age, body mass index, and tumor
ches by Andersen et al. [6]. This article appears to stage. A second question regarding validity may
be a possibly relevant article to give answer your be to look at if follow-up time was sufficiently
research question. You then proceed to read the completed. In the article of Andersen et al. [6],
abstract shown in Fig. 2.3. If it looks promising patients were included if they had surgery
we then proceed to read the whole article. between January 2005 and December 2011, the
database used had a completeness rate of over
STEP 3. Critically appraise the literature for 96% and it followed patients up to 30 days after
validity and applicability. surgery. Therefore, it would be acceptable to state
that follow-up was sufficient.
Question 1: Is the study valid?
Question 2: What were the results?
The validity of the chosen study will be judged When listing the results of Andersen et al. [6],
in different ways based on the methodology and one could look at the association between the
2 The Steps of Practicing Evidence-Based Surgery (EBS) 15
exposed and the outcome. For example, those STEP 5. Evaluate your performance.
patients who had open surgery, 0.37% suffered
ureteric injury; 0.59% experienced ureteric injury The final step in EBS practice is the continuous
in the laparoscopic group. The authors claimed that, assessment of your performance in implementing
for the whole cohort, laparoscopic surgery was all the steps mentioned above. You may want to
significantly associated with an increased risk of approach a mentor and ask him/her to provide you
iatrogenic ureteral injury; an odds ratio of 1.64 and with feedback on whether you mastered the skills
95% Confidence Interval of 1.02–2.63, p = 0.04 of (a) constructing answerable questions on the
was given. An odds ratio greater than 1 indicates a PICOT format (b) performing successful litera-
higher risk of harm in the exposed group (the ture searches to find the best evidence (c) ap-
laparoscopic group). The odds ratio was calculated praising what you read and are capable of
using a multivariable analysis to control for the separating “the chaff from the wheat” and (d) im-
differences in risk factors between groups. plementing what you found in your practice.
When answering question three, one needs to 1. Kamerow D. Milestones, tombstones, and sex educa-
consider if the patients in the study were similar tion. BMJ. 2007;334:0–a.
2. Meakins JL. Innovation in surgery: the rules of
to patients in your practice, if the exposure is evidence. Am J Surg. 2002;183(4):399–405.
similar to what might occur in your patients, the 3. Sackett DL, Strauss SE, Richardson WS, Rosen-
magnitude of the risk, and any possible benefits berg W, Haynes RB. Evidence-based medicine: how
associated with the exposure. In this example to practice and teach EBM. 2nd ed. Edinburgh,
Scotland: Churchill Livingston; 2000.
article, the patients in the study were from Den- 4. Richardson WS, Wilson MC, Nishikawa J, Hay-
mark; therefore, this may be something to con- ward RS. The well-built clinical question: a key to
sider when applying results to your patients. evidence-based decisions. ACP J Club. 1995;123(3):
A12–3.
5. Nollan R, Finout-Overhold E, Stephenson P. Asking
STEP 4. Apply the evidence to your clinical compelling clinical questions. In: Melnyk BM,
practice. Fineout-Overhold E, editors. Evidence-based practice
in nursing and healthcare: a guide to best practice. 2nd
If you believe that the patients and the tech- ed. Philadelphia, PA: Lippincott Williams & Wilkins;
2005.
nique and magnitude of risk and benefits 6. Andersen P, Andersen LM, Iversen LH. Iatrogenic
described in the article are all similar to yours, ureteral injury in colorectal cancer surgery: a nation-
you should consider applying the evidence to wide study comparing laparoscopic and open
your practice. approaches. Sur Endosc. 2015;39:1406–12.
Developing a Surgical Clinical
Research Question: To Find 3
the Answer in Literature Search
or in Pursuing Clinical Research
Abbreviations
FINER F: Feasible; I: Interesting; N: In clinical practice, we are faced with clinical
Criteria Novel; E: Ethical; R: Relevant questions on a daily basis by our medical stu-
PICOT P: Population; I: Intervention; C: dents, surgical residents, colleagues or patients.
Format Comparison Intervention; O: While we frequently know, and can readily
Outcome; T: Time Horizon provide an answer, sometimes this is not the
QOL Quality of Life case and we must proceed to find it through a
DIEP Deep Inferior Epigastric Perforator literature search. In such cases, we suggest uti-
NIH National Institute of Health lizing the approach to literature searches
TAP Transverse Abdominal Plane described by Banfield et al. in Chap. 4. If we do
Blocks Blocks not know the answer and we cannot find it in the
ECTR Endoscopic Carpal Tunnel Release surgical literature, we may decide to find it by
OCTR Open Carpal Tunnel Release performing a clinical research project.
Whether it is to find the evidence through a
literature search or via a clinical experimenta-
tion, the structure of the research question fol-
lows the same format. This chapter will explain
A. Thoma (&)
how to properly form a research question, as it is
Department of Surgery, Division of Plastic Surgery, the foundation of clinical evidence and proper
Department of Health Research Methods, Evidence clinical research. To understand and interpret the
and Impact, McMaster University, Hamilton, ON, findings of a surgical article, or to design one’s
Canada
e-mail: athoma@mcmaster.ca
own clinical research project, a surgeon must
master the art of formulating “the clinical re-
S. Sprague
Department of Surgery, Division of Orthopaedic
search question”. The posing of the research
Surgery, Department of Health Research Methods, question should be automatic in a surgeon’s
Evidence and Impact, McMaster University, daily practice just like driving. The same applies
Hamilton, ON, Canada to the surgeon-investigator; however, in this
e-mail: sprags@mcmaster.ca
case, prior knowledge on the subject must be
S. H. Voineskos J. Murphy identified through a literature search. Being
Department of Surgery, Division of Plastic Surgery,
McMaster University, Hamilton, ON, Canada
familiar with the current knowledge in an area is
e-mail: Sophocles.voineskos@medportal.ca also known as understanding the “boundary of
J. Murphy
knowledge”; this must be done prior to forming
e-mail: murphj11@mcmaster.ca the research question [1].
There are two types of clinical questions that Table 3.1 The FINER criteria
are frequently asked in surgery: (1) background Criteria Definition
questions and (2) foreground questions [2, 3]. Feasible Is the research question manageable?
The background questions are the ones usually (i.e., enough participants, expertise,
asked by medical students, surgical interns, and time, and money)
other learners early in their training. These types Interesting Do the investigator and the target
of questions ask for general knowledge about a audience find the research question to be
clinically important?
surgical problem and have two essential com-
ponents: (1) a question root (what, who, when, Novel Does the research question contribute
new information to the literature?
where, how, why) with a verb and (2) a surgical
problem. Examples of such questions are: Ethical Is the question ethical? (a clinical
research question should be investigated
(1) What causes ileus after laparotomy? (2) Why when it is ethical to do so; it requires
do some thin mastectomy skin flaps undergo approval by an Ethics Review Board)
necrosis? (3) When should a temporary ileost- Relevant Will the research question make an
omy be reversed? impact, guide further research, or
The foreground questions, on the other hand, influence practice?
should be asked by senior surgical residents, Adapted from Hulley et al. [4]
surgical fellows, and surgeons. These questions
ask about the management of patients with sur-
gical problems. (2) technical expertise; (3) resources (time, sup-
It is important for a research question to be port staff, and money); and (4) a proper study
persuasive and clear. A compelling and concise design, to make the project manageable [4–6].
research question helps guide and create a suc- There are numerous barriers that may arise from
cessful study, and is appealing to both grant each research project.
agencies and institutions. For example, a study with the following re-
This chapter will focus on two tools that can search question: “Does the supramicrosurgical
be used to help form a proper research question: periumbilical abdominal flap provide better
(1) The FINER (Feasible, Interesting Novel, Quality of Life (QOL) as compared to the deep
Ethical, Relevant) Criteria [4–6] and (2) The inferior epigastric perforator (DIEP) flap in
PICOT (Population, Intervention, Comparative breast reconstruction after mastectomy?” may
intervention, Outcomes, Time Horizon) formu- experience the following technical barriers:
lation [7, 8]. (1) We do not know how to transfer a flap with a
0.8 mm luminal diameter of the vascular pedicle
and (2) We do not have the required super deli-
The FINER Criteria cate microsurgical instruments. Additionally,
there is the following study design barrier
The FINER criteria, (Table 3.1) have been pro- (3) QOL scales may not be sensitive enough to
posed as important preconditions to develop a capture the difference. Therefore, the research
good research question [4–6]. These criteria, question may need to be reassessed based on
outlined further below, help to highlight the feasibility.
aspects that should be addressed to increase the
chances of success in a research project.
Interesting
speaking with mentors, potential funding agen- Randomized Control Trial design with a narrow
cies and experts in the area to determine if your Confidence Interval would provide the highest
proposed research question will be of interest to a level of evidence in a question of effectiveness,
large audience. this study design would not be deemed ethical if
For example, the use of suprafascial perforator the research question is about harm. For a
flaps for postmastectomy reconstruction would question of Harm in surgery, a better study
be of great interest to patients. With this tech- design is a case–control study [10].
nique of harvesting autogenous tissue to recon-
struct a breast, the patient will not worry about
the potential of abdominal hernia or abdominal Relevant
bulge as the abdominal muscles are not interfered
with. Lastly, this aspect asks whether or not the issue
that is under consideration in a project is clinically
relevant to physicians/surgeons, patients, health
Novel policymakers, and other researchers. This aspect
also focuses on if the issue being researched is
This variable asks if your research will provide timely [4–6]. The National Institutes of Health
new information, or if it will refute, confirm or (NIH), in the United States of America, empha-
extend previous findings. An important aspect of sizes the importance of a research question. By
each research project is explaining why this importance, we mean how the results, will
project is important in light of what is already improve knowledge, and how results will impact
known [4–6]. Additionally, a research question clinical services, methods, and concepts [4].
can be novel, while at the same time not being For example, the use of the suprafascial
completely original. The novelty of your research periumbilical perforator flap will be deemed rel-
question can be established with a literature evant and timely to the stakeholders mentioned
search, which can help identify gaps and above. An older technique for breast recon-
remaining questions in your research area. For struction such as a regional pedicled skin flap
example, recreating a previous study but reduc- would be considered a “historical” method of
ing past limitations, or focusing on a new pop- reconstruction and not currently relevant.
ulation, could also be considered novel [4].
For example, the suprafascial periumbilical While the FINER criteria justify a planned
perforator flaps for postmastectomy reconstruc- research project, The PICOT format helps in the
tion is a novel technique for harvesting autoge- development of a specific research question [6]
nous tissue for reconstruction and extends the which will help us find the answer through the
boundary of microsurgical instrumentation and literature search if we are “Users” of surgical
skills [9]. research. If we are “Doers” of surgical research,
it will help us formulate our research question
before we submit it to the local ethics committee
Ethical or a granting agency.
For the first component, Patient or Population, Using the same reasoning for the Comparative
the type of patient or patient population we are Intervention, we would like to know, again pre-
dealing with must be clarified [3, 5–8] Having a cisely, what this comparative intervention was [3,
clear definition of the patient or patient popula- 5–8]. The Comparative Intervention is often
tion allows one to identify if the patient(s) ref- either a control group or the best current standard
erenced in an article we read are similar enough procedure for the respective condition. Using the
to the patient or population for whom we need example of laparoscopic versus open cholecys-
the answer. For example, in the case of an tectomy where the laparoscopic procedure was
appendicitis patient or population, does our the main intervention, the comparative interven-
patient have acute appendicitis or chronic tion would be the open cholecystectomy. For the
appendicitis? In the case of inguinal hernia, does open cholecystectomy, we may want to know
our patient have a stable inguinal hernia or an how extensive the incision was; was it a Kocher
acute strangulated one? incision? Was it a midline incision from the
xiphoid to the suprapubic area? Or was it just
limited to the epigastric area (above the
umbilicus)?
Table 3.2 The PICOT format
Component Definition
Patient or The patient population of interest Outcome
population
Intervention What is being done for/to the With regard to the Outcome component of the
patient PICOT, we would like to know what was hoping
Comparison What is the intervention being to be measured, accomplished, improved or
intervention compared to? Often what is
affected [3, 5–8]. Is it a critical outcome such as
currently being done or the “gold
standard” mortality, was it a QOL outcome or was it some-
Outcome What is being measured? What
thing minor such as a rash around the incision?
are you hoping to improve/see The outcomes can be variable and may include,
change in? but not be limited to: (1) a critical outcome (mor-
Time horizon or How long following the tality), (2) a QOL outcome, (3) pain, (4) hospital-
time frame intervention do you measure the ization days, or (5) ability to return to work. The
outcome? primary outcome of a study should be succinctly
This table was formed using information from [3, 5–8] stated as it has implications in the study design,
3 Developing a Surgical Clinical Research Question: To Find … 21
sample size, and power of the study. These issues 2. In patients undergoing prostatectomy for
will be discussed later in Chap. 29. prostate cancer, is robotic-assisted prostatec-
tomy more likely to allow patients to maintain
erectile function than traditional open
Time Horizon approach, at 1 year post-surgery?
P Population: patients with prostate cancer
The final component of the PICOT formulation
undergoing prostatectomy
should address the Time Horizon of the study. The
I Intervention: robotic-assisted prostatectomy
time horizon describes when the outcome was
C Comparative Intervention: open prostatectomy
measured in reference to the intervention/event
O Outcome: erectile function
[4]. For example, was the outcome measured
T Time Horizon: 1 year.
1-month post-procedure or 1 year? The time
horizon of a study will depend on the nature of the
3. In patients 65 years or older who experience a
surgical problem we are exploring. Nobody will
hip fracture, does accelerated surgery (sur-
take us seriously if we reported our survival
gery within 6 h) decrease all-cause mortality
results of laparoscopic versus open colon cancer
compared to usual care within 30 days fol-
resection at 6 months after surgery. On the other
lowing surgery?
hand, this time horizon may be quite appropriate
after acute appendectomy. There may be occa- P Population: patients 65 years or older with a
sions, for example, when measuring pain, where hip fracture
an intermediate follow-up period, in addition to a I Intervention: accelerated surgery (surgery
long-term follow-up is relevant. Even if there is no within 6 h)
difference in survival after colon resection with C Comparative Intervention: standard surgical
open vs. laparoscopic approaches, there may be a care
difference in pain levels or length of time hospi- O Outcome: all-cause mortality
talized. It is important for investigators or authors T Time Horizon: 30 days from surgery.
of published reports or clinical trials to justify the
time horizon used [11]. 4. In patients with carpal tunnel syndrome, is the
Endoscopic Carpal Tunnel Release (ECTR)
When all components are combined, they technique more cost-effective than the Open
form a clear and searchable research question. Carpal Tunnel Release (OCTR) technique at
Examples of such foreground questions with 1 year post-surgery?
their equivalent PICOT formulation are:
P Population: patients with clinical symptoms
1. In patients undergoing laparoscopic surgery of carpal tunnel syndrome confirmed with
for colorectal cancer, does the use of electromyography and nerve conduction
ultrasound-guided transversus abdominis studies
plane (TAP) blocks reduce pain up to 72 h I Intervention: ECTR
following surgery? C Comparative Intervention: OCTR
O Outcome: cost-effectiveness (dollars per
P Population: patients with colorectal cancer natural unit)
undergoing laparoscopic surgery T Time Horizon: 1 year.
I Intervention: ultrasound-guided TAP blocks
C Comparative Intervention: no TAP block
O Outcome: pain The development of a research question could
T Time Horizon: up to 72 h following be seen as the most important step in a research
randomization. project. In this chapter, we explained two tools
22 A. Thoma et al.
that can be used to ensure the formation of a clear question and developing the study plan. In: Gaert-
and concise clinical research question: The ner R, editor. Designing clinical research. 4th ed.
Philadelphia, PA: Lippincott Williams & Wilkins, a
FINER criteria and the PICOT format Wolters Kluwer business; 2013. p. 14–22.
(Tables 3.1 and 3.2). A strong research question 5. Thabane L, Thomas T, Ye C, Paul J. Posing the
lays the foundation of a well-designed research research question: not so simple. Can J Anesth.
project. When structured properly, a research 2009;56:71–9.
6. Farrugia P, Petrisor BA, Farrokhyar F, Bhandari M.
question will make your literature search more Research questions, hypotheses and objectives. Can J
efficient through minimizing the results you Surg. 2010;53(4):278–81.
encounter. Furthermore, a well-formed research 7. Richardson WS, Wilson MC, Nishikawa J, Hay-
question lays a strong foundation and provides ward RS. The well-built clinical question: a key to
evidence-based decisions. ACP J Club. 1995;123(3):
direction for the surgeon-investigator throughout A12–3.
all stages of the research project, from study 8. Nollan R, Finout-Overhold E, Stephenson P. Asking
design to publication. compelling clinical questions. In: Melnyk BM,
Fineout-Overhold E, editors. Evidence-based practice
in nursing and healthcare: a guide to best practice.
Philadephia, PA: Lippincott Williams & Wilkings;
References 2005.
9. Koshima I, Yamamoto T, Narushima M, Mihara M,
1. Haynes RB, Sackett DL, Guyatt GH, Tugwell P. Clin- Ilda T. Perforator flaps and supermicrosurgery. Clin
ical epidemiology: how to do clinical practice Plast Surg. 2010;37:683–9.
research. 3rd ed. Philadelphia, PA: Lippincott Wil- 10. Thoma A, Kaur MN, Farrokhyar F, Waltho D,
liams & Wilkins; 2006. Levis C, Lovrics P, Goldsmith CH. Users’ guide to
2. Sackett D, Richardson WS, Rosenburg W, Hay- the surgical literature: how to assess an article about
nes RB. How to practice and teach evidence based harm in surgery. Can J Surg. 2016;59(5):351–7.
medicine. 2nd ed. London, England: Churchill 11. Mowakket S, Karpinski M, Gallo L, Gallo M,
Livingstone; 1997. Banfield L, Murphy J, et al. Reporting time horizons
3. Thoma A, McKnight L, McKay P, Haines T. Form- in randomized controlled trials in plastic surgery: a
ing the research question. Clin Plast Surg. 2008;35 systematic review. Plastic Reconst Surg. 2018; (in
(2):189–93. press).
4. Hulley SB, Cummings SR, Browners WS,
Grady DG, Newman TB. Conceiving the research
Finding the Evidence Through
Searching the Literature 4
Laura Banfield, Jo-Anne Petropoulos
and Neera Bhatnagar
Questions often arise that require finding evidence A 67-year-old female is seen in the bariatric pro-
to inform clinical decisions. However, not all gram with morbid obesity and hypertension. Her
clinicians seek answers to all questions [1, 2]. height is 5 ft., 6 in. Her weight is 360 lbs. She was
Clinician reasons for not seeking answers include: not able to reduce her weight with dieting over the
lack of time, questions lack urgency, importance, last decade. She finally decided to proceed with
or are forgotten, perception that an appropriate or bariatric surgery. Her bariatric surgeon recom-
relevant answer does not exist, questions are mended Roux-en-Y surgery. Her best friend
patient specific and non-generalizable, and chal- however had a sleeve gastrectomy and was happy
lenges in using the resources [1–7]. “Because of with the results. She is not certain which procedure
the increasing volume and speed of new research, to undergo. She is willing to undergo the proce-
finding useful evidence efficiently remains chal- dure which is likely to help her lose at least 50% of
lenging” [8]. PubMed alone has added between her weight over the long term (>5 years). As the
130,000 and 210,000 surgery related articles surgeon, you decide to review the literature to
annually since 2008, reinforcing the necessity to be determine which of the two options will best meet
efficient when seeking information [9]. As stated the patient’s goal. However, you are unsure of how
by Waltho et al. [10], “[s]urgeons must always to search the literature.
ensure that the care they provide is rooted in the
best available evidence” [10]. This chapter will
equip surgeons with knowledge and skills to locate Clinical Question
the best evidence to inform their practice.
As outlined in Chap. 3, Clinical Question, a
number of different questions may arise from any
one scenario or clinical encounter. Remember, “a
well-designed research question addresses sev-
eral components of the clinical scenario”, often
in the form of PICO(T)—P as the population
affected, I as the intervention, C as the com-
L. Banfield (&) J.-A. Petropoulos N. Bhatnagar parator, comparison intervention or standard of
Health Science Library, HSC 2B, McMaster care, O as the outcome of interest, and T as either
University, Hamilton, ON, Canada time or type of study [10]. For example, in this
e-mail: banfie@mcmaster.ca
scenario, the clinical question you developed is: some of the resources found within the pyramids can
in patients with morbid obesity and BMI >40, be used to find background information, the focus of
does Roux-en-Y, in contrast to sleeve gastrec- this chapter will be on how to use them to find
tomy, lead to more weight loss that persists at evidence to support clinical decisions and research.
greater than 5 years? We acknowledge that even a well-written clini-
cal question may not have an answer. Potential
P morbidly obese patients with BMI >40
reasons could include: “no feasible study design or
I Roux-en-Y bariatric surgery
measurement tools exist that investigators could use
C sleeve gastrectomy
to resolve an issue”, or “no one has conducted and
O loss of 50% weight
published the necessary study” [8]. In these instan-
(T) greater than 5 years
ces, you will need to reevaluate your question (see
Chap. 3).
Table 4.4 Applying Boolean operators to search concepts and search terms
Obesity AND Gastric Bypass will retrieve evi- Using either the AND or the NOT operators
dence containing both concepts. The OR opera- will narrow the scope of the search and decrease
tor is most often applied when combining related the number of articles retrieved (see Limits sec-
keywords within a search concept. Using the OR tion); also referred to as specificity. On the other
operator will retrieve evidence containing any of hand, the OR operator will broaden the scope of
the keywords in isolation or where they may the search and increase the number of articles
show up together (Table 4.4). For example, a retrieved, also referred to as sensitivity.
search using Surgery OR Gastric Bypass will
retrieve evidence containing either or both key-
Advanced Search Tip: Applying Boo-
words related to the same concept. In contrast,
lean Operators
the NOT operator will explicitly exclude articles
Typically, search concepts are combined
containing those terms from the result set. For
using the AND operator. However, there are
example, a search using NOT animals will
circumstances in which they can be com-
eliminate all articles with a reference to animals.
bined using the OR operator (e.g., Inter-
The NOT operator should be used with caution
vention (Roux-en-Y bariatric surgery) OR
as it could eliminate relevant articles that inci-
Comparator (Sleeve gastrectomy). This is
dentally mention the term you are excluding.
particularly applicable to surgical questions.
Using the same example, NOT Animals will also
Within each concept, the alternative
eliminate articles that reference both animals and
terms and controlled vocabulary are usu-
humans (i.e., studies that incorporate human and
ally combined using the OR operator.
animal models).
28 L. Banfield et al.
For the purposes of this section, we have iden- A simple search strategy can be applied to any
tified two approaches to searching: simple and resource using one or two concepts, one term per
complex. These approaches can be applied to the concept (Fig. 4.2). However, it is best applied at
sources of evidence identified in the Pyramid of the top levels in both pyramids (e.g., specifically
EBM Resources and 6S Pyramid (see Choosing DynaMed Plus, UptoDate®).
Appropriate Resources). The nature of the Using the concepts from our PICO(T)
resource will dictate the approach to searching. (Table 4.1), the search strategy could be any one
See Table 4.5 for a list of resources in which of the following but not limited to:
these search strategies can be applied.
Table 4.5 List of resources for simple search and complex search
¶
fee-based
+
premium or extra features with additional fee
Note Access to fee-based resources may be provided through your professional, academic, clinical, and hospital
affiliations
4 Finding the Evidence Through Searching the Literature 29
Limits and Search Filters Filters, also known as hedges, are combina-
tions of predefined search terms used to retrieve
Most resources enable you to limit or narrow results on a particular topic (e.g., cancer), with a
your search results. The limit options available particular study design (e.g., randomized con-
vary across the resources. Examples of limits trolled trials or cohort), or using a particular
include: publication date, language, age, and study methodology (e.g., prognosis or therapy).
publication type. It is best to apply these limits Filters are applied to improve the quality of the
one at a time, and at the end of your search. This result set by reducing the number of irrelevant
will allow you to see the impact that each of the results and increasing the number of relevant
limits will have on narrowing your results. In results [13–17]. These filters are added to the
some circumstances, it may be necessary to search terms being used for the content.
revisit the limits you have chosen based on your PubMed, Ovid MEDLINE®, Embase, and
preliminary results. This is easier to do if the PsycINFO®, and EBSCO CINAHL and MED-
limits have been applied one at a time. LINE®, have all adopted the search filters (clinical
4 Finding the Evidence Through Searching the Literature 31
hedges) developed by Health Information resources for making clinical and surgical deci-
Research Unit (HiRU) at McMaster University sions” [10]. This section will provide an over-
[18, 19]. The HiRU hedges, known as Clinical view of selected resources and the differences
Queries, are found under the Additional Limits in between them. Strengths and weaknesses of the
Ovid® and are found within the Edit feature in the resources will not be discussed as such assess-
search history and Refine Results section of ments are partially based on personal prefer-
EBSCO CINAHL. In PubMed, the link to Clinical ences, availability, and access. However, it is
Queries is located in the PubMed Tools section on worth noting that many of these may not have
the homepage. The common clinical queries filters comprehensive coverage of surgical content.
cover the following categories: therapy, diagnosis, There are many ways to describe and organize
etiology, prognosis, and clinical prediction sources of evidence. Two of the more common
guides. Additional HiRU hedges include reviews, ways are: Pyramid of EBM Resources and 6S
economics, qualitative, and costs [19]. hierarchy of evidence-based resources Pyramid
[8, 11].
Advanced Search Tip: Additional
search filters
Pyramid of EBM Resources
Not all filters are created equally. It should
be noted that not all filters have been
The Pyramid of EBM Resources divides
developed using a rigorous methodology
resources into three categories: summaries and
nor have all been validated. As a result,
guidelines, preappraised research, and
you should look critically at those available
non-preappraised research.
before selecting one that best suits your
needs.
HiRU Search Filters
Summaries and Guidelines
https://hiru.mcmaster.ca/hiru/HIRU_
Hedges_home.aspx
Summaries and guidelines are rarely limited to a
Scottish Intercollegiate Guidelines Net-
specific clinical question. Instead, they are topic
work (SIGN)
or condition driven, often integrating background
http://www.sign.ac.uk/search-filters.html
and foreground information to inform clinical
InterTASC Information Specialists’
decision making [8]. These resources rate the
Sub-Group (ISSG) Search Filter
strength of the recommendations based on either
Resource
or both internally defined criteria and externally
https://sites.google.com/a/york.ac.uk/
defined criteria (eg. GRADE). Examples of these
issg-search-filters-resource/
resources include UptoDate®, DynaMed Plus,
Strings attached: CADTH database
BMJ Best Practice, and clinical practice
search filters
guidelines.
https://www.cadth.ca/resources/finding-
evidence/strings-attached-cadths-
database-search-filters
Preappraised Research
Fig. 4.3 Relationship of the Pyramid of EBM Resources to the 6S Pyramid (Used with permission)
4 Finding the Evidence Through Searching the Literature 33
evidence surrounding a particular research journals and through databases including PubMed,
question. Examples of where systematic reviews MEDLINE®, Embase, and Cochrane Register of
can be found include individual journals and Controlled Trials (CENTRAL).
through the Cochrane Database of Systematic
Reviews, PubMed, MEDLINE®, and Embase.
Other Resources
Synopses of Single Studies
After reviewing the resources described, it is
Synopses of original studies are typically common to question the necessity of searching
one-page structured summaries that provide an multiple resources. If your purpose in searching
appraisal of single studies. Similar to synopses of is emerging out of a clinical situation to which
synthesis, they are explicit in their assessment of you need a more immediate answer, the preap-
the quality of the study and will often note praised and summaries resources are likely to be
newsworthiness and relevance to clinical practice your best starting point (Fig. 4.3). However, if
and research. Examples of resources include you are undertaking a systematic review, clinical
OrthoEvidence™, ACP Journal Club, and BMJ practice guideline, or other form of knowledge
Evidence-based Medicine. synthesis, you are more likely to be working with
the non-preappraised resources (synthesis and
studies). Before beginning your search consider
Studies why you are searching, what resources you have
access to, and how much time you have.
Studies are single articles reporting on original re-
search. Unlike the resources above, the responsi- Federated Search Tools
bility for critically appraising or assessing the
quality of the evidence falls to the reader. Examples These are resources that search across multiple
of where studies can be found include individual sources of evidence simultaneously are referred
Fig. 4.4 Result set from TRIP database (Used with permission)
34 L. Banfield et al.
by Casillas et al. is found [21]. As this is a complex topics or difficult to answer questions.
prospective cohort study, you will need to apply We encourage you to reach out, they will be
the critical appraisal skills discussed in Chap. 16. happy to help you.
Paraphrasing Glasziou, Burls, and Gilbert, the
ability to effectively search the literature is as
Conclusion essential as the stethoscope to a clinician’s
practice or the scalpel to a surgeon [22].
Surgeons encounter questions that require them
to search for evidence to inform a clinical deci-
sion or research process. There are many sources References
of information available and different search
strategies that can be applied. This chapter has 1. Del Fiol G, Workman TE, Gorman PN. Clinical
covered the core principles for developing a questions raised by clinicians at the point of care: a
search strategy from identifying concepts, search systematic review. JAMA Intern Med. 2014;174(5):
710–8.
terms, and controlled vocabulary (if applicable),
2. Brassil E, Gunn B, Shenoy AM, Blanchard R.
to combining concepts and terms using Boolean Unanswered clinical questions: a survey of special-
operators, and selecting resources. However, no ists and primary care providers. J Med Libr Assoc.
single resource will be sufficient to address all of 2017;105(1):4–11.
3. Cook DA, Sorensen KJ, Linderbaum JA, Pencille LJ,
your information needs and it may be necessary
Rhodes DJ. Information needs of generalists and
to search multiple sources. Federated search specialists using online best-practice algorithms to
tools, such as TRIP and ACCESSSS, can be answer clinical questions. J Am Med Inform Assoc.
invaluable especially when time is a factor. 2017;24(4):754–61.
4. Ely JW, Osheroff JA, Chambliss ML, Ebell MH,
Medical and health sciences librarians are also
Rosenbaum ME. Answering physicians’ clinical
a key resource in assisting you with developing questions: obstacles and potential solutions. J Am
your searching skills and with searching for Med Inform Assoc. 2005;12(2):217–24.
36 L. Banfield et al.
5. Ely JW, Osheroff JA, Ebell MH, Chambliss ML, can find some but not all knowledge translation
Vinson DC, Stevermer JJ, et al. Obstacles to answer- articles in MEDLINE: an analytic survey. J Clin
ing doctors’ questions about patient care with evi- Epidemiol. 2012;65(6):651–9.
dence: qualitative study. BMJ. 2002;324(7339):710. 15. Lokker C, Haynes RB, Wilczynski NL, McKib-
6. Green ML, Ruff TR. Why do residents fail to answer bon KA, Walter SD. Retrieval of diagnostic and
their clinical questions? A qualitative study of treatment studies for clinical use through PubMed
barriers to practicing evidence-based medicine. Acad and PubMed’s Clinical Queries filters. J Am Med
Med. 2005;80(2):176–82. Inform Assoc. 2011;18(5):652–9.
7. Bennett NL, Casebeer LL, Zheng S, Kristofco R. 16. Shariff SZ, Sontrop JM, Haynes RB, Iansavichus AV,
Information-seeking behaviors and reflective prac- McKibbon KA, Wilczynski NL, et al. Impact of
tice. J Contin Educ Health Prof. 2006;26(2):120–7. PubMed search filters on the retrieval of evidence by
8. Agoritsas T, Vandvik PO, Neumann I, Rochwerg B, physicians. CMAJ. 2012;184(3):E184–90.
Jaeschke R, Hayward R et al. In: Guyatt GH, 17. EBSCO Help. CINAHL Clinical Queries [Internet].
Rennie D, Meade MO, Cook DJ, editors. Finding [cited 2018 June 28]. Available from: https://help.
the best evidence. Users’ guides to the medical ebsco.com/interfaces/CINAHL_MEDLINE_Databases/
literature: a manual for evidence-based clinical CINAHL/CINAHL_Clinical_Queries. Accessed 28
practice. 3rd ed. New York: McGraw-Hill Education; June 2018.
2015. p. 29–49. 18. Scottish Intercollegiate Guidelines Network (SIGN).
9. Corlan AD. Medline trend: automated yearly statis- Search filters [Internet]. [cited 2018 June 25].
tics of PubMed results for any query, 2004 [Internet]. Available from: http://sign.ac.uk/search-filters.html.
Web resource at URL: http://dan.corlan.net/medline- 19. Health Information Research Unit, McMaster
trend.html. Accessed: 2012-02-14. Available at: University. Hedges [Internet]. [cited 2018 May 11].
Archived by WebCite at http://www.webcitation. Available from: https://hiru.mcmaster.ca/hiru/hiru_
org/65RkD48SV. hedges_home.aspx.
10. Waltho D, Kaur MN, Haynes RB, Farrokhyar F, 20. DynaMed Plus [Internet]. Ipswich (MA): EBSCO
Thoma A. Users’ guide to the surgical literature: how Information Services. 1995. Record No. 483434,
to perform a high-quality literature search. Can J Bariatric surgery; [updated 2018 Jul 18, cited July
Surg. 2015;58(5):349–58. 20, 2018]; [about 62 screens]. Available from http://
11. DiCenso A, Bayley L, Haynes RB. ACP Journal www.dynamed.com/login.aspx?direct=true&site=
Club. Editorial: Accessing preappraised evidence: DynaMed&id=483434. Registration and login
fine-tuning the 5S model into a 6S model. Ann Intern required.
Med. 2009;151(6):JC3-2, JC3-3. 21. Casillas RA, Kim B, Fischer H, Zelada Getty JL,
12. Herbert R, Jamtvedt G, Hagen KB, Mead J. Practical Um SS, Coleman KJ. Comparative effectiveness of
evidence-based physiotherapy. 2nd ed. Edinburgh: sleeve gastrectomy versus Roux-en-Y gastric bypass
Elsevier/Churchill Livingstone; 2011. for weight loss and safety outcomes in older adults.
13. McKibbon KA, Wilczynski NL, Haynes RB, Hedges Surg Obes Relat Dis. 2017;13(9):1476–83.
Team. Retrieving randomized controlled trials from 22. Glasziou P, Burls A, Gilbert R. Evidence based
medline: a comparison of 38 published search filters. medicine and the medical curriculum. BMJ.
Health Info Libr J. 2009;26(3):187–202. 2008;337:a1253.
14. McKibbon KA, Lokker C, Wilczynski NL, Hay-
nes RB, Ciliska D, Dobbins M, et al. Search filters
Hierarchy of Evidence in Surgical
Research 5
Gina Del Fabbro, Sofia Bzovsky, Achilles Thoma
and Sheila Sprague
research [4]. High-quality randomized controlled highest LOE, they are not suitable for addressing
trials (RCTs), as well as systematic reviews every surgical research question. When RCTs are
(SR) and meta-analyses (MA) of well-conducted unethical, impractical, or adequate resources are
RCTs, are considered the highest LOE for ther- not available, researchers may elect to conduct an
apeutic research. Randomized controlled trials, in observational study (e.g., prospective compara-
which patients are randomly allocated to an in- tive study, case-control study etc.) [4].
tervention group or a comparison group, are A prospective cohort study for therapeutic re-
considered the highest LOE because they offer search identifies a group of patients who have
the best protection against bias (Fig. 5.2a). received the treatment of interest and compares
Randomization is an effective method of reduc- them with a group of patients who received an
ing bias as it balances the treatment groups with alternative treatment (Fig. 5.2b). Patients in both
respect to both known and unknown prognostic groups are followed for a specified period of time
factors [7, 8]. Randomized controlled trials are to determine their outcomes [10]. These studies
discussed more in Chaps. 12–15. MAs and SRs are similar to the RCT in that the criteria for
of properly conducted RCTs also represent inclusion, the outcomes of interest, and time-
Level I evidence. An SR is an investigation of frame for follow-up are determined prior to the
relevant literature related to a specific research occurrence of any events/outcomes. However, in
question. SRs are conducted according to a pro- contrast to RCTs, the treating surgeon, patient, or
tocol that details the search strategy, eligibility both parties, determine which treatment the
criteria for included studies, as well as data patient receives, thus introducing a potential
abstraction and analysis plans [7–9]. An MA is source of bias. When a large effect is observed
an SR that uses quantitative methods or statistical and there are no other considerations for down-
techniques to pool the data found in the literature, grading the quality of the study, the evidence of
increasing the effective sample size [7, 8]. For an observational study is considered satisfactory
both SRs and MAs, a comprehensive search is for Level II. MAs and SRs may also be catego-
necessary to identify all relevant results, includ- rized under Level II evidence if the studies
ing published, unpublished, and gray literature. included in the analysis or review are considered
The more complete the search, the greater the Level II evidence, or if the included studies
potential for a high-quality study. SRs and MAs represent Level I evidence but demonstrate
are discussed in more detail in Chap. 16. inconsistent results [4].
A lesser quality RCT, defined as an RCT with Retrospective cohort studies comparing two
less than 80% follow-up, no blinding, or impro- or more treatments are commonly encountered in
per randomization, is considered Level II evi- the surgical literature and are classified as
dence when assessing therapeutic research [10]. Level III evidence. In a retrospective cohort
Although high-quality RCTs represent the study, the research study is initiated after both the
5 Hierarchy of Evidence in Surgical Research 39
intervention and outcomes have occurred. may not have been recorded in the past, so the
Researchers review previously recorded data presence or absence of risk factors cannot be
(e.g., from medical records or a registry) to determined. Lastly, case series may not be gen-
identify patients who received the treatments of eralizable, as they often focus on a single surgeon
interest and then compare their outcomes or center’s experiences [8]. Case series are very
(Fig. 5.2c) [4]. A case-control study, also at common in the surgical literature as they are
Level III on the hierarchy, compares patients relatively easy and inexpensive to do (Please see
who have the disease or outcome of interest Chap. 19: Case Series).
(cases) with controls (patients who do not have A historically controlled study is another
the disease or outcome of interest) and looks study design that constitutes Level IV evidence.
back in time to compare how frequently the This study design includes a control and a
exposure to a risk (e.g., fracture and vitamin D treatment group who share the same disorder.
deficiency) is present in each group to determine The treatment group is comprised of individuals
the relationship between the risk factor and the who are currently receiving the treatment of
disease (Fig. 5.2d). However, the status may not interest and the control group consists of indi-
have been recorded in the past, and therefore, the viduals who received an alternative treatment in
presence or absence of the risk factor cannot be the past. This design allows researchers to
determined. Case-control studies are described in investigate the effects of a new treatment quickly
greater detail in Chap. 18. and inexpensively. However, this study design is
Case series are frequently reported in the vulnerable to bias and requires that the selection
surgical literature and are classified as Level IV of historical controls be sufficiently similar to
evidence [4]. A case series is an expansion of an that of the treatment group [12]. Lastly, since the
individual case report to include a group of status may not have been recorded in the past, the
patients with the outcome of interest (Fig. 5.2e). presence or absence of the risk factor cannot be
A case series does not involve a comparison determined.
group; however, may be valuable at making Lowest on the hierarchy of evidence for
observations about a rare surgical case or therapeutic research is mechanism-based rea-
describing a new surgical technique [4]. A case soning [4]. Mechanism-based reasoning is based
series is also beneficial in developing hypotheses on expert opinion and is intended to generate
that are to be tested with analytic studies; how- hypotheses or interpret evidence from other
ever, it is not possible to make causal inferences sources. This LOE relies on the knowledge of
about the relationship between risk factors and underlying mechanisms to predict what the rel-
the outcome of interest [8, 11]. Similar to evant effect of a therapy (or diagnostic test or
case-control studies, in a case series, the status prognosis, depending on the area of research)
5 Hierarchy of Evidence in Surgical Research 41
will be for the patient [13]. This LOE has the is explained in Chap. 20. The hierarchy of evi-
greatest potential for error and is, therefore, the dence for diagnostic research is listed in Fig. 5.3.
lowest on the hierarchy. Evaluating Level V Similar to therapeutic research, RCTs are
evidence is described in more detail in Chap. 26. considered to be Level I evidence. Diagnostic
Level I through Level IV evidence can be RCTs are randomized comparisons of two diag-
graded downward based on study quality, impre- nostic interventions (one standard and one
cision, indirectness, or inconsistency between experimental) with identical therapeutic inter-
studies or when the effect size is very small. ventions based on the results of the competing
Similarly, these studies can be graded upward if diagnostic interventions (for example, disease:
there is a dramatic effect size [10]. yes or no) and with the study outcomes being
clinically important consequences of diagnostic
accuracy (Fig. 5.4a) [14]. Additionally, testing of
Clinical Example previously developed diagnostic criteria, using a
cohort study design with consecutive patients,
An example of a Level I evidence study is, “The consistently applied reference standards, and
Fixation using Alternative Implants for the blinding is also considered to be Level I evidence
Treatment of Hip Fractures trial (FAITH)” RCT, [4]. In these prospective diagnostic accuracy
which is appraised in Chap. 11. cohort studies, patients with suspected disease or
condition undergo both the diagnostic interven-
tion or criteria being evaluated and the diagnostic
Levels of Evidence for Diagnostic reference standard (Fig. 5.4b) [14]. Diagnostic
Research accuracy or the performance of the previously
developed experimental diagnostic intervention
Diagnostic research evaluates the ability of a or criteria is measured using a 2 2 table [14].
diagnostic test to determine the presence or Diagnostic tests with three or more levels allow
absence of the condition in question. Research for likelihood ratios to be computed, but not
questions that diagnostic research addresses sensitivity and specificity. Therefore, diagnostic
include “Is this (early detection) test worth- tests should include likelihood ratios and not
while?” and “Is this diagnostic or monitoring sensitivity and specificity, as they go in the
test accurate?” The appraisal of a diagnostic test wrong direction. While diagnostic cohort studies
inform us about the relative accuracy of an poor or nonindependent reference standard are
experimental diagnostic intervention compared also considered to be Level IV evidence. Lastly,
to a reference standard, they do not inform us therapeutic studies using mechanism-based rea-
about whether the differences in accuracy are soning are considered to be Level V evidence.
clinically important, or the degree of clinical
importance (in other words, the impact on patient
outcomes) [14]. Clinical Example
Level II evidence includes a prospective
cohort study, as described above, evaluating a An example of a Level II evidence study is the
new diagnostic test or diagnostic criteria. Addi- “Prospective validation of the ultrasound based
tionally, studies that describe the development of TIRADS (Thyroid Imaging Reporting And Data
a new diagnostic test or criteria, that include System) classification: results in surgically
consecutive patients with consistently applied resected thyroid nodules” which is appraised in
reference standards and blinding, are also con- Chap. 20.
sidered Level II evidence.
Level III study designs, include retrospective
cohort studies. These studies are similar to the Levels of Evidence for Prognostic
cohort studies described above, however, data Research
are collected retrospectively (Fig. 5.4c). Diag-
nostic case-control studies are also considered to Prognostic research estimates the risk of future
be Level III evidence (Fig. 5.4d). Additionally, outcomes in individuals based on clinical and
prospective cohort studies that includes noncon- nonclinical characteristics [15]. It addresses
secutive patients and does not include a consis- questions regarding the natural history of the
tently applied reference standard are also disease. By identifying prognostic studies
considered to be Level III evidence, and, there- involving patients with a similar clinical pre-
fore, may not be present in records. sentation, an attending surgeon is able to inform
As with therapeutic research, case series their patients about their expected clinical course
describing a diagnostic test or diagnostic criteria and make better-informed decisions on treatment
are considered to be Level IV evidence (Fig. 5.4 [15]. Typically, prognostic research consists of
e). When case series, case-control, or retrospec- prospective, observational studies that aim to
tive studies look back in time, the included data assess the causes of disease progression, predic-
may not be standardized, or the data are tion of risk in individuals, and individual
unavailable and were not recorded in the past, response to treatment [16]. As it is not possible to
resulting in a lower quality of evidence. Addi- randomly assign these factors, one usually cannot
tionally, prospective cohort studies that have a conduct an RCT to address a prognostic research
44 G. Del Fabbro et al.
question. Therefore, compared to therapeutic and Similar to therapeutic and diagnostic research,
diagnostic research, prognostic research studies case series are considered to be level IV evidence
follow different criteria in determining LOE in prognostic research (Fig. 5.6e) and
(Fig. 5.5) [17]. mechanism-based reasoning is considered to be
Inception cohort studies represent the highest Level V evidence in prognostic research [7].
LOE in prognostic research. In an inception
cohort study, all patients are enrolled at a com-
mon time early in the development of their dis- Clinical Example
ease or condition (near the onset of symptoms,
soon after diagnosis, or at detection of a clini- An example of a prognostic study deemed
cally important pathological event), and are fol- Level III evidence is the “Mechanical or Biologic
lowed thereafter [18]. Outcomes and factors Prostheses for Aortic-Valve and Mitral-Valve
associated with patient outcomes are then Replacement” retrospective cohort study, which
reported (Fig. 5.6a). is appraised in Chap. 22.
Level II evidence includes a prospective
cohort study (as described above) in which
patients are enrolled at different points of their Levels of Evidence for Economic
disease (Fig. 5.6b). Additionally, the use of data Research
from a control arm of a RCT to determine
prognostic outcomes is also considered Level II Economic research evaluates whether or not the
evidence. Retrospective cohort studies (Fig. 5.6 intervention or treatment offers the greatest ben-
c) and case-control studies (Fig. 5.6d) reporting efit for the amount of dollars spent [4]. All LOE
on prognostic measures are considered Level III for economic research include computer simula-
evidence. tion models (CSM), using Monte Carlo simula-
5 Hierarchy of Evidence in Surgical Research 45
Fig. 5.6 a Inception cohort design for prognostic studies. b Prospective cohort study design for prognostic studies.
c Retrospective cohort design for prognostic studies. d Case-control design for prognostic studies. e Case series design
for prognostic studies
tion or Markov models. The inputs for each LOE and uncertainty examined using probabilistic
differs. For Level I evidence, inputs should be sensitivity analyses [4].
derived from Level I studies, as well as include Level II evidence consists of computer simu-
lifetime time duration, outcomes expressed in lation from Level II studies, including lifetime
dollars per quality-adjusted life years (QALYs), time duration, outcomes expressed in QALYs
46 G. Del Fabbro et al.
and uncertainty examined using probabilistic informed by prior economic evaluation and
sensitivity analyses. Level III evidence also uses uncertainty is examined by univariate sensitivity
inputs derived from Level II studies, however, analyses [4]. Figure 5.7 summarizes the LOE for
differs as it uses a relevant time horizon, less than economic research and Chap. 23 provides guid-
a lifetime, and outcomes expressed in dollars per ance on how to critically appraise economic
QALYs and stochastic multilevel sensitivity evaluations.
analyses [4].
Level IV and V on the hierarchy of evidence
for economic research uses a decision tree over Clinical Example
the short-time horizon. Level IV uses input data
from original Level II and III studies and An example of a Level I evidence study is the
uncertainty is examined by univariate sensitivity “Cost effectiveness analysis of arthroscopic sur-
analyses. Level V differs as it uses input data gery compared with non-operative management
5 Hierarchy of Evidence in Surgical Research 47
Question
How common Is this diagnostic What will Does this What are the What are the Is this (early
is the or monitoring test happen if we intervention common harms? rare harms? detection) test
problem? accurate do not add a help? worthwhile?
therapy?
Diagnosis Prognosis Treatment Treatment harms Screening
benefits
Step 1 Local and SR of SR of SR of SR of randomized SR of SR of
(Level current crosssectional with inception randomized trials trials, SR of randomized randomized
1a) random consistently cohort or n-of-1 trials nested trials or n-of-1 trials
sample applied reference studies case-control trials
surveys (or standard and studies, n-of-1
censuses) blinding trial with the
patient you are
raising the
question about, or
observational
study with
dramatic effect
Step 2 SR of surveys Individual cross Inception Randomized trial Individual Randomized Randomized
(Level that allow sectional with cohort or observational randomized trial trial or trial
2a) matching to consistently studies study with or (exceptionally) (exceptionally)
local applied reference dramatic effect observational observational
circumstancesb standard and study with study with
blinding dramatic effect dramatic effect
Step 3 Local Non-consecutive, Cohort study Non-randomized Non-randomized controlled Nonrandomized
(Level non-random or studies without or control controlled cohort/follow-up study controlled
3a) sampleb consistently arm of cohort/follow-up (post-marketing surveillance) cohort/followup
applied reference randomized studyb provided there are sufficient numbers studyb
standardsb triala to rule out a common harm. (For
long-term harms the duration of
follow-up must be sufficient)
Step 4 Case-seriesb Case-control, or Case-series Case-series, Case-series, case-control, or historically controlled
(Level poor or or casecontrol or studiesb
4) nonindependent case-control historically
reference studies, or controlled
standardb poor quality studiesb
prognostic
cohort
studyb
Step 5 – Mechanism-based – Mechanism-based reasoning
(Level reasoning
5)
SR Systematic review
a
Level may be graded down on the basis of study quality, imprecision, indirectness (study of PICO does not match question PICO), because of
inconsistency between studies, or because the absolute effect size is very small. Level may be graded up if there is a large or very large effect size
b
A systematic review is generally better than an individual study
5 Hierarchy of Evidence in Surgical Research 49
Clinical decision-making in surgery has evolved The lowest level of evidence after expert
dramatically in the past several years. Tradi- opinion consists of retrospective case series
tionally, surgeons coalesced anecdotal evidence, which present outcomes without comparison, and
previous experience, expert opinion, and retrospective cohorts, which often are used to
extrapolation of basic science into clinical compare two or more therapies (see Chap. 5).
decision-making. While this often served patients While uncomplicated to design, the retrospective
well, it is also possible that this decision-making nature of these studies does not allow for the
process led to suboptimal treatment. To combat collection of all potential confounding factors.
this, a plethora of surgical research has been This limits the ability to infer causation from the
ongoing for the past several decades, and the conclusions. The next level of evidence consists
utilization and understanding of this evidence are of prospective, nonrandomized controlled trials.
vital to surgical practice in the twenty-first cen- While the prospective nature of these studies
tury [1]. The evidence itself can vary greatly in allows for better data collection, the lack of
quality and therefore several grading systems randomization can still lead to an imbalance
have been developed to assess quality [2, 3]. within groups for unmeasured variables that
Regardless of the grading system, the general would not be accounted for fully with statistical
hierarchy of evidence remains the same and is analyses. Properly designed and conducted ran-
based on the propensity of a certain study type to domized trials provide a high level of internal
introduce bias into its conclusions. Bias, specif- validity to their conclusions but can be limited by
ically, refers to any systematic deviation from the their generalizability [3]. Accordingly, the high-
truth, which could result in an underestimation or est level of evidence is when several randomized
overestimation of the true effect of an interven- trials looking at the same topic can be combined
tion [4]. into a meta-analysis [5, 6]. Meta-analyses com-
bine the results of several trials in an effort to
boost statistical power and minimize false nega-
tive results. However, one should be cautious of
A. G. Doumouras
meta-analyses of observational data as a meta-
Department of Surgery, McMaster University, analysis of biased data is still biased though
Hamilton, ON, Canada these types of studies may be useful in summa-
e-mail: aristithes.doumouras@medportal.ca rizing the current observational evidence for a
D. Hong (&) topic. Ideally, surgeons can evaluate the evidence
Department of Surgery, St. Joseph’s Healthcare, they have for the clinical problem at hand and
Hamilton, ON, Canada
e-mail: dennishong70@gmail.com
provide optimal care based on this assessment.
Table 6.2 Primary outcomes and mortality among study participants [7]
Outcome Low-cost Commercial Absolute difference p-value
mesh mesh Percentage 95% CI
(n = 143) (n = 148) points
Primary outcomes Hernia recurrence 1 (0.7) 0 0.7 (−1.2 to 2.6) 1.0
Any postoperative 44 (30.8) 44 (29.7) 1.0 (−9.5 to 11.6) 1.0
complication
Distribution of Hematoma or swelling 35 (24.5) 35 (23.6) 0.8 (−9.0 to 10.7) 1.0
Postoperative in groin or scrotum
outcomes Superficial infection 4 (2.8) 6 (4.1) 1.3 (−5.6 to 3.1) 0.38
Seroma 1 (0.7) 0 0.7 (−1.2 to 2.6) 1.0
Impaired wound 5 (3.5) 8 (5.4) 1.9 (−6.8 to 3.0) 0.29
healing
Severe pain 2 (1.4) 0 1.4 (−0.9 to 3.7) 1.0
Other complications 2 (1.4) 4 (2.7) 1.3 (−4.8 to 2.2) 0.34
Death 2 (1.4) 3 (2.0) 0.6 (−3.9 to 2.6) 1.0
Internal validity refers to the soundness of the Relationships between outcomes predictors of
methodology of the study and relates directly to interest are subject to confounding. Confounding
our ability to infer causation from the results. In is the potential for a third variable to influence
observational studies, the correlations presented the relationship between the outcome and pre-
in the results may not actually represent a true dictor thus limiting our ability to infer direct
causal relationship and therefore using their causation (see Chapter 32: Confounding Factors
conclusions without proper scrutiny may lead to and Interactions). For observational studies,
unnecessary or potentially harmful patient care. many of these confounders can be identified and
Two types of errors can threaten internal validity adjusted for within multivariable statistical anal-
[9]. Type I errors, or false positives, occur when yses. One of the advantages of prospective over
calling a treatment useful when it is not. Type II retrospective studies is the ability to ensure data
errors, or false negatives, occur when concluding collection on important confounding factors.
a treatment has no effect when it is actually However, there are many unmeasurable or
useful (see Chap. 29). There are numerous unknown confounding factors that cannot be
examples of observational study conclusions that collected and adjusted for. It is for these factors
turned out to be equivocal or even harmful when that randomization is crucial as it is the best way
examined in a randomized setting. Randomized to balance these unknown or unmeasurable fac-
trials provide the most sound experimental tors between groups. The method of randomiza-
design and results that are most likely to be tion (usually a computer program) should be
causal. This section will highlight some of the mentioned in the methods of any trial. Another
important points to look for when assessing the vital aspect of randomization is allocation con-
internal validity of a study. Importantly, some of cealment. This refers to the concealment of the
this information may actually be in the study randomization process from the investigators so
protocol rather than the paper itself. that it is unknown what group a patient will be
54 A. G. Doumouras and D. Hong
randomized into. If not done properly, investi- Were All Patients Who Entered
gators can intentionally or unintentionally direct the Trial Accounted for and Was
patients towards the group they feel is most Follow-up Adequate?
suitable thus introducing selection bias into the
process and losing much of the benefit of ran- Another major issue with trials of all types is
domization. In the study by Löfgren et al. [7] the patient attrition rate. Readers should be con-
operation list and order of patients was deter- cerned if not all patients are accounted for at a
mined the day before but randomization was not trial’s conclusion. If a large proportion of
performed until the patient was brought into the patients are unaccounted for at the end of a trial,
operative suite. In this way, randomization was the benefits of randomization may be lost.
concealed from the surgeons. In addition, ran- Moreover, bias can be introduced if the dropout
domization was done by a computer program in is related to some aspect of the procedure itself.
blocks of 4 and 6, rather than randomizing single If the dropout was random, then the benefits of
individuals which allows for more balance of randomization should be maintained. Therefore,
factors. a full report of patient attrition is required. In
addition, the follow-up should be rigorous,
blinded and equal between groups to ensure that
Were Study Personnel “Blinded” all adverse effects are accounted for. It should
to Treatment and Apart also be assessed similarly between groups and be
from the Experimental Intervention, long enough to ensure that the outcomes of
Were the Groups Treated Equally? interest can manifest themselves. In the Löfgren
et al. [7] study, the follow-up was thoroughly
Another major methodological concept is blind- conducted by two physicians who were blinded
ing. This is often confused with allocation con- to the study group assignments. Overall, 4.4% of
cealment but they are distinct concepts. Blinding patients were lost to follow-up which was not
refers to ensuring that stakeholders in the trial do different between groups. In addition, the time
not know what treatment patients received as from surgery to follow-up was similar between
knowing this can influence the behavior of groups. This is not unsurprising as there is little
patients and investigators. While blinding of both morbidity from a hernia repair and the follow-up
patients and investigators is relatively straight- period was relatively short.
forward in drug trials, it is often not possible in
surgical trials as surgeons will often need to
know the treatment the patient had. One way to Were Patients Analyzed According
minimize this issue in surgical trials is to have to the “Intention to Treat” Principle?
different clinicians provide the postoperative care
to patients. In the study by Löfgren et al. [7], The intention-to-treat principle is also funda-
after randomization was done, the surgeons in the mental to ensuring causal inference of results.
operative suite did know the type of mesh to be This principle states that patients should be
used within the procedure (blinded-patient). analyzed in the groups they were originally
However, to minimize bias due to this, the two allocated to, regardless of the treatment actually
physicians performing the follow-up did not received. This is vital in surgical trials as patients
participate in the surgeries and were unaware of of poor operative status sometimes may not
the study group assignments. Considering the receive surgery, despite being randomized for
study question, this was probably the strongest surgery. If these patients were to be analyzed in
level of blinding possible. the nonsurgical group, they can bias the results
6 Evaluating Surgical Interventions 55
by having healthier people in the surgical false negative result. Therefore, before a negative
group. Sometimes, this principle can lead to result can be truly established, the sample size
issues of validity of conclusions if there are too and statistical power should be scrutinized and
many patients that did not receive the treatment found to be adequate (see Chap. 29). In addition,
but in most cases the strategy is sound. In addi- the conclusions of other outcomes assessed, for
tion, in surgical trials, if there are many patients which there was no sample size calculation,
not receiving the treatment it may provide a should not be assumed to be adequately powered.
pragmatic answer as to the feasibility of the In the Löfgren et al. [7] study, the sample size
treatment. was calculated at 150 patients within each
In the Löfgren et al. [7] study, the intention to group. This would give the study a power of 80%
treat principle was followed for the final analysis. and a significance level of 5% to detect a
However, they make no mention of how the five-percentage-point absolute difference in the
missing data was dealt with. It is likely, based on rate of hernia recurrence at one year. Upon
the small number of missing patients that the completion of the trial, this study failed to
authors used a complete case analysis including achieve the desired power level. The inability to
data from only those patients who completed reach power based on accrual issues should have
follow-up. This method of handling missing data been mentioned in the limitations so that it can be
is likely unbiased in this case due to the small considered by the reader (see Chap. 29).
numbers and did not likely substantially change
the power for this study. Had patients dropped
out due to measured or unmeasured confounders, What Are the Results?
the analysis will be biased. Furthermore, if too
many patients were lost, even at random, the The key findings of the Löfgren et al. [7] study
power of the study would be called into question. are found in Table 6.2.
Was the Study Large Enough to Detect What Outcomes Were Used and How
a Difference? Were They Measured?
Ensuring an adequate sample size is essential to Because of the wide variety of outcomes that can
answering any clinical question. A randomized be assessed and the use of sophisticated statistical
trial should clearly describe an a priori sample analyses, simply understanding what the results
size and what factors they used to determine the are can be a challenge for a surgeon. This section
sample size. Generally, calculating a sample size will provide a brief summary of how the most
requires the rate of type I error (usually 5%), rate common outcomes are reported.
of type II error (usually 20%, also known as 80%
power), the allocation ratio between groups (1:1, Binary Versus Continuous Outcomes
2:1, etc.), the expected effect and the variance of The vast majority of outcomes in the surgical
that effect. While the first three are fairly standard literature are measured as either binary or con-
in the sample size calculation, the last few vari- tinuous variables. Binary outcomes represent
ables can be controlled by investigators. A larger dichotomous occurrences where patients either
a priori expected effect means a smaller sample have an outcome or they do not (e.g., death or
size but if that effect is unreasonable then it could anastomotic leak). Continuous outcomes repre-
lead to an underpowered trial. Conversely, the sent data that are measured by real numbers such
effect size chosen should also be large enough to as weight or length of stay. The study by Löfgren
be clinically relevant. Power is the complement et al. [7] had two primary outcomes, both binary,
of the type II error rate for a trial, and can be which were the hernia recurrence at 1 year and
described as the chance for a trial to produce a the overall complication rate at 2 weeks.
56 A. G. Doumouras and D. Hong
Univariate Testing: Univariable the effect on the outcome of interest by a one unit
and Multivariable Analyses change in the predictor. For example, if an out-
There is considerable confusion in the literature come was weight and the predictor was age, the
as to the nomenclature for testing let alone the results would be represented as the amount the
actual test themselves. Univariate testing refers to weight has changed for each year in age. For
statistical tests with a single response variable per dichotomous data analyzed by logistic regres-
observation. This represents the vast majority of sion, results are represented by odds ratios. Odds
tests in the surgical literature where a single ratios are related to risk ratios but are less intu-
patient will have a single outcome associated itive. Their predominant use stems from the fact
with them. Multivariate analyses refer to when a that they are much easier to calculate within
single patient has multiple outcomes associated statistical models. Risk ratios (also known as
with them and are rarely used (i.e., a patient has relative risks or the incidence rate ratios) repre-
several weight changes over the course of the sent the ratio between the actual event rate
study) [10]. Univariate testing can take the form between groups. For example, if the event hap-
of univariable or multivariable tests. Univariable pened 60% of the time in group A and 40% in
tests occur when the outcome is tested against group B then the risk ratio would be 1.5. Simi-
only a single predictor. Examples include the larly, odds ratios are the proportion of the odds of
chi-square test for dichotomous data and a t-test occurrence between the two groups. It is not vital
for continuous data. For observational data, these to truly understand the difference between odds
usually represent preliminary tests and are used and risks but it is important to know that odds
to give some exposition to the data rather than as ratios approximate risk ratios when outcomes are
actual conclusions. Alternatively, for randomized rare but overestimate risk when the outcomes are
trials, these may represent the final analysis common (>10%). Relative risks are common in
because randomization precludes the need for randomized trials where multivariable logistic
multivariable analysis. regression analyses are not required due to ran-
Multivariable analyses use multiple predictors domization. Although not utilized in the study by
to explain the outcome of interest. The simplest Löfgren et al. [7], one could easily calculate the
examples are linear regression for continuous risk ratio of postoperative complications (refer to
data and logistic regression for binary data. Table 6.3 for calculation). In the commercial
These studies demonstrate the effect of a single mesh group, the rate of complication was 29.7%
predictor of interest while holding all other pre- whereas the rate of complication in the low-cost
dictors steady. These analyses, therefore, account mesh group was 30.8%, therefore, the relative
for the effect of many different potential con- risk of complications in the low-cost mesh group
founders which is not done in univariable anal- was 1.04, or 4% higher, not a statistically sig-
yses. In the study by Löfgren et al. [7], because nificant difference. Furthermore, a relative risk of
of the fact that it is a randomized clinical trial, 4% is unlikely to be clinically important even if
multivariable regression was not required. This is statistically different. One could use the 95%
because the groups are expected to be balanced confidence interval and an a priori selected
due to randomization. Therefore, the main anal- clinically important difference benchmark
ysis was done using a chi-square test or the (non-inferiority margin) to determine whether the
Fisher exact test, where appropriate. change is a large one to persuade us to accept
the low-cost mesh. This is the basis of
Multivariable Regression Results: Odds non-inferiority trials (see Chap. 13). Table 6.3
Ratios (OR) and Risk Ratios (RR) explains the measures that are used to explain the
Every surgeon has seen the results of a multi- magnitude and precision of the treatment effect
variable regression but correctly interpreting the of a surgical intervention. The Löfgren et al. [7]
results can be a bit more difficult. For linear article is explained through these measures in
regressions, the conclusions are represented by Column 3.
6 Evaluating Surgical Interventions 57
How Large Was the Treatment Effect? 50% of the time then the absolute risk increase is
2.5%. Despite the clinical utility, absolute risks are
Importantly, the difference between absolute and usually only reported in randomized trials because
relative measures needs to be understood by the they are much more difficult to model in regres-
surgeon contextualizing any results. Odds ratios sion analyses. Accordingly, surgeons must keep in
and risk ratios represent a percentage change from mind the absolute risk of an event to contextualize
the occurrence of an event. The absolute risk the results of many trials. The study by Löfgren
represents the change of the occurrence of an et al. [7] does report the absolute risk difference
event on an absolute scale. From a clinical per- which is 1.1% and this was not seen as statistically
spective, the latter is usually the much more significant. This, and calculations for the other
important measure. For example, a study may results from the Löfgren et al. [7] study are shown
report a predictor increases the event rate by 30% in Table 6.2. The number needed to treat (NNT)
(risk ratio 1.30). This number seems large but if of 91 (see column 3 Table 6.3) is an important
the event only occurs 1% of the time then the measure which adds context to our interpretation
absolute risk only goes up to 1.3%. The same of the data. It tells us that we have to treat 91
study could have reported a 0.3% increase in patients with the commercial mesh (instead of the
absolute risk—a much less provocative number. low-cost mesh) to avoid one complication. Many
Conversely, a study could report a 5% relative surgeons may not see this as a huge benefit and
increase (risk ratio 1.05) but if the event occurs may opt for the low-cost alternative.
Table 6.3 Terms used to show the magnitude and precision of the treatment effect
Term Description Example from Löfgren et al. [7]
Risk The probability that an event would occur calculated Risk in treatment group:
as the number of events of interest divided by the total 44/143 = 0.308 (30.8%)
number within the group. Also, known as the relative Risk in the control group:
risk 44/148 = 0.297 (29.7%)
Risk ratio The ratio of the risk between 2 different groups Risk in treatment 0.308 versus
control 0.297
RR: 30.8/29.7 = 1.04
Odds A ratio of the probability that the event will happen to Risk in the treatment group:
the probability that the event will not happen within 44/(143−44) = 0.44
the same group Risk in the control group:
44/(148−44) = 0.42
Odds ratio The ratio of odds between 2 different groups Odds in treatment 0.44 versus in
control 0.42
RR: 0.44/0.42 = 1.05
Absolute Absolute reduction in events Complication in treatment 30.8%
risk in one group compared with versus in control 29.7%. ARR: 30.8
reduction the other −29.7 = 1.1%
Number The average number of patients who need to be treated ARR from article:1.1%
needed to to prevent one additional bad outcome. It is defined as NNT = 1/ARR = 1/0.011
treat the inverse of the absolute risk reduction NNT = 91 patients
Relative Complement of relative risk, Complication in treatment 30.8% versus
risk expressed as a percentage in control 29.7%. RR (30.8−29.7)/
reduction 30.8 = 3.6%
95% An interval of values that Various methods of calculation
confidence include the true value 95% of
interval the time (calculated)
58 A. G. Doumouras and D. Hong
How Precise was the Estimate assessed, this aspect of any trial should be clearly
of the Treatment Effect? investigated by the reader and the questions from
Table 6.1 should be asked.
The last thing a surgeon should examine are the
measures of statistical significance and precision.
The significance is represented by p values. Were the Study Patients Similar
p values represent the probability of obtaining the to My Patients?
results if the null hypothesis (i.e. the assumption
that the predictor has no effect) were true. Gen- Comparing the patient population of a trial to
erally, when this probability is less than 5% the patient in front of you is essential to the
(<0.05) we consider the result statistically signif- application of evidence to surgical practice.
icant. It should be noted that this specific thresh- Many trials have restricted criteria for enroll-
old is completely arbitrary and there is nothing ment and the results may not be directly appli-
magical between a p-value of 0.051 and 0.049. cable to the patient you are treating. In addition,
This threshold exists to limit the rate of Type I they may only be enrolled from a specific
error (e.g. false positives) to 5%. Lastly, even if patient group (e.g., veterans, men, uncompli-
the p-value is below this threshold, statistical cated surgical problems) which may not be the
significance does not equate to clinical relevance. patient in front of you. Often, the patients we
As a standard, most trials use a two-sided p-value treat are older, have more comorbidities, and
of 0.05 as the threshold for statistical significance complex surgical problems than a trial would
(see Chap. 27). In addition to p values, confidence allow. These factors may mitigate the expected
intervals exist to better characterize the precision benefit of a treatment and thus the surgeon
of the result. The 95% confidence interval is should know how the evidence relates to the
routinely used and can be interpreted as the patient in front of them before making any
interval in which the true effect lies 95% of the treatment decisions. The protocol for this study
time if the same study with the same sample was clearly outlines the inclusion and exclusion cri-
repeated (see Chap. 28). Therefore, in studies with teria. Specifically, it included patients 18 years
small sample sizes, the confidence interval can be of age or older with reducible, unilateral, pri-
quite large but with increasing sample size, the mary inguinal hernias. It excluded females,
confidence interval becomes smaller. In the recurrent hernias, femoral hernias, those on
Löfgren et al. [7] study, the two-sided 95% con- anticoagulation, those with current drug abuse
fidence interval were calculated and for the main and ASA 3 or above. In addition, these men
outcome of postoperative complication rate dif- were all from Uganda with a mean age of around
ference, this interval varied from −9.5 to 11.6%. 45 years old, mean BMI of 21 and ASA score of
One way to interpret this interval is by saying that 1 for nearly 90% of the patients. These criteria
if this same study was repeated multiple times, the give a clear picture of the type of patient these
true rate difference would lie within this interval results can be applied to. Based on the mean
95% of the time. BMI and ASA score of the patients within the
study, this trial may not be as applicable in
North America; however, it may be applicable
Are the Results Applicable to a low-income country. Considering the dif-
to My Patients? ferences, it is important to carefully determine
whether extrapolation to a different patient
External validity refers to the applicability of group is reasonable or whether this patient
results to other groups of patients [11]. After the group has several things that may be too dif-
results are understood and the internal validity ferent to apply to your patients.
6 Evaluating Surgical Interventions 59
Were the Measured Outcomes result may not be applicable to the larger surgical
Clinically Relevant? community. The trial by Löfgren et al. [7] utilized
a relatively simple surgical procedure which is
Another area that surgeons should question outlined in the protocol. Specifically, these were
before accepting a “superior treatment” is the day surgeries under local anesthesia which used a
choice of main outcomes considered in the study. Lichtenstein tension-free method. This method
The choice of main outcomes should be relevant could likely be replicated by most general
to both the surgeon and the patient, and should surgeons.
be clinically meaningful. Certain biochemical
markers may be relevant to surgeons but of no
relevance to patients while postoperative pain Resolution of the Clinical Scenario
and the return of function may be less important
to surgeons but of great substance to patients. A careful critique of this article demonstrates that
The main outcomes of the study include com- the internal validity is quite high and therefore
plication rate at 2 weeks and recurrence rate at 1 the results are likely valid to their goals. The
year. While a 1-year follow-up is relatively long, resident who appraised this article likely felt that
many recurrences occur after this and therefore it was appropriate for her staff to use the low-cost
the long-term durability of this treatment cannot mosquito net in this instance. It was comparable
be determined based on the results of this trial. to the hernia mesh used in North America when
used in these patients. However, it was clear that
this conclusion may not be applicable to other
Are My Surgical Skills Similar to Those groups at home in North America and the
of the Study Surgeons? extrapolation of the results should be limited.
8. Urschel JD, Goldsmith CH, Tandan VR, Miller JD. valid? Evidence-based medicine working group.
For the evidence-based surgery working group. Users’ JAMA. 1993;270:2598–601.
guide to evidence-based surgery: how to use an article 10. Hidalgo B, Goodman M. Multivariate or multivari-
evaluating surgical interventions. Can J Surg. 2001; able regression? Am J Public Health. 2013;103:
44(2):95–100. 39–40.
9. Guyatt GH, Sackett DL, Cook DJ. Users’ guides to 11. Sedgwick P. External and internal validity in clinical
the medical literature. II. How to use an article about trials. BMJ. 2012;344:e1004.
therapy or prevention. A. Are the results of the study
A Primer on Outcome Measures
for Surgical Interventions 7
Joy MacDermid
Table 7.2 Sources of outcome measures better understand what dimensions of health are
Outcome measure databases impacted by a health condition or intervention.
Name Link Disadvantages, particularly in upper extremity
surgery, are that the generic measures often do
Patient-reported health http://phi.uhce.ox.ac.uk/
instruments not adequately sample upper extremity function
ePROVIDE™ https://eprovide.mapi-
and can be less responsive to detecting clinical
trust.org/ change [22–24]. The more diffuse a concept/item
Rehabilitation outcome https://www.sralab.org/ is, the more likely that it will be understood and
measure database rehabilitation-measures calibrated differently by different patients. For
Orthopedic scores http://www.orthoscores. example, a single item, like the SANE [21], is
com/ appealing to clinicians as it provides a quick
PROMIS® http://www. barometer of patient status. However, how
(patient-reported healthmeasures.net/ patients define how ‘normal’ they are can be
outcomes measurement explore-measurement- influenced by a variety of factors, which may not
information system) systems/promis
be related to the treatment provided. These global
NIH Toolbox® http://www.
single item ratings are useful in combination with
healthmeasures.net/
explore-measurement- other scales, rather than as a sole indicator of
systems/nih-toolbox outcome.
An outcome measure strategy often includes
both observer/clinician-based outcomes (CBOs)
the nature of the items evaluated, how they are and PRO. CBO measurements are important
scored and the perspectives that the items are because they often are directly related to the
measuring as these all influence the nature and treatment targets. The effects of surgery, fracture
efficiency of the measurement. reduction and strengthening programs can only be
Constructs measured can be quite broad, to be assessed as being beneficial if they successfully
used generically across health conditions, such as change the underlying anatomy/impairments that
health-related quality of life (HRQoL) or health are being targeted. Therefore, the measurement of
status. HRQoL measures, like the EQ-5D [11– these constructs is important to enhance our
15] can take on a variety of formats that may understanding of the treatment mechanisms.
appear like health status measures, like the SF-36 Ultimately, improving the patient’s symptoms
[16, 17] or SF-12 [18–20], since both sample and functioning are often the primary goal of
important domains of life. The distinction surgery, and these impacts are typically measured
between these is often subtle, but quality of life by PRO. PRO can contain consistent standardized
(QoL) measures infer evaluation of the extent to items, adaptive items that are presented to
which a given health status meets the patient’s patients, based on a computer algorithm or blank
needs or expectations. Utility measures provide items where the patients can define the item to be
an overall evaluation on the value of different rated. Advantages of the consistent format are
health states, which, makes them ideally suited that the clinician and patient can monitor
for economic analyses. Most generic outcome responses to specific individual items, which can
measures are multi-item scales that reflect dif- be useful when tracking functional goals. Further,
ferent dimensions of QoL or health status. Single the administration is feasible in either paper or
item global constructs can provide a brief overall electronic formats (not dependent on technology).
assessment of change (e.g. the global rating of A disadvantage of these standardized PRO is that
change), function, or how normal patients per- they may not be appropriately targeted to some
ceive themselves to be (e.g. the Single Assess- individuals. Further, patients often must answer
ment Numerical Evaluation [SANE]) [21]. items that are not relevant to them. Many stan-
Generic measures can be compared across dif- dardized PRO focus on activities of daily life that
ferent health conditions as a global outcome or to might not adequately reflect important domains of
64 J. MacDermid
life for people with high occupational or sport function for example, the (Quick)DASH [32–34]
demands, which can result in ceiling or floor MHQ [35–38] or PRWE [7, 39–42] are highly
effects. Ceiling or floor effects occur when a validated PROs that are useful in hand surgery
patient score is already at the top or bottom of the practice and research. A 2014 practice survey
scale and so no further worsening or improve- found that the most used PROs in hand therapy
ment can be detected. Ceiling or floor effects for clinical practice was the DASH and second most
groups have been defined as when more than 15% used was the PRWE, followed by the Quick-
of the population scores at the bottom or top [25], DASH, PSFS, the patient-rated elbow evaluation
which should include the need for scores to be (PREE) and the upper limb functional index
able to change a clinically important difference. (ULFI) [43]. More recently, some larger centres
Adaptive measures have the advantage of being are adopting computer adaptive testing such as
efficient since the computer selects an item of the PROMIS system [44–46], which, like the
appropriate difficulty based on a previous DASH, MHQ and PRW(H)E possess strong
response. A disadvantage is that a different set of measurement properties.
items is presented on each occasion and the
change in specific functional tasks is not tracked.
The items selected by the computer are not nec- Choosing the Proper Outcome
essarily relevant to patients. Further, technology Measure
barriers may exclude some clinics or patient
groups. The most patient-centred outcome mea- The ideal characteristics of an outcome instru-
sures are those where the patient selects the items. ment include:
Patient-specific PRO, like the patient-specific
functional scale (PSFS) [26] measures provide • Interval level scaling properties
the highest level of responsiveness [27–29] since • Strong reliability, validity and responsiveness
patients select items that are salient and prob- • Validated for the proposed measurement
lematic. Further, these items are often aligned purpose (responsiveness, diagnosis or pre-
with the patient’s goals. A downside of diction) and population
patient-specific measures is that the scores cannot • Measures constructs relevant to your clinical
be compared between patients because the diffi- population and planned interventions
culty of the items is not calibrated. • Low clinician burden: Time, scoring com-
Disease or symptom-specific PRO measures plexity and training
are indicated where the condition is sufficiently • Low respondent burden: time, cognitive and
unique that important elements are not captured health literacy
by generic instruments. For example, median • Relevant and acceptable to respondents
(Symptom Severity Scale) [30] or the ulnar nerve • Has clinical applicability: published compar-
(Patient-Rated Ulnar Nerve Evaluation ative scores, decision-making benchmarks
[PRUNE])-[26] specific scales have been devel- minimal detectable change (MDC) and clini-
oped for neuropathies to capture the unique cally important difference (CID)
sensory and motor symptoms. Since pain is often • Cross-cultural translations available for mul-
a primary reason for patient consultation it can be tiple groups
measured with a variety of PROs. The most • Feasible Cost/availability
commonly used symptom outcome measure is a • Minimal expertise-required; or training easily
numeric pain rating scale, which can be admin- accessed
istered verbally or by paper to measure pain • Common usage or recommended by interna-
intensity. However, multidimensional pain scales tional consensus panels
that assess other dimensions of pain, such as the
McGill Pain Questionnaire [31], are gaining Multiple factors determine what is the best
popularity. Regional measures of symptoms and outcome measure for any given scenario; both
7 A Primer on Outcome Measures for Surgical Interventions 65
measurement properties and feasibility issues are • Reliability is high (intraclass correlation
important. Outcome measures function for a coefficient (ICC) 0.90 for individual
given population, spectrum of the phenomenon patients, or 0.75 for group’s comparisons or
and measurement purpose. Performance cannot evaluating trends with repeated measures [1]
be certain outside of those characteristics. For • The amount of measurement error is low as
example, PROs that are validated for evaluating indicated by a low standard error of mea-
change (responsiveness), may not be ideal for surement (SEM*); which is expressed in the
discrimination or prediction. Therefore, it is original units of measurement
important to have validation studies that apply pffiffiffiffiffiffiffiffiffiffiffi
both to the context and purpose of measurement. SEM ¼ Standard deviation 1 I CC
Table 7.3 Key measurement threats and respective solutions in outcome measures
Measurement threat Impact Solution
Ceiling or floor effects No room to assess CID; Failure to Choose an appropriately targeted
measure change instrument; patient-specific scales
Reliability or validity Measurement properties may not Conduct validation or choose a
established in different patient generalize; items may not be different instrument
population appropriately targeted
Reliability less than excellent Difficulty evaluating change Multiple assessments; larger margins
of error; large sample sizes in clinical
studies
Lack of interval level scoring Inaccurate measurement of change; Rasch-based scoring [47]
compromise of statistical test
assumptions
Lack of factor Inaccurate attribution of constructs or Use of separate subscales rather than
structure/unidimensionality areas of treatment impact total scores
English PRO Exclusion of some people; biased Cross-cultural validation; surrogate
assessments completion (PRO)
(Health) literacy gaps between Exclusion of segments of the Assess literacy; Choose scales with
the measure and the patient’s population; nonadherence or invalid low cognitive and literacy demands;
ability measurement surrogate completion
Intentional bias or effort failure Invalid measurement Assess respond patterns;
Cross-reference PRO with objective
measures
66 J. MacDermid
How Can I Interpret the Score independent process to reduce bias in adminis-
in Clinical Practice? tration. Key measurement threats and solutions
are listed in Table 7.3. Prognostic factors affect
• Evaluate the outcome trajectory over time the baseline score and outcome trajectory, and
• Compare the score to published these should be considered when comparing
benchmarks/trajectories outcomes across individuals or groups. Potential
• Compare the observed change to the clin- prognostic factors include age, gender, disease
ically important difference (CID) specific severity, psychological status, comorbid health
to the patient population, intervention and and socio-economic factors. Outcome measures
context indicate patient status and may be useful for
• Evaluate if the score achieved is consistent signaling the need for further investigation but
with patient goals cannot diagnose a problem.
• Use the score to differentiate subgroups or Conversely, while some people criticize PROs
predict the future outcome as being ‘subjective’, they have an important role
in the assessment of patient status. They indicate
Numerous studies indicate that the best base- outcomes that are important to patients, and the
line predictor of a future score on a PRO is the reason(s) for clinical intervention. In general,
baseline score on that PRO. When predicting the PROs used in hand surgery practice are as reliable,
meaning of PRO, the initial baseline score, the or more reliable, than impairment measures [39,
expected trajectory, published clinical prediction 48, 49]; and are more associated with important
rules and studies on outcome mediators should be outcomes like return to work [50]. Impairment
considered. There has been substantial focus on measures like grip and range of motion
defining CID for different PRO, and these can be (ROM) require patient cooperation, as do PROs.
useful benchmarks. However, these should be The objective outcome measures are those where
interpreted as loose benchmarks since they are the patient or evaluator input is not required (e.g.
often highly variable between studies with dif- nerve conduction). These measures, while impor-
ferent patient populations or interventions [1]. tant for directing care, are not indicators of the
Further since many PRO are ordinal scores, the patient’s functional outcome. Therefore, a bal-
CID can vary over the values of possible scores. anced measurement strategy that includes
Further, the CID for a group (research study) is diagnostic/mechanism-based impairment mea-
different than for an individual. Because of dif- sures, physical impairments like grip/motion and
ferences in methods of computation between the patient-based outcomes measured by PRO are
CID and MDC, the MDC can equal or exceed the typically needed as core measures in hand surgery.
CID. Studies reporting scores for clinical sub-
group scores can provide useful benchmarks; e.g.
those who return to work versus those who do not. Application of Above-Mentioned
Concepts to Resolve the Clinical
Scenario
What Cautions Should I Exercise?
In keeping with Mary’s condition, imaging will
Outcome measures may be invalid if not be used to monitor fracture alignment. Constructs
administered correctly, if not appropriately con- that are deemed relevant for assessing outcomes
structed or targeted, or if respondents are of Distal Radial Fracture (DRF) management
unwilling or unable to fully cooperate with the include joint and hand function. Unlike some
assessment. Outcome measures should be aspects of literature searching, no methodological
administered by an unbiased person or filters have been developed specifically for
7 A Primer on Outcome Measures for Surgical Interventions 67
clinical measurement studies. However, using a The SEM for the PRWE is 4–5 points [7]. The
simple Boolean search strategy, we can combine MDC90 is 9. This means that there is a 90%
the content constructs, measurement property chance that Mary’s true score is within 9 points
terms and clinical populations in a search strat- of the measured score; so, given her measured
egy to identify appropriate articles, e.g. (pain and score of 70, the true score is likely to fall between
disability) and (reliability or validity or respon- 65.5 and 74.5.
siveness or Rasch or factor analysis) and (distal The CID for the PRWE is 11 [7, 54, 55]. To
radius fractures). We can also consult outcome be confident that Mary score of 70 has changed
measure databases (Table 7.2) to identify an important amount, it would need to improve
potential candidates since often the measurement to 59 or less. We know that recovery is rapid in
tools are not included with the clinical mea- the first three months and slower thereafter [56],
surement studies evaluating their properties. and should approach 12, by 1-year post-DRF
To assess joint function, we measure grip and [57, 58]. After reading some of the clinical
range of motion; to assess hand function at the measurement studies that investigate how the
level of the person we will use a PRO. PRWE can be used to improve our prediction of
The functional consequences of a distal radius Mary’s future outcome, we find studies related
fracture are well documented, and international to both return to work and chronic pain. One
panels have defined the core concepts as pain and study demonstrated that patients with high
disability [51, 52]. In this case we decide to use baseline scores and who do not improve on
The patient-rated wrist evaluation [39, 53] successive evaluations are likely to have a
because it is specific to this patient population, is delayed return to work [50]. We will watch
highly reliable and valid [7], contains a separate Mary’s scores carefully given that she has a
pain score (salient given Mary’s pain), is brief, high pain score to see whether improvement can
has a low cognitive demand, has been translated be quickly obtained. Delayed improvement on
into multiple languages [42, 54] and has been the PRWE would suggest she might be at risk of
recommended as a core measure by international a slow return to work. Another study investi-
panels [51, 52]. The PRWE presents 15 items gated how baseline scores can be used to predict
scored on a 0–10 numeric rating scale. The pain future chronic pain. A score of 35/50 on the
scale is scored out of 50 by summing the 5 pain PRWE pain subscale of the PRWE has high
items. The function score is comprised of 6 items sensitivity (85%) and specificity (79%), Area
on specific activities and 4 items on usual under the Curve = 89%, for predicting chronic
activities that are summed to provide a score out pain at 1 year [59]. For more information on
of 100. This score is divided by 2 to provide the sensitivity and specificity, please see Chap. 21:
second 50% of the total score. Thus, the total Diagnostic Studies in Surgery. This study also
score equally waits for pain and disability on a reported that data as a relative risk. Thus, Mary
scale ranging from 0 to 100, with 100 repre- is 8.4 times more likely to experience chronic
senting the maximum pain and disability. pain at one year because her score exceeds that
Mary has a pain score of 40/50 on the PRWE risk threshold of 35 out of 50.
pain subscale, and a total score of 70/100 on Mary is provided the PRWE by administrative
PRWE. Her pain is constant, a 4/10 at rest, a 9/10 staff in the clinic to reduce social desirability
at worst or when lifting heavy objects. Her bias. Translations are kept in reserve and used
function ratings include two low scores ‘fasten- when needed for non-English speaking patients.
ing buttons’, indicating minimal problems with Her forms are kept in her medical chart, and
fine dexterity; and ‘recreation’ given that her comparison data from published studies are kept
main pastime is watching TV. Otherwise, the on hand for comparison. The clinic develops a
other functional items indicate high levels of database to use for comparison data, quality
difficulty in tasks that require motion or strength. assurance or research.
68 J. MacDermid
27. Abbott JH, Schmitt J. Minimum important differ- disability of the arm, shoulder, and hand question-
ences for the patient-specific functional scale, 4 naire, patient-rated wrist evaluation, and physical
region-specific outcome measures, and the numeric impairment measurements in evaluating recovery
pain rating scale. J Orthop Sport Phys Ther. 2014. after a distal radius fracture. J Hand Surg Am.
28. Gross DP, Battié MC, Asante AK. The patient- 2000;25(2):330–40.
specific functional scale: validity in workers’ com- 42. Goldhahn J, Shisha T, MacDermid JC, Goldhahn S.
pensation claimants. Arch Phys Med Rehabil. 2008. Multilingual cross-cultural adaptation of the
29. Hefford C, Abbott JH, Arnold R, Baxter GD. The patient-rated wrist evaluation (PRWE) into Czech,
patient-specific functional scale: validity, reliability, French, Hungarian, Italian, Portuguese (Brazil),
and responsiveness in patients with upper extremity Russian and Ukrainian, vol. 133, Archives of
musculoskeletal problems. J Orthop Sport Phys Ther. orthopaedic and trauma surgery; 2013. p. 589–93.
2012. 43. Valdes K, MacDermid J, Algar L, Connors B,
30. Levine DW, Simmons BP, Koris MJ, Daltroy LH, Cyr LM, Dickmann S, et al. Hand therapist use of
Hohl GG, Fossel AH, et al. A self-administered patient report outcome (PRO) in practice: a survey
questionnaire for the assessment of severity of study. J Hand Ther. 2014;27(4):299–307; quiz
symptoms and functional status in carpal tunnel 308.
syndrome. J Bone Joint Surg Am. 1993. 44. Döring AC, Nota SPFT, Hageman MGJS, Ring DC.
31. Dworkin RH, Turk DC, Trudeau JJ, Benson C, Measurement of upper extremity disability using the
Biondi DM, Katz NP, et al. Validation of the patient-reported outcomes measurement information
short-form McGill pain questionnaire-2 (SF-MPQ-2) system. J Hand Surg Am. 2014.
in acute low back pain. J Pain. 2015;16(4):357–66. 45. Tyser AR, Beckmann J, Franklin JD, Cheng C,
32. Kennedy CA, Beaton DE, Smith P, Van Eerd D, Hon SD, Wang A et al. Evaluation of the PROMIS
Tang K, Inrig T, et al. Measurement properties of the physical function computer adaptive test in the upper
QuickDASH (Disabilities of the arm, shoulder and extremity. J Hand Surg Am. 2014.
hand) outcome measure and crosscultural adaptations 46. Hung M, Voss MW, Bounsanga J, Crum AB,
of the QuickDASH: a systematic review, vol. 22, Tyser AR. Examination of the PROMIS upper
Quality of life research; 2013. p. 2509–47. extremity item bank. J Hand Ther. 2017.
33. Beaton DE, Wright JG, Katz JN, Group UEC. 47. Tennant A, Conaghan PG. The Rasch measurement
Development of the QuickDASH: comparison of model in rheumatology: what is it and why use it?
three item-reduction approaches. J Bone Jt Surg Am. when should it be applied, and what should one look
2005;87(5):1038–46. for in a Rasch paper? Arthritis Care Res. 2007;57
34. Kennedy CA, Beaton DE. A user’s survey of the (8):1358–62.
clinical application and content validity of the DASH 48. Vincent JI, Macdermid JC, Michlovitz SL, Rafuse R,
(disabilities of the arm, shoulder and hand) outcome Wells-Rowsell C, Wong O, et al. The push-off test:
measure. J Hand Ther. 2017;30(1):30–40. development of a simple, reliable test of upper
35. Chung KC, Pillsbury MS, Walters MR, Hayward RA. extremity weight-bearing capability. J Hand Ther.
Reliability and validity testing of the Michigan hand 2014;27(3):185–91.
outcomes questionnaire. J Hand Surg Am. 1998. 49. MacDermid JC, Kramer JF, McFarlane RM,
36. Shauver MJ, Chung KC. The minimal clinically Roth JH. Inter-rater agreement and accuracy of
important difference of the Michigan hand outcomes clinical tests used in diagnosis of carpal tunnel
questionnaire. J Hand Surg Am. 2009. syndrome. Work; 1997.
37. Chung KC, Hamill JB, Walters MR, Hayward RA. 50. MacDermid JC, Roth JH, McMurtry R. Predictors of
The Michigan hand outcomes questionnaire (MHQ): time lost from work following a distal radius fracture.
assessment of responsiveness to clinical change. Ann J Occup Rehabil. 2007;17(1):47–62.
Plast Surg. 1999. 51. Goldhahn J, Beaton D, Ladd A, Macdermid J,
38. Kotsis SV, Lau FH, Chung KC. Responsiveness of Hoang-Kim A. Recommendation for measuring
the Michigan hand outcomes questionnaire and clinical outcome in distal radius fractures: a core
physical measurements in outcome studies of distal set of domains for standardized reporting in clinical
radius fracture treatment. J Hand Surg Am. 2007. practice and research. Ugeskr Laeger. 2014;175
39. MacDermid JC, Turgeon T, Richards RS, Beadle M, (23):1638–41.
Roth JH. Patient rating of wrist pain and disability: a 52. Waljee JFJF, Ladd A, MacDermid JCJC, Rozen-
reliable and valid measurement tool. J Orthop tal TDTD, Wolfe SWSW, Benson LS, et al. A unified
Trauma. 1998;12(8):577–86. approach to outcomes assessment for distal radius
40. MacDermid JCJC, Tottenham V. Responsiveness of fractures. J Hand Surg Am. 2016;41(4):565–73.
the disability of the arm, shoulder, and hand (DASH) 53. MacDermid JC. Development of a scale for patient
and patient-rated wrist/hand evaluation (PRWHE) in rating of wrist pain and disability. J Hand Ther.
evaluating change after hand therapy. J Hand Ther. 1996;9(2):178–83.
2004;17(1):18–23. 54. Weinstock-Zlotnick G, Mehta SP. A structured liter-
41. MacDermid JC, Richards RS, Donner A, Bellamy N, ature synthesis of wrist outcome measures: an
Roth JH. Responsiveness of the short form-36, evidence-based approach to determine use among
70 J. MacDermid
common wrist diagnoses. J Hand Ther. 2016;29(2): radiographic outcomes following intra-articular distal
98–110. radius fractures. Hand (N Y). 2014;9(2):237–43.
55. Walenkamp MMJ, de Muinck Keizer RJ, Goslings JC, 58. Grewal R, MacDermid JC, Pope J, Chesworth BM.
Vos LM, Rosenwasser MP, Schep NWL. The mini- Baseline predictors of pain and disability one year
mum clinically important difference of the following extra-articular distal radius fractures.
patient-rated wrist evaluation score for patients with Hand. 2007;2(3):104–11.
distal radius fractures. Clin Orthop Relat Res. 59. Mehta SP, MacDermid JC, Richardson J,
2015;473(10):3235–41. MacIntyre NJ, Grewal R. Baseline pain intensity is
56. MacDermid JC, Richards RS, Roth JH. Distal radius a predictor of chronic pain in individuals with distal
fracture: a prospective outcome study of 275 patients. radius fracture. J Orthop Sport Phys Ther. 2015;
J Hand Ther. 2001;14(2):154–69. 45(2):119–27.
57. Lalone EA, Rajgopal V, Roth J, Grewal R, MacDer-
mid JC. A cohort study of one-year functional and
Patient-Important Outcome
Measures in Surgical Care 8
Katherine B. Santosa, Anne Klassen and Andrea L. Pusic
After reviewing the titles and abstracts for the [3]. Broadly speaking, PROs encompass symp-
studies that met your search strategy, you do not toms (e.g. post-operative pain or fatigue),
find a systematic review but do find a 2017 health-related quality of life (HRQoL) , such as
article in a high impact journal (Journal of physical and psychosocial well-being, and satis-
Clinical Oncology) entitled, ‘Patient-Reported faction with care (e.g. satisfaction with preoper-
Outcomes 1 Year After Immediate Breast ative information and postoperative follow-up
Reconstruction: Results from the Mastectomy provided by the care team).
Reconstruction Outcomes Consortium’, by Pusic HRQoL is a multidimensional concept that
and colleagues [1]. Although the abstract of the describes quality-of-life (QoL) as it pertains to
manuscript seems to meet your needs, you also health and disease and includes domains related
keep in mind that one of the MeSH terms in your to physical, mental, emotional and social func-
search, ‘patient-reported outcome measures’ was tioning [4]. HRQoL is more encompassing than
only introduced in 2017 so you find other rele- symptomology. Although symptoms describe the
vant articles from the paper and from the patient’s condition or treatment effect, it usually
remaining 31 citations. mirrors a clinician-reported measure [5]. HRQoL
In the Introduction of the paper by Pusic and on the other hand can be impaired by a patient’s
colleagues [1], the authors state that few studies symptoms but also reflects disease or treatment
in the breast reconstruction literature have eval- characteristics that may not be captured by clin-
uated ‘patient perceptions of outcomes’ and of ical symptomology alone.
those that do, most have relied on generic mea- In a variety of conditions and circumstances
sures or surveys with limited reliability or va- such as in our clinical scenario (QoL after breast
lidity. Therefore, the goal of their study was to reconstruction), it may be best to summarize
evaluate patient-reported outcomes (PROs) of outcomes and the importance of those outcomes
women undergoing immediate post-mastectomy with input from patients through PROs. Fur-
breast reconstruction with implant-based or thermore, despite advances in our ability to col-
autologous techniques at 1 year using a lect physical, physiological and biochemical data
patient-reported outcome measure (PROM) ter- that can inform us about a patient’s condition,
med the BREAST-Q [1]. After reviewing the there are certain data that cannot be obtained or
manuscript, you are encouraged that it will help adequately assessed without the patient’s per-
the shared decision-making process with your spective [6]. It is the addition of the patients’
patient and review other articles to learn more perspective that emphasizes the importance of
about PROMs and their importance. PROs in clinical research, patient care and
quality improvement [6].
Introduction
Other Important Concepts
What is a Patient-Reported Outcome? and Definitions
Over time, there has been a realization of the There are several key concepts and terms used
importance of patient-centered care and evalua- when discussing and applying PROs; Table 8.1
tion of outcomes as reported by the patients summarizes these [3, 7]. Importantly, every PRO
themselves in surgical care [2]. The term measure (PROM) has a conceptual framework,
‘patient-reported outcome (PRO)’, as defined by which, defines: (1) the concepts of interest or
the U.S. Food and Drug Administration importance to patients; and (2) a description of
(FDA) refers to ‘any report of the status of a the relationships between the questions asked
patient’s health condition that comes directly and concepts being measured. From the con-
from the patient, without interpretation of the ceptual framework, scales are developed to
patient’s response by a clinician or anyone else’ measure unidimensional concepts (or constructs).
8 Patient-Important Outcome Measures in Surgical Care 73
Fig. 8.2 Steps in the development of a PRO instrument as described by the FDA
one instrument was found to have properties 4. Collect, analyze, and interpret data: In the
that met internationally accepted criteria. The first round of field testing of the BREAST-Q,
goal was to develop a conceptual framework questionnaires were sent to 2715 women with
of patient satisfaction and health-related 1950 (72%) women returning the question-
quality of life in breast surgery. As such, naire for analysis. A subsequent study in
qualitative interviews were conducted on 2012 was performed to test the instrument for
breast reconstruction, reduction and augmen- other measures of validity and reliability with
tation patients. Together with expert opinion the inclusion of data from 817 women.
from plastic surgeons, oncologist breast sur- 5. Modify instrument: Final cognitive debriefing
geons, nurses, and psychologists, six key interviews were performed with 30 patients
themes were identified and formed the con- (10 patients from each procedure group) to
ceptual framework. In addition to clinical review the final draft of the questionnaire.
expertise, mastery in other skill sets such as During this step, the research team obtained
biostatistics, psychology, and measurement the average completion times for patients.
development are important to have in the Since the initial development of the
development of a PROM. BREAST-Q, the questionnaire has been
2. Adjust conceptual framework and draft translated into 13 languages and has helped
instrument: Using patient input, items and facilitate important outcomes studies among
preliminary scales were generated. Items breast surgery patients.
within each pool of patients (reconstruction,
reduction and augmentation) were grouped Another example is the development of the
into domains based on their conceptual mean- CLEFT-Q, a PROM designed to assess outcomes
ing. Content validity was established during for patients with cleft lip or palate. To develop
this step of development. Additionally, a recall the conceptual framework for this PROM, 138
period of two weeks was determined to be heterogenous individuals across 6 countries,
acceptable to patients, clinically relevant, and spanning different ages, genders, socioeconomic
reflective of ‘current status’ of patients for all classes and cleft types participated in rigorous
domains except sexual well-being. qualitative interviews. Data from the interviews
3. Confirm conceptual framework and assess were transcribed and contributed to the devel-
other measurement properties: Items were opment of this PROM [13]. Successful comple-
reduced such that half of the preliminary tion of each step ensures that the PROM
items were retained in the different modules. accomplishes three key things. First, it is
Rasch analysis [12], which will be discussed imperative that the PROM evaluates the concept
later in this chapter, allowed for the summing of interest. Next, the measure should integrate
of items to form a total score for each scale experiences important to the patient population
within each of the modules. In a subsequent of interest. Finally, the language utilized in the
study, different aspects of validity and relia- PROM should not only be easily understandable
bility of the instrument were performed. to the patient but should also allow the patient to
8 Patient-Important Outcome Measures in Surgical Care 75
respond without confusion. Asking irrelevant however, ongoing and refers not only to the
questions for a specific patient population or measurement itself, but also to how it is used [7].
including questions that are poorly written could When deciding on which PROM to use,
result in measurement error and bias [3]. researchers should consider content validity, or
After a literature search has been performed to the overall ability of the instrument to measure its
identify existing PROMs and potential gaps in stated concepts, and its ability to measure its
their utility in the subject area, qualitative re- stated concept in the patient population being
search is performed. Careful qualitative research studied.
forms the basis of any new PROM. Data are Criterion validity is the process whereby a
derived from patient interviews or focus groups, PROM is assessed against a true value or ‘gold
which are audio recorded to facilitate transcrip- standard’ [7, 15]. Importantly, often, criterion
tion and analysis. Transcribed verbatim, patient validity cannot be examined when there is no
interviews and focus group data are used to gold standard to which to compare the PROM.
identify key concepts of interest and to form the Criterion validity includes both concurrent
conceptual framework. In addition, expert opin- validity and predictive validity. Concurrent
ion from a heterogeneous group of individuals validity compares scores from the PROM of
can also be utilized to refine a PROM [14]. interest with the gold standard administered at
During development of the BREAST-Q, for the same time, whereas predictive validity refers
example a total of 48 patients were interviewed to the assessment of how well the measure pre-
in a semi-structured fashion, generating a total of dicts the gold standard in the future [7].
2749 statements about patient satisfaction and Another form of validity is constructed valid-
HRQoL after breast surgery [8]. ity, which ‘describes the relationship between
PROM scores and factors that describe something
that we can observe’ [14]. Physical attributes such
How Do You Evaluate a PROM? as a patient’s height or weight are concrete mea-
surements; however, measuring an abstract con-
Validity struct such as depression, for example, is more
Validity is a key property to consider in evalu- challenging and cannot be directly observed [7].
ating any PROM. Validity, or the ability of a Instead, there are behaviors that are consequences
PROM to measure what it is intended to mea- of depression such as decreased energy, irritabil-
sure, can be described by content validity, crite- ity, and changes in sleep that can be used to refer to
rion validity, and construct validity [15]. Content abstract ideas that we construct in our minds to
validity, which has been referred to as ‘the explain observed patterns or differences in
cornerstone of validity’ [5, 15] describes ‘the behavior, attitudes, or feelings [7]. A main goal of
degree to which the content of a measurement is any PROM is to measure these abstract concepts as
an adequate reflection of the construct measured’ there are no direct measurements to summarize
[15]. When assessing a PROM, it is important to such experiences. Construct validity, therefore,
ask: Is the questionnaire asking the important refers to the ability of a PROM to assess the
questions for the clinical question at hand? In abstract concept or construct it is trying to measure
other words, is this PROM appropriate for your [14, 17, 18]. It is evaluated by hypothesizing how
patient or patient population? It is apparent to see scores are associated with factors hypothesized to
why performing high-quality patient interviews be associated with the constructs measured by the
and establishing content validity during the PROM [14, 18].
development of the PROM are so imperative. In 2010, the Consensus-based Standards of
Content validity is arguably the most important Health Status Management Instruments (COS-
aspect of validity and is largely established in the MIN) study released a checklist to evaluate the
initial qualitative phase of PROM development methodological quality of studies measuring
[3, 16]. Establishment of content validity is, PROs [18, 19]. This consensus-based study
76 K. B. Santosa et al.
included opinions of international experts in Reliability assesses the precision of the mea-
psychology, epidemiology, statistics and clinical surements of the PROM and has been defined as
medicine, and is used as a guide to evaluate va- ‘the degree to which measurement is free from
lidity and reliability of PROMs. The BREAST-Q measurement error’, [15] and is inversely related
was developed prior to the release of this to measurement error. Reliability is typically
consensus-based checklist [8]. Nonetheless, it referred to in terms of its reproducibility [15, 29].
meets the rigorous criteria for content, criterion Importantly, validity and reliability are not
and construct validity set forth by the COSMIN. independent of each other—in other words, a
In Phase I of development, data from qualitative PROM may be shown to be valid but not reli-
interviews of 48 patients, expert opinion and able, reliable but not valid, neither valid nor
literature review generated the items included in reliable, or both valid and reliable. However,
the instrument. Assessment of all items led to validity can be limited by reliability in that
three pools of items from augmentation, recon- inconsistent responses may affect the validity of
struction and reduction patients, which formed the measurement [7].
the domains of the conceptual framework for There are two approaches to assessing the
each type of surgery and was not just applied to reliability of a PROM: (1) internal consistency
all breast surgery patients. The research team reliability; and (2) repeatability reliability.
then evaluated each of the item lists in each of Internal reliability is applied to multi-item scales
the domains to retain the most appropriate items and refers to the consistency of responses to the
to form the best potential scale. In a subsequent different items that form the scale. According to
study, the research team performed further vali- the Steering Committee of the COSMIN,
dation tests of the BREAST-Q by analyzing Cronbach’s alpha coefficient is the preferred
questionnaires from 817 women [20]. With statistic to measure internal consistency [20].
regards to criterion validity, BREAST-Q scales Cronbach’s alpha coefficient is a statistic that is
were compared to other ‘gold standard’ scales calculated from the pairwise correlations
such as the Short Form 12/36, Breast Evaluation between items [7]. Values for Cronbach’s alpha
Questionnaire, Breast-Related Symptoms Ques- coefficient vary from 0 to 1, with >0.70 being a
tionnaire, Breast Reduction Assessment Severity minimal requirement for internal consistency [7,
Scale, European Organization for Research and 17]. Repeatability reliability on the other hand,
Treatment of Cancer Scales, Body Image Scale, can be applied to single-item or multi-item scale
Body Image after Breast Cancer Questionnaire and evaluates the variances between repeated
Body Stigma Scale [21–27]. Moderate correla- measurements on the same group of subjects.
tions between BREAST-Q scores and other Repeatability reliability can be further divided
scales with related constructs and low correla- into: (1) test–retest reliability; (2) interrater
tions with dissimilar constructs were observed, reliability; and (3) equivalent-forms reliability.
confirming that this PROM meets criteria for Test–retest reliability is used when measure-
criterion validity. The research team also deter- ments are repeated on more than one occasion
mined the extent to which subscales of the separated by a time interval that is sensible to
BREAST-Q measured separate but related con- how the PROM will be used. A minimum of a
structs by calculating inter-correlations to estab- intra-class correlation coefficient (ICC) of 0.70
lish construct validity. Taken together, the in studies with at least 50 patients is generally
BREAST-Q exceeds criteria for validity [28]. considered to meet criteria for acceptable test–
retest reliability [13, 29]. Interrater reliability is
used when measurements are made at the same
Reliability time by different observers; and equivalent-
forms reliability is used when measurements
In addition to validity, another important psy- involve different variants of the same attribute
chometric property to consider is reliability. on construct [7].
8 Patient-Important Outcome Measures in Surgical Care 77
In addition to meeting the validity criteria set What Are the Different Types
forth by COSMIN, the BREAST-Q also meets of PROMs?
the reliability criteria. With regards to internal
consistency, the uni-dimensionality of each scale Various types of PROMs exist that differ by
of the BREAST-Q was checked and Cronbach’s content and the primary intended purpose of their
alpha coefficient was calculated for each scale use [19]. Instruments can be classified into dif-
separately. All domains within each module of ferent categories including:
the BREAST-Q (i.e. Augmentation, Reduction (1) disease/condition-specific; (2) generic; (3) di-
and Reconstruction) had Cronbach’s alpha mension specific; (4) region/site-specific; (5) in-
coefficients of >0.80. With regards to test–retest dividualized; and (6) utility measures. Although
reliability, the intra-class correlation coefficient they can be classified into different groups, they
was calculated as >0.70 across all domains and are not mutually exclusive, as some PROMs fit
modules after 462 patients retook the into more than one category. Here, we delve into
BREAST-Q after two weeks. the advantages and disadvantages between dis-
ease or condition-specific versus generic PROMs
and briefly define the different types of instru-
Classical Test Theory Versus Modern ments that exist.
Psychometric Test Theory Disease or condition-specific instruments
measure the patient’s perception of a specific
There are two main techniques for the develop- disease or health problem. An important advan-
ment and validation of PROMs: classical test tage of these types of instruments is that they are
theory (CTT) and item response theory composed of relevant content that is important
(IRT) [30]. While CTT is the most commonly when used in a clinical trial or in clinical care of
used, it has some drawbacks which IRT attempts a specific patient group. It has been shown that
to address [31]. In CTT for example, the psy- disease-specific instruments are more likely to
chometric properties only relate to a specific detect important changes that occur over time in
population and situation in which the question- the particular disease being studied [32, 33].
naire was developed and may only be valid for Because this type of PROM is more specific, it is
group-level-based research. Second, the also possible that a lower sample size may be
assumption is that all items in a scale contribute needed to detect clinically important differences
equally to the final score. Third, it is difficult to among a sample with the disease or condition of
equate scores that a person achieves on different interest or differences following treatment or
tests [14]. In contrast, IRT focuses on individual both [33, 34]. Additionally, patients may be more
items rather than an overall test score and likely to complete questionnaires that ask about
assumes that individual items contribute differ- relevant concepts that are important to their
ently to the final score [14]. One statistical condition or disease [32]. Region or site-specific
method within the umbrella of IRT is Rasch PROMs, which can be considered a type of
Measurement Theory (RMT) [12], which was condition-specific measures, are particularly
utilized in the development of the BREAST-Q useful in surgery. These measurements were
[8]. Advantages of using Rasch measurement developed to assess problems in a specific part of
methods during the development and item the body. For example, the Shoulder Disability
reduction of the BREAST-Q scales have been Questionnaire consists of 22-items to evaluate
described previously [8]. One important advan- disability arising from the shoulder [35].
tage of RMT over CTT, particularly for sur- The major disadvantage of using a disease or
geons, is the ability to measure clinically condition-specific PROM is the inability to use
meaningful change for individuals with RMT as the instrument on a generalized population
questionnaires developed with CTT may only be without the disease or condition being measured.
applicable to group-based level research [8]. Therefore, it is not possible to compare the health
78 K. B. Santosa et al.
status or well-being of the study sample to a satisfied the patient is with the feel of her breasts
general sample of well individuals. Moreover, or appearance in clothing after reconstruction.
disease or condition-specific measures may not Depending on the research question, it may be
capture health problems associated with a disease appropriate to use a condition-specific, generic,
or its treatment that was not anticipated in the or both in any given study. The advantages of
initial design of the instrument [32], further any measurement tool should be weighed against
emphasizing the importance of establishing the potential disadvantages of its use. In addition
content validity during the development of the to condition-specific and generic measurements,
PROM. other types of measures exist.
Generic PROMs cover a broad spectrum of Dimension-specific measurements evaluate
aspects of health status and can be applied to a one specific aspect of health status [32]. For
wide group of patient groups. The SF-36 is an example, a common type of dimension-specific
example of a generic PROM that has been widely measurement is one that assesses different
used across many disciplines and patient groups aspects of psychological function. The Beck
[17]. Another common and more recently Depression Inventory consists of 21 items
developed generic measurement is the Patient- addressing symptoms of depression [41]. Initially
Reported Outcomes Measurement Information developed for use in patients with psychiatric
System-29 (PROMIS-29). This generic, publicly illness, this dimension-specific measurement has
available approach to measuring HRQoL was been increasingly used to assess depression in
developed by the National Institutes of Health patients with physical illnesses [42, 43].
with the goal of comparing across different health Certain PRO platforms provide for a cus-
conditions [36–40]. The main advantage of using tomizable measure specific to the treatment area
generic measurements like the SF-36 or or specific device. For example, the FACE-Q is a
PROMIS-29 is the ability to use them across a PROM designed to measure the satisfaction and
broad spectrum of health problems and patient quality of life for facial aesthetic procedures [44].
populations. Generic measurements are particu- The instrument was developed with multiple
larly useful if no condition-specific measures modules of quality of life scales, appearance
exist for a particular patient group [41]. Addi- scales, adverse event checklist and patient
tionally, because they have been tested against experience of care scales. Modules can be chosen
healthy individuals, normative values exist for for a clinical study based on the specific treat-
these different measurements, allowing for com- ment or general facial location where the device
parison to a baseline of healthy individuals [32]. is used. For example, for a lip device, only
An important disadvantage of using a generic modules relevant to the lip are utilized in the
PROM is that not all items may be relevant for clinical study. Modules pertaining to other facial
the patient group or study question. Because features (e.g. forehead) are removed for the
generic measurements tend to have fewer rele- specified clinical study. The customization limits
vant items to a particular condition or disease, the patient burden and assures all questionnaire
they can be less sensitive in detecting differences items are relevant and appropriate for the patient
with an intervention than using a condition- cohort [45].
specific measurement [14]. For example, if we Individualized measures allow respondents to
wanted to evaluate the impact of a new technique select issues, domains, or concerns that are of
in breast reconstruction on sensation or particular interest to them [32]. Without a pre-
natural-feeling aspect of the newly reconstructed determined list of questions selected by the
breast, questions in generic measurement such as investigator or researcher, individualized mea-
‘Are you able to walk a flight of stairs?’ or ‘How surements encourage patients to identify the
is your energy level?’ are not as useful as a issues that are of personal concern or importance.
condition-specific measurement like the The McMaster-Toronto Arthritis Patient Prefer-
BREAST-Q that specifically asks about how ence Disability Questionnaire (MACTAR) is an
8 Patient-Important Outcome Measures in Surgical Care 79
individualized PROM that asks patients to iden- [1] to be the most directly relevant, you decide to
tify and subsequently prioritize five activities that share the main findings of the study with your
are most affected by their arthritis [46]. Follow- patient. In summary, the primary goal of the
ing an intervention, changes in the activities that study was to evaluate PROs after immediate
patients identified are measured [46, 47]. implant-based versus autologous breast recon-
Utility measures are typically considered a struction. The main outcome of the study was
class of generic measurements, and have been PROs as measured by BREAST-Q and
derived from economics and decision-analysis Patient-Reported Outcomes Measurement Infor-
[32]. Widely used in cost-effectiveness and mation System (PROMIS) scores at 1 year after
decision-analysis studies, utility scores ascribe reconstruction.
personal preferences of individuals regarding After reading more about how the
health states and provide evidence regarding the BREAST-Q was developed, you conclude that it
value of a particular intervention or treatment to is an appropriate condition-specific measure to
society as a whole [32]. It is beyond the scope of assess PROs after breast reconstruction. In
this chapter to summarize and fully describe all addition, the manuscript describes the scores
the key aspects of utility measures; for more from 7 domains of the PROMIS-29, a widely
information please see Chap. 23. used generic PROM. The study found that
patients who underwent autologous reconstruc-
tion were more satisfied with their breasts and
Applying the Literature to Patients had greater psychosocial and sexual well-being
than those who underwent implant-based tech-
How Do You Interpret Results? niques. Despite these results favoring autologous
reconstruction, there was a notable decrease
Being able to assign qualitative meaning to a physical well-being of the abdomen scores at 1
quantitative score from a PROM or inter- year compared to baseline. You find a similar
pretability, is of paramount importance. If a article assessing similar outcomes, but among
measurement tool shows validity and reliability patients with longer follow-up [49].
but has not measured the issues that matter to You share these results and their implications
patients it is not useful [14]. Calculating the with your patient. The patient is appreciative of
minimal important difference (MID) or the min- the information and states that it has helped her
imal important change (MIC), standard error of come to an informed decision. Using a
measurement (SEM) is one approach to estab- shared-decision-making approach, she utilizes
lishing interpretability of a PROM. Defined by the data from this study to opt for autologous
Jaeschke and colleagues, MID refers to ‘the reconstruction after mastectomy.
smallest difference in score in the domain of
interest which patients perceive as beneficial and
which would mandate in the absence of trou- Appendix 1. Articles found
blesome side effects and excessive cost, a change in literature search
in the patient’s management’ [48]. The MID or
MIC should be defined to provide information 1. Cohen WA, Ballard TN, Hamill JB,
regarding what changes in score would be con- Kim HM, Chen X, Klassen A, et al.
sidered to be clinically meaningful [14, 48]. Understanding and optimizing the patient
experience in breast reconstruction. Ann
Plast Surg. 2016;77(2):237–41.
Resolution of the Scenario 2. Dayicioglu D, Tugertimur B, Zemina K,
Dallarosa J, Killebrew S, Wilson A, et al.
After reviewing the literature and finding the Vertical Mastectomy incision in implant
evidence from the study by Pusic and colleagues breast reconstruction after skin sparing
80 K. B. Santosa et al.
25. Sigurdson L, Kirkland S, Mykhaloyskiy E. Valida- outcomes measured using the PROMIS-29. Value
tion of a questionnaire for measuring morbidity in Health. 2014;17(8):846–53.
breast hypertrophy. Plast Reconstr Surg. 40. Visser MC, Fletcher AE, Parr G, Simpson A,
2007;120:1108–14. Bulpitt CJ. A comparison of three quality of life
26. Hopwood P, Fletcher I, Lee A, Al Ghazal S. A body instruments in subjects with angina pectoris: the
image scale for use with cancer patients. Eur J Sickness Impact Profile, the Nottingham Health
Cancer. 2001;37:189–97. Profile, and the quality of well being scale. J Clin
27. Baxter N, Goodwin P, McLeod R, Dion R, Devins G, Epidemiol. 1994;47(2):157–63.
Bombardier C. Reliability and validity of the body image 41. Beck AT, Ward CH, Mendelson M, Mock J,
after breast cancer questionnaire. Breast J. 2006;12: Erbaugh J. An inventory for measuring depression.
221–32. Arch Gen Psychiatry. 1961;4:561–71.
28. Bohrnstedt GW. Measurement. In: Rossi PH, 42. Grabowska-Fudala B, Jaracz K, Gorna K, Miechow-
Wright JD, Anderson AB, editors. Handbook of icz I, Wojtasz I, Jaracz J, et al. Depressive symptoms
survey research. New York: Academic Press; 1983. in stroke patients treated and non-treated with
p. 69–121. intravenous thrombolytic therapy: a 1-year
29. Giraudeau B, Mary JY. Planning a reproducibility follow-up study. J Neurol. 2018. [Epub ahead of
study: how many subjects and how many replicates print] https://doi.org/10.1007/s00415-018-8938.
per subject for an expected width of the 95 per cent 43. Daniel M, Agewall S, Berglund F, Caidahl K,
confidence interval of the intraclass correlation Collste O, Ekenbäck C, et al. Prevalence of anxiety
coefficient. Stat Med. 2001;20:3205–14. and depression symptoms in patients with myocardial
30. Crocker LAJ. Introduction to classical and modern infarction with non-obstructive coronary arteries.
test theory. Rinehart and Winston: Holt; 1986. AM J Med. 2018. [Epub ahead of print] https://doi.
31. DeVellis RF. Classical test theory. Med Care. org/10.1016/j.amjmed.2018.04.040.
2006;44(11):S50–9. 44. Klassen AF, Cano SJ, Scott A, Snell L, Pusic AL.
32. Fitzpatrick R, Davey C, Buxton MJ, Jones DR. Measuring patient-reported outcomes in facial aes-
Evaluating patient-based outcome measures for use thetic patients: development of the FACE-Q. Facial
in clinical trials. Health Technol Assess. 1998;2(14): Plast Surg. 2010;26(4):303–9.
i–iv, 1–74. 45. Administration USFaD. Value and Use of
33. Deyo RA, Patrick DL. Barriers to the use of health Patient-Reported Outcomes (PROs) in Assessing
status measures in clinical investigation, patient care, Effects of Medical Devices. [Internet]. 2017. [cited
and policy research. Med Care. 1989;27(3 Suppl): 2018 July 14]. Available from https://www.fda.gov/
S254–68. downloads/AboutFDA/CentersOffices/
34. Deyo RA, Patrick DL. The significance of treatment OfficeofMedicalProductsandTobacco/CDRH/
effects: the clinical perspective. Med care. 1995;33(4 CDRHVisionandMission/UCM588576.pdf.
Suppl):As286–291. 46. Tugwell P, Bombardier C, Buchanan WW, Gold-
35. Croft P, Pope D, Zonca M, O’Neill T, Silman A. smith CH, Grace E, Hanna B. The MACTAR patient
Measurement of shoulder related disability: results of a preference disability questionnaire–an individualized
validation study. Ann Rheum Dis. 1994;53(8):525–8. functional priority approach for assessing improve-
36. Jensen RE, Potosky AL, Reeve BB, Hahn E, Cella D, ment in physical disability in clinical trials in rheuma-
Fries J, et al. Validation of the PROMIS physical toid arthritis. J Rheumatol. 1987;14(3):446–51.
function measures in a diverse US population-based 47. Brodin N, Grooten WJA, Strat S, Lofberg E,
cohort of cancer patients. Qual Life Res. 2015;24 Alexanderson H. The McMaster Toronto arthritis
(10):2333–44. patient preference questionnaire (MACTAR): a
37. Hahn EA, Beaumont JL, Pilkonis PA, Garcia SF, methodological study of reliability and minimal
Magasi S, DeWalt DA, et al. The PROMIS satisfac- detectable change after a 6 week-period of acupunc-
tion with social participation measures demonstrated ture treatment in patients with rheumatoid arthritis.
responsiveness in diverse clinical populations. J Clin BMC Res Notes. 2017;10(1):687.
Epidemiol. 2016;73:135–41. 48. Jaeschke R, Singer J, Guyatt GH. Measurement of
38. Teresi JA, Jones RN. Methodological issues in health status: ascertaining the minimal clinically
examining measurement equivalence in patient important difference. Control Clin Trials. 1989;10
reported outcomes measures: methods overview to (4):407–15.
the two-part series, “measurement equivalence of the 49. Santosa KB, Qi J, Kim HM, Hamill JB, Wilkins EG,
patient reported outcomes measurement information Pusic AL. Long-term patient-reported outcomes in
system((R)) (PROMIS((R))) Short Forms”. Psychol postmastectomy breast reconstruction. JAMA Surg.
Test Assess Model. 2016;58(1):37–78. 2018. [Epub ahead of print] https://doi.org/10.1001/
39. Craig BM, Reeve BB, Brown PM, Cella D, jamasurg.2018.1677.
Hays RD, Lipscomb J, et al. US Valuation of health
Surrogate Endpoints
9
Seper Ekhtiari, Ryan P. Coughlin, Nicole Simunovic
and Olufemi R. Ayeni
Introduction
act as a substitute for the true outcome of interest. making them useful particularly in early
For example, measuring joint range-of-motion stages of pharmacological trials: Surrogate
following orthopedic surgery and using it as an endpoints are often used as a
endpoint to estimate function and recovery is “proof-of-concept” in the development and
such a surrogate outcome. evaluation of pharmaceuticals and surgical
Surrogate endpoints offer multiple advan- techniques alike. For example, we know that
tages, hence their popularity in the literature. hemoglobin A1c (HbA1c) is an important
They do, however, have some important limita- marker of sugar control in diabetic patients,
tions that must be recognized and understood and thus a predictor of long-term disease
before using them. The “LDL cholesterol mor- outcomes [8]. It is much more practical to
tality paradox” is a finding of large well-designed measure HbA1c to assess the potential benefit
trials that high-intensity statin drugs (compared of a new antihyperglycemic medication than
to low- or moderate-intensity statins) result in a to wait for the long-term micro- and
statistically significant reduction in serum macrovascular sequela of diabetes to declare
low-density lipid (LDL) values, but do not affect themselves.
overall mortality [4]. In this case, LDL is used as 4. They reduce the sample size required to
a surrogate endpoint for the severity of observe an effect: In the randomized con-
atherosclerosis and the likelihood of eventual trolled trial (RCT) discussed later in this
cardiac events. The cholesterol mortality paradox chapter [9], all postoperative patients rou-
example displays many of the advantages of tinely underwent Doppler ultrasonography.
surrogate endpoints, as well as a major disad- Thus, even patients who were completely
vantage. Some of the most important advantages asymptomatic but had a positive ultrasound
of surrogate endpoints include the following would be considered as DVT events. Clearly,
[5, 6]: this means that patients with “subclinical”
DVTs that in routine practice would have
1. They are easier and cheaper to measure been inconsequential and gone unnoticed
than “true” endpoints: Measuring blood were now being identified. This increases the
pressure is simple, cheap and routinely per- event rate, thus decreasing the total sample
formed. Detecting a stroke requires thorough size that is needed for the study to be suffi-
clinical evaluation and detailed cross- ciently powered.
sectional imaging. Thus, using blood pres-
sure as a surrogate endpoint for risk of stroke Overall, the surrogate endpoints are indis-
is much cheaper and easier than focusing on pensable in both research and clinical settings.
stroke itself as an endpoint. Their ease of use, relative inexpensiveness,
2. They can often be measured more objec- objectivity, and early presentation make them
tively and precisely than patient-important ideal in both contexts if used appropriately.
endpoints: For a postoperative patient, par- That being said, there are certainly drawbacks
ticularly in the elderly and those with a his- to the use of surrogate endpoints. Some of these
tory of cardiac disease, myocardial injury is limitations include misleading conclusions, mis-
always a concern. Measuring serum troponin representation of the effect size of an interven-
levels as a surrogate endpoint for the presence tion, and challenges in evaluating the safety of an
and extent of cardiac injury is more precise intervention.
and objective than a clinical history and One potential pitfall is that the surrogate
physical exam [7]. endpoint produces results that are misleading—
3. They are often present earlier than the meaning they make the risk–benefit balance
clinical manifestation of the outcome, thus appear opposite to its “true” direction. This is a
9 Surrogate Endpoints 87
rare but not unprecedented scenario. Generally, intervention in terms of patient-important com-
this brings into question the validity of the sur- plications, such as death or major adverse clinical
rogate endpoint, and the relationship between the events. The detection of differences in these out-
surrogate and true endpoints should be reas- comes requires large RCTs that are specifically
sessed. Anti-arrhythmic medications provide a designed and powered to detect such events.
good case study for this phenomenon. Cardiac Balancing these advantages and disadvantages
arrhythmias are dangerous conditions which can is important when performing or evaluating RCTs
cause severe complications including heart fail- that employ surrogate endpoints. With the
ure and death [10]. It would seem to make sense advancement of laboratory medicine and tech-
then, that a medication that stops or reverses an nology, the number of surrogate endpoints avail-
arrhythmia would be beneficial to the patient’s able to clinicians and researchers has increased
health and longevity. Interestingly, a recent [15]. Thus, it can be tempting to “treat the num-
meta-analysis of randomized controlled trials bers” in an effort to help patients. Caution should
found an increase in noncardiac mortality and be exercised, however; normalizing or improving
all-cause mortality with anti-arrhythmic medica- laboratory or structural measurements does not
tion; the risk of cardiac death was not different in always yield clinical improvement [4]. For a sur-
patients on these medications [11]. Thus, what rogate endpoint to be valid, it has to meet two
may have seemed like quite a suitable endpoint major, important criteria:
(the reversal of an arrhythmia), turned out in fact
to show a net harm in patient outcomes. 1. Is there strong documented/published evi-
More commonly, surrogate endpoints correctly dence that connects the surrogate outcome
identify the direction of the treatment effect, but to the patient-important outcome under
over- or underestimate the magnitude of the effect. consideration? Troponin is a protein
As mentioned earlier, one of the advantages of expressed in both skeletal and cardiac muscle,
surrogate endpoints is that they allow for a trial to with specific subtypes only being expressed
be conducted with a smaller sample size than in the myocardium and nowhere else.
would be necessary otherwise. The potential Myocardial necrosis causes the release of
downside of this is that they may overestimate the troponins into the bloodstream, where they
effect size [12]. An interesting example of this can be easily measured [16]. In addition,
pitfall has been demonstrated in ophthalmological angiographic studies have confirmed that
trials for treatment of glaucoma. Clearly, glau- troponin concentration is directly related to
coma is a disease of increased intraocular pressure the presence and extent of a coronary artery
[13]. Thus, it makes sense that reducing intraocular clot [16]. Therefore, we can be assured that
pressure (IOP), whether medically or through a the co-occurrence of myocardial injury and
procedure, will lead to better functional outcomes. troponin elevation is not simply by chance
Interestingly, however, there are medications that either— there is no confounding effect here
provide quite clinically important structural (more on this later). Thus, the use of serum
improvements (i.e., lower IOP), but relatively troponin as a surrogate for myocardial injury
smaller functional gains [14]. Using a hypothetical meets the first criterion for a valid surrogate
example, gains in knee range-of-motion beyond a endpoint.
certain limit may not provide any further func- 2. Is there strong clinical data available to
tional benefit—does 150° of knee flexion provide correlate improvement in the surrogate
any additional benefit beyond 140°? Thus, using endpoint with improvement in the
range-of-motion as a surrogate endpoint for the patient-important outcome of interest?
function may overestimate the effect of the Myocardial injury after noncardiac surgery
intervention. (MINS) refers to a phenomenon where serum
Finally, surrogate endpoints are often unable troponin levels elevate after noncardiac sur-
to accurately quantify the overall safety of an gery, in the absence of other classic findings
88 S. Ekhtiari et al.
Literature Search
Table 9.1 Key findings from the article by Camporese et al. [9]
7 day GCS 7 day LMWH 14 day LMWH [95% CI] p-value
N 660 657 444 –
Age (year) 42.3 (14.4) 41.9 (15.1) 42.5 (16.7) –
Male:Female 1.66:1 1.62:1 1.60:1 –
Use of hormonal compounds, n(%) 154 (23.3) 194 (29.5) 114 (25.7) –
Primary efficacy end point, n(%) 21 (3.2) 6 (0.9) 4 (0.9[0.4–2.3]) 0.005
– Death 0 0 0
– Symptomatic PE 2 (0.3) 2 (0.3) 2 (0.5 [0.1–1.6])
– Asymptomatic proximal DVT 7 (1.1) 2 (0.03) 0
– Symptomatic proximal DVT 1 (0.2) 0 1 (0.2 [0.0–1.3])
– Symptomatic distal DVT 11 (1.7) 2 (0.3) 1 (0.2 [0.0–1.3])
Secondary efficacy end point, n(%) 31 (4.7) 12 (1.8) 11 (2.5 [1.4–4.4]) 0.005
– Asymptomatic distal DVT 10 (1.5) 6 (0.9) 7 (1.6 [0.8–3.2])
Primary safety end point, n(%) 2 (0.3) 6 (0.9) 2 (0.5 [0.1–1.6]) –
– Major bleeding event 1 (0.2) 2 (0.3) 1 (0.2 [0.0–1.3])
– Clinically relevant bleeding 1 (0.2 4 (0.6) 1 (0.2 [0.0–1.3])
Secondary safety end point, n(%) 22 (3.3) 29 (4.4) 18 (4.1 [2.6–6.3]) –
– Minor bleeding event 20 (3.0) 23 (3.5) 16 (3.6 [
2.2–5.8])
Let us evaluate whether or not our surrogate What Are the Results?
endpoint is valid. The patient has stated that she
is worried about “dying from a blood clot in her You now review the results of the Camporese
lung”; thus, PE-related mortality is your true end et al. [9] study, confident that DVT is a valid
goal. You wonder if DVT is a valid surrogate surrogate endpoint to look at. You turn your
endpoint in this case. Clearly, DVT as a surro- attention to the results section of the article.
gate endpoint for PE meets the first criterion: Interestingly, you notice that the authors report
there is a causal relationship between DVT and that among 877 patients with DVT symptoms
PE, and a PE can certainly be fatal. This causal (i.e., “suspected DVT”), only 16 had a confirmed
90 S. Ekhtiari et al.
DVT on ultrasound (1.8%) [9]. As well, the total including longer and more complex surgeries
number of patients with asymptomatic DVT such as ligament reconstruction. You note that the
(N = 32) is twice the number of patients with a study was conducted in Italy, and thus, the patient
symptomatic DVT (N = 16) [9]. After lamenting group may have had different risk factors and
the seemingly limited utility of symptoms in lifestyle patterns than your Canadian patient
guiding the diagnosis of DVT, you turn your group. Finally, you wonder about the applicabil-
attention to the data comparing the three groups. ity of LMWH—this study is nearly a decade old,
You note that there were zero deaths across the and there have been a number of newer, more
three groups. In addition, there was an identical effective anticoagulant medications developed
number of PEs across the three groups, though since then. Overall, you decide that the patient
the percentages are slightly different. Given that groups are relatively similar to your practice and
the primary endpoint was a composite outcome, patient, but you do have some reservations about
these individual outcomes are not directly com- the generalizability of the results.
pared. You note that the authors report a p-value
of 0.005 for their primary endpoint, with the
LMWH groups having fewer events than the Resolution of Clinical Scenario
GCS group. Finally, you look at the safety data:
the study reports that the rate of major and minor Having considered the study, you come to the
bleeding events across the three groups was following conclusions:
similar, though no direct statistical comparison
was performed. – The study was well-designed and well-
conducted.
– The sample size was large enough (as the
Are the Results Applicable to My authors performed a prospective sample size
Patients? calculation and met their target [9]) and
generalizable to your patients. Demographic
You are satisfied that using DVT as a surrogate data similar to your patient included: age,
endpoint for PE-related mortality is valid. You BMI, smoking status and family history of
are somewhat concerned about the fact that the venous thromboembolism. Furthermore, the
study reported composite endpoints, and that included patients had arthroscopic knee sur-
these included many cases of asymptomatic gery (38–44% meniscectomies) performed
DVT. Nonetheless, you turn your attention to the under regional anesthesia.
demographics of the study to decide whether the – The sample included patients on hormonal
results are applicable to your practice in general, therapies, though their outcomes were not
and this patient in particular. You notice that the separately analyzed or reported.
study had a ratio of approximate 1.6:1, males – The rate of the composite outcome, which
versus females enrolled, and that the patient included asymptomatic DVTs, was lower in
groups had a mean age of about 42 years. Based the 7-day LMWH group compared to the
on your knee arthroscopy patient population, this GCS group.
seems like a close representation. Given that your – The rate of adverse events was similar for
patient is female, however, the sex distribution LMWH compared to nonmedical treatment,
does make you somewhat cautious. Fortunately, though a direct statistical comparison was not
the authors do specifically mention that close to performed.
10% of the patients were using a “hormonal
compound”, such as the OCP. Unfortunately, Overall, you decide that you have a number of
however, these patients’ results were not sepa- concerns about the study which make it difficult
rately reported or analyzed. Interestingly, the to use this paper to help counsel your patient.
study includes a range of arthroscopic procedures, Ultimately, you decide that she is likely quite low
9 Surrogate Endpoints 91
risk, with OCP use being her only risk factor. As References
well, the composite outcome makes it difficult to
assess how much real clinical benefit LMWH 1. Barker RC, Marval P. Venous thromboembolism:
confers. risks and prevention. Contin Educ Anaesthesia, Crit
Your hunch is that LMWH is likely not nec- Care Pain. 2011;11(1):18–23. https://doi.org/10.
1093/bjaceaccp/mkq044.
essary after knee arthroscopy, particularly for
2. Trenor CC 3rd, Chung RJ, Michelson AD,
low risk patients. Thus, you decide to dig a little Neufeld EJ, Gordon CM, Laufer MR, et al. Hor-
bit deeper into the literature and find two very monal contraception and thrombotic risk: a multidis-
helpful articles: A recent meta-analysis of RCTs ciplinary approach. Pediatrics. 2011;127(2):347–57.
https://doi.org/10.1542/peds.2010-2221.
found that anticoagulant therapy did significantly 3. Robb MA, McInnes PM, Califf RM. Biomarkers and
decrease the risk of DVT, but not of symptomatic surrogate endpoints: developing common terminol-
VTE or PE [22]. The other is a large case-control ogy and definitions. JAMA—J Am Med Assoc.
study which found that the use of OCP in women 2016;315(11):1107–8. https://doi.org/10.1001/jama.
2016.2240.
over 46 years old had an odds ratio of 46.6 for 4. Nunes JPL. Statins and the cholesterol mortality
DVT [23]. In the end, you decide that while a PE, paradox. Scott Med J. 2017;62(1):19–23. https://doi.
particularly a fatal one seems quite unlikely, org/10.1177/0036933016681913.
anticoagulation for the prevention of DVT may 5. Aronson JK. Biomarkers and surrogate endpoints.
Br J Clin Pharmacol. 2005;59(5):491–4. https://doi.
be a reasonable choice in this case. You plan to org/10.1111/j.1365-2125.2005.02435.x.
bring this information back to the patient and 6. Bucher H, Cook D, Holbrook A, Guyatt G. Surrogate
decide on a course of action together with her. Outcomes. In: Guyatt G, Rennie D, Meade MO,
Cook DJ, editors. Users’ guides to the medical
literature: a manual for evidence-based clinical
practice, 3rd ed. USA: McGraw-Hill Education;
Conclusion 2015, p. 271–84. (evidence; 2015).
7. Hallén J. Troponin for the estimation of infarct size:
Surrogate endpoints are outcome measures that what have we learned? Cardiology. 2012;121
(3):204–12. https://doi.org/10.1159/000337113.
can be measured directly, objectively, and often 8. Sherwani SI, Khan HA, Ekhzaimy A, Masood A,
inexpensively. They are meant to serve as a more Sakharkar MK. Significance of HbA1c test in
practical substitute for true, patient-important diagnosis and prognosis of diabetic patients. Biomark
outcomes. When applied correctly, surrogate Insights. 2016;11:95–104. https://doi.org/10.4137/
Bmi.s38440.
endpoints are indispensable to conducting 9. Camporese G, Bernardi E, Prandoni P, Noventa F,
well-designed clinical trials. To ensure a surro- Verlato F, Simioni P, et al. Low-molecular-weight
gate endpoint is valid, it must meet two main heparin versus compression stockings for thrombo-
criteria: (1) have a direct causal relationship to prophylaxis after knee arthroscopy: a randomized
trial. Ann Intern Med. 2008;149(2):73–82. https://
the true outcome and (2) have been shown with doi.org/10.7326/0003-4819-149-2-200807150-
strong evidence to be correlated with change in 00003.
the true outcome. It is always important to con- 10. Fu DG. Cardiac arrhythmias: diagnosis, symptoms, and
sider confounding variables, and try to ensure treatments. Cell Biochem Biophys. 2015;73(2):291–6.
https://doi.org/10.1007/s12013-015-0626-4.
that they are not producing misleading results. 11. Pandya B, Spagnola J, Sheikh A, Karam B, Reddy
The downside of surrogate endpoints is that they Anuggu V, Khan A, et al. Anti-arrhythmic medica-
may mislead us on the direction and magnitude tions increase non-cardiac mortality—a meta-
of the risk-benefit analysis, and that they are not analysis of randomized control trials. J Arrhythmia.
2016;32(3):204–11. https://doi.org/10.1016/j.joa.2016.
well-suited to assess the safety of an intervention. 02.006.
When appraising any study, remember to ask: 12. Lantz B. The large sample size fallacy. Scand J
(1) is the study valid? (2) What are the results? Caring Sci. 2013;27(2):487–92. https://doi.org/10.
and (3) Can I apply the results to my practice? 1111/j.1471-6712.2012.01052.x.
92 S. Ekhtiari et al.
13. Weinreb RN, Aung T, Medeiros FA. The Patho- sleep. Orthop Traumatol Surg Res. 2011;97(2):139–
physiology and treatment of glaucoma. JAMA. 2014; 44. https://doi.org/10.1016/j.otsr.2010.12.003.
311(18):1901. https://doi.org/10.1001/jama.2014.3192. 20. Coleman AB, Lam DP, Soowal LN. Correlation,
14. Medeiros FA. Biomarkers and surrogate endpoints: necessity, and sufficiency: common errors in the
lessons learned from glaucoma. Investig Ophthalmol scientific reasoning of undergraduate students for
Vis Sci. 2017;58(6):BIO20–BIO26. https://doi.org/ interpreting experiments. Biochem Mol Biol Educ.
10.1167/iovs.17-21987. 2015;43(5):305–15. https://doi.org/10.1002/bmb.
15. Yudkin FJS, Lipska Robert KJ, Montori VM. The 20879.
idolatry of the surrogate. BMJ. 2012;344(7839). 21. Kelly J, Hunt BJ. Do anticoagulants improve survival
https://doi.org/10.1136/bmj.d7995. in patients presenting with venous thromboem-
16. Korff S, Katus HA, Giannitsis E. Differential diag- bolism? J Intern Med. 2003;254(6):527–39. https://
nosis of elevated troponins. Heart. 2006;92(7):987– doi.org/10.1111/j.1365-2796.2003.01206.x.
93. https://doi.org/10.1136/hrt.2005.071282. 22. Zheng G, Tang Q, Shang P, Pan XY, Liu HX. No
17. Botto F, Alonso-Coello P, Chan MT, Villar JC, effectiveness of anticoagulants for thromboprophy-
Xavier D, Srinathan S, et al. Myocardial injury after laxis after non-major knee arthroscopy: a systemic
noncardiac surgery: a large, international, prospective review and meta-analysis of randomized controlled
cohort study establishing diagnostic criteria, charac- trials. J Thromb Thrombolysis. 2018;45(4):562–70.
teristics, predictors, and 30-day outcomes. Anesthe- https://doi.org/10.1007/s11239-018-1638-x.
siology. 2014;120(3):564–78. https://doi.org/10. 23. van Adrichem RA, Nelissen RGHH, Schipper IB,
1097/ALN.0000000000000113. Rosendaal FR, Cannegieter SC. Risk of venous
18. Lu N, Misra D, Neogi T, Choi HK, Zhang Y. Total thrombosis after arthroscopy of the knee: results
joint arthroplasty and the risk of myocardial infarc- from a large population-based case-control study.
tion—a general population, Propensity score- J Thromb Haemost. 2015;13(8):1441–8. https://doi.
matched cohort study. Arthritis Rheumatol. 2015: org/10.1111/jth.12996.
n/a–n/a. https://doi.org/10.1002/art.39246.
19. Wylde V, Rooker J, Halliday L, Blom A. Acute
postoperative pain at rest after hip and knee arthro-
plasty: severity, sensory qualities and impact on
How to Assess an Article that Deals
with Health-Related Quality of Life 10
Achilles Thoma, Jenny Santos, Margherita Cadeddu,
Eric K. Duku and Charles H. Goldsmith
expectations, standards, and concerns” [1]. outlined by Banfield et al. in Chap. 5 and the
A subcategory of QoL, Health-Related Quality of Waltho et al. Users Guide to the Surgical Literature
Life (HRQL), is a multi-dimensional concept that [6], we searched the literature after creating a clin-
includes domains of physical, mental, emotional, ical research question based on the PICOT format
and social functioning [2]. Since around 1990, (see Chap. 4 on the clinical research question):
there has been an increasing trend to evaluate the
outcomes of health care interventions using – Population: Patients who underwent rectal
sophisticated HRQL scales (instruments, mea- cancer resection.
sures) [3, 4]. The same has been espoused for – Intervention: Early temporary ileostomy
surgical interventions [5]. closure.
With time, we see more and more articles in – Comparative Intervention: Late temporary
surgical literature where the results of novel ileostomy closure.
interventions are reported with HRQL scales – Outcome: HRQL, complications.
rather than the traditional physiologic outcomes. – Time Horizon: 12 months.
Decisions to adopt or reject novel interventions
are based on such reports. It is therefore important From the components of the PICOT format
for surgeons to be familiar with the appraisal of above, the research question is: “In patients who
such articles that measure HRQL before they undergo rectal cancer resection, does early tempo-
incorporate interventions into their practice and rary ileostomy closure lead to improved HRQL and
recommend them to their patients. fewer complications after 12 months”?
Using a filtered database, COCHRANE, we
performed a search using the key words rectal
cancer, surgical resection, RCT and Health-
Clinical Scenario
Related Quality of Life. We did not identify any
articles that were relevant to our research question.
At the last surgical oncology rounds, a surgical
We, therefore, proceeded to use an unfiltered
fellow asked his supervisor why she closed the
database, PubMed, and performed a search using
temporary ileostomy of a 68-year-old male patient
the same keywords and identified 11 articles.
after rectal cancer resection at 12 days in contrast to
After reviewing the titles and abstracts, only a few
her colleagues who tend to close it at 3 months after
seemed relevant. Articles were eliminated if they
the creation of the temporary stoma. The senior
did not address our question. This included studies
surgeon responded that in her opinion, the early
that did not look at HRQL as the primary out-
closure improved the patient’s QoL and was asso-
come, studies that were not comparing surgical
ciated with fewer complications. She challenged the
procedures, as well as studies that were not
surgical fellow to look at the literature evidence and
designed as a randomized controlled trial (RCT).
let her know if she should change her practice.
One of the remaining articles by Park et al.
(2018), which looked at the quality of life in a
randomized trial of early closure of temporary
Literature Search ileostomy after rectal resection for cancer (the
EASY trial), seems the most relevant [7]. In addi-
The ideal article addressing this surgical question tion to the study being based on a prior RCT study,
would be a meta-analysis of randomized con- the interventions were surgical, HRQL was used as
trolled trials (RCTs) comparing early versus late the main outcome and it was quite recent [8].
closure of the temporary ileostomy after rectal Therefore, this is the one appraised in the present
cancer resection. In the absence of such a study chapter. The appraisal of an article with HRQL as
design, one should look for a large single RCT an outcome follows the same format as other
that addresses the same question. Using the skills chapters in this book (see Box 1).
10 How to Assess an Article that Deals with Health-Related … 95
Box 1. Guidelines for the appraisal of an on imaging or some laboratory marker or just
article in the surgical literature that pur- simply by asking patients how they feel. The
ports to be reporting on Health-Related problem here is that we are not certain if we have
Quality of Life considered all aspects of health that patients
consider important. If one looks at the WHO
A. Are the results valid? definition of HRQL, one can see it is
multi-dimensional [1]. One way to address this
Primary guides problem is to administer HRQL questionnaires to
patients before and at various times after the
i. Have the surgical investigators measured surgical intervention and see if there has been a
aspects of patients’ lives that surgical change.
patients consider important? The development of HRQL questionnaires is
ii. Have important aspects of HRQL been an arduous process that may take years to
omitted? develop. Such an instrument requires the com-
iii. Are the HRQL instruments chosen valid, bined effort of methodologists, psychologists,
reliable and responsive? biostatisticians, patients, and other experts.
According to the U.S. Food and Drug Adminis-
Secondary guides tration (FDA), there are four measurable out-
comes; Patient-reported outcomes (PROs),
i. Were HRQL assessments appropriately
clinician-reported outcomes, observer-reported
timed to evaluate the effects of the surgical
outcomes, and performance outcomes [9].
intervention?
Streiner et al. [10] provide an excellent source
ii. If there were trade-offs between quantity
on how to design health-measurement scales. As
and quality of life, did the investigators
mentioned above, it takes a lot of effort to
use an economic analysis?
develop a new HRQL scale. Surgeons do not
have to develop such scales unless one does not
B. What were the results?
exist for a particularly common condition.
i. What was the magnitude of the effect on McDowell provides most of the known QOL
HRQL? questionnaires and surgeons are encouraged to
use this helpful resource [11]. The nomenclature
C. Will the results help me in caring for my of these HRQL instruments has evolved to
patients in my practice? Patient-Reported Outcomes (PROs), a term that
is used more frequently now [12]. Guidelines
i. Will the information from this study help have been developed since 2010 for the devel-
me inform my patients? opment of a new patient-reported (PRO) instru-
ment. For PROs or HRQL instruments to be
valid, they need to have proper psychometric
Primary Guides properties. By this, we mean that the PROs
reliably attach a numerical value to patients’
Have the Investigators Measured feelings so they can be assessed. If they are not
Aspects of Patients’ Lives that Patients constructed by the correct methods they are
Consider Important? invalid and unreliable. Some commonly used
valid HRQL instruments can be found in
As mentioned in the introduction, traditionally, Table 10.1. Although generic PROs provide
surgeons decide if an intervention is successful some important information about both health
based on the restoration of some anatomical status and health utility, they are not always the
defect, reverting laboratory data to normal range, most appropriate choice, if chosen as the sole
successful resection of malignant disease based HRQL instrument [13]. If the quality of life is
96 A. Thoma et al.
greatly affected by a specific disease, such as in EORTC questionnaires are ones that relate
cancer patients, a disease-specific instrument directly to cancer and colorectal cancer patients.
should be used [13]. As mentioned, conditions such as cancer can
In appraising the Park et al. article, we need to greatly impact one’s quality of life and would
assess whether the authors have measured require a disease-specific HRQL instrument. The
aspects of patients’ lives that the patients would investigators took this into account, ensuring
consider important [7]. A temporary iliostomy they chose the correct HRQL measures to assess
may reduce the risk of pelvic sepsis after anas- health status. However, they do not measure
tomotic dehiscence following rectal cancer health utility, an important component of HRQL,
resection. An iliostomy, however, carries with it with instruments such as the EQ-5D or the
certain risks such skin irritation, parastomal HUI-3. A utility instrument is useful since we
infection, leakage outside the appliance bag, can calculate quality-adjusted life years
parastomal hernia, stomal stenosis, stoma pro- (QALYs), an important component of cost-
lapse, and high-volume output that can lead to effectiveness analysis.
acute kidney injury. Certainly, all of the above Although there are only a select few instru-
can affect the quality of life of a patient who ments highlighted here, one should review what
undergoes this procedure. The theoretical the objectives of the study are, the condition they
advantage of late closure after the procedure is are planning to focus on and whether there are
the potential to ensure healing of the rectal colon any disease-specific measures for HRQL before
anastomosis, avoiding dehiscence and pelvic moving forward.
sepsis, and allowing adjuvant chemotherapy to
be undertaken with minimal delay. The disad-
vantages, on the other hand, are the potential Have Important Aspects of HRQL Been
complications related to the ileostomy site men- Omitted?
tioned above. Measuring the HRQL in a com-
parative intervention, which in this case is the Depending on the condition assessed, it is
timing of the closure of the ileostomy, seems important that clinical investigators remain
appropriate. unbiased in assessing HRQL. In other words,
In terms of the HRQL assessment itself, Park their assessment should be comprehensive.
and colleagues administered the Short Form Inclusion of some HRQL issues and exclusion of
Health Survey (SF-36) and two EORTC ques- others will bias the final results. A generic HRQL
tionnaires, QLQ-C30 and QLQ-CR29. The may not capture the specific differences of
SF-36 is a generic questionnaire and both the ileostomy reversal at early versus late time
10 How to Assess an Article that Deals with Health-Related … 97
periods. On the other hand, a cancer-specific or observers) reliability, high intra-observer (within
preferably “ileostomy related PRO”, if one exists the same observers) reliability, or both. These
should be able to discriminate between early and measures provide information about whether the
late reversal periods. It is for these reasons that in measure being used has a tendency to change
surgical trials we should administer three types of depending on who is conducting the test or when
questionnaires, as suggested by Guyatt et al., to it is being conducted. Ideally, we are after the
ensure that we are taking into account all aspects least variable inter- and intra-observer scores.
of the patients’ health and quality of life [14]. As This would mean that regardless of assessor and
mentioned, Park and colleagues did not include a time period, the measure under consideration is
utility measure. They included only the generic providing reliable information on which to base
(SF-36) and subscales (QLQ-C30, QLQ-CR29) conclusions.
of a condition-specific scale the EORTC. Inves- Test–retest reliability is measured by com-
tigators did, however, assess all domains of the paring scores from the same individual and then
SF-36 and QLQ-C30/CR29, ensuring a compre- calculating a correlation between the different
hensive assessment using the instruments they results. Park et al. did not assess test–retest reli-
did include. ability. This is important because it illustrates
Death is always a possibility with any surgical whether results are truly reliable based on whe-
procedure. Although death is rare in the closure ther the calculated correlation is strong. It is
of an ileostomy, it is important to assess if there important to keep in mind that a reliable test is
is difference in the mortality rate among the not always a valid one, but a valid test is always
comparative interventions. Park et al. reported reliable so it is necessary to evaluate both these
one death in each group and in each case, it was qualities of an assessment tool.
attributed to cancer rather than the ileostomy Last, responsiveness has two major aspects:
closure timing [7]. If patients were of an internal and external. Internal responsiveness
employable age, reporting on their ability to entails the ability of a measure to capture the
return to work would be an important issue for change in a participant’s outcome over a pre-
the patients. If carrying an ileostomy bag for a specified timeframe. In the case of Park et al.,
prolonged period of time interfered with a that would be 12 months. External responsive-
patient’s job then this would also be an important ness reflects the extent to which change in a
issue to consider. measure relates to a corresponding change in a
reference measure of clinical or health status [15,
16]. Responsiveness, therefore, tells us whether
Are the HRQL Instruments Chosen or not an assessment tool will be able to detect a
Valid, Reliable and Responsive? change in the outcome (HRQL) over time
(12 months).
For an HRQL instrument to be chosen, it must Park and colleagues use the SF-36, which has
meet three important preconditions; validity, re- validity and reliability that has been supported in
liability, and responsiveness to change. This clinical research [17, 18]. The responsiveness of
ensures that it is measuring HRQL (the outcome) the SF-36 has also been documented in relation
in a way that provides consistent, valid results. In to various common clinical conditions [19]. The
short, the validity of an instrument describes to other measures used by the investigators, the
what degree it measures what it was developed to EORTC QLQ-C30 [20] and QLQ-CR29 [21],
measure. also have evidence of being valid and reliable in
Reliability is assessed by conducting tests of colorectal cancer patients. We conclude that the
repeatability or reproducibility of the measure measures used for the measurement of HRQL by
under consideration. It can be classified as either Park et al. meet the guidelines of validity, relia-
having high interobserver (between different bility, and responsiveness to change.
98 A. Thoma et al.
The quantity versus quality of life is an important Investigators report the similarities or differences
consideration because as mentioned earlier, the between scores obtained from the questionnaires,
focus of surgical interventions in the past was rather than whether one treatment over the other
prolonging life. Over the last decades, it has begun resulted in better outcomes. However, these
to become clear that more years or months of life similarities or differences need to be reported and
does not necessarily translate to a “good” few interpreted in a meaningful way, so conclusions
years or months. Some patients may be willing to about HRQL of the patients following the treat-
forgo longevity if this is associated with pain. ments under evaluation can be made.
10 How to Assess an Article that Deals with Health-Related … 99
Park et al. found that SF-36 dimension scores similar to the late closure group at 12 months.
were similar between the treatment groups, with No statistically significant differences were seen
no differences in the physical and mental com- at 12 months in the dimensions of the
ponent scores. They did find some significant QLQ-CR29 questionnaire (Table 10.4). Signifi-
differences in various other dimension scores cant results are indicated in bold.
including; role physical (3 months), bodily pain The difference in medians observed in relation
(12 months), and mental health (12 months). At to the MCIDs for these dimension scores fall in
12 months, 52–85% of the patients scored higher the 5–10-point interval making them clinically
than the late closure group, with physical func- important even though not statistically important.
tioning scoring the highest among the dimen- The original RCT (the EASY trial) outlines
sions (Table 10.2). In terms of the other the sample size calculation (60 per group) and
measures, EORTC QLQ-C30 and QLQ-CR29, stated 80% power to detect a 62.5% reduction in
scores were comparable between early and late annual mean number of complications [8]. Park
closure groups. Emotional functioning was lower et al. concluded that no significant differences
in the early closure group at 3 and 6 months, but were observed in HRQL within 12 months after
rectal resection for cancer when early and late developers of these measures, and these results
closure of temporary ileostomy was compared. suggest the distributions of data are skewed.
This was based on 5–10-point difference as a The lack of statistically significant findings
little change and 10–20-point difference as a may also be due to the fact that the authors
moderate change [7]. We can conclude that the included many dimension scores in the model
results are valid as investigators had an adequate rather than choosing categories (such as mental
sample size. component score and physical component score).
The values presented in these tables seem to These are a few of the many things to consider
be quite high, with many intervals for the when using questionnaires like these ones.
dimension scores including a 100 (perfect state). For those of you who are statistically
Park et al. state that dimension scores were cal- inclined, you will notice a few things about
culated based on guidelines given by the Tables 10.2, 10.3 and 10.4. First, the original
Table 10.3 EORTC QLQ-C30 scores at 3, 6, and 12 months after rectal resection
Dimensions 3 months 6 months 12 months
Median (Q1–Q3) p Median (Q1–Q3) p Median (Q1–Q3) p
Global Quality of Life
Early 75 (50–83.3) 0.941 66.7 (50–83.3) 0.961 83.3 (50–91.7) 0.889
Late 66.7 (58.3–83.3) 66.7 (66.7–83.3) 83.3 (66.7–91.7)
Physical Functioning
Early 93.3 (73.3–100) 0.634 93.3 (80–100) 0.433 93.3 (73.3–100) 0.137
Late 93.3 (73.3–100) 93.3 (80–100) 100 (80–100)
Role Functioning
Early 83.3 (66.7–100) 0.066 100 (66.7–100) 0.503 100 (66.7–100) 0.793
Late 66.7 (50–100) 83.3 (66.7–100) 100 (66.7–100)
Emotional Functioning
Early 83.3 (66.7–100) 0.023 83.3 (66.7–100) 0.031 91.7 (66.7–100) 0.409
Late 91.7 (83.3–100) 91.7 (75–100) 91.7 (83.3–100)
Cognitive Functioning
Early 100 (83.3–100) 0.447 83.3 (83.3–100) 0.131 100 (66.7–100) 0.652
Late 100 (83.3–100) 100 (83.3–100) 100 (83.3–100)
Social Functioning
Early 83.3 (66.7–100) 0.583 83.3 (66.7–100) 0.882 83.3 (66.7–100) 0.142
Late 83.3 (66.7–100) 83.3 (66.7–100) 100 (66.7–100)
Table 10.4 EORTC QLQ-CR29 scores for functional scales at 3, 6, and 12 months after rectal resection
Dimensions 3 months 6 months 12 months
Median (Q1–Q3) p Median (Q1–Q3) p Median (Q1–Q3) p
Urinary Frequency
Early 16.7 (0–33.3) 0.323 16.7 (0–50) 0.353 16.7 (0–33.3) 0.268
Late 16.7 (0–50) 16.7 (8.3–41.7) 33.3 (0–50)
Stool Frequency
Early 33.3 (16.7–50) <0.001 33.3 (16.7–50) 0.068 33.3 (16.7–50) 0.611
Late 0 (0–16.7) 16.7 (0–66.7) 33.3 (16.7–50)
Body Image
Early 88.9 (66.7–100) 0.715 88.9 (77.8–100) 0.364 94.4 (77.8–100) 0.502
Late 77.8 (66.7–100) 88.9 (66.7–100) 100 (88.9–100)
10 How to Assess an Article that Deals with Health-Related … 101
tables presented by the authors use the abbre- that a difference of 5–10 points be considered a
viation of “i.q.r.” rather than “IQR” and the “little change” and a difference of 10–20 points a
values presented were not the IQR. An IQR is a “moderate change” [7]. These standards were
single number, which represents the difference chosen as the study took place in Sweden and
between the lower (Q1) and upper (Q3) quar- Denmark. When they compared their SF-36®
tiles. Therefore, Park et al. presented the scores with Swedish reference data, a general
dimension scores varying from the lower quar- improvement was seen during the 12-month
tile to the upper quartile. This was changed in follow-up interval.
Tables 10.2, 10.3, and 10.4 to indicate this.
Second, the authors interchanged mean and
median values in their results section and then Will the Results Help Me in Caring
presented median values for their tables. This for My Patients?
causes some confusion as to whether the authors
examined the differences in dimension scores Will the Information from This Study
based on the median values reported, as mean Help Me Inform My Patients?
values were not reported. This needs to be clar-
ified and stated explicitly for the reader. In the Surgeons know from their experience that
field of colorectal cancer patients and patients who undergo similar surgical interven-
health-related quality of life, mean values for tions can respond differently. To determine
measures such as the SF-36 and EORTC whether our patients will be helped by the results
QLQ-C30 are often reported, rather than the of the Park et al. study we would like to know if
median. However, as authors reported median the patients in the Park et al. study are similar to
and wanted to show the IQR, we can assume that our patient. We, therefore, look at the table that
there were outliers as reporting these statistics is describes the demographic characteristics of their
common when outliers are present in the data. study group. We look to see if the Park et al.
As previously mentioned, patient-reported study patients were similar in age to our patient
outcomes (PROs) have become increasingly uti- and any other prognostic factors. Our patient is
lized in clinical research. The questionnaires 68 years in age and the mean patient age in the
used in Park et al. (SF-36, EORTC QLQ-C30, Park et al. study was 67 years, which is similar to
and EORTC QLQ-CR29) are a great way to ours. If we believe that their patient group
measure PROs. Interpreting the scores of these (Scandinavian) is different from ours (North
questionnaires properly is immensely important, American), we may be skeptical of the general-
as this is the information that is used to base the izability of their findings. If we agree that there
decision of whether a novel treatment will be are no major differences in these two popula-
used over the current gold-standard treatment. tions, we should accept their findings. As we do
A way to interpret scores or responses is by using not have any evidence of major differences, we
a minimal clinical important difference (MCID) should accept their findings.
that has been supported in previous literature for
the surveys in question. The MCID has been
defined as “the smallest difference in score in the Resolution of the Scenario
domain of interest in which patients perceive as
beneficial” [24]. The results from the Park et al. [7] study indicate
Clinicians use the MCID to define the level of that there is no difference in the HRQL of
change an intervention creates to determine patients whether they undergo early or late clo-
whether that change is a clinically important sure of the ileostomy, even though the early
change in the population [24]. Park et al. reported closure was associated with fewer complications.
they used the suggestion from a Swedish study This clinical advantage had no effect on the
102 A. Thoma et al.
patients’ HRQL. The surgical fellow informs the 11. McDowell I. Measuring health: a guide to rating
senior surgeon that in fact what she doing is just scales and questionnaires. 3rd ed. Oxford: Oxford
University Press; 2006.
fine! 12. Deshpande PR, Rajan S, Sudeepthi Nazir CPA.
Patient-reported outcomes: a new era in clinical
research. Perspect Clin Res. 2011;2(4):137–44.
References 13. Litwin MS. Health-related quality of life. In: Pen-
son DF, Wei JT, editors. Chapter 13: Clinical
research for surgeons. Totowa, NJ: Humma Press;
1. World Health Organization. WHOQOL: Measuring 1998.
Quality of Life. [Internet]. Geneva; 2018 [cited 2018 14. Guyatt GH, Naylor CD, Juniper E. Users’ guides to
Jan 10]. Available from: http://www.who.int/ the medical literature XII. How to use articles about
healthinfo/survey/whoqol-qualityoflife/en/. health-related quality of life. JAMA. 1997; 277:
2. Office of Disease Prevention and Health Promotion. 1232–7.
Health-Related Quality of Life and Well-Being. 15. Thoma A, Cornacchi SD, Lovrics PJ, Goldsmith CH.
[Internet]. Washington, DC; 2010 [cited 2018 Jan Users’ guide to the surgical literature: how to assess
10]. Available from: https://www.healthypeople.gov/ an article on health-related quality of life. Can J Surg.
2020/about/foundation-health-measures/Health- 2008;51(3):215–24.
Related-Quality-of-Life-and-Well-Being. 16. Husted JA, Cook RJ, Farewell VT, Gladman DD.
3. Kirshner B, Guyatt G. A methodological framework Methods for assessing responsiveness: a critical
for assessing health indices. J Chronic Dis. review and recommendations. J Clin Epidemiol.
1985;38:27–36. 2000;53:459–68.
4. Wood-Dauphinee S. Quality of life assessment: 17. Stansfeld SA, Roberts R, Foot SP. Assessing the
recent trends in surgery. Can J Surg. 1996;39: validity of the SF-36 general health survey. Qual Life
368–72. Res. 1997;6(3):217–24.
5. Wood-Dauphinee S. Assessing quality of life in 18. Jenkinson C, Wright L, Coulter A. Criterion validity
clinical research: from where have we come and and reliability of the SF-36 in a population sample.
where are we going? J Clin Epidemiol. 1999;52: Qual Life Res. 1994;3:7–12.
355–63. 19. Garratt AM, Ruta DA, Abdalla MI, Russell IT. SF 36
6. Waltho D, Kaur MN, Haynes RB, Farrokhyar F, health survey questionnaire: II. Responsiveness to
Thoma A. Users’ guide to the surgical literature: hot changes in health status in four common clinical
to perform a high-quality literature search. Can J conditions. Qual Health Care. 1994;3(4):186–92.
Surg. 2015;58(5):349–58. 20. Ganesh V, Agarwal A, Popovic M, Bottomley A,
7. Park AK, Danielsen E, Angenete D, Bock AC, McDonald R, Vuong S, et al. Comparison of the
Marinez E, Haglind JE, et al. Quality of life in a FACT-C, EORTC QLQ-CR38, and QLQ-CR29
randomized trial of early closure of temporary quality of life questionnaires for patients with
ileostomy after rectal resection for cancer (EASY colorectal cancer: a literature review. Support Care
trial). Wiley Online Libr. 2018;105:244–51. Cancer. 2016;24(8):3661–8.
8. Danielsen AK, Park J, Jansen JE, Bock D, Skull- 21. EORTC. EORTC QLQ-C30. [Internet]. Brussels;
man S, Wedin A, et al. Early closure of a temporary 2017 [cited 2018 Jan 5]. Available from: http://
ileostomy in patients with rectal cancer: a Multicenter groups.eortc.be/qol/eortc-qlq-c30.
randomized controlled trial. Ann Surg. 2017;265 22. EuroQoL Group. EQ-5D-5L. [Internet]. Rotterdam;
(2):284–90. 2009 [cited 2018 Jan 5]. Available from: https://
9. US Food and Drug Administration. Clinical Outcome euroqol.org/eq-5d-instruments/eq-5d-5l-about/.
Assessment Qualification Program. [Internet]. Silver 23. Horsman J, Furlong W, Feeny D, Torrance G. The
Spring, Maryland; 2018 [cited 2018 Jan 10]. Avail- health utilities index (HUI): concepts, measurement
able from: https://www.fda.gov/Drugs/ properties and applications. Health Qual Life Out-
DevelopmentApprovalProcess/ comes. 2003;1(54):1–13.
DrugDevelopmentToolsQualificationProgram/ 24. Jaeschke R, Singer J, Guyatt GH. Measurement of
ucm284077.html. health status: ascertaining the minimal clinically
10. Streiner DL, Norman RG, Cairney J. Health mea- important difference. Control Clin Trials. 1989;10:
surement scales: a practical guide to their develop- 407–15.
ment and use. 5th ed. Oxford: Oxford University
Press; 2014.
Randomized Controlled Trial
Comparing Surgical Interventions 11
Max Solow, Raman Mundi, Vickas Khanna
and Mohit Bhandari
“If I had my initial fracture fixed with a different hip screw and 551 patients to receive cancellous
implant, would it still have failed?” Unsure ini- screws. The primary outcome of this study was
tially of how to answer this question, you decide the need for reoperation within 24 months. Other
to review the literature and assure her that at her important outcomes included mortality, fracture
next clinical appointment, you will provide her healing, complications, and health-related quality
with the best possible answer. of life scores. The mean length of follow-up was
633 days (standard deviation [SD] 208 days).
Reoperations within 24 months, mortality,
Literature Search fracture healing, implant failures, nonunions,
infections, medically related adverse advents,
You begin your literature search by creating a and health-related quality of life scores did not
well-defined research question that encompasses differ in either treatment arm. However, more
several aspects of the clinical scenario. Using the cases of avascular necrosis were observed in the
PICO(T) format [4], which incorporates infor- sliding hip screw group than in the cancellous
mation about the patient population (P), the in- screws group (50 patients [9%] vs. 28 patients
tervention (I), comparative interventions (C), the [5%]; HR 1.91, 1.06–3.44; p = 0.0319). Like-
outcome (O), and time period (T), you generate wise, prespecified subgroup analyses showed that
the following research question: In femoral neck sliding hip screws are favored in patients with
fracture patients, does fixation with a sliding hip displaced fractures, fractures at the base of the
screw lead to higher revision rates compared with femoral neck, and in patients who currently
other methods of fixation? Using Medline, the smoke.
National Library of Medicine’s PubMed database
[5], you enter “femoral neck fracture” AND
“reoperation” in the search field. You limit your Evaluating a Randomized Controlled
search to English articles, published in the last Trial
3 years, clinical trials, and human subjects. The
search yields five articles [6–10]. Three of the When reading a research article, it is always
articles do not compare fixation methods [7, 8, important to critique whether or not the study
10] and one article compares hemiarthroplasty to was carried out in a manner that would produce
internal fixation [6]. The article, “Fracture fixa- reliable results. Notably, three questions should
tion in the operative management of hip fractures be asked: Are the results valid? What are the
(FAITH): an international, multicentre random- results? And are the results applicable to my
ized controlled trial” compares two methods of practice (see Box 1) [11]?
femoral neck fracture fixation and seems to
address your clinical question.
Box 1. Important points to consider
when evaluating a surgical RCT
Summary of the Appraised Article
Are the results valid?
• Was the learning curve taken into
The FAITH trial [9] was conducted to compare
consideration?
the sliding hip screw to cancellous screw fixation
• How was randomization and allocation
for patients with low-energy femoral neck frac-
concealment performed?
tures. The study was an international, multicen-
• Who was blinded?
tre, and randomized controlled trial, which
• Were the patient groups similar to one
included 1108 patients across 81 clinical centers
another?
from 8 countries. Following randomization,
• How were patients’ data analyzed?
557 patients were assigned to receive a sliding
11 Randomized Controlled Trial Comparing Surgical Interventions 105
Fig. 11.1 Bias introduced when using a per-protocol more than double that of the control group. The
analysis versus an intention-to-treat analysis. Per-protocol intention-to-treat analysis analyzes patients in their orig-
analysis includes all those patients who received van- inal groups irrespective of the treatment they actually
comycin, irrespective of their initial assignment. With received. Under this model, we see that the event rates are
this analysis, the event rate in the experimental group is equal. R = randomization; TKA = total knee arthroplasty
in the ancef-only group would benefit from the postoperative thromboprophylaxis, postoperative
local vancomycin powder due to intraoperative weight-bearing status, and rehabilitation). For
complications. If 10 of the 100 patients in the example, differences in postoperative care could
initial experimental group develop an infection, introduce bias if one group were given extra
plus 5 of the 10 patients that switched group’s physiotherapy sessions. Standardization tries to
perioperatively go on to develop an infection, address any bias that may be introduced in
then the event rate would be 14%; however, the studies with multiple components by attempting
rate in the control group would be 6% to keep things as consistent as possible [31]. In
(Fig. 11.1). These values represent a spurious the FAITH trial [9], patient positioning, fracture
reduction in infection rates for the control reduction, and surgical exposure were left to the
group. Intention-to-treat analysis eliminates this surgeons’ discretion. However, surgeons were
potential bias by analyzing patients in an RCT given specific criteria for acceptability of post-
according to the treatment arm they were origi- fixation radiographic fracture alignment. The
nally assigned to, irrespective of the treatment authors also provide supplemental materials in
they actually received [29, 30]. In our example, if the appendix regarding procedural and rehabili-
patients were analyzed with an intention-to-treat tative standardization. Therefore, it is safe to
model the event rates would be 10/100 for both assume that the necessary steps were taken to
groups. Incorporating this method of analysis address any potential variability.
into a study eliminates any bias that may arise
from participant attrition or crossover. In the Sample Size Calculation and Follow-Up
FAITH trial [9], the authors state that an A predetermined sample size calculation is inte-
intention-to-treat analysis was employed. gral to the conduct of an RCT (see Chap. 29).
A study with too few participants may fail to
Treatment Standardization reach its objective by lacking capacity to answer
Standardization of interventions is important in the primary study question due to being statisti-
surgical trials to ensure treatment effects are not cally underpowered. Conversely, a study should
biased due to differential care between groups not simply recruit an excessive number of patients
outside of the main surgical intervention (i.e., as this would overpower the study and may result
preoperative antibiotics, perioperative care, in findings of statistical significance that are not
11 Randomized Controlled Trial Comparing Surgical Interventions 109
actually clinically important [32]. As such, RCTs common terminology used to report data are
should have a sample size calculation based upon relative risk (RR), relative risk reduction (RRR),
a determined minimally clinically important dif- absolute risk reduction (ARR), and the number
ference and desired study power [32]. Sample needed to treat (NNT) (Box 2). The RR describes
size calculations provide a study with adequate the probability of an event occurring in one
power to detect the minimally clinically important group of people versus another [36, 37] (see
differences for outcomes used in the calculation. Chap. 6 for further explanation). In our study, the
However, subgroup analysis is not included in RR would tell us the probability of reoperation
this calculation. within 24 months in patients receiving a sliding
Another caveat in any study is participant hip screw (experimental) versus patients receiv-
attrition. Patients that are lost to follow-up ing cancellous screws (control). The RRR mea-
threaten the validity of the study since we do sures how much risk is reduced in the
not know their outcomes, whether or not they experimental group versus the control group
died, or if demographic differences exist between [38]. Consider a hypothetical situation compar-
them and the study group [33]. While there is not ing stroke rates in hypertensive patients receiving
any hard and fast cutoff value at which attrition an intensive treatment regimen (experimental)
related bias becomes apparent, it is generally versus the standard of care (control). If 10% of
accepted that 5% loss to follow-up is of little the control group experienced a stroke, compared
worry, a loss of 20% should raise concern, and a to 5% of patients in the treatment group, it can be
loss between 5 and 20% may still produce bias said that the intensive treatment regimen resulted
but to a lesser degree [34]. To maintain the in a relative risk reduction of 50% (Table 11.2).
predetermined study power and avoid skewing The ARR describes the absolute difference in
study results, it is important to consider the event rates between the control and treatment
anticipated loss to follow-up when calculating a groups [38]. Using our hypothetical situation, the
trials minimum sample size in the planning phase ARR for strokes would be 5%. The NNT refers
of the study [35] (see Chap. 29 for further to the number of patients that would need to be
explanation). In the FAITH trial [9], the original treated in order to prevent one additional adverse
sample size calculation found that enrolment of event from occurring and can be thought of as
1500 patients would give the trial a study power the inverse of the ARR [39, 40]. Referring to our
of 81.5%. However, upon reanalysis of com- hypothetical example once more, for every 20
pleted follow-up data from the first 589 patients, patients treated with the intensive treatment reg-
it was found that a sample size of 1100 patients imen, 1 stroke would be prevented. In the FAITH
would provide 95.7% power to detect a relative trial [9], the authors report that reoperations of
risk reduction of 35%. The authors were able to any kind within 24 months were roughly equal in
recruit 1108 patients. The authors validate their the sliding hip screw group versus the cancellous
study by providing details of sample size calcu- screw group and there was no minimally clini-
lations and patient exclusion rationale in the cally important difference or statistical signifi-
appendix. cance (20% vs. 22%, p = 0.18); however,
implant removal took place significantly less
frequently in the sliding hip screw group than in
What Are the Results? the cancellous screws group although the find-
ings were not clinically significant (5% vs. 9%,
What Was the Impact p = 0.0009). The authors also report that there
from the Treatment? were no statistically or clinically significant dif-
Now that the study has been conducted, we want ferences between the two groups in terms of
to know what the results mean and their clinical mortality, fracture healing, complications, and
importance and statistical significance. Some health-related quality of life scores.
110 M. Solow et al.
Table 11.2 Sample calculations from the hypothetical unbeknownst to researchers that may impact the
hypertensive drug trial study. The best we can do is come up with a
Experimental Control group close estimate, known as the point estimate, of
group the true value that would lie within that ballpark.
Total 124 130 To communicate the point estimate, researchers
number provide a variety of values, known as the confi-
of
patients dence interval (CI), that specifies a probability
within which one can be confident the true value
Number 6 13
of lies [41, 42]. By convention, a 95% CI is gen-
strokes erally used, meaning we can be 95% certain that
Event x = 6/124 = 0.05 y = 13/130 = 0.10 the “true” value lies within the interval (see
rates (5%) (10%) Chap. 28 for more information).
Relative risk: x/y = 0.05/0.10 = 0.50 (50%) In the FAITH trial [9], results are reported
Relative risk reduction: (x − y)/y = (0.05 − 0.10)/ using hazard ratios (HR), the chance of an event
0.10 = 0.50 (50%) occurring in the treatment arm divided by the
Absolute risk reduction: x − y = 0.05 − 0.10 = 0.05 chance of the event occurring in the control arm
(5%) [43], CIs, and p values. They report an HR of
Number needed to treat: 1/(x − y) = 1/ 0.83 with a 95% CI of 0.63–1.09 and a p-value of
(0.05 − 0.10) = 20 0.18. This means that the patients who received a
Patients in the control group were 50% more likely to sliding hip screw are less likely to require a
experience a stroke (RR). Similarly, the experimental reoperation and that the true value for this rate
intervention reduced the risk of stroke by 50% (RRR).
Likewise, for every 100 patients treated with the
lies between 0.63 and 1.09. However, the dif-
experimental intervention, there would be 5 fewer ference in event rates between the two groups is
strokes (ARR). Furthermore, in order to prevent one not statistically or clinically significant.
stroke, 20 patients would need to be treated with the
experimental intervention (NNT).
What Now?
Generalizability
Box 2. Equations for common statistic
Before we decide to implement a particular in-
terminology
tervention, we need to assess how relevant the
Relative Risk ðRRÞ Absolute Risk Reduction ðARRÞ information is to our patient population. In our
RR ¼ xy AAR ¼ x y clinical scenario, our patient is a 74-year-old
Relative Risk Reduction ðRRRÞ Number Needed to Treat ðNNTÞ
RRR ¼ ðxyÞ NNT ¼ AAR 1 woman who sustained a femoral neck fracture
y
due to a fall. In the FAITH trial [9], they enrolled
patients who were 50 years or older with a
x—# of events in experimental
low-energy femoral neck fracture requiring
group/total # of patients in experimental
operative fixation. Looking at the table of patient
group.
baseline characteristics, we also see that the mean
y—# of events in control group/total #
age of patients was 72.1 years and that 61% of
of patients in control group.
the study participants were female, therefore
these findings could apply to our patient.
Clinical Relevance
How Precise Were the Results? It is important to consider whether or not the
It is impossible to know the “true” reduction to outcomes that the researchers examined are
reoperation rates within 24 months caused by the clinically relevant. For example, a study com-
use of a sliding hip screw because of variables paring two treatment modalities for rotator cuff
11 Randomized Controlled Trial Comparing Surgical Interventions 111
tears that looked at “return to sport time” would cancellous screws would be acceptable treatments
be clinically useful for a 20-year-old pitcher, but for low-energy femoral neck fractures in terms of
less relevant for a 74-year-old patient. In the the primary endpoints assessed (reoperation
FAITH trial [9], the outcomes assessed included within 24 months, death, complications, and
reoperation within 24 months, mortality, fracture quality of life). Although our patient in the clin-
healing, complications, and health-related quality ical scenario experienced an implant failure
of life scores. All of these outcome measures within a year, and will likely require reoperation,
would be relevant to both physicians and patients we can be confident in answering her that her
alike, as they are clinically important with failure likely was not due to the choice of implant.
important implications on the ultimate outcome.
For instance, revision surgeries tend to be more
technically complex, impose added expenses to
References
the health care system, and expose patients to
further risks and potential perioperative morbid-
1. National Insititue of Health (NIH). What we do:
ity [44]. budget [Internet]. 2018 Apr 11. [cited 2018 Aug 9].
Available from: https://www.nih.gov/about-nih/
What Does This Mean for Me what-we-do/budget#note.
2. Graham R, Mancer M, Wolman Miller D, Green-
as a Healthcare Provider?
field S, Steinberg E, editors. Clinical practice guide-
Now that we have assessed the evidence pro- lines we can trust. Washington, DC: The National
vided to us, how do we use it? For medical Academies Press; 2011. p. 290.
doctors, this may be straightforward because if a 3. Guyatt G, Cairns J, Churchill D, Cook D, Haynes B,
Hirsh J, et al. Evidence-based medicine. A new
study shows that drug A is more efficacious than
approach to teaching the practice of medicine.
drug B, they can simply begin prescribing drug JAMA. 1992;268(17):2420–5.
B. However, implementing new evidence poses 4. Richardson WS, Wilson MC, Nishikawa J, Hay-
potential challenges for surgeons. If a study ward RS. The well-built clinical question: a key to
evidence-based decisions. ACP J Club. 1995;123(3):
shows that a particular intervention produces
A12–3.
superior outcomes, how do they go about 5. Motschall E, Falck-Ytter Y. Searching the MED-
implementing it? Surgeons need to objectively LINE literature database through PubMed: a short
critique their expertise and proficiency with a guide. Onkologie. 2005;28(10):517–22.
6. Lu Q, Tang G, Zhao X, Guo S, Cai B, Li Q.
procedure and earnestly consider if they would Hemiarthroplasty versus internal fixation in
be happy with the level of care provided. It super-aged patients with undisplaced femoral neck
would be unethical for a surgeon to perform a fractures: a 5-year follow-up of randomized con-
procedure they are not familiar with as patients trolled trial. Arch Orthop Trauma Surg. 2017;137
(1):27–35.
would potentially experience a higher complica- 7. Bhandari M, Jin L, See K, Burge R, Gilchrist N,
tion rate [45–47]. If a surgeon is not comfortable Witvrouw R, Krohn KD, Warner MR, Ahmad QI,
performing the procedure they can: (1) refer the Mitlak B. Does teriparatide improve femoral neck
patient to a colleague, (2) seek additional train- fracture healing: results from a randomized
placebo-controlled trial. Clin Orthop Relat Res.
ing, or (3) perform a different procedure after 2016;474(5):1234–44.
considering the evidence for it. Any of these 8. Sprague S, Slobogean GP, Bogoch E, Petrisor B,
options would be sufficient and the decision Garibaldi A, O’Hara N, Bhandari M, FAITH Inves-
ultimately lies with each surgeon. tigators. Vitamin D use and health outcomes after
surgery for hip fracture. Orthopedics. 2017;40(5):
e868–75.
9. Fixation using Alternative Implants for the Treatment
Resolving Our Clinical Scenario of Hip fractures (FAITH) Investigators. Fracture
fixation in the operative management of hip fractures
(FAITH): an international, multicentre, randomised
After thoughtfully examining the information controlled trial. Lancet. 2017;389(10078):1519–27.
provided to us from the FAITH trial [9], it is fair 10. Watts CD, Houdek MT, Sems SA, Cross WW,
to conclude that either a sliding hip screw or Pagnano MW. Tranexamic acid safely reduced blood
112 M. Solow et al.
loss in hemi- and total hip arthroplasty for acute traumatic anterior dislocations of the shoulder:
femoral neck fracture: a randomized clinical trial. long-term evaluation. Arthroscopy. 2005;21(1):
J Orthop Trauma. 2017;31(7):345–51. 55–63.
11. Guyatt G, Rennie D, Meade MO, Cook DJ. User’s 28. Farrokhyar F, Bajammal S, Kahnamoui K, Bhan-
guides to the medical literature: essentials of dari M. Practical tips for surgical research. Ensuring
evidence-based clinical practice. 2nd ed. Hamilton, balanced groups in surgical trials. Can J Surg.
ON: McGraw-Hill Professional; 2008. p. 380. 2010;53(6):418–23.
12. Hopper AN, Jamison MH, Lewis WG. Learning 29. Herman A, Botser IB, Tenenbaum S, Chechick A.
curves in surgical practice. Postgrad Med J. 2007;83 Intention-to-treat analysis and accounting for missing
(986):777–9. data in orthopaedic randomized clinical trials. J Bone
13. Maruthappu M, Duclos A, Lipsitz SR, Orgill D, Joint Surg Am. 2009;91(9):2137–43.
Carty MJ. Surgical learning curves and operative 30. Montori VM, Guyatt GH. Intention-to-treat principle.
efficiency: a cross-specialty observational study. CMAJ. 2001;165(10):1339–41.
BMJ Open. 2015;5(3):e006679. 31. Blencowe NS, Cook JA, Pinkney T, Rogers C,
14. Devereaux PJ, Bhandari M, Clarke M, Montori VM, Reeves BC, Blazeby JM. Delivering successful
Cook DJ, Yusuf S, et al. Need for expertise based randomized controlled trials in surgery: methods to
randomised controlled trials. BMJ. 2005;330 optimize collaboration and study design. Clin Trials.
(7482):88. 2017;14(2):211–8.
15. Cook JA. The challenges faced in the design, conduct 32. Zhong B. How to calculate sample size in random-
and analysis of surgical randomised controlled trials. ized controlled trial? J Thorac Dis. 2009;1(1):51–4.
Trials. 2009;10:9. 33. Dumville JC, Torgerson DJ, Hewitt CE. Reporting
16. Cook JA, Elders A, Boachie C, Bassinga T, Fraser C, attrition in randomised controlled trials.
Altman DG, et al. A systematic review of the use of BMJ. 2006;332(7547):969–71.
an expertise-based randomised controlled trial 34. Fergusson D, Aaron SD, Guyatt G, Hébert
design. Trials. 2015;16:241. P. Post-randomisation exclusions: the intention to
17. Kim J, Shin W. How to do random allocation treat principle and excluding patients from analysis.
(randomization). Clin Orthop Surg. 2014;6(1):103–9. BMJ. 2002;325(7365):652–4.
18. Schulz KF. Subverting randomization in controlled 35. Maggard MA, O’Connell JB, Liu JH, Etzioni DA,
trials. JAMA. 1995;274(18):1456–8. Ko CY. Sample size calculations in surgery: are they
19. Boutron I, Estellat C, Guittet L, Dechartres A, done correctly? Surgery. 2003;134(2):275–9.
Sackett DL, Hróbjartsson A, et al. Methods of 36. Yelland LN, Salter AB, Ryan P. Relative risk
blinding in reports of randomized controlled trials estimation in randomized controlled trials: a com-
assessing pharmacologic treatments: a systematic parison of methods for independent observations.
review. PLoS Med. 2006;3(10):e425. Int J Biostat. 2011;7(1):1–31.
20. Kaptchuk TJ. Powerful placebo: the dark side of the 37. Sistrom CL, Garvan CW. Proportions, odds, and risk.
randomised controlled trial. Lancet. 1998;351 Radiology. 2004;230(1):12–9.
(9117):1722–5. 38. Irwig L, Irwig J, Trevena L, Sweet M. Smart health
21. Zhang W, Robertson J, Jones AC, Dieppe PA, choices: making sense of health advice. London:
Doherty M. The placebo effect and its determinants Hammersmith Press; 2008. p. 1–242.
in osteoarthritis: meta-analysis of randomised con- 39. Haas M, Schneider M, Vavrek D. Illustrating risk
trolled trials. Ann Rheum Dis. 2008;67(12):1716–23. difference and number needed to treat from a
22. van der Linden W. Pitfalls in randomized surgical randomized controlled trial of spinal manipulation
trials. Surgery. 1980;87(3):258–62. for cervicogenic headache. Chiropr Osteopat.
23. Prescott RJ, Counsell CE, Gillespie WJ, Grant AM, 2010;18:9.
Russell IT, Kiauka S, et al. Factors that limit the 40. Bender R. Calculating confidence intervals for the
quality, number and progress of randomised controlled number needed to treat. Control Clin Trials. 2001;22
trials. Health Technol Assess. 1999;3(20):1–143. (2):102–10.
24. Karanicolas PJ, Farrokhyar F, Bhandari M. Practical 41. Flechner L, Tseng TY. Understanding results:
tips for surgical research: blinding: who, what, when, P-values, confidence intervals, and number need to
why, how? Can J Surg. 2010;53(5):345–8. treat. Indian J Urol. 2011;27(4):532–5.
25. Miller LE, Stewart ME. The blind leading the blind: 42. Sedgwick P. Randomised controlled trials: inferring
use and misuse of blinding in randomized controlled significance of treatment effects based on confidence
trials. Contemp Clin Trials. 2011;32(2):240–3. intervals. BMJ. 2014;349:g5196.
26. Randelli P, Arrigoni P, Lubowitz JH, Cabitza P, 43. Trinquart L, Jacot J, Conner SC, Porcher R. Com-
Denti M. Randomization procedures in orthopaedic parison of treatment effects measured by the hazard
trials. Arthroscopy. 2008;24(7):834–8. ratio and by the ratio of restricted mean survival
27. Kirkley A, Werstine R, Ratjek A, Griffin S. Prospec- times in oncology randomized controlled trials.
tive randomized clinical trial comparing the effec- J Clin Oncol. 2016;34(15):1813–9.
tiveness of immediate arthroscopic stabilization 44. Bhandari M, Jon Smith, Larry E. Miller, Jon E.
versus immobilization and rehabilitation in first Block. Clinical and economic burden of revision
11 Randomized Controlled Trial Comparing Surgical Interventions 113
knee arthroplasty. Clin Med Insights Arthritis Mus- hospital volume and outcomes for shoulder arthro-
culoskelet Disord. 2012;(5):89–94. plasty. J Bone Joint Surg Am. 2004;86–A(3):496–
45. Hammond JW, Queale WS, Kim TK, McFarland EG. 505.
Surgeon experience and clinical and economic out- 47. Hervey SL, Purves HR, Guller U, Toth AP, Vail TP,
comes for shoulder arthroplasty. J Bone Joint Surg Pietrobon R. Provider volume of total knee arthro-
Am. 2003;85–A(12):2318–24. plasties and patient outcomes in the HCUP-
46. Jain N, Pietrobon R, Hocker S, Guller U, Shankar A, nationwide inpatient sample. J Bone Joint Surg
Higgins LD. The relationship between surgeon and Am. 2003;85–A(9):1775–83.
How to Assess a Pilot Trial in Surgery
12
Guowei Li, Gillian A. Lancaster and Lehana Thabane
A hip fracture leads to much pain, bleeding, For a long time, there has been no consensus on
immobility and subsequent complications. Early the differences in definition of pilot and feasi-
surgery may reduce morbidity and mortality due bility studies: some consider the terms pilot and
to hip fractures. However, most patients with a feasibility as being synonymous, while others
hip fracture have to wait a long time to receive argue that these two terms have related but dif-
surgery mainly because of delayed medical ferent meanings [1]. Multiple terms such as pilot
clearance and operation room access. Therefore, study, feasibility trial, pilot work, pilot trial and
you hope to run a multicenter randomized con- feasibility investigation, have been used without
trolled trial (RCT) to assess whether accelerated clear distinction among them [2]. However, in
care (i.e. accelerated medical clearance plus 2016, as part of the process of developing the
surgical access) would cause better outcomes CONSORT extension to pilot trials [3, 4],
compared with standard care (i.e. regular medical Eldridge et al. introduced a conceptual frame-
clearance plus surgery). But before you begin, work and published definitions of what consti-
you are uncertain about whether it is feasible to tutes a feasibility study and a pilot study.
perform such a large clinical trial. You are con- A feasibility study is viewed as an overarching
cerned that you may not be able to recruit a concept of uncertainty about aspects of a future
sufficient number of willing patients and whether study within which three distinct types of study
they will be able to complete the follow-up are identified: randomized pilot studies in which
assessments. Therefore, you decide to review the part or all of the aspects of the future study are
literature first to see if any pilot study assessed carried out on a smaller scale to see if it can be
the feasibility of the subsequent large trial. done; non-randomized studies which are similar
but exclude randomization of participants; and
other types of feasibility studies where some
element of the future study is being addressed
G. Li L. Thabane (&)
(e.g. intervention development or acceptability)
Department of Health Research Methods, Evidence,
and Impact, McMaster University, St. Joseph’s but no part of the future trial is being conducted
Healthcare Hamilton, Hamilton, Canada [4, 5]. Within this framework, the authors define
e-mail: thabanl@mcmaster.ca a feasibility study as a study asking ‘whether
G. A. Lancaster something can be done, should we proceed with
Institute of Primary Care and Health Sciences, Keele it, and if so, how’ and a pilot study as ‘a study in
University, Keele, UK
which a future RCT or part of it, is conducted on 10 months, indicating the feasibility of a further
a smaller scale’ [4]. Essentially, Eldridge et al. large-scale RCT to assess the effect of probiotics
appropriately note that ‘the corollary of these on VAP in patients in ICU [9].
definitions is that all pilot studies (including tri-
als) are feasibility studies but not all feasibility
studies are pilot studies’ [4]. Current Practice in Pilot Trials
in the Literature
suboptimal nature of the current practices of pilot principles of the Helsinki declaration, it is
studies in the literature. unethical to expose participants unnecessarily to
The IDEAL (Idea, Development, Exploration, the unknown risks of research [15]. An example
Assessment, and Long-term Follow-up) frame- can be found in Dingemans et al. study that
work was set up to provide guidance and rec- aimed to assess the feasibility of a new portable
ommendations for an integrated evaluation single-use negative pressure wound therapy
system for surgical studies [13]. The adoption of (NPWT) device in patients undergoing major
the IDEAL framework has been slow but is foot ankle surgery: ‘… However, as only one
gaining increased momentum in the surgical lit- prospective study on prophylactic negative
erature [14]. The IDEAL framework is currently pressure wound therapy in patients undergoing
being updated and is expected to explicitly major lower extremity fracture surgery is avail-
incorporate pilot studies in the future under the able, evidence on its beneficial effect is limited
‘exploration’ category. It is hoped that the and it is uncertain whether the extra expenses of
inclusion of pilot trials will address some of the the NPWT device are justified. Additionally,
inappropriate practices regarding the conduct and lately portable, single-use devices have been
interpretation of small studies in surgery. For developed, which allow early discharge without
example, in surgical trials comparing two tech- the need for specialized home care. It is however
niques, authors often conclude inappropriately unknown whether the good results observed with
that absence of evidence (i.e. p-value > 0.05 regular NPWT devices are equalled by these
based on the data) is equivalent to evidence of newly developed devices’ [16].
absence (i.e. in fact, no difference between the
two techniques) [14]. This incorrect interpreta-
tion may mislead researchers to thinking that Objective(s) and Outcomes(s)
further properly designed studies are not needed
to detect the differences between interventions. The primary objective(s) of a pilot study should
concern the feasibility of performing a subse-
quent main study. To address the feasibility
Key Considerations for Designing objective(s), pre-specified outcome measure-
a Pilot Trial ments should be carefully chosen. In general,
outcomes of pilot studies may include recruit-
Before conducting a pilot study, some key ele- ment rates, adherence levels, data completion and
ments should be carefully considered. Several variance estimates, among others. For instance,
key considerations we emphasize here include in Skoretz et al study, the ‘… primary objective
study rationale, objective(s), feasibility and was to determine the feasibility of using vali-
patient-centred outcome(s) , sample size justifi- dated and objective interpretation measures for
cation, appropriate methods of analysis and cri- videofluoroscopy in conjunction with nasendo-
teria of success. scopy to assess swallowing and upper airway
physiology on prospectively enrolled CV [car-
diovascular] surgery patients following pro-
Study Rationale longed intubation’ and the ‘… secondary
objective was to explore the tolerability and
It is important to show readers why a pilot trial is impact of this study on patients and nursing
needed before a future main RCT is conducted, practice’ [17]. Accordingly, their outcome mea-
given the current scientific background and surements included ‘recruitment rate, patient
available evidence. The rationale for a pilot study participation, task completion durations, and the
is generally to explore areas of uncertainty that inter-rater reliability of VFS [videofluoroscopic
need to be addressed before the large-scale trial swallowing study] measures using the intraclass
can be planned [4] because according to the correlation coefficient’.
118 G. Li et al.
Some pilot studies may be performed to estimates are based on a small sample [22]. Other
determine an appropriate endpoint for the main approaches in the literature include (1) discussion
study; therefore, researchers may not be able to with health professionals for further supplemen-
choose the most appropriate primary outcome for tal information, (2) creating a sample size table
the main study until the pilot trial is completed. with different variation estimates of an effect size
Of note, however, researchers should explicitly to emphasize the uncertainty around the outcome
list that identifying the primary outcome for the of interest, and (3) running simulation studies to
main study as one of the pilot study objectives. assess different scenarios and present a variety of
Likewise, exploring learning curve effects may results with estimates of variability [6].
also be an important objective in surgical pilot
studies. Learning curves are a well-known way to
increase surgical proficiency over time, which Methods of Analysis
consequently could distort comparisons between
treatment groups [18, 19]. How to incorporate Pilot studies should not primarily focus on
learning curve effects in surgical trials is an treatment effects and tests of effectiveness;
important issue that requires further investigation. instead, they are conducted to address the feasi-
bility of conducting a main RCT. Therefore,
statistical tests of significance may not be needed
Sample Size or necessary. Where hypothesis tests are per-
formed a sample size calculation should have
One key requirement for pilot studies is that they been carried out or some justification given. It is
should ensure a sufficient sample to provide also helpful to include a cautionary caveat about
informative messages for feasibility of conduct- the study being underpowered and therefore the
ing a large-scale RCT. For pilot trials, sample results should be treated as preliminary [10].
size calculations should always be based on the Findings from pilot studies are usually summa-
feasibility objectives and scientific rationale. In rized descriptively by means with standard
many cases, pilot studies give a justified rationale deviations, and counts with percentages for each
as to the number in the sample without per- treatment group separately. Moreover, it is not
forming a formal sample size calculation. In uncommon to show the results of pilot studies
other studies that have a primary objective of qualitatively, for example, by using narrative
achieving an estimated recruitment, retention, or descriptions. An example can be found in Forero
adherence rate, a desired degree of precision et al. study, in which ‘the feasibility outcomes
(variation, or confidence interval) around the were reported descriptively and narratively. For
estimated rate is usually needed to help calculate the clinical endpoints, only descriptive statistics,
the sample size. We would refer readers to mean (standard deviation) for continuous out-
Thabane et al. publication about how to use the comes and raw count (%) for categorical out-
confidence interval approach to estimate sample comes, were reported. Due to the nature of pilot
size for a pilot study where rates are to be esti- designs, we chose not to conduct any informative
mated [6]. statistical tests on the collected data’ [23].
Pilot studies may also provide some infor-
mation about estimating variation of an effect
size to inform the sample size calculation for a Criteria of Success
future main RCT. Readers can refer to relevant
methodological publications for more detail [20, Of note, pilot studies are not conducted to esti-
21]. Nevertheless, caution is needed because mate an effect size or compare different treatment
findings from pilot studies may easily mislead effects due to their small sample size and lack of
sample size calculations for a main trial if the generalizibility. However, it is good practice to
12 How to Assess a Pilot Trial in Surgery 119
draw up progression criteria before the pilot checklist items are a very useful reference guide
study begins that relate to objectives such as to consult at the design stage of the pilot study
recruitment rates or adherence rates. As a rule of and many examples are given in the paper. It is
thumb, the criteria of success should be related to also easily adapted to non-randomized studies by
the primary feasibility objective(s). Therefore, omitting the items that relate to randomization.
feasibility findings from pilot trials can inform
progression to a subsequent main RCT including
whether a large RCT should be planned as is Key Resources for Further References
(main trial feasible, no modification needed),
planned but with changes (main trial feasible, To assist with the implementation of pilot studies
protocol amendments needed), or not planned in surgery, Table 12.1 summarizes some key
(main trial not feasible, aborted completely). As resources for readers. The table covers the
mentioned above, aspects of the feasibility may aspects of surgical pilot studies including their
include procedural, resource- or general conduct (the what/why/how and common
management-related, scientific and methodolog- misconceptions), deficiencies in current practice,
ical considerations. In Pai et al. study, for the IDEAL framework and recommendations,
example, the primary feasibility objective was to CONSORT extension reporting guideline, ethical
determine whether implementing the SENTRY issues, surgical pilot trial examples and examples
(Strategies to Enhance Venous Thromboem- of published protocols. For instance, Thabane
bolism Prophylaxis in Hospitalized Medical et al. publication gives a straightforward tutorial
Patients) could be feasible and whether the on the what/why/how regarding a pilot trial; it
strategies could improve appropriate thrombo- also covers some aspects including the differ-
prophylaxis rates [24]. Therefore, the pilot study ences between pilot trials and proof-of-concept
was considered definitely feasible if ‘the pro- studies, some common confusions and miscon-
portion of at risk patients receiving appropriate ceptions and some frequently asked questions,
prophylaxis in the intervention hospitals versus among others [6]. Ethical issues of pilot studies
the usual care hospitals was > 25%’ [24]. have not been appropriately addressed in the
literature. We did a quick literature search of
ethics guidelines for research, but did not find
Reporting of a Pilot Trial any detailed information on (1) whether a pilot
study is ethical if its feasibility cannot be guar-
Regarding the reporting of pilot studies, we anteed, or (2) how to approach participants for
would refer readers to the published CONSORT their consent given the feasibility nature of pilot
(Consolidated Standards of Reporting Trials) trials. However, we would refer readers to Tha-
extension for pilot and feasibility trials [4]. The bane et al. publication [6] and the CONSORT
guideline provides instructions on how to trans- reporting guideline by Eldridge et al. [5] that
parently and adequately report abstracts and main proposed some solutions to such ethical issues.
texts of pilot studies with detailed examples and Examples of surgical pilot trials [25, 26], as well
explanations. It also compares the standard as examples of published protocols [27, 28], can
CONSORT checklist items with the extension also be found in Table 12.1.
checklist items for pilot studies, with some items
not applicable to pilot studies and some new
items added. We suggest researchers who plan to Returning to the Scenario
design, conduct and report a pilot study should
carefully refer to the reporting guideline and After reviewing the literature, you identified a
adhere to the checklist items as appropriate. It is pilot trial comparing the accelerated care with
important to note that the CONSORT extension standard care in patients with a hip fracture [29].
120 G. Li et al.
Table 12.1 Some key resources for readers’ further references to surgical pilot studies
Aspects of pilot studies Key referencea
Definition Eldridge, 2016 [4]; Eldridge, 2016 [5]
The what/why/how and common misconceptions Thabane, 2010 [6]
Deficiencies in current practice Lancaster, 2004 [10]; Arain, 2010 [11]
The IDEAL framework and recommendations McCulloch, 2009 [13]; McCulloch, 2018 [14]
CONSORT extension—Reporting guideline Eldridge, 2016 [4]
Ethical issues Thabane, 2010 [6]; Eldridge, 2016 [4]
More examples of surgical pilot trials Mason, 2015 [25]; Lim, 2018 [26]
Examples of published protocols of surgical pilot trials Snee, 2016 [27]; Kearney, 2017 [28]
a
Expressed as: first author, publication year [study reference]
Table 12.2 CONSORT checklist of information reported in the HIP ATTACK pilot triala
Section/Topic Item Checklist item Reported
No (Yes/No/NA)
Title and abstract
1a Identification as a pilot or feasibility randomized trial in the Yes
title
1b Structured summary of pilot trial design, methods, results and Yes
conclusions (for specific guidance see CONSORT abstract
extension for pilot trials)
Introduction
Background and objectives 2a Scientific background and explanation of rationale for future Yes
definitive trial, and reasons for randomized pilot trial
2b Specific objectives or research questions for pilot trial Yes
(continued)
12 How to Assess a Pilot Trial in Surgery 121
26. Lim CP, Roberts M, Chalhoub T, Waugh J, Delegate L. 28. Kearney RS, Parsons N, Mistry D, Young J,
Cadaveric surgery in core gynaecology training: a Brown J, O’Beirne-Elliman J, et al. A protocol for
feasibility study. Gynecol Surg. 2018;15(1):4. a feasibility randomised controlled trial to assess the
27. Snee M, McParland L, Collinson F, Lowe C, difference between functional bracing and plaster cast
Striha A, Baldwin D, et al. The SABRTooth for the treatment of ankle fractures. Pilot Feasibility
feasibility trial protocol: a study to determine the Stud. 2017;3(1):11.
feasibility and acceptability of conducting a phase III 29. Hip Fracture Accelerated Surgical Treatment and
randomised controlled trial comparing stereotactic Care Track (HIP ATTACK) Investigators. Acceler-
ablative radiotherapy (SABR) with surgery in ated care versus standard care among patients with
patients with peripheral stage I non-small cell lung hip fracture: the HIP ATTACK pilot trial. CMAJ Can
cancer (NSCLC) considered to be at higher risk of Med Assoc Journal (journal de l’Association medi-
complications from surgical resection. Pilot Feasibil- cale canadienne). 2014;186(1): E52–60.
ity Stud. 2016;2(1):5.
Non-inferiority Randomized
Controlled Trials 13
Yaad Shergill, Atefeh Noori, Ngai Chow
and Jason W. Busse
search terms “advanced ovarian cancer” and unacceptably worse” than a standard procedure
“primary chemotherapy” and “primary surgery” may be desirable when the new procedure is
and limit the search to the “English-language associated with fewer harms, favorable cost, or
studies carried out on human participants.” Your greater accessibility [6–8]. The rationale behind
search yields ten articles (Appendix), of which trialists’ choice of their prespecified non-inferior
two are randomized controlled trials (RCTs) [2, margin is thus critical to providing readers with
3]. The more recent 2015 CHORUS [2] trial, confidence in the validity of a study that has
published in the Lancet, directly relates to your concluded non-inferiority.
research question, and would seem to be the The US Food and Drug Administration has
study referred to at the conference you attended. suggested a strategy for setting non-inferiority
When you download the article and begin to read margins based on the smallest plausible benefit
through it, you notice that the trial is labeled as a of the existing standard therapy for the outcome
non-inferiority trial, and wonder how this differs under consideration [9]. This value is informed
from other trial designs. by identifying the most defensible estimate of
effect from existing trials (or ideally, a systematic
review) and the associated measure of precision;
Randomized Controlled Trials typically, the 95% confidence interval (95% CI).
The smallest plausible benefit is the lower
RCTs remain the gold standard for evaluating the boundary of the 95% CI—the value closest to no
effectiveness of surgical interventions, as suc- effect. A new treatment should be similarly
cessful randomization ensures that patients in beneficial to conclude non-inferiority, and this is
both treatment arms are prognostically similar at defined by many drug regulatory agencies as
baseline and any differences in outcome can be including at least 50% of the minimal plausible
attributed to the intervention. The experimental treatment effect [10].
therapy is considered superior to the control in- For instance, an existing surgical procedure
tervention when the null hypothesis is rejected may show an absolute reduction of 4% in overall
and shows a statistically significant difference in mortality, compared with nonsurgical care, with
favor of the new treatment [4, 5]. For more a 95% CI of 2–6%. The smallest plausible benefit
information on RCTs, please see Chap. 11. is then a 2% reduction in overall mortality; half
of this is 1%. If a non-inferiority trial testing an
alternative surgical procedure shows a treatment
Non-inferiority RCT Design effect with an associated measure of precision
that includes no more than a 1% increase in
Non-inferiority trials investigate whether a new overall mortality (e.g., a point estimate of no
surgical procedure is not worse than another difference, with a 95% CI of −1 to 1%), then
procedure by more than an acceptable amount. 50% of the 2% absolute reduction in absolute
The null hypothesis, which, if met will reject mortality is preserved and would satisfy the cri-
non-inferiority, if that the new surgical procedure teria for concluding non-inferiority. Estimates of
is worse than the standard intervention by greater precision associated with treatment effects that
than an acceptable amount (delta, Δ), where Δ is only include benefit versus the comparator
the non-inferiority margin. The alternate demonstrated superiority and estimates of preci-
hypothesis, which if met will demonstrate sion that exceeds the margin have shown inferi-
non-inferiority, is that the new surgical inter- ority to the comparator. Furthermore, 95% CIs
vention is not worse than standard intervention that include the non-inferiority margin are
for the condition by greater than Δ. A new sur- inconclusive (e.g., a more precise estimate may
gical procedure that is non-inferior or “not or may not be non-inferior) (Fig. 13.1).
13 Non-inferiority Randomized Controlled Trials 127
Fig. 13.1 Possible scenarios of observed treatment dif- new treatment is non-inferior in the sense already defined
ferences for adverse outcomes (harms) in non-inferiority but also inferior in the sense that a null treatment
trials. Bars indicate 2-sided 95% CIs. The dashed line at difference is excluded. E and F, if the 95% CI includes D
x = D refers to the non-inferiority margin, the light gray and zero, the difference is nonsignificant but the result
tinted region to the left of D refers to the zone of regarding non-inferiority is inconclusive. G, if the 95%
inferiority. A, if the CI lies to the left of zero, the new CI includes D and is to the right of the zero, the difference
treatment is superior. B and C, if the 95%CI lies to the is statistically significant but the result is inconclusive
left of D and includes zero, the new treatment is regarding possible inferiority of magnitude D or worse.
non-inferior but not shown to be superior. D, if the H, if the 95% CI is above D, the new treatment is inferior.
95% CI lies to the left of D and to the right of zero, the Adapted from Piaggio [11]
Were the novel and standard surgical interven- chemotherapy was non-inferior to primary sur-
tion groups prognostically similar at baseline? gery, a conclusion that rests—in part—on the
Was prognostic balance maintained as the trial assumption that the standard treatment (primary
progressed? surgery) was delivered in an optimal manner.
Similar to superiority RCTs, non-inferiority tri- The investigators reported that all surgery was
als can reduce the risk of bias by adequately ran- performed by accredited specialist gynecological
domizing patients to the different treatment arms; oncologists, who operated on at least 15 patients
concealment of allocation; blinding of participants, with ovarian cancer each year, and whose work
healthcare providers, investigators, and outcome was regularly peer-reviewed. The CHORUS trial
assessors; and minimizing loss to follow-up. Not all [2] predicted a 50% 3-year survival rate with
methodological safeguards against bias may be primary surgery, consistent with previous trials
possible in surgical trials; for example, surgeons [16], but found a much lower rate of 32%. This
cannot be blinded in regard to which procedure they difference may be explained by an older median
perform. Please see Chap. 11. age of CHORUS patients [2] (65 years), and the
In the CHORUS [2] study, participants were high rate of poorly differentiated tumors (77%),
centrally randomized by telephone using a min- but the lower than expected survival rate with
imisation method with a random element, and primary surgery does pose some risk for a mis-
stratified according to randomizing center, tumor leading finding of non-inferiority.
size, clinical FIGO stage, and prespecified
chemotherapy regimen. Central randomization Did investigators analyze patients according to
ensures concealment of allocation, and stratifying the surgical treatment they received, as well as to
randomization by factors that may be associated the groups to which they were assigned?
with prognosis provides additional reassurances Randomization aims to balance prognostic
that treatment groups are prognostically similar factors between interventions; however, patients
at baseline. Review of the CHORUS trial’s [2] may not receive their allocated treatment. Trial-
table of baseline characteristics provides reas- ists can address this issue by only analyzing
surance that randomization was successful those patients who adhered to the study protocol
(Table 13.1). Because of the nature of the inter- (a per-protocol analysis); however, this assumes
ventions (primary surgery vs. primary that patients who deviated from protocol were
chemotherapy), it was not possible to blind prognostically similar to those who did not—an
patients or clinicians; however, the primary out- assumption that is often not met, as non-adherent
come measure was overall survival which would patients are often sicker and destined to do
not be influenced by lack of blinding [15]. The worse. As such, per-protocol analyses are likely
authors reported no loss to follow-up for their to lead to an overestimation of treatment effects
primary outcome, and no more than 7% missing in superiority trials. Using an intention-to-treat
data for secondary outcomes. No loss to (ITT) analysis for superiority trials, where
follow-up over the course of the trial for the patients are analyzed according to the group to
primary outcome ensures that prognostic balance which they were randomized, provides a more
between treatment arms was maintained conservative estimate of treatment effects [17].
throughout the study. You conclude that the trial In non-inferiority trials, however, ITT analysis
is at low risk of bias. may lead to a misleading conclusion of
non-inferiority. Consider a non-inferiority trial in
Did investigators guard against an unwarranted which the novel treatment is actually inferior,
conclusion of non-inferiority? and patients are less adherent to the standard
The CHORUS trial [2] provided both surgery treatment. An ITT analysis that included all
and chemotherapy to both treatment arms; it was non-adherent patients would then reduce the
the order of approaches that were tested. apparent effectiveness of the standard therapy
The CHORUS trial [2] concluded that primary and facilitate the misleading conclusion of
130 Y. Shergill et al.
non-inferiority in comparison with the new and per-protocol analyses and found similar
approach. As such, a per-protocol analysis is the results.
more conservative approach for non-inferiority
trials. Investigators that present both ITT and Did investigators justify their non-inferiority
pre-protocol analyses provide further confidence margin?
in a conclusion of non-inferiority when the Non-inferiority trials are undertaken on the
results between both approaches are concordant. basis that a new approach is either safer, less
Of note, the choice of analysis may have limited costly, or more readily available, and the issue is
impact on results in most cases. A review of 117 not whether the new approach is more effective
non-inferiority comparisons in which both an than standard care, but only that it is similarly
ITT and per-protocol analysis were reported effective (not much worse). The CHORUS [2]
found differing conclusions in 7 (4%) of cases investigators reported that historical trials of pri-
[12]. The CHORUS study [2] reported both ITT mary surgery for advanced ovarian cancer have
13 Non-inferiority Randomized Controlled Trials 131
found a 3-year survival rate of 50%, and they Step 2: What Are the Results?
believed that most patients would be willing to Treatment of advanced ovarian cancer focusses
accept up to 6% less survival with primary on survival, but other important outcomes are
chemotherapy. This meant that non-inferiority quality of life and adverse events associated with
could be concluded if the upper limit of the 90% therapy. Pursuing chemotherapy before surgery
CI for the hazard ratio for the effect of primary may reduce tumor size, facilitate more successful
surgery versus primary chemotherapy on overall debulking surgery, and—with less extensive
mortality was less than 1.18. The authors’ of surgery required—reduce postoperative compli-
CHORUS selected their non-inferiority margin cation rates; however, delaying surgery may also
based on “consideration of the size of differences decrease survival rates.
noted in similar trials and clinical consensus” [2]. The CHORUS trial [2] found a nonsignificant
Their rationale is therefore vague, and it is not pattern for greater overall 3-year survival that
possible to derive the non-inferiority margin from favored primary chemotherapy versus primary
the details provided. Moreover, it is not clear surgery (34% vs. 32%), but they did not report an
whether or not patients with advanced ovarian associated measure of precision for the absolute
cancer would accept a 6% loss of survival at 3 difference of 2%. In an ITT analysis, the hazard
years follow-up. The information provided in the ratio (HR) for overall 3-year survival was 0.87
trial registry does not include any further details with an associated estimate of precision (95% CI:
(http://www.isrctn.com/ISRCTN74802813). 0.72–1.05) for which the upper boundary fell
This is lack of detail is, unfortunately, not below the non-inferiority margin of 1.18 (this
unusual. In their systematic review of 163 would correspond to scenario C in Fig. 13.1).
non-inferiority trials, Aberegg and colleagues A per-protocol analysis was consistent with these
found that only 25% provided clear justification results (HR 0.89; 95% CI: 0.78–1.01).
for their selected margin, 58% provided no The planned meta-analysis for overall survival
rationale, and 17% reported vague reasoning moved the point estimate closer to no effect, and
[12]. The choice of non-inferiority margin also improved precision (HR 0.93; 95% CI: 0.82–
informs the required sample size, and the 1.05) (Fig. 13.3). Improvement in quality of life,
550-patient CHORUS trial had 65% power for measured by the EORTC quality of life ques-
comparison between the treatment groups [2]. tionnaire and the ovarian cancer-specific quality
The authors addressed this concern by combining of life questionnaire, at 6 and 12 months was not
their results with another trial which would significantly different between treatment groups,
increase power to 90% [18]. but did favor primary chemotherapy (63% vs.
Fig. 13.3 Forest plot of overall survival in CHORUS [2] and EORTC studies [18]
132 Y. Shergill et al.
6. Hegazy MA, Hegazi RA, Elshafei MA, primary surgery and chemotherapy: an intergroup
Setit AE, Elshamy MR, Eltatoongy M, et al. study. Gynecol Oncol. 2006;100(1):133–8.
4. Bothwell LE, Greene JA, Podolsky SH, Jones DS.
Neoadjuvant chemotherapy versus primary In: Malina D, editor. Assessing the gold standard—
surgery in advanced ovarian carcinoma. lessons from the history of RCTs. N Engl J Med.
World J Surg Oncol. 2005;3:57. 2016;374(22):2175–81.
7. Alberts DS, Hannigan EV, Liu PY, Jiang C, 5. Altman DG, Schulz KF, Moher D, Egger M, David-
off F, Elbourne D, et al. The revised CONSORT
Wilczynski S, Copeland L, et al. Randomized statement for reporting randomized trials: explana-
trial of adjuvant intraperitoneal alpha inter- tion and elaboration. Ann Intern Med. 2001;134
feron in stage III ovarian cancer patients who (8):663–94.
have no evidence of disease after primary sur- 6. Hahn S. Understanding noninferiority trials. Korean J
Pediatr. 2012;55(11):403–7.
gery and chemotherapy: an intergroup study. 7. Temple R, Ellenberg SS. Placebo-controlled trials
Gynecol Oncol. 2006;100(1):133–8. and active-control trials in the evaluation of new
8. Eisenhauer EL, Tew WP, Levine DA, Licht- treatments. Part 1: Ethical and scientific issues. Ann
man SM, Brown CL, Aghajanian C, et al. Intern Med. 2000;133(6):455–63.
8. Ellenberg SS, Temple R. Placebo-controlled trials
Response and outcomes in elderly patients and active-control trials in the evaluation of new
with stages IIIC-IV ovarian cancer receiving treatments. Part 2: Practical issues and specific cases.
platinum–taxane chemotherapy. Gynecol Ann Intern Med. 2000;133(6):464–70.
Oncol. 2007;106(2):381–7. 9. U.S. Department of Health and Human Services
Food and Drug Administration Center for Drug
9. Drews F, Bertelli G, Lutchman-Singh K. Evaluation and Research, Center for Biologics Eval-
Management of advanced ovarian cancer in uation and Research. Non-inferiority clinical trials to
South West Wales—a comparison between establish effectiveness: guidance for industry. [inter-
net]. 2016. [cited 2018 July 5]. Available from
primary debulking surgery and primary
https://www.fda.gov/downloads/Drugs/Guidances/
chemotherapy treatment strategies in an uns- UCM202140.pdf.
elected, consecutive patient cohort. Cancer 10. Althunian TA, de Boer A, Groenwold RHH, Klun-
Epidemiol. 2017;49:85–91. gel OH. Defining the noninferiority margin and
analysing noninferiority: an overview. Br J Clin
10. Jarnagin WR, Dematteo RP, D’Angelica MI,
Pharmacol. 2017;83(8):1636–42.
Barakat RR, Chi DS. The addition of extensive 11. Piaggio G, Elbourne DR, Pocock SJ, Evans SJW,
upper abdominal surgery to achieve optimal Altman DG, For the CONSORT Group. Reporting of
cytoreduction improves survival in patients noninferiority and equivalence randomized trials.
JAMA. 2012;308(24):2594–604.
with stages IIIC-IV epithelial ovarian cancer.
12. Aberegg SK, Hersh AM, Samore MH. Empirical
Gynecol Oncol. 2006;103(3):1083–90. consequences of current recommendations for the
design and interpretation of noninferiority trials.
J Gen Intern Med. 2018;33(1):88–96.
13. Thoma A, Farrokhyar F, Bhandari M, Tandan V,
Evidence-Based Surgery Working Group. Users’
References guide to the surgical literature. How to assess a
randomized controlled trial in surgery. Can J Surg.
1. Cadeddu M, Farrokhyar F, Levis C, Cornacchi S, 2004;47(3):200–8.
Haines T, Thoma A. users’ guide to the surgical 14. Thoma A, Farrokhyar F, Waltho D, Braga LH,
literature. Understanding confidence intervals. Can J Sprague S, Goldsmith CH. Users’ guide to the
Surg. 2012;55(3):207–11. surgical literature: how to assess a noninferiority
2. Kehoe S, Hook J, Nankivell M, Jayson GC, Kitch- trial. Can J Surg. 2017;60(6):426–32.
ener H, Lopes T, et al. Primary chemotherapy versus 15. Anthon CT, Granholm A, Perner A, Laake JH,
primary surgery for newly diagnosed advanced Møller MH. No firm evidence that lack of blinding
ovarian cancer (CHORUS): an open-label, ran- affects estimates of mortality in randomized clinical
domised, controlled, non-inferiority trial. Lancet. trials of intensive care interventions: a systematic
2015;386(9990):249–57. review and meta-analysis. J Clin Epidemiol.
3. Alberts DS, Hannigan EV, Liu PY, Jiang C, Wilczyn- 2018;100:71–81.
ski S, Copeland L, et al. Randomized trial of adjuvant 16. Seidman JD, Yemelyanova A, Cosin JA, Smith A,
intraperitoneal alpha-interferon in stage III ovarian Kurman RJ. Survival rates for international federa-
cancer patients who have no evidence of disease after tion of gynecology and obstetrics stage III ovarian
134 Y. Shergill et al.
carcinoma by cell type. Int J Gynecol Cancer. 20. Shimokawa M, Kogawa T, Shimada T, Saito T,
2012;22(3):367–71. Kumagai H, Ohki M, et al. Overall survival and
17. Gupta SK. Intention-to-treat concept: a review. post-progression survival are potent endpoint in
Perspect Clin Res. 2011;2(3):109–12. phase III trials of second/third-line chemotherapy
18. Vergote I, Tropé CG, Amant F, Kristensen GB, for advanced or recurrent epithelial ovarian cancer.
Ehlen T, Johnson N, et al. Neoadjuvant chemother- J Cancer. 2018;9(5):872–9.
apy or primary surgery in stage IIIC or IV ovarian
cancer. N Engl J Med. 2010;363(10):943–53.
19. Herzog TJ, Armstrong DK, Brady MF, Coleman RL,
Einstein MH, Monk BJ, et al. Ovarian cancer
clinical.
Expertise-Based Randomized
Controlled Trials 14
Daniel Waltho, Kristen Davidge and Cagla Eskicioglu
and for the Evidence-Based Surgery Working Group
differences, or lack thereof, between treatment the surgeons to perform their preferred surgical
groups are more likely to be true. Thus, we will technique, which in turn increases validity and
be asking the following questions: feasibility of the trial.
To ensure validity in this category, one must
(i) Was the learning curve taken into ensure that the expertise-based protocol was
consideration? carried out appropriately. Therefore, this chapter
(ii) Were the subjects randomized? proposes the following sub-criteria that must be
(iii) Was the randomization concealed? fulfilled to determine validity:
(iv) Were the experimental and control group
similar in terms of prognostic factors? (a) Sufficient level of expertise in each treatment
(v) Were the subjects stratified? group objectively determined and clearly
(vi) Were subjects analysed in the group they stated
were initially randomized into at (b) Matching of equally skilled surgeons
enrolment? between each treatment group
(vii) Was follow-up complete? (c) Sufficient number of surgeons in each treat-
ment group
Was the Learning Curve Taken Into
Consideration? (a) Sufficient level of expertise in each treatment
group objectively determined and clearly
This aspect of validity is a common pitfall in stated
surgical RCTs and is a primary focus of opti- Level of surgeon expertise is at the crux of
mization in the EBRCT. When an RCT com- this RCT design. To conclude that the trial
paring two surgical interventions is performed results are valid, it must first be determined
without an expertise-based design, the experi- that the surgeons performing the treatment(s)
mental group is often a relatively novel proce- are adequately trained, such that resultant
dure, with which many surgeons may not have outcomes (i.e. Success, complications, etc.)
sufficient experience. The presence of a learning are not attributable to poor surgical tech-
curve will likely create a discrepancy between nique. Therefore, the study should provide
the surgeon’s proficiency and their ultimate an objective metric for each surgeon based
degree of success with the newer surgical treat- on expertise in their assigned operative
ment. This discrepancy may translate into a treatment. There is evidence to suggest that
systematic difference in outcomes between for many procedures, at least 5 years of
groups favouring the older procedure. experience is required to reach the plateau of
The EBRCT design assigns patients to an inter- the learning curve [2]. In the Mohtadi et al.
vention group where only surgeons who are well [4] trial, one surgeon in each group had
experienced in that specific surgical procedure 10 years of experience, while the remaining
perform the intervention. Further this design can surgeons in each group had 2–5 years of
match patients based on surgeon’s level of experience. No sensitivity analysis was done
experience. In other words, if one patient ran- on the study outcomes based on expertise
domized to the experimental treatment arm level. Moreover, the authors are assuming
receives their respective procedure by a surgeon that each surgeon has performed a similar
with a certain degree of experience in that tech- number of cases in their respective tech-
nique, he/she will be matched to a patient ran- nique. Whereas, a surgeon could, in theory,
domized to the comparison arm who receives have performed only one open procedure a
their respective procedure from another surgeon year for the past 10 years and another sur-
with a similar degree of experience in the com- geon could have performed 100 arthroscopic
parison technique. Therefore, this design allows procedures a year for 10 years. A more
14 Expertise-Based Randomized Controlled Trials 139
appropriate metric for measuring expertise group when the cases are frequent and need
may be the total number of a particular to be done on an emergent basis. In the case
procedure performed. of the Mohtadi et al. [4] trial, cases were
(b) Matching of equally skilled surgeons done on an elective basis and a total of 226
between each treatment group eligible patients were included in
Another critical component of the EBRCT expertise-based randomization between 2001
protocol is matching of surgeons based upon and 2008. Throughout the trial period, two
their level of expertise in their respective surgeons performed open repairs and three
treatment group. In the above sub-criterion, surgeons performed arthroscopic repairs.
we identify a diversity of skill level with the Only one patient, who was randomized to
surgeons involved in the trial. While it is open repair, crossed over. Therefore, we can
important to enlist surgeons in all treatment conclude that there were an adequate number
groups who have sufficient training to be of surgeons in both groups to support this
considered ‘experts’ in a given surgical trial.
procedure, these surgeons still exist on a
spectrum in terms of how seasoned they are Based on these criteria, we can conclude that a
in that technique. Therefore, further match- fairly robust expertise-based design was carried
ing of surgeons based on their level of ex- out throughout the trial, however caution must be
pertise is indicated to control for any taken when interpreting the results as true ex-
variation in outcomes that may be attributed pertise of the surgeons involved is not clearly
to how many of these procedures a surgeon reported.
has performed. In the Mohtadi et al. [4] trial,
surgeons were matched by experience based
on the approximate number of years they Were the Subjects Randomized?
were performing their preferred technique.
This ensures that for every patient random- In this trial, the allocation was determined using
ized to a surgeon who has performed the computer-generated, variable-block-size ran-
arthroscopic technique for a certain number domization. Block randomization is simply ran-
of years, a patient would also be randomized domization of participants within blocks such
to a surgeon who has performed the open that equal numbers of patients are assigned to
technique for the same number of years. each treatment. Variable-block-sizes decreases
However, as mentioned above, the number predictability in the allocation process by varying
of years may not necessarily correspond to the number of patients assigned to a given block
the number of procedures performed and throughout the allocation process. In the Mohtadi
matching would be more suitably performed et al. [4] trial, appropriate randomization is uti-
based on the later metric. lized. Because this trial is an expertise-based
(c) Sufficient number of surgeons in each treat- design, allocation determined both the technique
ment group and surgeon as discussed above. Patients would
To further optimize the execution of the then meet with their assigned surgeon, who dis-
EBRCT design, the trial must ensure an cussed the technique, confirmed eligibility, and
adequate number of surgeons in each treat- addressed any concerns. Randomization was,
ment group, thereby limiting the number of therefore, sufficiently executed in this trial.
incidences where patients may cross over to
the other group. The number of surgeons
required is contingent upon both the volume Was the Randomization Concealed?
and priority of cases. There is likely to be
increased difficulty with respect to the Above all, concealment of allocation should
availability of any surgeon in the desired apply to those enrolling a patient into the trial, as
140 D. Waltho et al.
it ensures that systematic selection bias associ- demographic data of each group. Ideally, no
ated with decisions to enrol certain patients into statistically significant differences should exist
each treatment arm is limited. In the Mohtadi between key demographic information that may
et al. [4] RCT, the allocation was concealed with change outcomes, such as age, gender or
consecutively numbered opaque envelopes. comorbidities (p > 0.05). According to the
Assuming no tampering occurred, we can be authors, there was no statistically significant
satisfied that allocation concealment was suffi- difference between groups in terms of age, gen-
cient for those enrolling patients initially. How- der, collision/contact sport involvement, number
ever, because the trial is based upon two of previous dislocations or the number of suture
significantly different surgical approaches, open anchors used intraoperatively. However, baseline
versus arthroscopic, which incorporate different characteristics differed significantly between
incisional patterns, blinding of patients is difficult treatment groups in two categories. First was the
to achieve. Mohtadi et al. [4] were unable to proportion of dominant shoulders being treated,
sufficiently blind the patients to the treatment wherein a significantly larger proportion of open
selection. Furthermore, the authors state that the repairs were done on the patient’s dominant
research assistant performing the clinical exami- shoulder. Second was the length of time from
nations for evaluation of treatments was aware of initial injury, wherein a significantly longer per-
the treating surgeon. This is due to both the iod from injury occurred with the open
difference in incisional patterns indicated above, group. A relative increase in the time from injury
as well as the expertise-based nature of surgeon with the open repairs could skew the results in
allocation. Finally, while concealment of alloca- favour of the arthroscopic procedure, as a greater
tion is impossible to achieve for the surgeons time until repair typically yields poorer results.
providing treatment, it is highly unlikely to Shoulder dominance is less clear, as a dominant
introduce bias in this case. In traditional surgical shoulder repair group may lead those patients to
RCTs, where the same surgeon performs both be more motivated or capable of post-operative
interventions, there is a concern that if a surgeon therapy and recovery or may also cause
strongly believes in the superiority of one of the post-operative injury due to over-exertion.
two interventions, then either consciously, or Moreover, because this primary outcome was
subconsciously, patients may be treated differ- based on a quality of life measurement, perceived
ently (for example obtaining more meticulous improvement in disease-specific quality of life
hemostasis) or introducing co-interventions. may be greater if surgery is performed on a
However, the use of the EBRCT study design dominant shoulder injury. In the Mohtadi et al.
avoids these concerns especially if this is a [4] trial, it is not clear whether this baseline
pragmatic expertise-based RCT. characteristic discrepancy affects validity. Need-
less to say, key patient demographics remain, for
the most part, uniform between groups.
Were the Experimental and Control
Groups Similar in Terms of Prognostic
Factors? Were the Subjects Stratified?
To ensure validity, we need to determine if dif- The further subdivision of subjects beyond the
ferences in outcomes between treatment groups treatment administered is called stratification.
may be attributable to differences between Stratification attempts to ensure equal allocation
patient populations in each group. Randomiza- of subgroups of participants based on a desire for
tion attempts to minimize this by limiting sys- uniform populations between groups as descri-
tematic biases that may cause certain patients to bed in the previous section. This exercise,
fall into one of the arms and not the other. Care therefore, limits systematic differences that may
should be taken to observe the initial skew results in favour of one treatment.
14 Expertise-Based Randomized Controlled Trials 141
Stratification can be based on gender, age or follow-up routine in their clinical practice. In
other demographic factors. In keeping with an addition to a sufficient follow-up period, a suffi-
expertise-based design, in the Mohtadi et al. [4] cient proportion of patients in the study should
trial, stratification occurred at the level of the complete follow-up to ensure validity. In the
surgeon, wherein surgeons of a similar level of Mohtadi et al. [4] RCT, follow-up occurred at 3
expertise in their respective technique were months, 6 months, 1 year and 2 years
matched. The authors comment on the random- post-operatively. The authors report that 19
ization of patients being stratified accordingly, patients in the open repair group and 14 patients
however do not specify the patient characteristics in the arthroscopic repair group were lost to
for which stratification was based upon. Based follow-up at 2 years. To optimize validity, a
on the demographic information discussed in the study should aim to have less than 20% of sub-
previous section, we can be satisfied with jects lost to follow-up [5]. In this case, 85% of
homogeneous strata between both treatment patients initially randomized remained in the trial
arms. throughout the 2-year follow-up. Therefore, we
can conclude that there are no major concerns
regarding follow-up in this RCT.
Were Subjects Analysed in the Group
They Were Initially Randomized
into at Enrolment? What are the Results of the Trial?
Intention-To-Treat (ITT) analysis is a method Once the validity of the trial has been estab-
wherein all patients enrolled and randomly allo- lished, we must determine the magnitude of the
cated to a treatment are analysed in that group to treatment effect. In this section, we will further
which they were randomized, regardless of understand the results of the trial by answering
deviations that occur after randomization the following questions:
(e.g. protocol violations, loss to follow-up/
withdrawal, patient non-compliance). This prac- (i) How were the results of the trial being
tice optimizes validity by ensuring that the study measured?
population remains uniform between study (ii) How large is the treatment effect?
groups, thus avoiding crossover of participants to
a group, to which they were not randomized. In
the Mohtadi et al. [4] RCT, all patients were How Were the Results of the Trial
analysed on an ITT basis so we can conclude that Being Measured?
validity has been optimized in this regard.
Study outcomes should be measured using an
objective tool that ideally has been appropriately
Was Follow-Up Complete? validated and shown to be reliable. The primary
outcome was disease-specific quality of life. The
To capture treatment effects, patients must be authors utilize a Patient-Reported Outcome (PRO)
followed up both frequently and for a sufficient measure to address this outcome. The Western
duration. Ideally, multiple patient follow-ups for Ontario Shoulder Instability Index (WOSI) was
a patient’s lifetime would provide the most chosen. It uses a 100 mm visual analog scale
accurate data, however feasibility must be con- response format. Patients completed the WOSI at
sidered. Therefore, the surgeon should identify baseline, at 3 and 6 months post-operatively, and at
the duration and frequency of follow-up that 1 and 2 years post-operatively. Evaluating the
would be practical based on the disease and original psychometric study on this PRO, we
treatment being studied. These follow-up periods determine that it is both validated and reliable in
usually correspond to the surgeon’s typical this category [6].
142 D. Waltho et al.
The secondary outcome was shoulder function, can only conclude that a difference between the
including both patient-reported and clinician- two procedures exists based on the recurrence of
reported measures. The American Shoulder and instability.
Elbow Surgeons (ASES) scale, a shoulder-specific
functional assessment tool is a PRO assessing
shoulder function. The score was determined Are the Results Applicable to Clinical
based on patient evaluation of pain, instability and Practice?
activities of daily living and utilized a 0–100 scale.
Once again, exploring the psychometric literature, According to Guyatt et al., in order to accept that
we are assured that this outcome is both valid and the evidence in a particular study is applicable to
reliable in the study population [7]. Furthermore, your patient(s), three critical questions must be
clinical assessment of shoulder range of motion addressed [8]:
was performed by trained, independent research
assistants using a standardized tool (i.e., gonio (i) Does the study represent my patient
meter). Clinical assessment was presumably also population?
used to measure adverse events/complications and (ii) Does the trial consider all relevant
recurrence rates. Based on this article, we can patient-important outcomes?
conclude that this method of clinical assessment is (iii) Do the benefits of the procedure outweigh
both valid and reliable. Recurrent instability was any potential risks and costs?
another secondary outcome, however details sur-
rounding how this was assessed were not disclosed
by the authors.
Does the Study Represent My Patient
Population?
How Large Is the Treatment Effect ?
Prior to applying the results of a study to your
In our clinical scenario, we ask which of the two patient(s), you should review the study popula-
treatments is superior. In order to answer our initial tion to ensure that the patient demographics are
clinical question, we must first determine whether in line with that of your patient(s). The authors
there are statistically significant differences presumably included all-comers with recurrent
between the two techniques. Assessing the Moh- traumatic anterior shoulder instability. However,
tadi et al. [4] article with respect to the primary those patients with glenoid fracture or bone loss
outcome, no statistically significant difference was seen on pre-operative X-ray were excluded.
found between the two treatments in terms of Assessing the demographic breakdown of
disease-specific quality of life (mean WOSI patients, mean age was 27.8 and 27.2 between
scores). Further, no statistically significant differ- the open and arthroscopic group, respectively,
ence was found between the two treatments in with an 82% male predominance in both groups.
terms of shoulder function (ASES scores and Based on these fundamental demographics, we
active ROM). However, recurrence rates at two can conclude that the study population is repre-
years were significantly different, with 11% in the sentative of our patient from the clinical scenario.
open group and 23% in the arthroscopic group Furthermore, approximately half of these patients
(p = 0.05). No statistical analysis was performed injured their shoulder in a contact sport, while a
based on adverse events/complications, presum- third to a half of the shoulder instability in the
ably because there were no clinically significant study involved the dominant shoulder. Finally,
results in these categories. Based on this data, we shoulder injuries were around the 6-year mark at
14 Expertise-Based Randomized Controlled Trials 143
the time of both treatments. Therefore, we are analysis within their study. Costs can be dichot-
confident that our patient’s clinical scenario is omized into direct and indirect and can be further
relatively well reflected in the study. divided into costs to the healthcare system, the
patient and to society. In this case, if we review
the literature, we come across a recent cost-
Does the Trial Consider All Relevant effectiveness study by Min et al. [9] According to
Patient-Important Outcomes? this study, the average direct cost of surgery for
the arthroscopic shoulder repair was $20,385 and
Looking at our clinical scenario, the main out- the average cost of surgery of the open repair was
come in question was quality of life. Indeed, this $21,389.10 The authors conclude that both the
is likely to be the most important outcome for the arthroscopic and the open are highly cost-
patient, as it provides a practical culmination of effective but the arthroscopic procedure is more
symptoms and functional aspects germane to the cost-effective due to a lower health utility state
condition and procedure. The Mohtadi et al. [4] after a failed open procedure [9]. However, this
RCT addresses disease-specific quality of life as study does not incorporate indirect costs to
a primary outcome. In addition, the authors patient or society and further costs are based on a
include shoulder function, including recurrent single institution. Further research would be
instability, and safety outcomes secondarily. indicated to determine true risk and benefit ratio,
Shoulder function provides appropriate data to however there appears to be similarity between
confirm the mechanism of any quality of life the groups so as not to dissuade the surgeon or
improvements that may have been found in the patient from either procedure.
study. Safety outcomes, including complications
and adverse outcomes, are critical in deciding
whether to adopt or endorse a particular surgical Conclusion
intervention. The authors presumably study all
complications and adverse outcomes throughout The expertise-based RCT is a promising answer to
the follow-up period. Overall, the included out- the challenges that are faced when applying the
comes are both clinically important and com- RCT design to surgical interventions. By obviat-
prehensive in assessing the two treatment arms. ing the issue of learning curve, the surgeon can be
more confident that differences between treatment
groups are not attributable to surgeon inexperi-
Do the Benefits of the Procedure ence. Furthermore, concerns of surgeons person-
Outweigh Any Potential Risks ally believing in the superiority of a specific
and Costs? surgical procedure and potentially introducing
bias, either consciously or subconsciously, is not
Another practical argument for deciding upon a an issue in this type of study design. However, care
treatment is the risks and costs to the patient, must still be taken when interpreting the results,
healthcare system and society. A significant ensuring that the RCT methodology is rigorous.
benefit in a particular treatment decision should The EBRCT design represents an ideal approach to
outweigh potentially undue risks to the patient or the surgical trial. Presently, this design is not
excessive costs that may be associated with a prominent in the literature, however appears to be
novel surgery. There were no significant com- growing more frequent in the past decade.
plications reported within this study. Both treat- Knowledge in how to identify an appropriately
ment arms had a small number of cases of designed EBRCT, and analyse and interpret the
transient nerve dysfunction which resolved methodology and results, will no doubt be critical
spontaneously. The authors did not include a cost to evidence-based surgery moving forward.
144 D. Waltho et al.
Table 15.1 Differences between traditional narrative reviews, systematic reviews and meta-analyses* [1, 4–7]
Narrative review Systematic review Meta-analysis
Research question Broad Focused Focused
Literature search Not reported, or not Explicit, reproducible Explicit, reproducible
comprehensive
Article selection Not reported Criterion-based, Criterion-based,
criteria complete complete
Quality assessment Not reported Methodology assessed Methodology assessed
Result summary Qualitative Qualitative Quantitative
scaphoid fracture. Radiographs demonstrate a to the following inclusion criteria: (1) English
non-displaced scaphoid waist fracture. You language articles, (2) systematic reviews or
communicate the results to the patient and dis- RCTs published within the last 10 years,
cuss potential management options. The patient (3) adult (age 18 years) population and
expresses that he is eager to quickly return to (4) articles that assess RTW outcomes following
work and asks whether surgery will shorten his surgical versus non-surgical management of
recovery time. You are uncertain whether cast acute non-displaced scaphoid fractures. You
immobilization or surgical fixation better quick- exclude the following: (1) nonrandomized ob-
ens return to work (RTW) outcomes. You place servational studies, and (2) studies which evalu-
the patient in a thumb spica splint and arrange ate outcomes following displaced scaphoid
follow-up to discuss the available evidence. fractures.
Searching the MEDLINE database from the
National Library of Medicine, you derive your
Finding the Evidence keywords from the clinical question. You enter
the terms ‘Surgical’ AND ‘Cast Immobilization’
This scenario can be structured in PICOT format AND ‘Scaphoid Fracture’ AND ‘Return to
as in Table 15.2. Work’; which yields 15 results—Appendix 1.
This can then be summarized into the fol- You limit the search to ‘Systematic Reviews’,
lowing research question: ‘Among adult patients narrowing it to two articles. One article [11]
with acute, non-displaced scaphoid waist frac- assesses the intervention of interest; however, it
tures, does operative management decrease the restricts its patient group to high demand manual
time to RTW compared to non-surgical man- workers between 16 and 40 years of age. There-
agement?’ A large systematic review and meta- fore, the patient group being studied is signifi-
analysis or a large randomized controlled trial are cantly younger and not consistent with the patient
preferable, as they lie at the top of the hierarchy described in the clinical scenario. The remaining
of scientific evidence [10]. article by Alnaeem et al. [12] is titled
To find the evidence to your question, you ‘A Systematic Review and Meta-Analysis
perform a literature search. You limit your search Examining the Differences Between
Table 15.3 Users’ guide for how to review systematic A clinically relevant question, as described by
reviews Thoma et al. [14] in their article ‘Forming the
Are the results valid? Research Question’, meets at least one of the
- Did the review address a sensible clinical question? following four criteria:
- Was the search for relevant studies detailed and
exhaustive?
- Was publication bias assessed? 1. The intervention is novel;
- Were the primary studies of high methodological 2. The intervention consumes large health care
quality? resources;
- Were assessments of studies reproducible?
3. There is a controversy on the effectiveness of
What are the results? the novel procedure as compared with the
- Were the results similar from study to study?
- What are the overall results of the review? existing procedure or;
- How precise were the results? 4. There is a large cost difference between two
Can I apply the results to patient care? prevailing interventions.
- How can I best interpret the results and apply them to
my patients? For a complete description on forming a sur-
- Were all patient-important outcomes considered? gical research question, please refer to Chap. 3.
- Is the benefit worth the potential costs and risks?
Alnaeem et al. [12] study addresses an
Adapted from Bhandari et al. [1]
appropriately focused clinical question with a
well-defined population (patients greater than
Nonsurgical Management and Percutaneous Fix- 15 years of age with isolated, acute,
ation of Minimally and Nondisplaced Scaphoid non-displaced or minimally displaced scaphoid
Fractures’ [12]. It adequately addresses the fractures), intervention (percutaneous or min-
question posed in the clinical scenario and you iopen screw fixation), control (non-operative
decide to conduct a critical appraisal. You use the treatment with cast immobilization), and pri-
following guide, adapted from Bhandari et al. [1], mary (time to return to work) and secondary
to help guide your analysis (Table 15.3). (time to union and complication rate) outcomes.
In addition, the answer to this clinical question
appears to be uncertain in the current body of
Are the Results Valid? literature and is therefore clinically relevant.
Therefore, you feel that Alnaeem et al. [12]
Did the Review Address a Sensible addresses a sensible clinical question, as per the
Clinical Question? criteria listed above [14].
A good research question is both focused and
clinically relevant [13]. A focused question Was the Search for Relevant Studies
yields a consistent treatment effect across the Detailed and Exhaustive?
range of Population, Intervention, Comparison, The ultimate goal of a search strategy for a sys-
Outcomes, and Time Horizons (the ‘PICOT tematic review is to include the entire body of
elements‘). Consider the hypothetical example of evidence that is relevant for a given research
a systematic review that pools the results from all question [15]. The published and unpublished
therapies, both operative and non-operative, for literature should be searched, including English
all types of upper extremity fractures, to generate and non-English journals [2]. The inclusion of
a single estimate of the effect on union rates. journals in multiple languages reduces the risk of
Clinicians would universally agree that this language bias, whereby studies with positive
question is too broad and the results would not be results are preferentially accepted to English
useful [6]. Conversely, asking a question with language journals, and therefore may result in an
very narrow inclusion and exclusion criteria may overestimation of treatment effect in the sys-
produce too few primary studies to review, and tematic review. It also strengthens the general-
decreases the generalizability of the results [2]. izability of the results [3, 7].
148 A. Copeland et al.
Most investigators will search common bib- (PubMed MEDLINE, Ovid MEDLINE,
liographic databases such as MEDLINE, EMBASE, SCOPUS, Cochrane central register
EMBASE and Cochrane Controlled Trials of controlled trials and Cochrane bone, joint and
Register (see the article by Haines et al. [2] for muscle trauma registry), listed their search terms
descriptions and uses of these databases). How- (‘scaphoid’, ‘navicular’ or ‘hand fracture’; ‘per-
ever, there is controversy regarding the inclusion cutaneous or screw fixation’; ‘conservative’,
of unpublished literature in systematic reviews. ‘immobilization’ or ‘cast’; ‘miniopen’ or ‘mini-
Since unpublished studies are more likely to be mally invasive’), and provided the complete
methodologically weak, their inclusion may search strategy in an Appendix 1, making it
compromise the validity of the meta-analysis. adequately reproducible. Additionally, Alnaeem
Conversely, omitting them may lead to publica- et al. [12] screened all articles using two inde-
tion bias (see the following section). Therefore, it pendent reviewers (‘H.A. and J.K.’), ultimately
is preferable to include unpublished studies in the limiting the potential for errors and minimizing
search strategy and subsequently perform sub- selection bias. However, you notice that the
group analysis to determine whether publication authors did not mention if their literature search
impacts the results. Unpublished studies can be was performed with the help of a medical
identified by hand searching conference pro- librarian. Furthermore, non-English articles and
ceedings and contacting experts in the field the unpublished literature were excluded and
[2, 6, 16]. Consulting a medical librarian helps there is no estimate of search completeness using
reassure the reader that the search was exhaustive CMR techniques.
[2, 3]. Publishing the search strategy also ensures
transparency and reproducibility. Was Publication Bias Assessed?
Moreover, investigators may also choose to Publication bias is ‘the selective publication of
evaluate the completeness of a literature search research findings based on the magnitude,
through the use of Capture-Mark-Recapture direction or statistical significance of the study
(CMR) techniques as a means to estimate the pop- result’ [18]. Small studies with negative results
ulation of articles available for a given topic, known are less frequently submitted or accepted for
as a horizon estimation [17]. While the specifics of publication. In a systematic review that excludes
CMR methodology may extend beyond the scope the unpublished literature, publication bias may,
of this chapter, it requires that study authors estab- therefore, lead to an overestimation of treatment
lish a priori search stopping criteria (% of total effect and potentially a false-positive result [2,
article estimate), perform a search of likely data- 16, 19, 20].
bases and complete screening to final inclusion, Publication bias is frequently explored with a
calculate the horizon estimate using CMR tech- funnel plot. As demonstrated by Tseng et al.
niques and subsequently compare article retrieval [21], this diagram plots the effect size (x axis)
with the established horizon estimate to determine against the sample size (y axis) of individual
whether the initial search stopping criteria is satis- studies. In studies with larger sample sizes,
fied [17]. This technique encourages authors to shown at the top of the plot, the scattered dots are
continue to search additional sources and therefore closer together representing higher precision.
guide efficient search strategies [17]. Smaller studies, at the bottom of the plot, are
Finally, two or more independent investiga- located more peripherally, representing over- or
tors should be involved in article selection and under-estimation of a treatment effect. If there is
data abstraction to minimize errors and limit no publication bias, the dots create a symmetri-
selection bias. In a recent meta-analysis of sys- cal, inverted funnel shape. If, however, there is a
tematic reviews, 48.4% of systematic reviews paucity of small studies with negative findings,
used two or more independent data extractors [7]. the plot will appear asymmetrical at the bottom,
Alnaeem et al. [12] used a defined time frame implying publication bias [21]. To complement
(1974–2015), identified their primary databases this qualitative representation, a statistical test
15 The Surgeon’s Guide to Systematic Review and Meta-Analysis 149
(e.g. rank correlation test) should be utilized, the 12-item MINORs score [25], which can be
though it has limited power where the used for the evaluation of nonrandomized trials,
meta-analysis is small (<25 studies) [4, 16, 20]. are common examples.
In the study by Alnaeem et al. [12], publica- While composite scoring tools are com-
tion bias was not addressed. mended for their ease of use, they are often
criticized for simply recording whether or not
Were the Primary Studies of High methodological safeguards (i.e. allocation con-
Methodological Quality? cealment) were reported rather than assess whe-
To avoid duplication and promote greater trans- ther they were performed appropriately within a
parency, the PROSPERO database was devel- trial (see Voineskos et al. [26]). Domain-based
oped in 2011 for authors of systematic reviews evaluations such as the Cochrane Collaboration’s
and meta-analyses to prospectively register re- Risk of Bias tool [27], assesses and reports each
search protocols where there are at least one quality component individually. This ‘compo-
health-related outcome [22]. It is recommended nent approach’ gives a more transparent analysis
that authors publish their protocols as it allows of methodological quality and should be utilized
readers to compare it to its completed review and [20, 27–30].
assess for any discrepancies, ultimately mini- Instead of excluding primary studies that are
mizing the opportunity for reporting bias [22]. found to have poor methodological quality,
Additionally, the Preferred Reporting Items investigators should perform a subgroup analysis
for Systematic reviews and Meta-Analysis to determine the impact of the quality score on
(PRISMA) statement [23] is a 27-item checklist the results [1].
and flow diagram that was originally published in In the chosen article by Alnaeem et al. [12]
2009 to help authors improve the reporting methodological quality of the primary studies
quality of their systematic reviews and meta- was appropriately assessed. For RCTs, both the
analyses. While not considered to be a quality Cochrane Collaboration’s tool and the Jadad
assessment instrument, the PRISMA statement score [24] were used; however, only the Jadad
can be referenced to critically appraise academic score was reported in the results section. For
review articles published within the observational studies, the MINORS score [25]
literature [23]. was used. Overall, the authors concluded that the
The quality of a systematic review is only as quality of included studies was ‘satisfactory’.
sound as that of its primary articles, thus the However, you notice that they did not use the
methodological quality of primary articles should quality scores to include or exclude any of the
be assessed. In one study, 35.3% of systematic studies, nor did they perform subgroup analysis
reviews published in the plastic surgery literature to compare the effect of methodological quality
evaluated the methodological quality of the pri- on treatment effect. Furthermore, assessing and
mary studies, and 30% factored this into their reporting the components of methodological
conclusions [7]. quality individually is more transparent and
At present, there is no single, universally useful than using scales that simply produce a
endorsed assessment instrument to evaluate the summative score [6, 26].
methodological quality of primary studies refer-
enced within systematic reviews and Were Assessments of Studies
meta-analyses [6]. While several composite Reproducible?
scoring tools exist (i.e. checklists and scales), the Systematic reviews must have a reproducible
three-item Jadad score [24], which can be used selection process. Eligibility criteria and data to
for the assessment of RCTs (i.e. randomization, be extracted should be listed. The selection pro-
double-blinding and participant withdrawal), and cess can be visually represented in the form of a
150 A. Copeland et al.
PRISMA flow diagram [31]. Each stage should While visual inspection of the Forest plot
be conducted by two or more independent gives the reader a rough estimate of hetero-
reviewers to minimize error and bias. Interob- geneity, statistical tests like the I squared statistic
server agreement, such as the kappa statistic, (I2) are important to quantify heterogeneity.
should then be measured. If there is good I squared (I2) describes the percentage of total
chance-correlated agreement between investiga- variation attributable to underlying differences in
tors, the reader will have more confidence in the effect (i.e. heterogeneity) rather than chance [32].
results of the review [6, 13]. In addition to the A value above 50% represents substantial
kappa statistic, it is useful to report the variables heterogeneity. Another test assessing hetero-
on which the reviewers most often disagreed. geneity is Cochrane’s Q test (measured with a
This adds another level of transparency to the chi-square test). A low p-value (<0.1) indicates
selection and data extraction process [21]. that the observed differences are unlikely to be
The process of study selection by Alnaeem et al. due to chance. Unfortunately, these tests may
[12] was systematic and the process was appro- lack power to detect statistically significant
priately documented in a PRISMA flow diagram heterogeneity when the review includes too few
(Appendix 2). Two independent reviewers were studies with few patients in each study
involved in title and abstract screening, full-text [2, 20, 21].
review, and quality assessment of primary studies In summary, widely different point estimates,
and disagreements were resolved by consensus. non-overlapping confidence intervals, a high
However, the degree of interobserver agreement I squared (I2) value, and a low p-value associated
(i.e. the kappa statistic) was not reported. with the chi square test should raise concern
about pooling results across studies into a single
summary estimate.
What Are the Results? If substantial heterogeneity exists between
studies, authors should look for explanations by
Were the Results Similar From Study To performing a subgroup analysis [16], since
Study? combining two differing subgroups could lead to
A meta-analysis creates a single summary esti- misleading conclusions [2, 3]. Other potential
mate of treatment effect by combining individual sources of heterogeneity may include differences
studies as though their sample size were all part in study groups, interventions, controls, outcome
of one larger study. However, the effect size of reporting, time horizons and research methodol-
each study should be similar enough to justify ogy. These hypotheses of potential sources of
combining them. Some variation is expected heterogeneity should be made a priori, to reduce
between studies due to chance. The question the likelihood of false-positive findings.
becomes, then to what extent are the observed Finally, the degree of heterogeneity between
differences in effect size greater than one would studies should direct the choice of statistical
expect by chance. effect estimate model. While random-effects
This question can be answered both visually, by models account for variation both within stud-
inspecting the point estimates and confidence ies and between studies, fixed-effects models
intervals on a Forest plot, and statistically, by using assume there is one true effect size underlying all
tests of heterogeneity. In the context of systematic studies and hence do not consider between-study
reviews and meta-analyses, heterogeneity refers to variation. They are, therefore, only appropriate
the differences in the effect sizes appreciated (in- for meta-analyses with a large number of
tervention outcomes) across included studies. homogenous studies (typically I squared (I2)
Specifically, similar point estimates and greatly <25%). If any degree of heterogeneity is sus-
overlapping confidence intervals indicate less pected, a random-effects model, while it will
heterogeneity and make the pooled summary produce larger confidence intervals, is the more
estimate more meaningful [6, 16]. conservative choice. The reader should be
15 The Surgeon’s Guide to Systematic Review and Meta-Analysis 151
suspicious of a fixed-effect model in the surgical represent the confidence interval (CI). The longer
literature, given the inevitable heterogeneity the whiskers, the less precise the estimate. If the
demonstrated by surgery [2, 16]. CI crosses the ‘line of no effect’, it is considered
Alnaeem et al. [12] chose the time to RTW as not statistically significant. This is also expressed
the primary end point. A random-effects model as a probability value (p value) in the ‘test for
was appropriately chosen. The individual studies overall effect’. The weighting of an individual
had similar point estimates and overlapping CIs. study is represented by the size of its box and is
Heterogeneity was tested using the chi-square typically determined by its sample size and the
test, which was statistically significant precision of its results [33]. Occasionally, studies
(p = 0.0008), and the I squared statistic (I2), are also weighted by their methodological quality
calculated at 79%. This suggests that 79% of the [1]. At the bottom of the Forest plot, the data are
observed variability cannot be explained by pooled from the weighted averages of the study
chance alone, indicating high heterogeneity. This results and is depicted as a diamond, where the
heterogeneity may be attributable to differences centre of the diamond indicates the overall effect
in the study groups (e.g. athletes may be more estimate and the width of the diamond depicts the
motivated to return to work), intervention overall CI [33].
(e.g. variations in surgical technique and post- Finally, the pooled effect size should ideally
operative regimen), control (e.g. different dura- be represented in units that are easy to interpret
tions of cast immobilization) or outcomes (e.g. and meaningful to clinicians and patients.
different methods of measuring return to work— Dichotomous outcomes should be reported as
insurance forms vs. patient reported). No attempt absolute risk reduction and number needed to
to identify reasons for heterogeneity with sub- treat (see Chap. 6). Continuous outcomes should
group analysis was reported in the Alnaeem et al. reference a patient-important effect size, such as
[12] article. minimal important difference [34].
In the selected study [12], ten articles were
What Are the Overall Results included in qualitative analysis and six were
of the Review? selected for meta-analysis, excluding four case
After completing the systematic review, if the series. Five of these six articles measured the
primary studies are of poor quality or are too primary outcome (time to return to work). The
heterogeneous, the data should be presented calculated pooled mean difference was
qualitatively. However, a meta-analysis is per- 38.64 days (95% CI, 27.9–49.48), indicating that
formed if the data are appropriate for pooling. patients who underwent percutaneous surgical
The rationale for pooling should be also stated. fixation returned to work 38.64 days earlier than
Not all studies need to be pooled. For exam- those who were treated non-operatively. There
ple, nonrandomized data could be included in the was no reference made to minimal important
systematic review but excluded from the difference, which makes the pooled mean dif-
meta-analysis. Alternatively, all studies could be ference of 38.64 days somewhat less meaningful;
included in the meta-analysis, but the RCTs however, one must consider the ‘importance’ of
could be analyzed separately from the non-RCTs all outcomes in the context of the individual
[3]. patient, ultimately considering their stated goals
Results from a meta-analysis are typically and beliefs.
displayed graphically as Forest plots (Fig. 15.1).
Outcomes for dichotomous variables are How Precise Were the Results?
expressed as ratios, while continuous outcome Precision of results is determined by the size of
measures are expressed as weighted or standard the CI around its point estimate. A small sample
mean differences. Individual study results are size and a wide standard deviation both widen
presented in rows. A box represents the study’s the CI. A narrow CI indicates a more precise
effect estimate and its associated whiskers summary estimate [6].
152 A. Copeland et al.
In the Alnaeem et al. article [12], the CI findings are applicable to his or her patient. After
informs us that we can be certain with 95% assessing the internal validity of the study, surgeons
probability that the true benefit of surgical fixa- should review treatment settings, inclusion/exclusion
tion is a faster return to work that lies between criteria, and patient characteristics to ensure adequate
27.9 and 49.48 days. This relatively narrow CI external validity that will demonstrate similar out-
indicates a precise result. It does not cross the comes to their patient population [21].
line of no effect and the p value is <0.00001, The study by Alnaeem et al. [12] included
indicating statistical significance. English articles published prior to 2015 with
n > 10, that assessed percutaneous fixation
(or miniopen technique) of isolated, acute,
Can I Apply the Results to My Patients? non-displaced scaphoid fractures in patients older
than 15 years of age. Your patient—a 55-year-old
How Can I Best Interpret the Results male commercial pilot presenting with an acute
and Apply Them to My Patients? non-displaced scaphoid waist fracture—fits the
While a systematic review or meta-analysis may inclusion criteria and study characteristics. The
support the use of a particular intervention, it is the study has appropriate external validity for the
responsibility of the surgeon to ensure that the results to be applied to your patient.
15 The Surgeon’s Guide to Systematic Review and Meta-Analysis 153
Were All Patient-Important Outcomes patient’s own stated goals and beliefs; while not
Considered? statistically significant, this may represent a clini-
Along with a defined primary outcome, other rel- cal and patient-important difference. Quality of life
evant outcomes must be considered when outcome measures were not assessed as they were
informing clinical decisions. Specifically, the inconsistently reported in the primary studies.
effectiveness of an intervention should be derived
by measuring ‘patient-important outcomes’—the Is the Benefit Worth the Potential Costs
clinical events deemed relevant to a specific patient and Risks?
population [5, 35]. These outcomes may include Either explicitly or implicitly, it is the responsi-
the occurrence of stroke, death or quality of bility of surgeons to make recommendations by
life outcomes. For example, potential patient- weighing the potential costs and benefits of an
important outcomes of percutaneous fixation for intervention in the context of the patient’s values
scaphoid fracture include the risk of infection, and preferences [6]. For example, a manual
reoperation, and neurovascular injury. A focused labourer may deem continued observation to be
review of the evidence may provide an accurate an unacceptable management strategy and may
result of the effect of fixation on each of these instead opt for prompt surgical intervention. As a
outcomes, but an informed clinical decision result, systematic reviews should avoid specific
requires that all outcomes be considered. As a care recommendations as they are unable to
result, the ideal systematic review will present a account for specific patient circumstances and
series of reviews for each patient-important out- values that can only be determined through direct
come [6, 20]. consultation with the patient [21].
A limitation of most systematic reviews and In the chosen example, the patient’s primary
meta-analyses remains that they frequently do not concern was returning to work. Complication
report all relevant patient-important outcomes [1]. rates and time to union should also be discussed.
While reasons for this may vary, one explanation is Overall, given that the results for RTW and
that individual studies often measure outcomes radiographic union statistically favour the surgi-
differently, or not at all, making it difficult to pool cal group with no significant difference in com-
and summarize results [1, 6]. To address this, the plication rate, one can infer based on the available
implementation of core outcome sets (COS) has evidence that surgical intervention provides a
been proposed. COS have been defined within the benefit. However, it is important to note the sig-
literature as an agreed upon set of outcomes that nificant heterogeneity for time to RTW and time
should be reported at a minimum in all studies to union, with I squared (I2) values of 79 and
assessing a particular condition [36, 37]. This ini- 88%, respectively. When this is considered in the
tiative, referred to as Core Outcome Measures in context of the patient’s expressed preference to
Effectiveness Trials (COMET), seeks to improve RTW, one can conclude that the potential benefits
homogeneity in outcome reporting to improve of surgical intervention may outweigh the costs
outcome-pooling in the form of systematic reviews and risks for this particular patient; however, the
and meta-analyses [32]. Authors and readers of results should be interpreted with caution and
systematic reviews are encouraged to review must be considered in the context of the patient’s
COMET resources to identify whether a set of ‘core own beliefs given the increased complication risk
outcomes’ exist for their condition of interest. associated with undergoing surgical intervention.
In addition to time to return to work, Alnaeem
et al. [12] also evaluated the adverse effects of
therapy, which is considered a patient-important Resolution of the Clinical Scenario
outcome. They found a higher complication rate in
the operative group (14% vs. 7%) but this was not At the follow-up appointment, you report your
statistically significant (p = 0.2). Again, this out- findings to the patient. You conclude that the
come must be considered in the context of the statistically significant results presented in the
154 A. Copeland et al.
study by Alnaeem et al. [12] have adequate percutaneous fixation of minimally and
internal and external validity and ultimately nondisplaced scaphoid fractures. J Hand
support the use of percutaneous pinning to Surg. 2016;41(12):1135–44.
improve RTW outcomes and union rates. You 6. Drexier M, Haim A, Pritsch T, Rosen-
inform the patient that surgical fixation demon- blatt Y. Isolated fractures of the scaphoid:
strated an increased prevalence of complications, classification, treatment and outcome.
although not statistically significant. The patient Harefuah. 2011;150(1):50–5.
reaffirms his desire to return to work as soon as 7. Bond C, Shin A, McBride M, Dao K. Per-
possible despite this risk and the decision is made cutaneous screw fixation or cast immobi-
to pursue surgical management. lization for nondisplaced scaphoid fractures.
J Bone Joint Surg Am. 2001;83(4):483–8.
8. Schädel-Höpfner M, Marent-Huber M,
Conclusion Sauerbier M, Pillukat T, Eisenschenk A,
Siebert H. Operative versus conservative
While systematic reviews and meta-analyses treatment of non-displaced fractures of the
represent the top of the level of evidence hier- scaphoid bone. Results of a controlled
archy, they are not without limitations. Conse- multicenter cohort study. Unfallchirurg.
quently, surgeons must possess the skills to 2010;113(10):804–13.
interpret the results of review articles to recog- 9. Fowler J, Ilyas A. Headless compression
nize the potential for bias and appropriately screw fixation of scaphoid fractures. Hand
assess their application to clinical practice. Clin. 2010;26(3):351–61.
10. Yinusa W, Adetan O, Odatuwa-Omagbemi
D, Eyo M. Bilateral simultaneous fracture of
Appendix 1: Search Results the carpal scaphoid successfully treated with
conservative cast immobilisation: a case
1. Arora R, Gschwentner M, Krappinger D, report. West Afr J Med. 2010;29(6):425–8.
Lutz M, Blauth M, Gabl M. Fixation of 11. Haddad F, Goddard N. Acute percutaneous
nondisplaced scaphoid fractures: making scaphoid fixation: a pilot study. J Bone
treatment cost effective. Arch Orthop Joint Surg. 1998;80(1):95–9.
Trauma Surg. 2006;127(1):39–46. 12. Soubeyrand M, Even J, Mansour C,
2. Schernberg F. Recent scaphoid fractures Gagey O, Molina V, Biau D. Cadaveric
(within the first three weeks). Chir Main. assessment of a new guidewire insertion
2005;24(3, 4):117–31. device for volar percutaneous fixation of
3. Geurts G, Van Riet R, Meermans G, Ver- nondisplaced scaphoid fracture. Injury.
streken F. Volar percutaneous transtrapezial 2009;40(6):645–51.
fixation of scaphoid waist fractures: surgical 13. Noaman H, Shiha A, Ibrahim A. Functional
technique. Acta Orthop Belg. 2012;78 outcomes of nonunion scaphoid fracture
(1):121–5. treated by pronator quadratus pedicled bone
4. Majeed H. Non-operative treatment versus graft. Ann Plast Surg. 2011;66(1):47–52.
percutaneous fixation for minimally dis- 14. McQueen M, Gelbke M, Wakefield A,
placed scaphoid waist fractures in high Will E, Gaebler C. Percutaneous screw
demand young manual workers. J Orthop fixation versus conservative treatment for
Traumatol. 2014;15(4):239–44. fractures of the waist of the scaphoid.
5. Alnaeem H, Aldekhayel S, Kanevsky J, J Bone Joint Surg Br. 2008;90(1):66–71.
Neel O. A systematic review and 15. Yin Z, Zhang J, Kan S, Wang P. Treatment
meta-analysis examining the differences of acute scaphoid fractures. Clin Orthop
between nonsurgical management and Relat Res. 2007;460(1):142–51.
15 The Surgeon’s Guide to Systematic Review and Meta-Analysis 155
(n = 41) (n = 31)
Studies included in
quantitative synthesis
(meta-analysis)
(n = 6)
7. Samargandi OA, Hasan H, Thoma A. Methodologic literature review and meta-analysis. J Urol. 2008;180
quality of systematic reviews published in the plastic (4):1249–56.
and reconstructive surgery literature: a systematic 22. Booth A, Clarke M, Dooley G, Ghersi D, Moher D,
review. Plast Reconstr Surg. 2016;137(1):225e–36e. Petticrew M, et al. PROSPERO at one year: an
8. Sprague S, McKay P, Thoma A. Study design and evaluation of its utility. Syst Rev. 2013;2(1):1–7.
hierarchy of evidence for surgical decision making. 23. Moher D, Liberati A, Tetzlaff J, Altman D. Preferred
Clin Plast Surg. 2008;35(2):195–205. reporting items for systematic reviews and
9. Egger M, Smith GD, Sterne JA. Uses and abuses of meta-analyses: the PRISMA statement. PLoS Med.
meta-analysis. Clin Med. 2001;1(6):478–84. 2009;6(7):e1000097.
10. Petrisor BA, Bhandari M. The hierarchy of evidence: 24. Jadad A, Moore R, Carroll D, Jenkinson C,
levels and grades of recommendation. Indian J Reynolds D, Gavaghan D, et al. Assessing the
Orthop. 2007;41(1):11–5. quality of reports of randomized clinical trials: is
11. Majeed H. Non-operative treatment versus percuta- blinding necessary? Control Clin Trials. 1996;17
neous fixation for minimally displaced scaphoid (1):1–12.
waist fractures in high demand young manual 25. Slim K, Nini E, Forestier D, Kwiatkowski F,
workers. J Orthop Traumatol. 2014;15(4):239–44. Panis Y, Chipponi J. Methodological index for
12. Alnaeem H, Aldekhayel S, Kanevsky J, Neel OF. non-randomized studies (MINORS): development
A systematic review and meta-analysis examining and validation of a new instrument. ANZ J Surg.
the differences between nonsurgical management and 2003;73(9):712–6.
percutaneous fixation of minimally and nondisplaced 26. Voineskos SH, Coroneos CJ, Ziolkowski NI,
scaphoid fractures. J Hand Surg. 2016;41(12):1 Kaur MN, Banfield L, Meade MO, et al. A systematic
135–44. review of surgical randomized controlled trials: part
13. Thoma A, Eaves FF III. What is wrong with I. Risk of bias and outcomes common pitfalls plastic
systematic reviews and meta-analyses: if you want surgeons can overcome. Plast Reconstr Surg.
the right answer, ask the right question! Aesthetic 2016;137(2):696–706.
Surg J. 2016;36(10):1198–201. 27. Higgins JP, Green S, editors. Cochrane handbook for
14. Thoma A, McKnight L, McKay P, Haines T. Form- systematic reviews of interventions (version 5.1. 0).
ing the research question. Clin Plast Surg. 2008;35 London, UK: The Cochrane Collaboration; 2016.
(2):189–93. 28. Jüni P, Altman DG, Egger M. Assessing the quality
15. Waltho D, Kaur MN, Haynes RB, Farrokhyar F, of controlled clinical trials. BMJ. 2001;323
Thoma A. Users’ guide to the surgical literature: how (7303):42–6.
to perform a high-quality literature search. Can J 29. Jüni P, Witschi A, Bloch R, Egger M. The hazards of
Surg. 2015;58(5):349–58. scoring the quality of clinical trials for meta-analysis.
16. Zlowodzki M, Poolman RW, Kerkhoffs GM, Tor- JAMA. 1999;282(11):1054–60.
netta P III, Bhandari M, International Evidence- 30. Moher D, Jadad AR, Nichol G, Penman M, Tug-
Based Orthopedic Surgery Working Group. How to well P, Walsh S. Assessing the quality of randomized
interpret a meta-analysis and judge its value as a controlled trials: an annotated bibliography of scales
guide for clinical practice. Acta Orthop. 2007;78 and checklists. Control Clin Trials. 1995;16(1):
(5):598–609. 62–73.
17. Lane D, Dykeman J, Ferri M, Goldsmith C, 31. Liberati A, Altman DG, Tetzlaff J, Mulrow C,
Stelfox H. Capture-mark-recapture as a tool for Gøtzsche PC, Ioannidis JP, et al. The PRISMA
estimating the number of articles available for statement for reporting systematic reviews and
systematic reviews in critical care medicine. J Crit meta-analyses of studies that evaluate health care
Care. 2013;28(4):469–75. interventions: explanation and elaboration. PLoS
18. Montori V, Smieja M, Guyatt G. Publication bias: a Med. 2009;6(7):e1000100.
brief review for clinicians. Mayo Clin Proc. 2007;75 32. Higgins JP, Thompson SG, Deeks JJ, Altman DG.
(12):1284–8. Measuring inconsistency in meta-analyses. BMJ. 2003;
19. Cook DJ, Guyatt GH, Ryan G, Clifton J, Bucking- 327(7414):557.
ham L, Willan A, et al. Should unpublished data be 33. Ried K. Interpreting and understanding meta-analysis
included in meta-analyses? Current convictions and graphs: a practical guide. Aust Fam Physician.
controversies. JAMA. 1993;269(21):2749–53. 2006;35(8):635–8.
20. Montori VM, Swiontkowski MF, Cook DJ. Method- 34. Evaniew N, van der Watt L, Bhandari M, Ghert M,
ologic issues in systematic reviews and Aleem I, Drew B, et al. Strategies to improve the
meta-analyses. Clin Orthop Relat Res. credibility of meta-analyses in spine surgery: a
2003;413:43–54. systematic survey. Spine J. 2015;15(9):2066–76.
21. Tseng TY, Dahm P, Poolman RW, Preminger GM, 35. Gallo L, Eskicioglu C, Braga LH, Farrokhyar F,
Canales BJ, Montori VM. How to use a systematic Thoma A. Users’ guide to the surgical literature: how
15 The Surgeon’s Guide to Systematic Review and Meta-Analysis 157
to assess an article using surrogate end points. Can J outcome set for research and audit studies in
Surg. 2017;60(4):280. reconstructive breast surgery. Br J Surg. 2015;102
36. Prinsen CA, Vohra S, Rose MR, King-Jones S, (11):1360–71.
Ishaque S, Bhaloo Z, et al. Core Outcome Measures 38. Preferred reporting items for systematic reviews and
in Effectiveness Trials (COMET) initiative: protocol meta-analyses: the PRISMA statement [Internet].
for an international Delphi study to achieve consen- Equator-network.org. 2018 [cited 2018 Sep 10].
sus on how to select outcome measurement instru- Available from: http://www.equator-network.org/
ments for outcomes included in a ‘core outcome set’. reporting-guidelines/prisma/.
Trials. 2014;15(1):247.
37. Potter S, Holcombe C, Ward JA, Blazeby JM,
BRAVO Steering Group. Development of a core
Prospective and Retrospective
Cohort Studies 16
Ramy Behman, Lev Bubis and Paul Karanicolas
your clinical diagnosis of adhesive small bowel aSBO. An outline of the study reported by
obstruction (aSBO). There is no indication for Zielinski and colleagues [7] is provided in
urgent surgical exploration, so you admit the Table 16.1.
patient for a trial of non-operative management. Using the study by Zielinski and colleagues
The following morning at handover, the attend- [7] as a case study, in the following chapter, we
ing surgeon asks if you administered a present a practical guide for the appraisal of co-
water-soluble contrast study. You explain that hort studies in surgery (Box 1). Our aim is to
you have never used this technique. The attend- provide practicing surgeons and trainees with a
ing surgeon suggests you review this interven- framework for the evaluation of the method-
tion, stating there is evidence of both diagnostic ological quality and applicability of cohort
and therapeutic benefit [6]. studies.
Table 16.1 Outline of the prospective cohort study conducted by Zielinski et al. [7]
Study element Summary
Population and setting 316 patients with aSBO without signs of bowel strangulation, hernia, history of
abdominal or pelvic malignancy, or abdominal surgery within 6 weeks presenting to
1 of 14 academic medical centres in the United States
Exposure/comparators and Exposure—aSBO treatment protocol with or without oral contrast challenge
allocation Allocation—selection of aSBO treatment protocol at discretion of treating surgeon
(except at institutions without availability of oral contrast agent)
Results Rate of surgical exploration—20.8% in oral contrast group versus 44.1% in
non-contrast group (p < 0.0001). On adjusted analysis receipt of oral contrast study
was associated with significantly lower odds of operative exploration (aOR 0.23, 95%
CI 0.12–0.43)
Overall complication rate—12.5% in oral contrast group versus 17.9% in
non-contrast group (p = 0.22)
Length of stay—median 4 days (Q1–Q3: 2–7 days) in oral contrast group versus
median 5 days (Q1–Q3: 2–13 days) in non-contrast group (p = 0.036)
Authors’ conclusions Patients managed with an aSBO protocol incorporating routine oral contrast studies
had lower rates of surgical exploration and shorter hospital length of stay, without an
increase in risk of complications compared to no contrast. Incorporation of an oral
contrast study is of benefit in the care of aSBO patients presenting without indications
for immediate surgical exploration
Abbreviations aSBO—adhesive small bowel obstruction; aOR—adjusted odds ratio; Q1—first quartile; Q3—third
quartile; AUROC—area under the receiver operating characteristic curve
for readers to evaluate to whom the study results collection, assessment of consistency (e.g.
apply. Sample size calculations based on antici- inter-rater reliability) is important to assure the
pated effect sizes in comparative effectiveness quality of data collection [16]. Zielinski et al. [7]
studies are valuable to ensure studies are ade- do not describe data collection methods used in
quately powered to detect clinically meaningful their study, nor do they describe whether any
difference between interventions, should they data was missing. Consequently, readers cannot
truly exist. In their article, Zielinski et al. [7] assess whether data collection methods or miss-
appropriately describe study inclusion and ing data impacts the validity of study results.
exclusion criteria, and provide a sample size
calculation based on historic rates of their pri- Do the Authors Provide Clear
mary outcome of surgical exploration for aSBO. and Justifiable Definitions of the Primary
and Secondary Outcomes?
Are the Data Sources Reliable? The outcomes measured in surgical cohort stud-
There should be a clear description of data sources. ies should address the key aspects of the clinical
In retrospective cohort studies, data may be question under study. When applicable, authors
obtained from primary sources (i.e. from chart should justify their designation of primary and
abstraction and prospective institutional data- secondary outcomes. Methods of outcome
bases) or secondary sources designed for alternate ascertainment must be carefully planned and
purposes (i.e. from administrative, demographic reported with sufficient clarity to be replicable.
and physician claims databases) [11]. Some stud- Zielinski et al. [7] did not explicitly specify pri-
ies have demonstrated poor accuracy of secondary mary and secondary outcomes. However, the
data for surgical outcomes assessment [12, 13]. authors provide a power calculation to determine
However, retrospective chart abstraction or cre- the sample size required to detect a significant
ation of institutional databases require a large difference in the rate of operative exploration
amount of resources and are unfeasible for large between the contrast and non-contrast groups.
population-based studies. For research using Other outcomes reported include length of hos-
administrative databases, validity may be bol- pital stay and complications. In addition, Zielin-
stered through conduct of preliminary studies ski et al. [7] report the diagnostic accuracy of the
assessing their accuracy [14, 15]. In prospective oral contrast challenge for prediction of need for
studies, data may be obtained via active data col- operative exploration.
lection processes specific to the study or through In studies of surgical interventions, the time-
linkage to routinely collected data from external frame of outcomes assessment is a key consid-
sources such as registries or population-based eration. A high proportion of postoperative
databases. For prospective studies using active complications are diagnosed only after hospital
data collection, reliability may be bolstered by discharge. Data sources that are limited to the
creating clear data collection protocols inclusive of index hospitalization may, therefore, substan-
detailed variable definitions prior to study conduct. tially undercount postoperative adverse events
[12, 17]. In this regard, a limitation of the study
Are the Data Collection Methods Sound? conducted by Zielinski et al. [7] is that the
Authors should report on the methods of data timeframe for assessment of the secondary out-
collection as well as the extent of missing data. comes of complications is not specified. It is
Missing data may compromise the validity of possible that study groups may have differing
study results. As such, considerable effort should risks of short-term (i.e. in-hospital) and longer
be made to minimize missing data and the term (i.e. 30-day) complications. Therefore, a
method of handling missing data in statistical clear description of timeframes for outcome
analyses should be described. In addition, assessment is imperative to allow readers to fully
where multiple contributors participate in data understand the study results.
16 Prospective and Retrospective Cohort Studies 163
of an analytical approach; these are discussed in multivariable regression may be used to account
greater detail in Chaps. 27–31. for these differences in baseline characteristics.
In the study by Zielinski et al. [7], the final The article by Zielinski et al. [7] reported the
paragraph of the Methods section describes the baseline characteristics of the two exposure
different variable types, how they were presented groups in Table 16.1, demonstrating that the
and which statistical methods were used to groups were similar. The authors included many
compare them. The authors clearly describe how of the important demographic, clinical and radi-
potentially subjective clinical and radiological ological characteristics for patients treated for
variables such as obstipation, mesenteric edema adhesive SBO. Several potentially relevant base-
and SB faeces sign were defined for this study. line characteristics were not described, including
The covariates selected by the authors for duration of symptoms, number of previous
multivariable analysis are valid to the question of admissions for SBO, admitting service (medical
interest, albeit not an exhaustive list of potential vs. surgical), and admission septic status.
confounders. Interestingly, the authors elect to It is unclear why the authors include CT
include CT findings in the multivariable model findings in Table 1 by Zielinski et al. [7] as
for operative exploration, but do not include ‘Baseline Characteristics’ but report physiologic
physiologic or biochemical findings in this model and biochemical findings separately. Physiologic
and do not give an explanation as to why these and biochemical findings are compared between
were excluded. While CT findings provide the two exposure groups separately in Table 2 by
information regarding the degree of bowel Zielinski et al. [7].
obstruction, physiologic and biochemical find-
ings also contribute valuable information that Are the Results and Relevant Statistics
guides clinical decision-making. Clearly Stated?
Results should be clearly described according to
the analytical approach outlined in the Methods
What Are the Results? section of the study. Readers of the study should
be able to identify which statistical test yielded
Are Baseline Characteristics the results being described. In addition to the
of the Exposure Groups Clearly Reported results, the study should report the appropriate
and Accounted For? statistics (e.g. p-value, standardized difference,
Cohort studies are inherently subject to selection confidence interval) in a manner that is appro-
bias, which may distort findings. A clear priate to the statistical test and is consistent
description of differences in baseline character- throughout the study.
istics between treatment groups provides impor- In the study by Zielinski et al. [7], univariate
tant contextual information with which analyses (proportions, comparison of
subsequent results may be interpreted. In the means/medians) were reported using p-values
article under study, the baseline characteristics and multivariable analyses were reported using
were similar in the two treatment groups, sug- odds ratios and 95% confidence intervals. The
gesting comparability. However, readers of ob- results are clearly stated and the reporting is
servational studies should not allow similarity of consistent. Readers are easily able to identify a
baseline characteristics to make them confident description of the corresponding analytical
in a study’s results, since unmeasured or unre- approach in the Methods section.
ported baseline characteristics may be different
between exposure groups. How Precise Were Estimates of Effect
When baseline characteristics differ signifi- Size?
cantly between exposure groups, there should Effect estimates may be reported in a variety of
be concern about selection bias. In these ways, depending on the outcomes of interest.
instances, analytic methods such as matching or Reporting can vary from simple differences in
166 R. Behman et al.
proportions to absolute/relative risk ratios to odds incorrectly reported the interquartile range,
ratios and hazard ratios. Confidence intervals which is a single number that represents the
should be reported to allow readers to determine difference between the first and third quartiles.
the precision of effect estimates; similarly, if
variability estimates were included the reader Are Results Robust to Planned
could have calculated confidence intervals to Sensitivity Analyses?
determine precision. Sensitivity analyses are used to evaluate the va-
Confidence intervals provide important infor- lidity of study results by testing whether an effect
mation with which effect sizes should be inter- is consistent under alternate study conditions
preted. Wide confidence intervals suggest a high [34]. By introducing or eliminating some of the
degree of uncertainty and that effects should be inherent uncertainty that existed in the original
interpreted with caution; narrow confidence study design, investigators can test how much of
intervals suggest a high degree of certainty of the the measured effect size was a result of the
reported effect size [31, 32]. P-values, while not independent variables of interest, and how much
specifically reflecting the precision of effect was a product of uncertainty or potential con-
sizes, are a measure of the type I error—the founders [34].
probability that an observed effect is due to A careful reading of the results of sensitivity
chance. Both confidence intervals and p-values analyses is critical. Results that differ substan-
are dependent on sample size and error [31–33]. tially from the primary analysis reduce confi-
In the article under study, the authors clearly dence in the conclusions. In this article, the
describe the key results, dividing these into authors did not perform sensitivity analyses.
diagnostic outcomes and therapeutic outcomes.
With respect to the diagnostic ability of Were Subgroup Analyses Used? If So,
water-soluble contrast studies, the authors report Were these Specified a Priori and is
the positive-predictive value, negative-predictive the Issue of Multiple Testing Addressed?
value, and estimate the area under the receiver Unlike sensitivity analyses, which are performed
operating characteristic (AUROC) curve (a to test assumptions that have been made by
measure of a test’s diagnostic ability) of investigators, subgroup analyses are used to
water-soluble contrast studies. The authors did determine if the relationship between an expo-
not report likelihood ratios, which should be sure and an outcome differs across levels of a
reported in evaluations of diagnostic ability in third variable [30]. To perform a subgroup
addition to positive- and negative-predictive analysis, the study population is divided into
values. The authors report the primary thera- levels of the third variable and patients are
peutic outcome (rates of surgical exploration) compared across groups. The subgroups should
using univariate and multivariable analyses, be specified prior to the analysis of data, and not
reporting proportions with chi-square p-values be the result of conclusions that are found during
and odds ratios with 95% confidence intervals, data analysis [35]. In the article under study, the
respectively. Rates of bowel resections and authors did not perform any subgroup analyses.
complications are also reported as proportions Readers of studies that include subgroup
with chi-square p-values and median length of analyses should be aware of issues with multi-
stay and median time to operative exploration are plicity in which, when multiple subgroup analy-
reported described with interquartile ranges and ses are performed, the probability of false positive
compared using Mann–Whitney U tests. It results increase [30]. For more information about
should be noted that the authors of this study subgroup analysis, please see Chap. 30.
16 Prospective and Retrospective Cohort Studies 167
How Can I Interpret these Results existing literature and how this study might fill
in the Context of My Clinical Practice? gaps in current knowledge. It should be noted
that selective reporting of literature that supports
Are the Results of the Study Placed a study’s findings is common and readers of
in the Broader Context of the Clinical cohort studies should be cautious. For example,
Question? Zielinski et al. [7] report in detail the findings of
The aim of cohort studies is to use a sample of studies that support their own, including the odds
patients for whom data are available to answer a ratios and confidence intervals for operative
clinical question that applies to the broader ref- exploration associated with the use of
erent patient population. The extent to which water-soluble contrast studies. In contrast, the
results from a study sample may be generalized authors do not describe the results of the studies
to the overall population is the external validity with conflicting results, including a Cochrane
or applicability of the study. A careful evaluation Systematic Review and Meta-Analysis that sug-
of study design, setting and analytical approach gests there is no difference in the rate of operative
is necessary to identify sources of bias and con- exploration associated with water-soluble con-
founding that may limit a study’s internal valid- trast studies [37].
ity—a prerequisite for external validity [36].
Investigators should frame their study in the Were All Clinically Important Outcomes
context of a larger clinical question and should Considered?
clearly indicate how their results may be gener- The effect of an intervention on any single out-
alized to the overall population that their cohort come is typically only a small part of the relevant
aims to represent. The readers of a study then clinical picture. Particularly in studies of surgical
have a responsibility to question whether it is interventions, important corollary outcomes
appropriate to do so. Differences in patient pop- invariably exist (e.g. complications rates,
ulation or setting may limit the applicability a peri-operative mortality, length of stay, read-
study’s results to the reader’s own practice. mission rate, quality-of-life). Without this
Unfortunately, Zielinski et al. [2] do not discuss broader clinical picture, application of a study’s
how closely their patient population and study results to a clinical setting is challenging and
setting reflect the population and setting of potentially dangerous. For example, in the article
patients with small bowel obstruction in general. by Zielinski et al. [2], the authors reported rates
of specific complications such as acute kidney
Is There a Description of the Existing injury and pneumonia as these were potentially
Literature? relevant outcomes for routine water-soluble
A description of the existing literature about a contrast use in aSBO. The authors do not report
clinical topic and a discussion of how the results on several important clinical outcomes, including
of a study fit into the body of literature provides peri-admission mortality, readmission and
important information to a reader. A study in recurrence of SBO.
which the findings are supported by previously
published studies increases confidence in the Are Limitations of the Study Clearly
conclusions, as the previously published studies Described?
may be performed in different settings and with Authors should clearly describe the limitations of
different patients. Alternatively, a study that their study and how these limitations may impact
contradicts the existing literature or for which the study’s interpretation and application.
there is a paucity of comparable studies should Potential limitations in any study include those
be considered with caution and the generaliz- associated with study setting and study design.
ability of its findings carefully explored. Issues such as selection bias, information/
The authors of the present study thoroughly measurement bias, and measured and unmea-
discuss how their findings correlate with the sured confounding are present to some extent in
168 R. Behman et al.
6. Ceresoli M, Coccolini F, Catena F, Montori G, Di 17. Bilimoria KY, Cohen ME, Ingraham AM, Ben-
Saverio S, Sartelli M, et al. Water-soluble contrast trem DJ, Richards K, Hall BL, et al. Effect of
agent in adhesive small bowel obstruction: a system- postdischarge morbidity and mortality on compar-
atic review and meta-analysis of diagnostic and isons of hospital surgical quality. Ann Surg.
therapeutic value. Am J Surg. 2016;211(6):1114–25. 2010;252(1):183–90.
7. Zielinski MD, Haddad NN, Cullinane DC, Inaba K, 18. Sackett DL. Bias in analytic research. J Chronic Dis.
Yeh DD, Wydo S, et al. Multi-institutional, prospec- 1979;32(1–2):51–63.
tive, observational study comparing the Gastrografin 19. Delgado-Rodriguez M, Llorca J. Bias. J Epidemiol
challenge versus standard treatment in adhesive small Community Health. 2004;58(8):635–41.
bowel obstruction. J Trauma Acute Care Surg. 20. Grimes DA, Schulz KF. Bias and causal associations
2017;83(1):47–54. in observational research. Lancet. 2002;359
8. Thoma A, Farrokhyar F, Bhandari M, Tandan V, (9302):248–52.
Evidence-Based Surgery Working G. Users’ guide to 21. Rothman KJ, Greenland S, Lash TL. Modern
the surgical literature. How to assess a randomized epidemiology, 3rd ed. Philadelphia: Wolters Kluwer
controlled trial in surgery. Can J Surg. 2004;47 Health/Lippincott Williams & Wilkins; 2008. 758 p.
(3):200–8. 22. Tripepi G, Jager KJ, Dekker FW, Wanner C, Zoc-
9. Cole P. The hypothesis generating machine. Epi- cali C. Bias in clinical research. Kidney Int. 2008;73
demiology. 1993;4(3):271–3. (2):148–53.
10. Loder E, Groves T, Macauley D. Registration of 23. Gail MH. Does cardiac transplantation prolong life?
observational studies. BMJ. 2010;340:c950. A reassessment. Ann Intern Med. 1972;76(5):815–7.
11. Registries for evaluating patient outcomes: a user’s 24. Tripepi G, Jager KJ, Dekker FW, Zoccali C. Selec-
guide. Rockville (MD): Agency for Healthcare tion bias and information bias in clinical research.
Research and Quality (US); 2014. Available from: Nephron Clin Pract. 2010;115(2):c94–9.
https://www.ncbi.nlm.nih.gov/books/NBK208611/. 25. Greenland S, Morgenstern H. Confounding in health
Accessed 20 Jul 2018. research. Ann Rev Public Health. 2001;22(1):189–
12. Lawson EH, Louie R, Zingmond DS, Brook RH, 212.
Hall BL, Han L, et al. A comparison of clinical 26. Mamdani M, Sykora K, Li P, Normand S-LT,
registry versus administrative claims data for report- Streiner DL, Austin PC, et al. Reader’s guide to
ing of 30-day surgical complications. Ann Surg. critical appraisal of cohort studies: 2. Assessing
2012;256(6):973–81. potential for confounding. BMJ. 2005;330
13. Cima RR, Lackore KA, Nehring SA, Cassivi SD, (7497):960–2.
Donohue JH, Deschamps C, et al. How best to 27. Fewell Z, Davey Smith G, Sterne JAC. The impact of
measure surgical quality? Comparison of the Agency residual and unmeasured confounding in epidemio-
for Healthcare Research and Quality Patient Safety logic studies: a simulation study. Am J Epidemiol.
Indicators (AHRQ-PSI) and the American College of 2007;166(6):646–55.
Surgeons National Surgical Quality Improvement 28. Normand S-LT, Sykora K, Li P, Mamdani M,
Program (ACS-NSQIP) postoperative adverse events Rochon PA, Anderson GM. Readers guide to critical
at a single institution. Surgery. 2011;150(5):943–9. appraisal of cohort studies: 3. Analytical strategies to
14. Mason SA, Nathens AB, Byrne JP, Fowler R, reduce confounding. BMJ. 2005;330(7498):1021–3.
Gonzalez A, Karanicolas PJ, et al. The accuracy of 29. Austin PC. The performance of different
burn diagnosis codes in health administrative data: a propensity-score methods for estimating differences
validation study. Burns. 2017;43(2):258–64. in proportions (risk differences or absolute risk
15. Lee DS, Stitt A, Wang X, Yu JS, Gurevich Y, reductions) in observational studies. Stat Med.
Kingsbury KJ, et al. Administrative hospitalization 2010;29(20):2137–48.
database validation of cardiac procedure codes. Med 30. Wang R, Lagakos SW, Ware JH, Hunter DJ,
Care. 2013;51(4):e22–6. Drazen JM. Statistics in medicine—reporting of
16. Shiloach M, Frencher SK Jr, Steeger JE, Rowell KS, subgroup analyses in clinical trials. N Eng J Med.
Bartzokis K, Tomeh MG, et al. Toward robust 2007;357:2189–94.
information: data quality and inter-rater reliability in 31. Nakagawa S, Cuthill IC. Effect size, confidence
the American College of Surgeons National Surgical interval and statistical significance: a practical guide
Quality Improvement Program. J Am Coll Surg. for biologists. Biol Rev Camb Philos Soc, 5 ed.
2010;210(1):6–16. 2007:82(4):591–605.
170 R. Behman et al.
32. Gardner MJ, Altman DG. Confidence intervals rather 35. Yusuf S, Wittes J, Probstfield J, Tyroler HA. Anal-
than P values: estimation rather than hypothesis ysis and interpretation of treatment effects in sub-
testing. Br Med J (Clin Res Ed). 1986;292 groups of patients in randomized clinical trials.
(6522):746–50. JAMA. 1991;266(1):93–8.
33. Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, 36. Calder BJ, Phillips LW, Tybout AM. The concept of
Flint J, Robinson ESJ, et al. Power failure: why small external validity. J Consum Res. 1982;9(3):240–4.
sample size undermines the reliability of neuro- 37. Abbas S, Bissett IP, Parry BR. Oral water soluble
science. Nat Rev Neurosci. 2013;14(5):365–76. contrast for the management of adhesive small bowel
34. Schneeweiss S. Sensitivity analysis and external obstruction. Cochrane Database Syst Rev. 2007:(3):
adjustment for unmeasured confounders in epidemi- CD004651.
ologic database studies of therapeutics. Pharmacoepi-
demiol Drug Saf. 2006;15(5):291–303.
Case-Control Studies
17
Achilles Thoma, Jenny Santos, Jessica Murphy,
Eric K. Duku and Charles H. Goldsmith
(2) the association between smoking and lung on the association between chylothorax and
cancer [2] and (3) the association between the limb repeat paediatric cardiac surgery.
malformation, phocomelia, and thalidomide
ingestion during pregnancy [1].
Literature Search
“In paediatric patients undergoing cardiac sur- Thoma et al. [1] article provides appraisal ques-
gery, what is the risk of chylothorax following tions specific to case-control studies where harm
surgery?” OR “In paediatric patients undergoing to a patient is involved; these questions are
a redo cardiac surgery, what is the risk of chy- summarized in Box 17.1.
lothorax following surgery”.
Next, you use the above terms to perform a
Box 17.1. Guidelines for the Appraisal
thorough literature search (see Chap. 4). Using
of an Article in the Surgical Literature
the terms: “Paediatric” AND “Cardiac Surgery”
Using a Case-Control Study Design [1]
AND “Chylothorax” you perform a literature
search in the Cochrane Library to determine if
A. Are the results valid?
there are any acceptable randomized control tri-
als. The search yields two articles, however,
(i) Were cases and controls similar
neither are relevant to your research question (see
with respect to the indication or
Appendix 1). You try performing another litera-
circumstances that would lead
ture search on Cochrane using “Redo Cardiac
to exposure?
Surgery” in place of “Cardiac Surgery” and this
(ii) Were the circumstances and
search yields no results. Since your search on
methods for determining
Cochrane Library was not successful, you per-
exposure similar for cases and
form a literature search using PubMed using
controls?
“Paediatric” AND “Cardiac Surgery” AND
(iii) Was the correct temporal rela-
“Chylothorax”. Your search yields 84 publica-
tionship demonstrated?
tions, so you decide to narrow your search to
(iv) Was there a dose-response
those published in 2018 and utilize the “best
relationship?
match” sorting feature; this new filter yields nine
publications (see Appendix 2). Eight of these B. What were the results?
articles focused on the management of chy-
lothorax or elements of this condition. One of (i) How strong is the association
these articles, by Day et al. [3] titled “Chy- between exposure and outcome?
lothorax following paediatric cardiac surgery: A (ii) How precise was the estimate of
case-control study” seemed promising. It is a risk?
matched case-control study investigating the
association between numerous factors on the risk C. Will the results help me in caring
of chylothorax. for my patients in my practice?
Are the Results Valid? When selecting cases, Day et al. [3] used the
diagnostic criteria of (1) a pleural fluid lympho-
i. Were cases and controls similar with respect cyte count of >80% or (2) a triglyceride level >
to the indication or circumstances that would 1.1 mmol/L. Having predetermined diagnostic
lead to exposure? criteria helps to reduce misclassification bias, or
the error associated with misidentifying a
The study by Day et al. [3] reviewed health patient’s disease status [1]. The authors then
records of all paediatric patients who received found a matched control group that had under-
cardiac surgery at the Royal Children’s Hospital gone cardiac surgery within the same time period
(Australia). The authors used a 48-month window but did not develop chylothorax. For both the
from January 2008 to January 2012, as the inclu- case and control groups, the Cardiobase database
sion criteria for review. They authors identified from the Royal Children’s Hospital in Melbourne
121 cases (diagnosed with chylothorax). The was used.
comparison (control) group was formed by iden-
tifying 121 patients who had also received heart iii. Was the correct temporal relationship
surgery, within the same time period, but did not established?
develop chylothorax. The control group was mat-
ched to the cases by (1) Age (within three months if Establishing a temporal relationship, meaning
younger than one year of age or within one year for that the exposure occurred before the outcome, is
older patients); (2) Date of surgery (within one important in determining whether there is an
year); and (3) Sex. The matching of cases and association. In the Day et al. [3] study, the
controls is often performed in case-control studies authors do not give any information on the date
to better control for confounding (the confusion of of surgery as compared to the date of diagnosis.
the effect of a risk factor on the outcome of interest, The absence of these data may likely be due to
which is actually due to some other factor not the fact that the objective of this study was to
accounted for in the study) as well as to provide a determine the diagnosis of chylothorax
control group that better represents the target post-surgery, therefore, one would assume that
population. While the authors controlled for these the diagnosis date (outcome) would have come
three factors, they did not explicitly check, with after the re-do surgery date (exposure). While
statistical testing, to ensure that the matching you understand why this information may not be
process worked for these variables. provided, you feel it would have been helpful to
The objective of the Day et al. [3] study, was be given the dates of surgery and the date of
to explore the risk of developing chylothorax diagnosis. This information could have been
following surgery associated with a number of beneficial to compare both the impact of the
identified influencing factors. To explore this length of time between surgeries on risk, and the
association the cases and controls cannot be length of time past from surgery to diagnosis
matched for these factors; the following factors between the cases and controls.
have been identified as influencing the develop-
ment of chylothorax following surgery and were iv. Was there a dose-response relationship?
explored by Day et al. [3]: (1) genetic risk fac-
tors; (2) the annual hospital volume [4]; Determining a dose-response relationship can
(3) weight; (4) RACHS-1 score; (5) arch surgery; be difficult in surgery; it is much easier, for
(6) open or closed heart; (7) incision site; and example, to find a dose-response relationship
(8) surgery type (redo or virgin). between smoking (number of packs per year) and
lung cancer [1]. Previous work in the area of
ii. Were the circumstances and methods for cardiac surgery and chylothorax have suggested
determining exposure similar for the cases a possible dose-response relationship between
and controls? number of surgeries and the development of
17 Case-Control Studies 175
chylothorax. Both Ismail et al. [5] and Mery et al. in this manner does not account for potential
[4] found that there was an increased incidence confounding by other factors (genetic risk fac-
rate in redo cardiac surgery cases. In the Day tors, the annual hospital volume [4], weight at
et al. [3] article, results indicate that there is an time of surgery, RACHS-1 score, arch surgery,
increased risk with redo surgery; therefore this open or closed heart, and incision site [3].
may be suggesting that there is a dose-response Using the format of this basic 2 2 table, an
relationship in regard to number of interventions OR can be calculated using the following
and chylothorax risk. formula:
However, concluding a dose-response rela-
tionship the study by Day et al. [3] may not be a=b ad
Odds Ratio ¼ ¼
realistic as the authors did not use a measure c=d cb
detailing the specific number of redo surgeries in
the analysis. The redo surgery variable was In the Day et al. [3] article, the exposure of
classified as “none” or “at least 1 redo”. interest was re-do congenital cardiac surgery,
while the outcome of interest was postoperative
chylothorax. The authors included various risk
What Are the Results? factors for chylothorax following cardiac surgery
including the risk adjustment for congenital heart
i. How strong is the association between surgery (RACHS-1) score, arch surgery, open or
exposure and outcome? closed heart, incision site and incision type (redo
or virgin). A univariable OR is then calculated
In a case-control study, those individuals who for each risk factor. For example, Table 17.3
have the outcome of interest are known prior to illustrates what the 2 2 table for “Virgin sur-
the study beginning, therefore, a relative risk gery” versus “Redo surgery” would look like,
estimate cannot be used [1]. Instead, the odds using data from Day et al. [3], had the study been
ratio (OR) is more appropriate to measure the unmatched [3].
association of the exposure on the outcome [1].
ad
An OR represents a ratio of the odds having the Odds Ratio ¼
outcome in an “exposed” or case group relative cb
to the odds of having the outcome in an “unex- ð36Þð112Þ
¼
posed” or control group [6]. ð85Þð9Þ
Day and colleagues matched the cases to the ¼ 5:27
controls, and therefore a matched analysis
method was used to calculate the OR as opposed An OR of less than 1 represents a “protective”
to the standard method [1]. The standard effect of the risk factor increasing the odds of the
approach for the OR is calculated using the target group developing the target outcome [1]. In
number of individuals within both case and comparison, an OR of greater than 1 represents a
control groups who were or were not exposed to “hazardous” effect [1]. For the above-mentioned
any risk factor(s) of interest for an unmatched scenario, you are interested in the OR associated
study [1]. These numbers are best visualized by with the repeat surgery. Using a multivariable
using 2 2 tables. This type of table is appro- conditional logistic regression model, Day et al.
priate for unmatched studies because cases are [3] reported an OR of 20.7 (95% CI: 4.24–100)
independent from controls. Table 17.2 illustrates for redo surgeries. This means that the odds of
the basic format of a 2 2 table, to facilitate the chylothorax developing in those exposed to a
calculation of an OR for every risk factor (ex- redo surgery is 20.7 times greater than those who
posure) included in an unmatched study. It is also are having a surgery for the first time after con-
important to note that calculating an OR by hand trolling for other confounding variables.
176 A. Thoma et al.
However, as cases and controls were matched Table 17.2 Basic format of a 2 2 table used in
by date of surgery, age at surgery and sex and case-control studies
Day et al. [3] used a matched analysis, the basic Outcome
format of the 2 2 table above would be Exposure Yes (Cases) No (Controls)
slightly changed and the study results would Yes a b
have resulted from a different method of calcu- No c d
lation. In a matched case-control, numbers of
exposed and unexposed patients are illustrated
by how many PAIRS had different exposure
Table 17.3 Format of a 2 2 table that could have
statuses (Table 17.4). Contrary to the traditional been used in the Day et al. [3] study had the design been
method, cases and controls are no longer inde- unmatched
pendent groups and the unit of analysis is the Chylothorax
matched case-control pair.
Surgery type Cases Controls
In turn, the formula shown above for an un-
matched case-control study is also slightly Redo 36 9
changed to calculate the OR in a matched study: Virgin 85 112
b
Odds Ratio ¼
c Table 17.4 Format of a 2 2 table that should be used
in the Day et al. [3] study due to its matched design
In Table 17.4, the cells describe how many
Control pair member
patients, within each matched pair (121 in total)
were exposed or unexposed (redo or virgin sur- Case pair member Exposed Not exposed
gery). For a matched case-control design, cells b Exposed a b
and c are the only cells that would apply to Not exposed c d
calculate an OR, as you can be confident that the
odds of having the outcome due to the exposure
or not having the outcome due to not being
exposed would be comparable between cases and in this analysis [3] (this is often used in uni-
controls in a matched sample. variable models to determine which risk factors
In this particular situation, it is not possible to may be cofounding the results), it was one of the
calculate the OR that would represent the risk factors to be included in the subsequent
dependent nature of cases and controls as the multivariable model.
study does not provide information for matched In this multivariable model, used to account
pairs and just lists information by group for cases for confounding by other factors, Day et al. [3]
and controls. Again, calculating an OR in this found an OR of 20.7, between redo surgeries and
manner does not account for potential con- the risk of chylothorax. As the risk became
founding by other factors such as the risk factors greater in this model, this means that confound-
listed previously. ing factors had an effect on the relationship
Tables 17.2, 17.3 and 17.4 are purely for between chylothorax and “redo” surgeries, so it
illustrative purposes, to show how you might go was appropriate to include those factors in the
about getting a raw/unadjusted OR to measure multivariable model.
the risk between the exposure and outcome in
your study. This type of analysis is referred to as ii. How precise was the estimate of risk?
a univariable analysis. The univariable model in
the Day et al. [3] article yielded an OR of 10.0, The precision of an OR estimation can be
between redo surgeries and the risk of chy- judged by the confidence interval, which, is
lothorax. As this was significant at the 10% level strongly impacted by the sample size of a study.
17 Case-Control Studies 177
Day et al. [3] used two methods of calculating previously stated, the outcome of interest has
the OR, a univariable conditional logistic already occurred in case-control studies before
regression model and a multivariable conditional the onset of the investigation. Day et al. [3]
logistic regression model. When looking at the included records in this study up to 48-months
univariable model, the “redo” group had an OR from January 2008, which is equivalent to
of 10.0 with a 95% confidence interval of approximately 4 years, or January 2012. A study
3.05–32.8 [3]. In the multivariable model, the period of 3 years seems sufficient for the outcome
“redo” surgery group had an OR of 20.7 with a of interest; although some outcomes take time to
95% confidence interval of 4.24–100 [3]. As a develop, chylothorax has been found to develop
multivariable model accounts for other factors on average, eight days following surgery [8].
that can affect the relationship between an out- When comparing the current article to the
come and exposure, any conclusions should be literature, chylothorax studies seem to have a
based on the results from a multivariable model follow-up period varying from 2 to 7 years
(OR = 20.7; 95% CI: 4.24–100). [7–10]. Further, Day et al. [3] identify 121 cases,
The confidence intervals are quite wide, and making it the second largest study of its kind.
most likely due to the low number of patients Given the fact that the outcome has occurred
(n = 45) who had “redo” surgeries. While there prior to the commencement of the study, the
are no set criteria to determine if a confidence similar follow-up interval to other studies, and
interval is “too wide”, we can look at previous the large number of cases identified, you are
studies in the same area. Day et al. [3] state that confident in concluding that the follow-up in the
their study is the second largest study in the area Day et al. [3] study was sufficient.
identifying 121 cases of chylothorax, second
only to a multi-site study [3]. Additionally, one ii. Is the exposure similar to what might occur
should consider the fact that chylothorax is a in my patient?
relatively rare condition, therefore the absolute
risk is quite low, and there is a smaller chance of Where the exposure in the Day et al. [3] study
this event occurring within a specific time inter- is cardiac surgery, the exposure of interest in our
val. Therefore, while the estimate of the risk of scenario is redo heart surgery. Given that both
chylothorax may not be overly precise, due to the exposures (cardiac surgery or redo heart surgery)
small sample size, Day et al. [3] did have an are standardized procedures used for specific
adequate number of patients, as compared to diseases or in specific situations, there is no
other work in the area. Additionally, this shows reason to believe that there would be a difference
that “redo” surgeries are quite rare in this par- between the exposure of your own patient, and
ticular study. the patients in the Day et al. [3] article.
Fig. 17.1 Odds ratio and confidence interval for univariate and multivariable model
17 Case-Control Studies 179
Due to this low risk, the apparent commonality Resolution of the Scenario
and mandatory nature of second surgeries you
believe that there are benefits of this exposure to Based on the evidence presented in the Day et al.
the patient. [3] article, the Paediatric cardiac fellow presents
her conclusions to her staff. Chylothorax has
v. Were the patients in the appraised study odds 10 as likely to occur with repeat cardiac
similar to the patient in my practice? surgery than a single previous surgery. She also
acknowledges that the Day et al. [3] article also
In the Day et al. [3] study, the median age for incriminates the complexity of the cardiac sur-
the case and control groups was 0.23 and gery especially around the arch of the aorta as
0.25 years, respectively. The patient in our sce- carrying a higher risk for chylothorax. In this
nario is 3 months old (or 0.25 years), the authors regard, both staff were correct.
mention that those under 12 months of age are at
higher risk and therefore, the results can likely be
applied to our patient [3]. Additionally, this study Additional Information
was performed in Australia, which does not have for the Reader
any drastic differences in terms of healthcare,
access to services and lifestyle compared to the Appraising epidemiological studies usually
Canadian population [7]. Due to these similari- involves piecing together numerous components.
ties, you are confident that the information from The Graphical Appraisal Tool for Epidemiolog-
the Day et al. [3] article could be generalized to ical studies (GATE) framework can help to
your patient. organize these different elements of a study [12].
Figure 17.2 is our interpretation of this
framework, as it relates to the Day et al. [3] Chylothorax following paediatric cardiac
article. The original framework was developed surgery: a case-control study. Cardiol
by The Evidence-Based Medicine (EBM) Young. 2018;28(2):222–8.
Working group to help students conceptualize a 2. Waterhouse SG, Vergales JE, Con-
study as a whole, as well as by its individual away MR, Lee L. Predictive factors for
parts [4]. central line-associated bloodstream infec-
We have taken this framework and adjusted it tions in pediatric cardiac surgery patients
to represent the variables that were looked at for with chylothorax. Pediatr Crit Care Med.
this particular scenario. In our version of this 2018; Epub ahead of print.
framework, we used “Redo surgery” as the 3. Justice L, Buckely JR, Floh A, Horsley M,
exposure/intervention and “Virgin surgery” as Alten J, Anand V, Schwartz SM. Nutrition
the comparison. We then explain the odds ratio considerations in the pediatric cardiac
calculation to show the risk associated with intensive care unit patient. World J Pediatr
developing chylothorax (cases) following a Congenit Heart Surg. 2018;9(3):333–43.
“Redo surgery” (exposure). For this recreation of 4. Justice LB, Nelson DP, Palumbo J,
the framework, the time horizon was left out as Sawyer J, Patel MN, Byrnes JW. Efficacy
there was no specific time horizon, other than just and safety of recombinant tissue plasmino-
“following surgery” for this scenario. We gen activator for venous thrombosis after
encourage the reader to review the GATE paediatric heart surgery. Cardiol young.
framework as it offers a clear and concise format 2018;28(2):214–21.
to summarize and appraise epidemiological 5. Lo Rito M, AlRadi OO, Saedi A, Kotani Y,
studies [12]. Ben Sivarajan V, Russell JL, Cal-
darone CA, Van Arsdell GS, Honjo O.
Chylothorax and pleural effusion in con-
Appendix 1 temporary extracardiac fenestrated fontan
completion. J Thorac Cardiovasc Surg.
Search Results Using Cochrane Library: “Pae- 2018;155(5):2069–77.
diatric”, “Cardiac Surgery”, “Chylothorax” 6. Wu C, Wang Y, Pan X, Wu Y, Wang Q,
Li Y, An Y, Li H, Wang G, Dai J. Analysis
1. Das A, Shah PS. Octreotide for the treatment of of the etiology and treatment of chylothorax
chylothorax in neonates. Cochrane Database of in 119 pediatric patients in a single clinical
Systematic Reviews. 2010;9: CD006388. center. J Pediatr Surg. 2018. E-pub ahead of
2. Mosalli R, AlFaleh K. Prophylactic surgical print.
ligation of patent ductus arteriosus for pre- 7. Weissler JM, Cho EG, Koltz PF, Car-
vention of mortality and mobidity in extre- ney MJ, Itkin M, Laje P, Levin LS, Dori Y,
mely low birth weight infants (Review). Kanchwala SK, Kovach SJ. Lymphovenous
Cochrane Database of Systematic Reviews. anastomosis for the treatment of chylotho-
2008;1:CD006181. rax in infants: A novel microsurgical
approach to a devastating problem. Plast
Reconstr Surg. 2018;141(6):1502–7.
8. Muniz G, Hidalgo-Capos J, Valdivia-Tapia
Appendix 2 MDC, Shaikh N, Carreazo NY. Succesfful
management of chylothorax with etilefrine:
Search Results Using PubMed: “Paediatric”, Case report in 2 pediatric patients. Pedi-
“Cardiac Surgery”, “Chylothorax” atrics. 2018;141(5):e20163309.
9. Ok YK, Kim YH, Park CS. Surgical
1. Day TG, Zannino D, Golshevsky D, reconstruction for high-output chylothorax
d’Udekem Y, Brizard C, Cheyng MMH. associated with thrombo-occulusion of
17 Case-Control Studies 181
superior vena cava and left innominate vein 5. Ismail SR, Kabbani MS, Najm HK, Shaath GA,
in a neonate. Korean J Thorac Cardiovasc Jijeh AMZ, Hijazi OM. Impact of chylothorax on the
early post operative outcome after pediatric cardiac
Surg. 2018;51(3):202–4. surgery. J Saudi heart Assoc. 2014;26(2):87–92.
10. Winder MM, Eckhauser AW, Delgado- 6. Guyatt G, Rennie D, Meade MO, Cook DJ. Users
Corcoran C, Smout RJ, Marietta J, guides to the medical literature: a manual for
Bailly DK. A protocol to decrease postop- evidence-based clinical practice. 2nd ed. 2009 JAMA.
7. Hussey P, Anderson GF, Osborn R, Feek C,
erative chylous effusion duration in chil- McLaughlin V, Millar J, Epstein A. How does the
dren. Cardiol Young. 2018;28(6):816–25. quality of care compare in five countries? Health Aff.
2004;23(3):89–99.
8. Yeh J, Brown ER, Kellogg KA, Donohue JE, Yu S,
Gaies MG, Fifer CF, Hirsch JC, Aiyagari R. Utility
of a clinical practice guideline in treatment of
References chylothorax in the postoperative congenital heart
patient. Ann Thorac Surg. 2013;96:930–7.
1. Thoma A, Kaur MN, Farrokhyar F, Waltho D, 9. Chan EH, Russell JL, Williams WF, Van Arsdell GS,
Levis C, Lovrics P, Goldsmith CH. Users’ guide to Coles JG, McCrindle BW. Postoperative chylothorax
the surgical literature: how to assess an article about after cardiothoracic surgery in children. Ann of
harm. Can J Surg. 2016;59(5):351–7. Thoracic Surg. 2005;80(5):1864–70.
2. Song JW, Chung KC. Observational studies: cohort 10. Szumilas M. Explaining Odds Ratios. J Can Acad
and case-control studies. PRS. 2010;126(5):2234–42. Child Adolesc Pscyhiatry. 2010;19(3):227–9.
3. Day TG, Zannino D, Golshevsky D, d’Udekem Y, 11. Villa-Hincapie CA, Careeno-Jaimes M,
Brizard C, Cheung MMH. Chylothorax following Obando-Lopez CE, Camacho-Mackenzie J, Umaña--
paediatric cardiac surgery: a case–control study. Mallarino JP, Sandoval-Reyes NF. Risk factors for
Cardiol Young. 2018;28(2):222–8. mortality in reoperations for pediatric and congential
4. Mery CM, Moffett BS, Khan MS, Zhang W, heart surgery in a developing country. World J
Guzmán-Pruneda FA, Fraser CD Jr, Cabrera AG. Pediatr Congenit Heart Surg. 2017;8(4):425–9.
Incidence and treatment of chylothorax after cardiac 12. Jackson R, Ameratunga S, Broad J, Connor J, Lethaby A,
surgery in children: analysis of a large multi- Robb G, Wells S, Glasziou P, Heneghan C. The GATE
institution database. J Thorac Cardiovasc Surg. frame: critical appraisal with pictures. Evidence-Based
2014;147(3):678–86. Med. 2006;11(2):35–8.
Evaluating Case Series in Surgery
18
Christopher J. Coroneos and Brian Hyosuk Chin
where the esophagus is replaced with a pedicled Table 18.1 Key characteristics of the case series by Poh
jejunal conduit, and the perfusion to the most et al. [10]
proximal segment of jejunum is augmented with Characteristic Case series
microsurgical vessel anastomoses. You wonder if Objective “To review our experience and
the surgeons that presented the work have pub- technique of the supercharged
lished their techniques in a peer-reviewed jejunal flap for total esophageal
reconstruction”
journal.
Population n: 51
Esophageal cancer: 38 (75%)
Age (mean, min–max): 55 (28–74)
Literature Search Male sex: 36 (71%)
Active smoking: 6 (12%)
History of smoking: 32 (63%)
You visit PubMed.gov and perform a search with
Hypertension: 22 (43%)
the terms: “esophageal” AND “reconstruction” Other malignancy: 10 (20%)
AND “jejunum”. The search is limited to human Coronary artery disease: 7 (14%)
subjects, English language, and a publication Diabetes mellitus: 5 (10%)
COPD/asthma: 5 (10%)
date between 2008 and 2018. This yields 116
results and you identify several studies on eso- Intervention Supercharged jejunal reconstruction
Immediate: 34 (67%)
phageal reconstruction with free jejunal flaps. Delayed: 17 (33%)
There are four case series that interest you. Two Preoperative chemotherapy: 33
of the case series are from Japan with 11 and 24 (65%)
patients in the study [7, 8]. The third case series Preoperative radiation: 34 (67%)
is from the United Kingdom with 31 cases [9]. Study design Retrospective review of prospective
The final case-series published in the Annals of database
Not explicitly consecutive
Surgery by Poh et al. [10] from MD Anderson
Outcomes 30-day mortality: 0
Cancer Center (Houston, Texas) includes 51
Overall mortality: 2 (4%)
patients. Given that Poh et al. [10] includes the Flap re-exploration: 3 (6%)
greatest number of cases, a North American Flap failure: 2 (4%)
patient population, and is published in a ICU stay (mean, SD): 9.0 days
(11.7)
high-profile journal, you select this paper for
Length of stay (mean, SD):
review. Key characteristics of the case series are 21.5 days (14.0)
listed in Table 18.1. Regular diet: 44 (90%)
Discontinue tube feeds: 39 (80%) at
103 days (SD 81)
Appraisal of a Surgical Case Series Follow-up Mean 21.9 months (2–80)
Complications Overall: 33 (65%)
The appraisal of case series involves evaluating Fistula: 7 (14%)
Stricture: 5 (10%)
the validity of the study, interpreting the results,
and applying study findings clinically (Box 1).
18 Evaluating Case Series in Surgery 185
intervention, the reconstructive procedure simply reports that patients were identified
itself is a consistent time point of reference from a prospective database from 2000 to
for all patients. While a prospective study 2009 with a supercharged jejunal flap
would be ideal, recall and information bias within the defined period at MD Anderson
are minimized in this series. Cancer Center (Houston, Texas). There is
no discussion pertaining to the percentage
of patients agreeing to treatment, though
iii. Was patient recruitment consecutive?
with a cancer diagnosis this procedure is not
Inclusion of consecutive patients is the best
elective. However, an indication of the
indication to the reader that selection bias is
proportion of patients “fit” for this proce-
minimized. In case series where favorable
dure should be discussed.
patients are selected or “cherry-picked”,
results will likely overestimate positive
outcomes, namely, the success of a surgical iv. Was outcome assessment appropriate?
intervention [5]. For example, selecting As with other study designs, the primary
only American Society of Anesthesiologists outcome of a case series should be specific,
(ASA) 1 patients will result in decreased clinically relevant, measured at an appro-
morbidity and mortality. Reporting only priate time, and be important to patients;
ideal patients limits the external validity of these factors are indicative of the series’
the study, and readers cannot be confident clinical impact [14]. Readers can consider
the series represents practical effectiveness the modified hierarchy of patient-important
for all patients in real-world practice [13]. outcomes where Class I: mortality, Class II:
In a special case, for series where an inci- morbidity, Class III: symptoms/quality of
dence or estimate of risk is reported, it is life/functional status, Class IV: surrogate
important that all patients are reported for a [15, 16]. The highest-class clinically rele-
given time period, and represent a demo- vant outcome should be reported to have the
graphic patient sample. An accurate “de- largest impact on patient care [15, 16].
nominator” for the patient group must be Class I and II outcomes are considered
obtained. For example, a series of all births “hard” outcomes; they are dichotomous,
at a hospital are a demographic sample, and and least subjective (e.g., “life or limb”).
the study can provide the incidence of birth For example, a case series on melanoma, or
injury for the given patient group. Con- major burn care, can report mortality (Class
versely, a series of cases not resolving with I), a case series on early laparoscopic
conservative management and subsequently cholecystectomy for cholecystitis can mea-
referred to a specialist represents only a sure 30-day major complications (Class II),
portion of all cases, and patient group and a case series on breast reconstruction
metrics cannot be calculated. can report long-term quality of life with
In case series where consecutive patients are BREAST-Q [17] (Class III).
not included, specific inclusion and exclu- Appropriate timing of outcome assessment
sion criteria should be reported [5]; readers is equally important; the reader must con-
should interpret results accordingly. Fur- sider if sufficient time has accumulated to
ther, for case series reporting a high-risk assess patient survival, the success of a
surgical intervention, even consecutive procedure, or quality of life [5]. For exam-
cases will not represent all patients; some ple, 30 days may be appropriate for a series
discussion of the characteristics for patients assessing survival of acute trauma inter-
who agree to treatment should be included. ventions, whereas 2 years may be necessary
Poh et al. [10] do not specifically report for long-term functional outcomes of a
consecutive, or “all” patients. The series nerve reconstruction potentially requiring
18 Evaluating Case Series in Surgery 187
multiple procedures or long-term physical Poh et al. [10] comprehensively report the
therapy. Readers should look for a mini- details of a supercharged jejunal flap. The
mum follow-up period as a criterion for discussion focuses on three important aspects
case series inclusion [4]. Finally, possible of the reconstruction: “(1) selecting the
bias in outcome assessment should be con- appropriate jejunal segment, (2) choosing the
sidered. Blinding outcome assessment is optimal recipient vessels for microsurgical
frequently feasible in prospective designs anastomosis, and (3) creating a suitable con-
[14]. For retrospective designs, “hard” out- duit passageway.” Beyond the meticulous
comes are the most objective. description of the intervention, technical
Poh et al. [10] report a number of appro- refinements, pitfalls, and the “how” and
priate intra- and postoperative outcomes “why” of these critical steps are rationalized.
given the retrospective design. Mortality Further, authors comment on planning the
(Class I) is reported, as are dichotomous procedure between three teams, with steps
Class II outcomes of intensive care admis- that can be performed concurrently between
sion, reoperation, flap failure, overall major ablative and reconstructive surgeons.
complications (Table 18.1). Finally,
Class III outcomes of oral intake, achieving
regular diet and discontinuing tube feeds.
Mean follow-up time of 22 months is II. What Are the Results?
appropriate given the outcomes. The 2–80
month minimum and maximum follow-up
i. Are outcomes appropriately reported?
interval could include mortality; defining
The case series is a descriptive study design,
minimum follow-up excluding these cases
not analytic; no control group is present [4].
would improve their methodology and
As such, “cause and effect” relationships
reporting.
cannot be established. Readers should be
cautious that causal inferences between an
v. Is the intervention appropriately described? intervention and outcomes are not reported.
Most surgical case series describe a proce- Instead, outcomes can be discussed in a
dure; technical details of the intervention and descriptive manner, and hypotheses can be
perioperative management should be suffi- generated for testing in a higher level study
ciently described for readers to appropriately design [4]. For example, the first case series
interpret, and possibly reproduce reported of breast implant-associated anaplastic large
results [5]. Beyond the specific description of cell lymphomas (BI-ALCL) [19] could not
a surgical procedure, readers should evaluate demonstrate causation of silicone breast
for operative indications, preoperative implants; results suggested an association
workup/care (e.g., imaging investigations, with silicone implants, and suggested confir-
resuscitation protocols), perioperative care mation with higher level studies (e.g.,
(e.g., antibiotics, venous thromboembolism case-control study) . As discussed in Was
prophylaxis, necessary patient recruitment consecutive?, the series
instruments/disposables, dosages of unique also did not include a demographic patient
medications), and postoperative care (e.g., sample, and incidence of BI-ALCL could not
physical therapy protocols, imaging surveil- be estimated.
lance) [1]. Any cointerventions should be Readers should be cautious interpreting
described [18] (e.g., comorbid fracture repair statistics beyond simple descriptive statistics,
with intramedullary nail of lower extremity especially if p-values are present; authors
fractures, abdominoplasty with breast reduc- should be conservative in reporting case ser-
tion) [5]. ies results [4]. In select cases, authors may
188 C. J. Coroneos and B. H. Chin
must decide if patients are comparable to their Poh et al. [10] describe a patient group with
own. Case series typically have high external high prevalence of recurrent disease and
validity [5]. Inclusion and exclusion criteria secondary reconstructions after treatment
are not as stringent as in higher level study failures at other institutions. Patients simi-
designs (e.g., RCTs), and results can often be larly have a number of comorbidities. MD
applied to a wider spectrum of patients [4]. Anderson can be classified as a “quater-
Still, readers should consider the comorbidi- nary” setting, where patients are captured
ties and characteristics of the patients inclu- from a global base [22]. Expertise beyond
ded in the case series, especially prognostic the reconstructive team is necessary, with
variables (e.g., age, smoking status, tumor the series describing a 3-team surgical
stage) [1]. approach, postoperative ICU and ward
Poh et al. [10] comprehensively describe their management, and subsequent nutrition and
patient population. Table 18.1 describes barium swallow assessment.
baseline demographics, with high prevalence
of recurrent disease, smoking, major comor-
bidities, and chemoradiation. These prognos-
tic factors are, however, not surprising given Resolution of Clinical Scenario
the disease process and intervention described
in the case series. The age, sex, comorbidities, You review the case series and discuss it with
and presenting diagnosis of patients in the your colleagues in planning for this challenging
case series match those of your patient. case. Given that the stomach will be unavailable
for reconstruction, you agree a supercharged
jejunal flap is appropriate for reconstruction. The
ii. Is the setting similar to my own?
series is appropriately designed, with appropriate
In applying results, readers must be certain
outcome assessment for the procedure. You are
that the care setting is similar to their own.
impressed by the number of technical refine-
Case series are often single center, however,
ments, the comprehensive description of the
if outcomes are pooled with a multicenter
surgical steps, and the candid rationale for
design, the differences between centers
changes. Your hospital is not a quaternary level
should be defined [5]. Novel and innovative
center, but it is an academic center with onco-
interventions are often introduced by ter-
logic general surgeons, thoracic surgeons, and
tiary or even “quaternary” settings, where
microsurgical plastic surgeons who are all fel-
highly specialized care is delivered to what
lowship trained, and capable of undertaking the
amounts to a worldwide referral base. While
required supercharged jejunum procedure. You
academic centers typically have more
have a discussion with your patient regarding the
resources versus community settings,
risks of the procedure, the perioperative man-
patients are correspondingly more compli-
agement, postoperative diet, and expected
cated [5]. Multidisciplinary care is often
recovery. Given the complexity of the case, you
necessary for increasingly complex inter-
and your colleagues decide to proceed with a
ventions, for example, intraoperative anes-
two-stage surgical plan. On the first day of the
thesia management, expertise of other
procedure, the ablative teams will proceed with
surgical specialties, postoperative inpatient
an esophagectomy, and partial gastrectomy. At
monitoring, postoperative imaging surveil-
the conclusion of the ablative procedure, the
lance, and functional assessment, and
patient’s remaining stomach will be evaluated,
long-term physical therapy.
and the teams will decide if reconstruction with a
190 C. J. Coroneos and B. H. Chin
gastric conduit is possible. If this is not possible 4. Kooistra B, Dijkman B, Einhorn TA, Bhandari M.
(as predicted), reconstruction with a super- How to design a good case series. J Bone Joint Surg
Am [Internet]. 2009;91(Suppl 3):21–6. Available
charged jejunal flap will be performed the fol- from: http://www.ncbi.nlm.nih.gov/pubmed/
lowing day in a staged manner to minimize 19411496.
operating time. During the procedure, you will 5. Coroneos CJ, Ignacy TA, Thoma A. Designing and
incorporate the case series’ details in selecting reporting case series in plastic surgery. Plast Reconstr
Surg [Internet]. 2011;128(4):361e–8e. Available from:
and isolating a jejunal segment of adequate http://www.ncbi.nlm.nih.gov/pubmed/21544010.
length, creating a pathway through the chest to 6. Audigé L, Hanson B, Kopjar B. Issues in the
pass the jejunum, and finally dissecting a recip- planning and conduct of non-randomised studies.
ient artery and vein to perfuse the segment. Injury [Internet]. 2006;37(4):340–8. Available from:
http://www.ncbi.nlm.nih.gov/pubmed/16483576.
7. Okumura Y, Mori K, Yamagata Y, Fukuda T, Wada I,
Shimizu N, et al. Two-stage operation for thoracic
Conclusion esophageal cancer: esophagectomy and subsequent
reconstruction by a free jejunal flap. Surg Today
[Internet]. 2014;44(2):395–8. Available from: http://
Although case series are recognized as Level IV www.ncbi.nlm.nih.gov/pubmed/24292600.
in the hierarchy of evidence, they remain 8. Takeno S, Takahashi Y, Moroga T, Yamamoto S,
prevalent in the surgical literature. Case series is Kawahara K, Hirano T, et al. Reconstruction using a
the most common study design in surgical spe- free jejunal graft after resection of the cervical
esophagus for malignancy. Hepatogastroenterology
cialty research and remain predominant in [Internet]. 2013;60(128):1966–71. Available from:
reporting unique patient presentations, and http://www.ncbi.nlm.nih.gov/pubmed/24719936.
innovative surgical procedures. They are unique 9. Laing TA, Van Dam H, Rakshit K, Dilkes M,
in their high external validity, and fast and less Ghufoor K, Patel H. Free jejunum reconstruction of
upper esophageal defects. Microsurgery [Internet].
costly design. Alone, case series can establish the 2013;33(1):3–8. Available from: http://www.ncbi.
safety of an intervention, or the diagnostic nlm.nih.gov/pubmed/22821641.
accuracy of a test [4]. From a methodological 10. Poh M, Selber JC, Skoracki R, Walsh GL,
standpoint, they are important in demonstrating Yu P. Technical challenges of total esophageal
reconstruction using a supercharged jejunal
feasibility, and patient characteristics for higher flap. Ann Surg [Internet]. 2011;253(6):1122–9.
level study planning. With consideration of the Available from: http://www.ncbi.nlm.nih.gov/
topics discussed in this chapter, surgeons can pubmed/21587114.
11. Bernard T, Armstrong-Wells J, Goldenberg N. The
confidently assess the validity and applicability
institution-based prospective inception cohort study:
of a case series relevant to their practice. design, implementation, and quality assurance in
pediatric thrombosis and stroke research. Semin
Thromb Hemost [Internet]. 2012;39(1):10–4. Avail-
able from: http://www.thieme-connect.de/DOI/DOI?
References 10.1055/s-0032-1329551.
12. Sedgwick P. Prospective cohort studies: advantages
1. Agha RA, Fowler AJ, Rajmohan S, Barai I, Orgill DP, and disadvantages. BMJ [Internet]. 2013;347(1):
PROCESS Group. Preferred reporting of case series in f6726. Available from: http://www.bmj.com/cgi/doi/
surgery; the PROCESS guidelines. Int J Surg [Inter- 10.1136/bmj.f6726.
net]. 2016;36(Pt A):319–23. Available from: http:// 13. Sedgwick P. Bias in observational study designs:
www.ncbi.nlm.nih.gov/pubmed/27770639. prospective cohort studies. BMJ [Internet]. 2014;349
2. Chan K, Bhandari M. Three-minute critical appraisal (2):g7731. Available from: http://www.bmj.com/cgi/
of a case series article. Indian J Orthop [Internet]. doi/10.1136/bmj.g7731.
2011;45(2):103–4. Available from: http://www.ncbi. 14. Voineskos SH, Coroneos CJ, Ziolkowski NI,
nlm.nih.gov/pubmed/21430861. Kaur MN, Banfield L, Meade MO, et al. A systematic
3. Joyce KM, Joyce CW, Kelly JC, Kelly JL, Car- review of surgical randomized controlled trials: part
roll SM. Levels of evidence in the plastic surgery I. Risk of bias and outcomes: common pitfalls plastic
literature: a citation analysis of the top 50 “classic” surgeons can overcome. Plast Reconstr Surg [Inter-
papers. Arch Plast Surg [Internet]. 2015;42(4):411–8. net]. 2016;137(2):696–706. Available from: http://
Available from: http://www.ncbi.nlm.nih.gov/ www.ncbi.nlm.nih.gov/pubmed/26818309.
pubmed/26217560. 15. Bala MM, Akl EA, Sun X, Bassler D, Mertz D,
Mejza F, et al. Randomized trials published in higher
18 Evaluating Case Series in Surgery 191
vs. lower impact journals differ in design, conduct, [Internet]. 2008;300(17):2030–5. Available from:
and analysis. J Clin Epidemiol [Internet]. 2013;66 http://www.ncbi.nlm.nih.gov/pubmed/18984890.
(3):286–95. Available from: http://www.ncbi.nlm. 20. Akl EA, Briel M, You JJ, Lamontagne F, Gangji A,
nih.gov/pubmed/23347852. Cukierman-Yaffe T, et al. LOST to follow-up
16. Lubsen J, Kirwan B-A. Combined endpoints: can we Information in Trials (LOST-IT): a protocol on the
use them? Stat Med [Internet]. 2002;21(19):2959–70. potential impact. Trials [Internet]. 2009;10:40. Avail-
Available from: http://www.ncbi.nlm.nih.gov/ able from: http://www.ncbi.nlm.nih.gov/pubmed/
pubmed/12325112. 19519891.
17. Pusic AL, Klassen AF, Scott AM, Klok JA, 21. Schulz KF, Grimes DA, Altman DG, Hayes RJ. Blind-
Cordeiro PG, Cano SJ. Development of a new ing and exclusions after allocation in randomised
patient-reported outcome measure for breast surgery: controlled trials: survey of published parallel group
the BREAST-Q. Plast Reconstr Surg [Internet]. trials in obstetrics and gynaecology. BMJ [Internet].
2009;124(2):345–53. Available from: http://www. 1996;312(7033):742–4. Available from: http://www.
ncbi.nlm.nih.gov/pubmed/19644246. ncbi.nlm.nih.gov/pubmed/8605459.
18. Carey TS, Boden SD. A critical guide to case series 22. Van Allen EM, Lui VWY, Egloff AM, Goetz EM,
reports. Spine (Phila Pa 1976) [Internet]. 2003;28 Li H, Johnson JT, et al. Genomic correlate of
(15):1631–4. Available from: http://www.ncbi.nlm. exceptional erlotinib response in head and neck
nih.gov/pubmed/12897483. squamous cell carcinoma. JAMA Oncol [Internet].
19. de Jong D, Vasmel WLE, de Boer JP, Verhave G, 2015;1(2):238–44. Available from: http://www.ncbi.
Barbé E, Casparie MK, et al. Anaplastic large-cell nlm.nih.gov/pubmed/26181029.
lymphoma in women with breast implants. JAMA
Quality Improvement and Patient
Safety in Surgery 19
Martin A. Koyle and Jessica H. Hannick
Introduction Background
The practice of medicine continues to evolve, In essence, surgeons have always pursued the
due to advances, often disruptive in nature, in concept of gradual, progressive change leading to
basic science, pharmaceuticals and technology. ongoing opportunities for improvement. Our
However, the way we practice is also rapidly surgical results are audited to search for root
changing, as a result of the demands and causes of untoward outcomes caused by errors of
expectations of the environment(s) in which we omission and commission related to our inter-
work (government, private sector, academic, ventions. At the turn of the twentieth century,
hospital), and newer approaches for whom we Ernest Amory Codman began tracking and
care. The paternalistic model of care, where a reporting his ‘end results’, a concept that was
physician dictated diagnostic and management totally foreign to the established (Harvard)
plans, has become a shared decision model, community at Massachusetts General Hospital
wherein transparency and joint interaction have (MGH). Codman promoted the End Result Idea,
become the new norm. Most of us were brought through which he would document patient
up in an era of volume-based health care deliv- demographics and track their outcomes. He
ery. This is now the era of evidence-based suggested that it was our ethical duty to report
medicine, with personalized care and value has each surgeon’s results, good or bad, and identify
become the pivotal goals. Value (V)–based care, opportunities for improvement His desire to
will be evaluated based on performance, that is formally assess surgeon competence, rather than
on quality (Q) and cost ($) (V = Q/$) [2]. This promotion based on seniority, led to his dismissal
approach will likely reimburse or reward provi- from MGH. Defiantly, he built his own facility,
ders based on outcomes (quality) rather than on The End Result Hospital. There, he could mea-
number of cases performed. With skyrocketing sure efficiency and performance and present it to
healthcare costs, preventable errors have become the public [3, 4]. To the medical community at
a target for quality improvement and patient that time, it was heresy. Ultimately, the concept
safety (QIPS). of frank, scheduled discussion sessions, where
For surgeons, in particular, this constantly analysis of hospital reform and patient outcomes
changing playing field has to be balanced with occurs, has led to our now sacrosanct morbidity
patient safety concerns. Yet surgery has so and mortality (M and M) conferences. He and
many opportunities to be successful in QIPS other leaders of the newly formed American
initiatives due to the establishment of centrally College of Surgeons (ACS), pursued a goal of
based registries, standardization when possible, evaluation and standardization. This culminated
simulation training, and video critiquing of in the formation of the Joint Commission on
surgeries we perform. To many of us, QIPS is Accreditation, for which Codman is now given
viewed as top-down, where institutions promote much of the credit. Most recently, the ACS has
measures that we often consider nuisances and focused on QIPS by developing the National
hence disregard, that are focused on issues such Surgical Quality Improvement Program
as hand hygiene, start times, checklists, line, (NSQIP), which has rapidly expanded beyond
surgical site and catheter-associated urinary tract North America [5]. NSQIP allows validated risk
infections, falls and pressure sores, and adjusted outcomes of surgical procedures to be
ventilator-associated pneumonia. It, therefore, compared between reporting hospitals. Because
becomes incumbent upon us to define reliable individual hospitals are often challenged when it
(the right) and meaningful metrics to make comes to analyzing their surgical outcomes, it is
QIPS initiatives relevant and to provide difficult to pinpoint areas of concern to improve
evidence that indeed we are delivering quality problem areas. NSQIP provides a tool to allow
care. hospitals to focus on areas where improvement is
19 Quality Improvement and Patient Safety in Surgery 195
necessary, where complication rates can be to a special cause, knowing that new knowledge
improved. is likely to be present within the organization and
Quality Improvement and Patient Safety that there may be different individual drivers to
(QIPS) has now become a center of focus in all enhance motivation. A key to Deming’s thinking
facets of medicine. It has become a mandated was that robust leaders who could engage
component of every hospital’s annual report and stakeholders were mandatory, and that unwanted
a target for a broad variety of stakeholders, variation must be reduced. Kilo and IHI adopted
including payers and consumers (patients). The the model for improvement in which an aim
Institute of Medicine (IOM) has proposed six identifies a change concept to be improved
principles for improvement: safe, effective, within a defined time period [11]. Through a
patient-centered, efficient, timely and equitable series of cycles called P-D-S-A (Plan, Do, Study,
care [6]. There is even a structure for publishing, Act), a concept for change is addressed with an
the SQUIRE guidelines, as the increasing num- action plan, and an analysis of the effect of that
ber of medical journals either entirely devoted, or action is then performed to allow further subse-
at least partially devoted to this subject, publish quent opportunities. With each ensuing PDSA, or
new contributions [7]. change cycle, small adjustments can be made
with hopes of making escalating gains. PDSA
reiterates 3 fundamental questions in each cycle:
What is Quality Improvement? 1. What are we trying to accomplish? 2. How
will we know that a change is an improvement?
Although there are many interpretations of what 3. What changes can we make that will result in
QI is, Batalden and Davidoff have simplistically an improvement? Thus, through CPI, a provider,
defined it as ‘… combined and unceasing efforts organization, or system is more likely to improve
of everyone-healthcare professionals, patients on an ongoing basis. Other tools such as
and their families, researchers, payers, planners Lean/Six Sigma may be used in furthering CPI.
and educators-to make the changes that will lead By reducing waste, one is increasing value for
to better patient outcomes (health), better system potential stakeholders [12].
performance (care) and better professional
development (learning)’ [8]. Berwick and col-
leagues at the Institute for Healthcare Improve- Is QI Research?
ment (IHI) developed the triple aim of quality
healthcare as that of improving the experience of Both QI and traditional research are based on
care, improving the health of populations, and data. Although it may be folklore as to the
reducing per capita costs of healthcare [9]. originator of the statement, one of the most
Many of the current concepts of continuous famous quotations attributed to Deming (and for
process improvement (CPI) are based on the that matter others) is ‘In God we trust, all others
work of W. Edwards Deming, his ‘system of must bring data.’ Simplistically, research
profound knowledge’ and his ‘fourteen points of attempts to provide new knowledge while QI
management’ [10]. In QI, the four keys of the takes existing knowledge and seeks improve-
system of profound knowledge: appreciating the ment. Not that QI lacks rigor, but primarily it is
system, understanding variation, a theory of based on rapid cycle tests of change (PDSA).
knowledge (epistemology) and understanding Although statistical methodology can be used in
human behaviour/psychology, are pivotal in QI, data are best presented visually in real time
instituting sustained positive changes. Simplisti- using run charts or when control limits are
cally, this means viewing an organizational sys- required, Shewart (Control) charts. Because of
tem as a series of interrelated processes not silos the confusion and inconsistencies regarding the
where they share a common goal, understanding publication of QI Material, the SQUIRE guide-
that variation may be either due to a common or lines were introduced in 2008.
196 M. A. Koyle and J. H. Hannick
Unique to QI reporting in healthcare are the related to components of care and fate upon
SQUIRE guidelines (Standards for QUality recovery. The Institute of Medicine further pro-
Reporting Excellence) [7]. SQUIRE consists of a moted QIPS in their 2000 and 2001 publications
checklist composed of 19 items/4 categories, that To Err is Human and Crossing the Quality
authors should consider when submitting papers Chasm [5, 16]. The focus of these reports was
describing new QIPs and value knowledge in that systems need to be improved knowing that
healthcare. The original descriptions of SQUIRE human fallibility and error are innate. Thus,
stressed that these checklist items are common to much harm within healthcare was preventable.
scientific reporting but with modifications that With the escalation of healthcare costs, preven-
make them more applicable to the unique aspects tion of costly errors, while reducing the onus on
of medical QI work. It was acknowledged that it human blame, has become a goal as the road
was important to present a system such as this as towards HROs is travelled.
many peer-reviewed journals and stakeholders It has been estimated that approximately half
were unfamiliar with the design, implementation of all hospital complications occur within the
and reporting of QI projects [13]. In no way are operating room itself [17]. Surgery is unique, as
they meant to be exclusive of other guidelines it demands the combination of one’s surgical
and at times may be synergistic. skill, knowledge, and the judgment to apply the
right operation to the right patient at the right
time. Addressing the fact that surgery is a team
Discussion sport exacerbates this daunting challenge.
Although an operating room is as inanimate as a
Healthcare is a dynamic, complex entity that is cockpit or an aircraft carrier, the variability in the
unpredictable at best. Historically QIPS in supporting personnel (resident, fellow nursing,
healthcare has been driven by external forces anesthesiologist) are not always as reliably
(National Health Services (NHS), Centers for interchangeable as those members of the airline
Medicare and Medicaid Services (CMS), Joint crew or naval vessel. Although it is unlikely that
Commission on Accreditation of Hospital Orga- the surgeon him/herself will change during an
nizations (JCAHO). As the stakes and emphasis operation, commonly the other team players will.
in healthcare change and costs soar, many QIPS If anything the value of standardization and
initiatives are now central to organizations checklists should not be underscored. Commu-
(ACS), hospitals (Virginia Mason), and providers nication failures have been demonstrated to occur
(Intermountain Healthcare, Kaiser-Permanente). every 7–8 min during an operation [18]. Thus,
Healthcare has attempted to emulate other highly the longer the operation, with increasing likeli-
reliable organizations (HROs) such as aviation hood of greater personnel turnover, the more the
and the nuclear agencies, where safety is priori- opportunity for error increases. Surgical check-
tized and reduced errors are a priority. As value lists are now a decade old and have been
becomes more important, future research and implemented to reduce preventable surgically
leaders in the field will need to promote our related morbidity (and mortality) [19]. Concep-
knowledge of QIPS. In 1966, Donabeian identi- tually, this makes sense to assure that indeed it is
fied three approaches in the assessment of qual- the right patient who is consented for that
ity: structure, process and outcome [14]. To this scheduled operation, where laterality is con-
day, it remains one of the most frequently cited firmed as are issues of allergies and antibiotic
articles in healthcare and is a solid pillar in the administration. Urbach, using a large Canadian
evaluation of QI in healthcare [15]. In this sem- provincial database, has appropriately demon-
inal paper, structure referred not to the settings strated that there are indeed potential challenges
but to the qualifications of the providers in those with checklists when there is non-compliance
settings and administrative systems within. Pro- and inconsistencies with implementation
cess and outcome are more straightforward and [20, 21]. Although checking boxes per se may
19 Quality Improvement and Patient Safety in Surgery 197
not impact the occurrence of adverse events, patients in optimizing patient care. Sackett’s
improved communication amongst operating description of Evidence-Based Medicine (EBM)
room personnel, regardless of ‘status’, is to be of CPG’s, ‘The conscientious, explicit, and
encouraged. judicious use of current best evidence in making
Medicine as a whole, but surgery in particular, decisions about the care of individual patients’, is
has housed a culture of shame and blame where most apt [24]. Problematically, is that not all
an individual, rather than a system has been held CPG’s have been based on high-quality evidence
responsible for a given outcome (especially a which potentially can lead to less than ideal
complication). Lucien Leape, a well known results, and thus has led to scepticism regarding
pediatric surgeon who is now at the Harvard their implementation. CPGs should take volu-
School for Public Health, testified before the US minous or complex data that is not at our fin-
Congress in its hearings on health care quality gertips, and make such data more manageable for
improvement that, ‘The single greatest impedi- that individual patient, who is cared for by that
ment to error prevention in medicine is that we individual physician, at that given time. Hence in
punish people for making mistakes.’ As a result no way should CPG’s be construed as cookbook
of medical error, there is often more than just the medicine or the only option for that specific
patient as a victim, as it is easier to blame an patient or encounter. Sadly, defensive medicine
individual than a system. The IOM’s To Err is and ‘over testing’ are realities of the practice of
Human [16] focuses on the fundamental issue of medicine. This is costly to the system as a whole
human fallibility. In QIPS, James Reason’s Swiss and impacts the utility of CPGs and other pro-
Cheese Model has often been used as an analogy gressive initiatives such as ‘Choosing Wisely’
[22]. Reason uses several layers of Swiss cheese [25]. In QIPS, various tools have been developed
with latent and active holes in this model. In an to construct and to analyze CPGs. GRADE
ideal system, errors are trapped between the (Grades of Recommendation, Assessment,
layers, as the holes do not align. However, given Development and Evaluation) has been devel-
the worst-case scenario, the holes of each layer oped as a tool for constructing CPGs that
align, and this creates an opportunity for an enhances physician evaluation of the quality of
untoward event to occur. medical evidence so that recommendations can
Recognizing that human factors play an be applied most appropriately in patient care
important role in outcome, it is obvious that an [26]. It replaces letter recommendations regard-
understanding of other elements that either sup- ing quality with ‘strong’ or ‘weak/conditional’,
port or hinder an individual’s performance. As a ‘for’ or ‘against’ and avoids ‘expert opinion’ in
core element of QIPS, Human Factors Engi- the hierarchy of evidence pyramid, while recog-
neering, often called ergonomic or human nizing that even the highest evidence standard,
engineering uses physical and psychological SR, can be flawed. Recommendations using
principles to affect processes and systems, and GRADE ideally should be actionable and
promote a reduction in errors [23]. In the oper- avoid restating obvious facts and vagaries.
ating room setting, it has especially been appli- The AGREE II tool (Appraisal of Guidelines for
cable to medical devices, medication safety, and Research and Evaluation) has gained acceptance
Information Technology. as a standardized technique for the evaluation
Clinical Practice Guidelines (CPGs) have and comparison of existing CPGs [27]. It con-
increasingly become part of our daily practices in sists of 6 quality domains containing 23 items
an attempt to enhance the quality of care. To be used to quantify each CPG. It requires several
relevant, CPG’s should be based on systematic appraisers who independently score each domain
reviews (SR) of current best evidence regarding a of each guideline. Importantly each appraiser
given topic, that has been addressed by an also assesses the overall quality of the CPGs and
unbiased, multidisciplinary group of experts, so makes a recommendation as to whether that
that they can use to assist practitioners and their guideline should be used. CPGs should be used
198 M. A. Koyle and J. H. Hannick
judiciously, remembering that medical judgment change(s) to allow for opportunities to improve
and experience remain essential components of further [29]. The environment of the operating
our fostering shared decision making and optimal room is complex and each patient is unique. As
care. CPGs will increasingly be available to us. such the opportunity for error is heightened, as
As such, practitioners must understand how they are opportunities to reduce, or ideally, prevent
are developed and what their qualities and their such occurrences.
relevance are [28]. Electronic Health Records
likely will improve their availability, as well as
be an adjunct in improving compliance with their References
utilization.
1. Kim JK, Chua ME, Ming JM, Dos Santos J,
Zani-Ruttenstock E, Marson A, et al. A critical
review of recent clinical practice guidelines on
Resolution of the Clinical Scenario management of cryptorchidism. J Pediatr Surg.
2017. [Epub ahead of print].
After reading this single aforementioned article, 2. Institute of Medicine, Committee on Redesigning
the surgeon found that five major guidelines all Health Insurance Performance Measures, Payment,
and Performance Improvement Programs. Rewarding
recommended that referral of an infant with provider performance: aligning incentives in medicare.
presumed cryptorchidism should occur by Washington, DC: National Academies Press; 2016.
6 months of age [1]. The guidelines agreed that 3. Berwick DMEA. Codman and the rhetoric of battle: a
palpation under anesthesia followed by laparo- commentary. Milbank Q. 1989;67(2):262–7.
4. Donabedian A. The end results of health care: Ernest
scopic evaluation if the testis remained Codman’s contribution to quality assessment and
non-palpable should be the next step of care. All beyond. Milbank Q. 1989;67(2):233–56.
of these modern guidelines were in concordance, 5. Birkmeyer JD, Shahian DM, Dimick JB,
with rare exceptions, supporting that ultrasound Finlayson SR, Flum DR, Ko CY, et al. Blueprint
for a new American College of Surgeons: National
and other diagnostic imaging were not recom- Surgical Quality Improvement Program. J Am Coll
mended as an adjuvant is assessment, as imaging Surg. 2008;207(5):777–82.
rarely improve diagnostic accuracy, and does not 6. Institute of Medicine, Committee on Quality of
impact management. Given this information in Health Care in America. Crossing the quality chasm:
a new health system for the 21st century. Washing-
hand, the surgeon suggests that as the boy, who ton, DC: National Academies Press; 2001.
is approaching 6 months of age, should be 7. Davidoff F, Batalden P, Stevens D, Ogrinc G,
scheduled for exam under anesthesia and Mooney S, SQUIRE Development Group. Publica-
laparoscopy with potential orchidopexy as indi- tion guidelines for quality improvement in health
care: evolution of the SQUIRE project. Qual Saf
cated, under the care of a pediatric urologist or Health Care. 2008;17(Suppl 1):3–9.
pediatric surgeon, who performs cases such as 8. Batalden P, Davidoff F. What is “quality improve-
this on a routine basis. Therefore, the ultrasound ment” and how can it transform healthcare? Qual Saf
is deemed unnecessary and is cancelled. Health Care. 2007;16:2–3.
9. Berwick DM, Nolan TW, Whittington J. The triple
aim: care, health, and cost. Health Aff. 2008;27(3):
759–69.
Conclusion 10. Walton M. The Deming management method. New
York: Perigee Books; 1986.
11. Kilo CM. A framework for collaborative improve-
QIPS is more than checklists and evidence-based ment: lessons from the Institute of Healthcare
guidelines. There are many tools that can be used Improvement’s breakthrough series. Qual Manag
in QI that are applicable to promote a culture of Health Care. 1998;6(4):1–13.
change and safety, eliminate silos, understand the 12. Cima RR, Brown MJ, Hebl JR, Moore R, Rogers JC,
Kollengode A, et al. Use of lean and six sigma
basic problem(s) at hand, test changes, with methodology to improve operating room efficiency in
continuous study of outcomes and performance, a high volume tertiary-care academic medical center.
and to report pertinent findings to sustain J Am Coll Surg. 2011;213(1):83–92.
19 Quality Improvement and Patient Safety in Surgery 199
13. Davidoff F, Batalden P. Toward stronger evidence on 22. Reason J. Human error models and management. Br
quality improvement. Draft publication guidelines: Med J. 2000;320:768–70.
the beginning of a consensus project. Qual Saf Health 23. Vosper H, Hignett S, Bowie P. Twelve steps for
Care. 2005;14:319–25. embedding human factors and ergonomics principles
14. Donabedian A. Evaluating quality of medical care. in healthcare education. Med Teach. 2018;40(4):
Milbank Q. 1966;44:166–206. 357–63.
15. Ayanian JZ, Markel H. Donabedian’s lasting frame- 24. Sackett DL, Rosenberg WM, Gray JA, Haynes RB,
work for health care quality. New Eng J Med. Richardson WS. Evidence based medicine: what it is
2016;375:205–7. and what it isn’t. BMJ. 1996;312:71–2.
16. Kohn LT, Corrigan J, Donaldson MS. To err is 25. Welk B, Winick-Ng J, McClure JA, Lorenzo AJ,
human: building a safer health system. Washington, Kulkarni G, Ordon M. The impact of the choosing
DC: National Academies Press; 2000. wisely campaign in urology. Urology. 2018 Mar 20.
17. Gawande AA, Thomas EJ, Zinner MJ, Brennan TA. [Epub ahead of print].
The incidence and nature of surgical adverse events 26. Guyatt GH, Oxman AD, Vist G, Kunz R, Falck-Ytter
in Colorado and Utah in 1992. Surgery. 1999;126(1): Y, Alonso-Coello P, et al. GRADE: an emerging
66–75. consensus on rating quality of evidence and strength
18. Hu YY, Arriaga AF, Peyre SE, Corso KA, Roth EM, of recommendations. BMJ. 2008;336:924–6.
Greenberg CC. Deconstructing intraoperative com- 27. Brouwers MC, Kho ME, Browman GP, Burgers JS,
munication failures. J Surg Res. 2012;177(1):37–42. Cluzeau F, Fedar G, et al. AGREE II: advancing
19. Haynes AB, Weiser TG, Berry WR, Lipsitz SR, guideline development, reporting, and evaluation in
Breizat AH, Dellinger EP, et al. A surgical safety health care. CMAJ. 2010;182:839–42.
checklist to reduce morbidity and mortality in a global 28. Koyle MA. Truth and consequences—the issue with
population. N Engl J Med. 2009;360(5):491–9. standards and guidelines. Can Urol Assoc J. 2018;
20. Urbach DR, Govindarajan A, Saskin R, Wilton AS, 12(4):79–80.
Baxter NN. Introduction of surgical safety checklists 29. Koyle MA, Koyle LCC, Baker, GR. Quality
in Ontario, Canada. N Engl J Med. 2014;370(11): improvement and patient safety: reality and respon-
1029–38. sibility from Codman to today. J Pediatr Urol.
21. Leape LL. The checklist conundrum. N Engl J Med. 2018:14(1):16–9.
2014;370(11):1063–4.
Diagnostic Studies in Surgery
20
Stuart Archibald, Jessica Murphy, Achilles Thoma
and Charles H. Goldsmith
of those nodules which were not resected (Table 1 in the Horvath et al. [7] article).
remains uncertain, thus those studies do not have However, in the total thyroidectomy specimens
a consistently applied ‘gold standard’ against done for dominant nodules classified as TIRADS
which TIRADS was compared. In the recent 4 or 5, other non-dominant nodules were studied
study by Horvath et al. [7], consecutive eligible with histopathology, and this includes some
patients underwent US for nodule pattern and TIRADS 2 and 3 lesions, even though they may
TIRADS categorization of the dominant nodule, not have had FNAB. In this way, some TIRADS
as well as of any additional nodules which were 2 and 3 incidentally found nodules were studied
considered likely to be identifiable on gross with pathology. The radiologists who performed
examination of the surgical specimen. A subset the US and FNAB test were all highly experi-
of US images were used to determine inter- enced. A single experienced pathologist who
observer agreement. US-guided FNAB’s of performed the surgical pathology studies was
nodules in TIRADS categories 4 (including 4A, blinded to the US and TIRADS results. In total,
4B and 4C) and 5 were then carried out. FNAB 210 patients, and 520 nodules were studied,
was not done on TIRADS 2 and 3 because the Fig. 20.1 outlines the study design by Horvath
likelihood of cancer is 0 and <5%, respectively et al. [7].
Fig. 20.1 Flow chart of study design. Created using information from Horvath et al. [7]
204 S. Archibald et al.
Test Positive for Cancer True Positives (TP)* False Positives (FP)
a** b
* ‘TP’, ‘FP’, ‘FN’ and ‘TN’ are used to designate the cell
** ‘a’, ‘b’, ‘c’ and ‘d’ are used to designate the numerical values in a cell
LR + is calculated as : LR +
LR also can be applied to a single level out of
Probability of a positive test in those with cancer
¼ a multilevel data set, avoiding the need to col-
Probability of a positive test in those without cancer
sensitivity lapse the data into a 2 2 table which results in
¼
ð1 specificityÞ a loss of information. The data from Horvath
et al. [7] is shown for each level in Table 20.2.
LRl− answers the question: ‘What is the When an LR is 1.0, the pre-test probability
probability that this negative test will be found in (prevalence), and post-test probability are
a patient with the disease compared to the unchanged. When an LR is greater than 1.0, then
probability that this negative test will be found in the post-test probability of the diagnosis has been
a patient without the disease?’ increased from the pre-test probability (preva-
lence), and conversely when an LR is less than
LR is calculated as : 1.0, then the post-chance probability of the
Probability of a negative test in those with cancer diagnosis has been decreased from the pre-test
¼
probability of a negative test in those without cancer probability (prevalence). Generally, LRs greater
ð1 sensitivityÞ than 10 indicate large and likely conclusive
¼ :
specificity
changes in the pre-test probabilities and ‘rule in’
the condition of interest, whereas LRs less than
The above is applied to the results from 0.1 indicate large and likely conclusive changes
Horvath et al. [7], as follows: in the pre-test probabilities and ‘rule out’ the
206 S. Archibald et al.
Table 20.2 Results as reported by Horvath et al. [7] showing LR for each category
Test result (TIRADS Cancer No cancer Totals LR
categories) (shown by surgical pathology) (shown by surgical pathology)
5 86 1 87 72.7
4C 135 13 148 8.8
4B 49 29 78 1.4
4A 1 16 17 0.05
3 1 55 56 0.02
2 0 116 116 0.00
Totals 272 230 502
condition of interest. LRs between 2 and 5, or the known disease categories to get test perfor-
between 0.5 and 0.2 indicate modest but possibly mance measures. The term ‘sensitivity’ describes
important changes in the pre-test probabilities. the proportion of diseased individuals in the
LRs between 1 and 2, or between 0.5 and 1.0 population that are classified as having the dis-
indicate small and not usually conclusive changes ease using the test. In other words, sensitivity
in pre-test probabilities [8]. The additional describes the probability that the test is able to
advantages of LRs are that the value is not correctly diagnose a disease. Sensitivity is cal-
influenced by prevalence of the condition. LRs culated using the formula below:
can be used sequentially in several tests, which
a
follows the pattern of use in clinical practice. Sensitivity ¼
Thus, the pre-test probability and LR can give the ða þ cÞ
post-test probability of the first test, which serves
Specificity is the term used to describe the
as the pre-test probability with the LR to give the
proportion of those people without the disease
post-test probability on the second test, and so on.
who were correctly identified by a negative test
This can be done using mathematical formulae for
as non-diseased. Specificity is calculated using
each test, or more simply by applying a nomo-
the formula below:
gram to visually give the results [9]. This nomo-
gram, shown in Fig. 20.2, makes it easy to use d
cascading tests in this manner [8]. By placing a Specificity = :
ðb þ d Þ
ruler on the pre-test probability and aligning it
with the calculated LR we can obtain on the far Using results from the Horvath et al. [7] study
right of the nomogram the post-test probability of (Table 20.3), sensitivity and specificity would be
the disease. As TIRADS has multiple categories calculated as:
various cut-off points may be selected for deter-
mining sensitivity, specificity, positive predictive 270 270
Sensitivity ¼ ¼
value, negative predictive value, and likelihood ð270 þ 2Þ 272
ratios. The cut-off point shown in Table 20.3 in ¼ 0:993 or 99:3%
the Horvath et al. [7] article is at 4B; thus TIR-
ADS 2, 3, and 4A are considered ‘negative’ for and
cancer, and TIRADS 4B, 4C and 5 are considered
‘positive’ for cancer. 187 187
Specificity ¼ ¼
Sensitivity and specificity are commonly sta- ð187 þ 43Þ 230
ted measures of a diagnostic test, even though ¼ 0:813 or 81:3%
they reverse the process used clinically by using
20 Diagnostic Studies in Surgery 207
Table 20.3 Results as reported by Horvath et al. [7] grouping TIRADS categories. Pooled LR = 5.31
been made. The first suggestion was to same, as the probabilities quoted by Horvath
de-emphasize the criteria designed to character- et al. [7]. The use of different criteria for the
ize specific benign conditions because with most levels is a weakness of the present status of
thyroid nodules, the main goal is to identify those TIRADS. In any TIRADS system, the expertise
nodules which are malignant. The second was to of the US operator is important. The US opera-
reduce the number of ultrasound characteristics tors in published studies are usually the most
of the nodule to the five considered relevant for expert available, and this level of expertise may
determining risk of cancer: composition (solid, affect the generalizability of the results.
cystic, mixed), echogenicity (hyper, hypo, Once an US, using the TIRADS categories,
marked hypo), adverse margins (yes/no), adverse has been found to warrant FNAB (generally
calcifications (yes/no), and adverse shape considered for 4B and higher), then the patient is
(yes/no). These suggested modifications were usually sent for an US-guided FNAB, as was
used to define levels 4A, 4B, 4C and 5, where 4A done in the 2017 paper by Horvath et al. [7], and
has one of the US characteristics present, 4B has the results of this test are reported using the
two, and 4C has three or four [11]. The risk of commonly accepted Bethesda System for
malignancy associated with each category is Reporting Thyroid Cytopathology [13]. The
shown in Table 20.4. The effect of these modi- Bethesda System uses ‘diagnostic categories’
fications has been to simplify the TIRADS sys- designated by Roman numerals, rather than
tem and so making it easier to apply. An Arabic numerals as with TIRADS levels.
additional result is that ‘TIRADS’ may mean A meta-analysis of the Bethesda System by
slightly different things depending on whether Bongiovanni et al. [14] presents data that allow
one is using Horvath’s model [12], or one of the LRs to be calculated for the various Diagnostic
others. Categories. Thus, the diagnostic tests of US with
For instance, using the modified system by TIRADS Levels can be followed by FNAB with
Kwak et al. [11], the probabilities of malignancy Bethesda Diagnostic Categories, and by using
for each level are similar to, but not exactly the LRs in each test in the sequence, the probabilities
20 Diagnostic Studies in Surgery 209
Table 20.5 Bethesda diagnostic categories showing probabilities of cancer and likelihood ratios
Diagnostic categorya Risk of cancer (%) Likelihood ratio (LR)
VI 99.0 592
V 75.0 18.2
IV 30.0 2.6
III 14.0 1.1
Created using information from Bongiovanni et al. [14]
a
Diagnostic categories I and II are not shown as the risk of cancer is <5% [14]. None of these LRs can be used to ‘rule
out’, but V and VI clearly exceed 10, so could be used to ‘rule in’ cancer
performed. This is unavoidable for ethical rea- and the Horvath et al. [7] paper, the data give a
sons because to do otherwise would require that pooled LR of 5.31 (Table 20.3). Consider this
all patients have the reference standard per- option 1, which accepts that only some of the
formed knowing that those with TIRADS Levels TIRADS 2 and 3 (and likely a minority, although
2 and 3 have a very low probability of malig- we do not know) are included, thus not fully eval-
nancy. A sensitivity analysis using confidence uating the TIRADS system. What if other ways of
intervals could be used for those Levels. studying TIRADS are considered? A full evalua-
tion of TIRADS (consider this Option 2) would
Secondary guide #1(b): Are methods for per- require all of the TIRADS 2 and 3 be studied in the
forming the test described in sufficient detail to same way as TIRADS 4 and 5, operating on those
permit replication? patients knowing that the diagnosis of cancer is very
The performance of all of the tests, TIRADS, and low; this is not ethical. TIRADS 4 and 5 alone are
FNAB, are well-described with examples of each considered in the data by Horvath et al. [7] (Option
of the ten US levels in the case of TIRADS, the 3), the LRs are different as seen in Table 20.6. 4C is
equipment (US machine) for TIRADS, and lab- less likely to change the pre-test probability than
oratory process for FNAB to get the Bethesda was the case in option 1, because LR is now 2.3
Classification specified [7]. instead of 8.8, and 4B is reduced to 0.37 from 1.4.
Table 20.6 Calculated LR’s for each category of TIRADS 4 and 5 when TIRADS 2 and 3 are discarded (Option 3)
TIRADS level Cancer No cancer Totals LR
5 86 1 87 18.7
4C 135 13 148 2.3
4B 49 29 78 0.37
4A 1 16 17 0.01
271 59 330
Created with data from Horvath et al. [7]
212 S. Archibald et al.
be straightforward, unlike situations where nei- diagnostic test variables. Seeing that multiple
ther of these systems are currently used. diagnostic categories of Bethesda Classification
are reported is reassuring that the article is
3(b) Are the results applicable to my patient? comprehensive. Two of the categories, III and
We have assessed the validity of the Horvath IV, are shown here, with calculation of LRs
et al. article [7] and its results, and now want to (Tables 20.7 and 20.8).
know if it helps us in caring for the patient. The
pre-test probability of cancer is estimated at 3%,
and for TIRADS 4B application of the LR of LR ¼½c=ða þ cÞ=½d=ðb þ d Þ
5.31 gives a post-test probability of about 22% ¼ ð3=35Þ=ð101=119Þ¼ 0:1
(see Nomogram in Fig. 20.2). This is not enough
to exceed the patient’s threshold for surgery, This does not exceed 10, so it indicates a
although it would be high enough for most modest change in probabilities from pre-test to
surgeons. post-test.
Table 20.7 ThyroSeq version 3: performed on thyroid nodules subjected to FNAB, and read as ‘Bethesda III’
(follicular neoplasm or suspicious for a follicular neoplasm, probability of cancer 5–15%)
20 Diagnostic Studies in Surgery 213
Table 20.8 ThyroSeq version 3: Performed on thyroid nodules subjected to FNAB, and read as ‘Bethesdal IV’
(follicular neoplasm or suspicious for a follicular neoplasm, probability of cancer 15–3%)
Test result Cancer present Cancer not present Totals
Positive 34 16 50
Negative 1 42 43
35 58 93
in-hospital mortality after replacing the ascend- period of time. Furthermore, cohort studies can
ing aorta and aortic valve in a patient with be conducted prospectively—forward in time—
Marfan syndrome would need studies that report or retrospectively—data are collected on patients
outcomes in other Marfan patients. Such patient who received the intervention some time ago,
data are referred to as prognostic factors; clinical and we want to know their outcome [9]. Both
characteristics that are objectively measured and approaches have advantages and disadvantages.
can help predict patient outcomes [4]. Prospective designs are powerful for assessing
In the remainder of this chapter, we will focus incidence and investigating potential causes.
on identifying trustworthy prognostic informa- They measure the stage of disease prior to
tion and using the results in patient care. treatment, preventing the true effect from being
influenced by the knowledge of the outcome.
Most importantly, investigators using a
Study Designs for Prognostic Studies prospective cohort design can measure outcomes
more completely and accurately than a retro-
Before undertaking a literature search, it is spective design [10]. The disadvantages are high
important to understand what study designs best cost and inefficiency when studying rare out-
answer our clinical question. When determining comes. Alternatively, a retrospective study
the effectiveness of treatment, we consider ran- design is less costly and time-intensive. A retro-
domized controlled trials (RCTs) as high quality spective cohort study fundamentally has the
evidence and observational studies as lower same methodology as a prospective study, except
quality [5]. When determining prognosis, how- that it looks back in time. Subjects already had
ever, we might place more confidence in esti- their exposure to the variable of interest, had
mates of prognosis from observational studies baseline measurements recorded, and follow-up
than RCTs [6]. This is because RCTs include period completed. The main disadvantage of a
filters in their eligibility criteria (e.g., age retrospective study is that biases may affect the
restriction, comorbidity, drug intolerance, etc) selection of patients and recall of information.
that exclude patients relevant to the broader Further, we must use the data available, which
prognostic question of interest. Eligible patients may be incomplete, inaccurate, or measured in
may decline to participate in an RCT due to ways not suited to answer our question [11]. The
reasons related to their prognosis. One exception 2017 study by Goldstone et al. [12], which we
to this rule is large, simple, pragmatic trials with chose to help answer our patient’s questions, is a
broad eligibility criteria. For example, Stassano retrospective cohort study. The authors identified
et al. [7] randomized 310 patients to receive a patients who underwent surgical aortic valve
mechanical or bioprosthetic aortic valve. If our replacement from 1996 through 2013 and eval-
patient wanted to compare the effectiveness of uated their outcomes [12].
one valve type over the other, “Aortic valve Investigators can also use a “case-control”
replacement: a prospective randomized evalua- design to study prognosis. They evaluate patients
tion of mechanical versus biological valves in who have experienced an outcome of interest
patients ages 55–70 years” by Stassano et al. [7] (cases) and compare them to those without the
might be a good study to use. However, our outcome (controls) [13]. Investigators are then
patient wants to know his prognosis specifically able to assess the relative frequency of prognostic
after receiving a bioprosthetic valve. As such, a factors among both study groups. For example,
cohort of study is ideal in identifying associa- Sleder et al. [14] retrospectively evaluated
tions between our prognostic factors of interest socioeconomic and racial disparities among
and outcomes [8]. severe aortic stenosis (AS) patients receiving
In cohort studies (a type of observational transcatheter aortic valve replacement (TAVR) in
study), patients receive the intervention (i.e., they a case-control study. They compared 67 patients
are not randomized) and are observed over a with severe AS who underwent TAVR (cases) to
21 How to Assess a Prognostic Study 219
patients with severe AS who were not offered Table 21.1 User’s guide to surgical literature: guide to
TAVR (controls). They noted a statistically sig- an article about prognosis [16]
nificant income disparity between the “cases” I. Guide for validity (study methodology)
and “controls” [14]. By definition, case-control Step 1:
• Is the sample representative?
studies are retrospective and share the limitations – Were patients homogeneous with respect to their
of retrospective cohort studies. In addition, prognostic risk? Are they at a similar point in the
case-control studies provide relative odds of course of the disease?
outcomes, but not absolute risks. Case-control – If subgroups were identified, did investigators provide
estimates for clinically relevant subgroups and adjust
studies are particularly useful when the outcome for important prognostic factors?
of interest is rare or requires lengthy time to Step 2:
occur—beyond what would be feasible for a • Was follow-up sufficiently long and complete?
prospective study design [15]. • Were assessed outcomes objective and unbiased?
II. Understanding the results
Step 3: How likely are the outcomes to occur over time?
Step 4: How precise are the estimates of likelihood?
Literature Search III. Using the results to determine patient care
(applying the results to your patients)
Step 5: Can I apply the results to my practice or patient?
To identify relevant literature to our patient with
aortic stenosis, we entered relevant key words
into an established database of publications
(Please see Chap. 4: How to Perform a Literature study that deals with prognosis are shown in
Search). In our scenario, entering the keywords Table 21.1.
“biologic” AND “aortic valve” AND “replace-
ment”, along with activating filters for human,
full texts, and published within the last 2 years Appraising the Surgical Literature
identified 27 articles in PubMed. If an initial on Prognosis
search does not produce articles of interest, we
could employ other strategies such as querying Before applying the study results to our patient,
multiple databases (EMBASE and MEDLINE). we must critically appraise the Goldstone et al.
We can also review references of relevant pub- [12] publication. Key questions to ask are (1) Is
lications and recent textbooks (hand-searching) the sample representative of my patient? (2) Was
or reach out to field experts for guidance. Scan- the follow-up sufficiently long and complete?
ning through the most recent publications, we did (3) Were the assessed outcomes objective and
not identify any relevant RCTs or prospective unbiased? (4) How likely are the outcomes to
cohort study. Ten studies were excluded for occur over time? (5) How precise are the esti-
focusing on basic science, two were excluded mates of likelihood? (6) Can I apply the results to
because they were case studies, and four studies my practice or patient? Table 21.1 outlines these
excluded for evaluating outcomes with specific questions, and more, that are important when
repair types in comparison to valve replacement. interpreting and using prognostic studies [16].
The search revealed one promising article:
“Mechanical or Biological Prostheses for
Aortic-Valve and Mitral-Valve Replacement” by Is the Sample Representative?
Goldstone et al. [12]. This is a retrospective
cohort study that identified patients who under- Ideally, we would like to determine prognosis by
went surgical aortic valve replacement from 1996 studying an affected population from the onset of
through 2013 and evaluated their outcomes. This exposure to the end. However, we cannot study
study seems promising in answering our patient’s entire populations and must instead determine
questions. The steps in appraising a surgical the prognosis of a sample. As such, evaluating a
220 S. Gupta et al.
representative sample of the entire population is When evaluating the sample in the Goldstone
crucial for generalizable results [17]. If a sample et al. [12] study, we see that the authors obtained
population is systematically different from the all patient records from the California Office of
population of interest, it is not representative and Statewide Health Planning and Development
may be biased. Biased prognostic studies can (OSHPD) Patient Discharge, Emergency
result in over- or underestimating event rates. Department, and Ambulatory Surgery Center in
A common way to identify an unrepresentative the Patient Discharge, Emergency Department,
sample is to look for any systematic processes and Ambulatory Surgery Centre datasets [12].
that patients passed through before entering the This database includes all patients treated across
study, which may introduce bias; for example, California in different centers and regions [12].
patients with more complex or severe forms of The authors use the ICD-9-CM codes—objective
the disease are more likely to be referred for criteria—to define the inclusion of patients
tertiary care [3]. As a consequence, the prognosis undergoing, “primary aortic valve replacement,
in a tertiary center sample may differ from the or mitral valve with biologic prosthesis or
prognosis of the disease in general due to referral mechanical prosthesis”. They also state clear
bias. To determine whether a sample is a repre- exclusion criteria: not residing in California
sentative, look for a description of inclusion and during initial surgery, previous cardiac surgery,
exclusion criteria as well as a recruitment strat- multiple valve replacements, aortic valve repair,
egy. In addition, the authors should report mitral valve repair, and thoracic aortic surgery.
objective criteria for sample selection and disease They included a diverse group of patients and
diagnosis. Unclear or inadequate definition of stratified them by age. Mortality outcomes were
patients within a study increases the risk that the also analyzed in these strata and reported. We
sample will not be representative, and its results can confidently say that Goldstone et al. [12]
biased. This is achieved through precise and included a representative sample, with charac-
consistent inclusion criteria [18]. For instance, teristics similar to our patient’s.
we want studies reporting outcomes in a study
group with similar baseline characteristics as our
patient from the clinical scenario (63-year-old Was Follow-up Sufficiently Long
male with severe, symptomatic aortic stenosis). If and Complete?
a study included a diverse spectrum of patients
undergoing aortic valve replacement (a wide age Investigators should follow their patients for an
range, for example), a subgroup analysis of study adequate length of time to capture all
participants similar to our patients would suffice. patient-important outcomes. This is especially
In addition to similar disease severity, other important in determining long-term outcomes,
prognostic factors such as age, smoking history, where the risk of loss to follow-up is increased
diabetes, among others, should be considered. It [21]. A study reporting follow-up results of
is important to consider these prognostic factors patients at 5 years after surgery may be helpful,
in relation to each other to avoid making false but not adequate for our patient who is expected
conclusions. to live for another 10–15 years after his surgery.
In situations where a patient characteristic Losing participants to follow-up threatens the
impacts prognosis, a stratified analysis may be validity of prognostic studies and increases the
used. In a stratified analysis, participants are risk of bias as outcomes in those lost to follow-up
separated into groups (i.e., strata) by prognostic may differ from those who remain in the study
factors [19]. When a large number of variables [21]. For example, anticoagulation therapy is a
predict prognosis, stratified analyses are imprac- lifelong commitment for patients receiving a
tical and statistical methods, like regression, mechanical aortic valve [22]. In a study reporting
adjust for multiple variables, and determine the the survival of patients after valve replacement
strongest prognostic factors [20]. surgery, those lost to follow-up may not be
21 How to Assess a Prognostic Study 221
deceased but may have moved or followed up by reoperation. While mortality and reoperation are
another physician. The patients surviving and not objective outcomes, stroke and bleeding require
lost to follow-up are no longer representative of definitions. The authors appropriately define
the sample or even the entire study group. The stroke, bleeding, and reoperation by ICD-9 codes
threat to study validity is greatest when outcomes in table S4 of the supplementary appendix of
of interest are infrequent, and/or many partici- their article.
pants are lost to follow-up. Let us consider the
example of mechanical valve thrombosis. In a
study by Cannegeiter et al. [23], the incidence of How Likely are the Outcomes
mechanical valve thrombosis in patients receiv- to Occur Over Time?
ing appropriate anticoagulant therapy was 0.2 per
100 patient-years. If data for 10 participants were Results from prognostic studies help determine
lost to follow-up, and they all experienced valve the risk of events occurring over time [8]. If our
thrombosis, the true incidence of mechanical patient wanted to understand his 5-year survival
valve thrombosis would be higher than reported. after receiving a prosthetic valve, we would look
This carries important clinical implications, for a study that reports outcomes of patients
suggesting that valve thrombosis was underre- 5 years after their initial surgery. Goldstone et al.
ported due to loss of follow-up. One of the demonstrate that in patients between ages 55 and
weaknesses in the Goldstone et al. [12] study is 64 years, the probability of mortality at 5 years is
that while they report that the median follow-up approximately 0.1 [12]. If our patient asked
for patients receiving bioprosthetic aortic valve about changes in his chances of survival over
replacement was 5 years, they do not report time, we would show him a survival curve,
number lost to follow-up at 10 and 15 years. In which illustrates the incidence of events over
consequence, we cannot exclude the possibility time. To create a survival curve, events must be
of differential loss to follow-up—when the discrete (death, reoperation, hospitalizations),
dropout rate differs between treatment groups— and the precise time at which they occur must be
potentially introducing bias into the results. recorded [27]. According to Goldstone et al. [12],
Therefore, we must view their data on long-term probability of mortality at 15 years is approxi-
mortality with caution as the authors do not mately 0.4 [12], suggesting a fourfold increase in
report completeness of follow-up [24]. risk between 5 years and 15 years after the valve
replacement.
risk for stroke after the surgery lies between 12 Resolution of Clinical Scenario
and 23%. Put another way, if the study were
repeated 100 times, the incidence of stroke would Goldstone et al. [12] reported an unbiased
be between 12 and 23%, approximately 95 of assessment of risk in their cohort. The authors
those 100 times. In the Goldstone et al. [12] provided long-term mortality estimates in patients
paper, the authors report 15-year mortality as resembling our patient. There were some limita-
36.1% for patients 55–64 years of age, but do not tions related to study design and loss to follow-
report a confidence interval around this. As such, up, but we can apply the results to our patient and
we cannot estimate precision. However, they did answer his question about long-term prognosis.
report hazard ratios with mechanical valve Since the study included multiple surgeons’ out-
patients as a reference point (1.04, 95% CI comes, the long-term data reported can be com-
[0.91–1.18]) [12]. This implies that for patients pared with your surgical skills and capability.
between the ages of 55 and 64, risk of mortality Therefore, we can confidently tell our patient
at 15 years among patients with biological valves that his risk of mortality is approximately 36% at
or mechanical valves is not statistically signifi- 15 years after a bioprosthetic valve replacement.
cantly different, and likely not important clini- Based on the information provided to him, the
cally either, because the 95% CI includes 1.0. patient opted to undergo a surgical aortic valve
For more information on confidence intervals replacement with a biological valve.
please see Chap. 28.
References
Can I Apply the Results to my Practice
or Patient? 1. Moons KGM, Vergouwe Y, Grobbee DE, Alt-
man DG. Prognosis and prognostic research: what,
The authors should describe patient characteris- why, and how? BMJ. 2009;338:375.
tics and demographics in enough detail that we 2. Mack JW, Joffe S. Communicating about prognosis:
ethical responsibilities of pediatricians and parents.
can compare to our own patients and determine Pediatrics. 2014;133(Suppl 1):S24–30.
applicability. Further, surgical interventions 3. Randolph AG, Guyatt GH, Richardson WS. Progno-
change and evolve over time [30]. The surgical sis in the intensive care unit: finding accurate and
procedure and risks involved with aortic valve useful estimates for counseling patients. Crit Care
Med. 1998;26:767–72.
replacement in 1975 are not the same in 2018. 4. Italiano A. Prognostic or predictive? It’s time to get
With evolving valve types and cardiopulmonary back to definitions! J Clin Oncol. 2011;29:4718;
bypass technology, outcomes have improved author reply 4718–9.
dramatically [31]. Therefore, it is prudent that the 5. Barton S. Which clinical studies provide the best
evidence? The best RCT still trumps the best
study we review be current, and applicable to observational study. BMJ. 2000;321:255–6.
current practice. Goldstone et al. [12] recognized 6. Iorio A SF, Falavigna M. Use of GRADE for
the differences in surgical techniques and pref- assessment of evidence about prognosis: rating
erences over time. They reported patient enroll- confidence in estimates of event rates in broad
categories of patients. BMJ. 2015;350.
ment by study period; of the 3854 patients 7. Stassano P, Di Tommaso L, Monaco M, Iorio F,
receiving bioprosthetic aortic valve replacement, Pepino P, Spampinato N, et al. Aortic valve replace-
51.7% were enrolled from 2008 to 2013. 69.6% ment: a prospective randomized evaluation of
of the patients were male, and over 80% were mechanical versus biological valves in patients ages
55 to 70 years. J Am Coll Cardiol. 2009;54:1862–8.
white [12]. These characteristics of the study 8. Hansebout RR, Cornacchi SD, Haines T, Gold-
sample, along with stratification of outcomes by smith CH. How to use an article about prognosis.
age, make the study applicable to our patient. Can J Surg. 2009;52:328–36.
21 How to Assess a Prognostic Study 223
9. Gamble JM. An introduction to the fundamentals of 21. Akl EA, Briel M, You JJ, Sun X, Johnston BC,
cohort and case-control studies. Can J Hosp Pharm. Busse JW, et al. Potential impact on estimated
2014;67:366–72. treatment effects of information lost to follow-up in
10. Berger ML, Dreyer N, Anderson F, Towse A, randomised controlled trials (LOST-IT): systematic
Sedrakyan A, Normand SL. Prospective observa- review. BMJ. 2012;344:e2809.
tional studies to assess comparative effectiveness: the 22. Jaffer IH, Whitlock RP. A mechanical heart valve is
ISPOR good research practices task force report. the best choice. Heart Asia. 2016;8:62–4.
Value Health. 2012;15:217–30. 23. Cannegieter SC, Rosendaal FR, Wintzen AR, van der
11. Thiese MS. Observational and interventional study Meer FJ, Vandenbroucke JP, Briet E. Optimal oral
design types; an overview. Biochem Med (Zagreb). anticoagulant therapy in patients with mechanical
2014;24:199–210. heart valves. N Engl J Med. 1995;333:11–7.
12. Goldstone AB, Chiu P, Baiocchi M, Lingala B, 24. Carlson MD, Morrison RS. Study design, precision,
Patrick WL, Fischbein MP, et al. Mechanical or and validity in observational studies. J Palliat Med.
biologic prostheses for aortic-valve and mitral-valve 2009;12:77–82.
replacement. N Engl J Med. 2017;377:1847–57. 25. Walton MK, Powers JH 3rd, Hobart J, Patric D,
13. Mihailovic A, Bell CM, Urbach DR. Users’ guide to Marquis P, Vamvakas S, et al. Clinical outcome
the surgical literature. Case-control studies in surgi- assessments: conceptual foundation-report of the
cal journals. Can J Surg. 2005;48:148–51. ISPOR clinical outcomes assessment—emerging
14. Sleder A, Tackett S, Cerasale M, Mittal C, Isseh I, good practices for outcomes research task force.
Radjef R, et al. Socioeconomic and racial disparities: Value Health. 2015;18:741–52.
a case-control study of patients receiving tran- 26. Russell SD, Saval MA, Robbins JL, Ellestad MH,
scatheter aortic valve replacement for severe aortic Gottlieb SS, Handberg EM, et al. New York heart
stenosis. J Racial Ethn Health Disparities. association functional class predicts exercise param-
2017;4:1189–94. eters in the current era. Am Heart J. 2009;158:S24–
15. Song JW, Chung KC. Observational studies: cohort 30.
and case-control studies. Plast Reconstr Surg. 27. Rich JT, Neely JG, Paniello RC, Voelker CC,
2010;126:2234–42. Nussenbaum B, Wang EW. A practical guide to
16. Hansebout RR, Cornacchi SD, Haines T, Gold- understanding Kaplan-Meier curves. Otolaryngol
smith CH. User’s guide to the surgical literature: how Head Neck Surg. 2010;143:331–6.
to use an article about prognosis. Can J Surg. 28. du Prel JB, Hommel G, Rohrig B, Blettner M.
2009;52(4):328–36. Confidence interval or p-value?: part 4 of a series on
17. Banerjee A, Chaudhury S. Statistics without tears: evaluation of scientific publications. Dtsch Arztebl
populations and samples. Ind Psychia- Int. 2009;106:335–9.
try J. 2010;19:60–5. 29. Messe SR, Acker MA, Kasner SE, et al. Stroke after
18. Randolph AG CD, Guyatt G. Prognosis. In: Guyatt G, aortic valve surgery: results from a prospective
Rennie D, Meade MO, Cook DJ, editors. Users’ cohort. Circulation. 2014;129:2253–61.
guides to the medical literature: a manual for 30. Gawande A. Two hundred years of surgery. N Engl J
evidence-based clinical practice. New York, NY: Med. 2012;366:1716–23.
McGraw-Hill, 2018. 31. Maganti M, Rao V, Brister S, Ivanov J. Decreasing
19. Farrokhyar F, Bajammal S, Kahnamoui K, Bhan- mortality for coronary artery bypass surgery in
dari M. Practical tips for surgical research. Ensuring octogenarians. Can J Cardiol. 2009;25:e32–5.
balanced groups in surgical trials. Can J Surg.
2010;53:418–23.
20. Christenfeld NJ, Sloan RP, Carroll D, Greenland S.
Risk factors, confounding, and the illusion of statis-
tical control. Psychosom Med. 2004;66:868–75.
Decision Analysis and Surgery
22
Gloria M. Rockwell and Jessica Murphy
A junior plastic surgery resident assesses a Using PubMed, the residents performed a literature
50-year-old male patient who presents to the search (see Chap. 4) using the following search
clinic with worsening paresthesia in the ulnar terms: “Ulnar Nerve” AND “Surgical Treatment”
nerve distribution of the hand without important and “Systematic Review”, and a filter of years
weakness. The patient was seen by neurology (2015–2018); this search yielded over 300 articles.
three months previously and confirmed to have To make the task manageable the chief resident
moderate ulnar neuropathy at the elbow, advised suggested they narrow the search strategy and
to modify activity and wear an extension splint at consider using an article with a decision analytic
the elbow. He has done this but without any approach. They keep the same criteria for year, but
improvement in his symptoms. The junior resi- change their search terms to “Ulnar Nerve” AND
dent discusses the case with the chief resident, “Surgical Treatment” AND “Decision Analysis” ;
anticipating that a surgical intervention would be this search yields 10 articles (Appendix 1). The
required. The chief resident explains the patho- residents screen the titles and abstracts of the arti-
physiology of ulnar neuropathy at the elbow, and cles; three of the articles (Article numbers 5, 7, and
the four different types of surgery they have seen 8) have the term decision analysis in the title. Of
performed for this problem. The junior resident those three articles, one is in children (Article 5) and
asks the chief resident how to decide which is therefore not appropriate. The second article
technique is the best to perform for this patient, (Article 7) was an individual decision analysis. The
and in general. The chief resident is unsure and third article (Article 8) by Brauer and Graham [1]
both residents decide to do a literature search. was a proper decision analysis using evidence from
the literature, and the one likely to provide the
answer for them.
G. M. Rockwell (&)
Department of Surgery, Division of Plastic Surgery,
Background on Decision Analysis
University of Ottawa, Ottawa, ON, Canada in Surgery
e-mail: grockwell@toh.ca
J. Murphy Clinical decision-making involves weighing the
Department of Surgery, Division of Plastic Surgery, risks and benefits associated with the various
McMaster University, Hamilton, ON, Canada treatment alternatives available for the
e-mail: murphj11@mcmaster.ca
Fig. 22.1 Parts of a decision tree. Created using information from Richardson and Detsky [9]
were also included as chance nodes in the deci- further economic evaluations such as
sion tree. Figure 22.2 demonstrates a pictorial cost-effectiveness analyses (see Chap. 23).
schematic of the decision tree for the ulnar nerve When the decision tree is folded back on itself
at the elbow treatment surgical treatment the expected outcomes (utility, cost, QALY, etc.)
modalities. The four surgical choices included in can be calculated for each decision pathway by
the decision analysis are shown in Table 22.1. multiplying the probability of each branch in the
The value of the given outcomes can be tree with the final outcome for that pathway [4].
expressed in terms of utilities. The techniques to
measure utilities for different health states can
involve time trade-off, standard gamble, visual The Outcomes
analog scales measured directly from experts and
patients, or indirectly interpreted from similar All clinically relevant outcomes for the available
clinical scenarios with previously published management options must be identified, analyzed,
health utilities [4]. and incorporated into the decision tree for the tree
Disutility represents a transient health state to be an accurate representation of a clinical sce-
that can downgrade quality of life temporarily nario. Uncertainties (confidence intervals or other
and, for surgical interventions, could involve intervals) are identified during the literature sys-
peri-operative discomforts, inconvenience from tematic review. The sensitivity analysis of the tree
hospitalization and immobilization. Disutilities is repeated by varying either one variable or mul-
can be measured for short-term complications tiple variables, throughout their plausible inter-
occurring in each health state. Disutilities are vals, to determine the robustness of the decisions
subtracted from the utilities of that health state to derived from the decision analysis [6]. Conclu-
give an overall utility value for each specific sions are not as strong if they change during the
outcome [4]. Permanent undesirable conse- sensitivity analysis within the plausible interval of
quences decrease utility associated with a health a variable. Extreme values can be used in the
state and can be converted to Quality-Adjusted sensitivity analysis to debug the decision tree and
Life Years (QALY) for comparison to other help refine the design for obvious areas that give
health states, interventions, and even used in illogical results [4, 5, 14, 15].
228 G. M. Rockwell and J. Murphy
BO∆ NA
0.47
C
0.27 GO 0.930
0.53
Medial 29.791
0.961
Epicondylectomy BO∆ NA years
0.45
NC
0.73 GO 0.970
Cubital
Tunnel 0.55
Syndrome BO∆ NA
0.38
C
0.07 0.957
GO
Subcutaneous 0.62 30.039
0.969
TransposiƟon BO∆ NA years
0.38
NC
0.93 GO 0.972
0.62
BO 0.950
0.10
C
0.09 0.950
GO
Submuscular 0.90
TransposiƟon 0.950 0.965 29.925
BO years
0.10
NC
0.91 0.970
GO
0.90
Fig. 22.2 Note C Complication, NC No complication, endpoint, BOD Resulted in revision submuscular trans-
BO Bad outcome, GO Good outcome, QALYs position surgery, Circle nodes Chance event, Square
Quality-adjusted life years, BOh Was considered an nodes Decision to be made, Triangle nodes Outcome
22 Decision Analysis and Surgery 229
surgery with a longer recovery and potential for Table 22.3 Utility by health status
chronic disability (see Table 22.3) [11–13]. Health state Utility
Menopause symptom 0.99
Anti-hypertensive treatment side effects 0.95–0.99
Are the Results Valid?
Wrist arthrodesis 0.95
Before applying the decision analysis to the Kidney transplant 0.84
patient in question, it is important to determine if Hospital dialysis 0.57
the decision analytic model the authors formu- Severe angina 0.50
lated is applicable to the clinical scenario. This Adapted from [11–13]
means examining the methodology of the mod-
22 Decision Analysis and Surgery 231
el’s design, as to how the authors determined the outcome was defined as complete relief of all
outcomes, the methods used to combine the ev- sensory symptoms for greater than 2 years and all
idence, the sources and credibility of the utilities other results were considered bad outcomes.
chosen, as well as the potential impact of
uncertainty. Was an Explicit and Sensible Process Used to
Identify, Select, and Combine the Evidence into
Were All the Important Strategies and Outcomes Probabilities?
Included?
The methods used to combine the evidence in
The first step in appraising a decision analysis a decision analysis are predictive of the clinical
is determining whether the authors have included value of the final result. Providing the proper
all clinically relevant treatment strategies and methodological steps is necessary to determine
outcomes. The decision analysis by Brauer and applicability to a given patient and also to min-
Graham [1] reviews the literature of the four imize the introduction of bias. Brauer and Gra-
most common surgical interventions for ulnar ham [1] state that they performed a literature
nerve entrapment at the elbow. Brauer and Gra- review and that they were guided by results from
ham [1] tabulate the advantages and disadvan- various clinical trials and guidelines [2–5]. They
tages of each surgical intervention as well as the described the reasoning for defining the out-
peri-operative and postoperative care. They pro- comes as good outcome versus bad, and also data
vide a table of the variables determined from limitations and assumptions used in the model.
their review of various randomized controlled Unfortunately, their tabulated base case data for
trials comparing several of each of the operative probabilities, utilities and disutilities as well as
interventions. They clearly state the limitations of the sensitivity analysis and threshold values, did
the available studies, but do not explicitly not include confidence intervals or other intervals
describe the search criteria, nor provide tabulated to more easily allow the reader to analyze the
or confidence intervals or other intervals for each model and compare to their clinical case. Brauer
variable. The authors constructed a decision and Graham [1] report the use of SMLTREE (JP
model to fit the clinical scenario usually Hollenberg, Roslyn, NY, USA) software to
encountered in clinical practice; in this case, the construct the decision tree. Figure 22.2 showing
four different operative interventions the basic tree design moving from right to left,
(Table 22.1). The failure of an initial surgery demonstrates that after each decision node for the
either a direct release, or subcutaneous transpo- four various surgical options there was then a
sition, or medial epicondylectomy, would result chance node of complication (C) or no compli-
in the secondary surgical treatment of submus- cation (NC), with a further chance node of bad
cular transposition. Failure of submuscular outcome or good outcome as the final pathway
transposition was equated as the endpoint. state. They did summarize the risks from the
How well the structure of the model fits with literature to establish an overall rate but without
our clinical scenario depends on reviewing the the intervals being given. The papers analyzed by
decision tree. Figure 22.2 is the complete deci- Brauer and Graham [1] were explicitly given in
sion tree for their surgical management of ulnar their appendix. The authors chose base case
neuropathy at the elbow. There are four clinically failure rates of 10% for submuscular transposi-
important surgical options each with good out- tion (3–50%), 38% (10–73%) for anterior sub-
come (GO) or bad outcome (BO). The objective cutaneous transposition, 60% for subcutaneous
for the study was to establish a general policy for decompression and 45% for medial epi-
recommendations of treatment recognizing that condylectomy [1]. Their model base case is
this policy may not fit all situations. The out- biased towards a better outcome, by choosing a
comes of the decision analysis are labeled as lower initial base case complication rate of 10%
health states. In this decision analysis, a good within an interval of 3–50% from their literature
232 G. M. Rockwell and J. Murphy
review for submuscular transposition and 38% of quality of life ratings by patients; and from
from a range of 10–73% for anterior subcuta- large groups of people representing the general
neous transposition; with a poorer outcome with population.
simple decompression (60%) or medial epi- In the decision analysis by Brauer and Gra-
condylectomy (45%) [1]. Further analysis with a ham [1], they used indirect measures of pub-
one-way sensitivity analysis was performed to lished utilities from the quality of life outcomes
determine threshold values (Table 22.2). in upper extremity and hand disorders from
They assume the outcomes would not change patients with osteoarthritis of the wrist, as well as
with minor complications, except in the pathway from the American Medical Association Guides
for medial epicondylectomy, which could result to the Evaluation of Permanent Impairment [11–
in medial instability or ectopic bone formation 13]. They stated their chosen utilities of 0.99 for
and therefore the probability of a failure with a small scar without pain, 0.98 for a large scar,
complications was slightly higher than with no and 0.954 for chronic mild sensory or motor
complications. They considered most complica- disturbances [1]. They assumed the utility would
tions for the various surgical treatments to be be unchanged despite a minor temporary com-
transient and to have little effect on the failure of plication. The effect of most complications was
a treatment and therefore the probabilities of a measured through disutilities also taken from the
good or poor outcome were unchanged with literature and varying from 0.005 to 0.01 [1].
complications or no complications. This is an Disutilities would occur for either hospitaliza-
assumption that may not necessarily be true for tion, discomfort associated with the intervention,
all patient groups but they have explicitly length of postoperative immobilization and need
described this. Outcome estimates were trans- for rehabilitation postoperatively (Table 22.2).
formed into quantitative estimates, or probabili- The indirect values chosen may not represent
ties with zero being impossible and one being those of either patients or the general public. By
certain. The authors do highlight the data limi- comparison, a Cost-Utility Analysis study by
tations and assumptions used. Song et al. [14] directly measured utility values,
from family members and patients undergoing
Were the Utilities Obtained in an Explicit and surgical interventions for ulnar neuropathy, using
Sensible Way from Credible Sources? time trade-off techniques. Despite the differences
in either indirect calculations of utilities versus
The authors used a simple decision analysis direct measurement, Brauer and Graham [1] and
with two possible outcomes “good outcome Song et al. [14] had found similar expected
(complete resolution of all sensory deficits at utilities for direct decompression and anterior
2 years)” or “bad outcome.” In this model, util- subcutaneous transposition (0.98 and 0.98 for
ities were determined in comparison to similar Song et al. [14] versus 0.973 and 0.969 for
disabilities in the upper extremity, and other Brauer and Graham [1], respectively. Medial
disease states (Table 22.3) [11–13]. Several epicondylectomy and submuscular transposition
methods are available to measure utilities had a statistically significant difference between
directly; different methods use different scales the two authors (Brauer and Graham [1]: 0.961
with zero equaling death and one equaling per- and 0.965, vs. Song et al. [14] 0.88 and 0.82
fect health. These authors report sources of util- respectively). When compared to standardized
ities directly from standardized health states [25– health state utility measurements (Table 22.3) a
27]. Patient specific utilities can be measured utility of 0.84 is that of a kidney transplant. Thus
directly from the patient. Clinical policy utilities despite direct measurements from patients and
can be measured directly from large groups of family, undergoing the specific surgery for ulnar
patients with the disorder; from published studies neuropathy, and measured with time trade off,
22 Decision Analysis and Surgery 233
utilities measure by Song et al. [14] appear lower published evidence is often imprecise with wide
than expected. Utilities by Brauer and Graham intervals of estimates. Brauer and Graham [1]
[1] seem more logical when compared to those of report some of the limitations in the quality of the
other health states, though indirectly measured. studies in their review, with wide intervals of
success and failure. The less valid the methods or
Was the Potential Impact of Any Uncertainty in less precise the estimates from the literature the
the Evidence Determined? wider the intervals that must be included in the
sensitivity analysis to determine when a thresh-
Uncertainty in the evidence, and poor model old level is reached. Threshold levels occur when
design can lead to miscalculation of both the a change in a variable results in a change in the
utilities assigned to outcomes and the probabili- outcome conclusions would occur (Table 22.4).
ties of each health state. As in any critical ap- Sensitivity analysis is the systematic explo-
praisal of the literature, the highest quality of ration of uncertainty in the data to see what effect
evidence comes from well designed, adequately varying estimates or probabilities for risks, ben-
powered randomized controlled trials (RCTs), efits and utilities have on the expected outcome
and meta-analysis derived from such RCTs. and change of a clinical strategy. Performing a
Authors should acknowledge the quality of the sensitivity analysis is akin to exploring the best
literature in their review. Much of the uncertainty and worst case scenario. In this case, the authors
in decision-making arises from a lack of valid used a wide sensitivity analysis to determine
evidence in the literature. Even when present, threshold values between the two highest rated
Table 22.4 Results of one-way sensitivity analysis comparing simple decompression and subcutaneous transposition
Variable Base value probability
(Threshold)
Complication rate Simple Decompression 0.02 (NT)
Medial Epicondylectomy 0.27 (NT)
Anterior Subcutaneous Transposition 0.07 (NT)
Anterior Submuscular Transposition 0.09 (NT)
Probability of a bad outcome Simple Decompression/Complication 0.60 (0.82)
Medial Epicondylectomy/Complication 0.47 (NT)
Anterior Subcutaneous 0.38 (NT)
Transposition/Complication
Anterior Submuscular 0.10 (NT)
Transposition/Complication
Simple Decompression/No Complication 0.60 (NT)
Medial Epicondylectomy/No Complication 0.45 (NT)
Anterior Subcutaneous Transposition/No 0.38 (NT)
Complication
Anterior Submuscular Transposition/No 0.10 (0.82)
Complication
Disutility of individual Simple Decompression 0.005 (0.0158)
procedures Medial Epicondylectomy 0.01 (NT)
Anterior Subcutaneous Transposition 0.008 (0.001)
Anterior Submuscular Transposition 0.01 (0.0296)
NT no threshold
Adapted from Brauer and Graham [1]
234 G. M. Rockwell and J. Murphy
procedures, simple decompression, and subcuta- where there was bias for better outcomes with
neous transposition. The threshold, in this case, submuscular transposition and worse with direct
refers to the value beyond which the preferred decompression. The gain would be much more if
strategy shifted from simple decompression to a more realistic base case was calculated. This
subcutaneous transposition. decision analysis model has very small differ-
ences in outcome indirect utilities and perhaps
may be an over-simplification because transient
What are the Results? complications were not used in the outcomes.
In the baseline analysis does one strategy results How Strong is the Evidence in the Analysis?
in a clinically important gain for the patient? If
not is the result a toss-up? The strength of evidence used in creating a
The choice between various outcomes strate- decision analysis depends on the quality of the
gies is based on the outcome with the highest studies from which the probabilities and utilities
value of expected utility. One calculates this by were estimated. Ideally, every probability at each
folding back the decision tree probabilities from node in the decision tree is supported by precise
right to left for each outcome. The decision estimates from primary RCT data or
analyst chooses the scale on which to measure meta-analysis of high methodological quality.
the expected utilities to fit the clinical problem: Good decision analysis can still be performed
Quality of life, reduction of mortality or other with some imprecise or ambiguous data as long
gains in remaining life expectancy. Quality of as most of the data are good and the analysts
life and quantity of life can both be combined in explain any limitations and plan for sensitivity
calculating the QALYs or healthy year equiva- analysis accordingly. The weaker the evidence is
lents. The authors chose to present their out- used in the analysis the weaker the overall
comes in a table of expected utilities for each inference one to make from the results [3].
procedure (Fig. 22.2). The strength of the evidence cited by Brauer
and Graham [1] is good; this is reflected by their
Is the Difference Between the Strategies Clini- systematic review of the data from three ran-
cally Important? domized controlled studies, one meta-analysis,
and various other studies totaling 24 studies.
A difference means that any given patient may A Cochrane database systematic review of ulnar
gain more or less than others and this is where neuropathy at the elbow updated in 2016 and
using a policy decision analysis loses the speci- initially published in 2010 with a second update
ficity for an individual patient. A gain in QALYs in 2012 tabulated all available studies which
greater than 2–3 months is considered an showed that most trials had fewer than 60 par-
important gain as published by Naimark et al. ticipants and nine randomized control trials with
[12] and Tsevat et al. [13]. In the study by Brauer a total of only 587 participants [15]. Their
and Graham [1] they display expected utilities, meta-analysis concluded that there was at best
which, can be converted to QALYs assuming a moderate quality of evidence. It should be
50-year-old man with 31 years of remaining life remembered that this study was a deterministic
as determined from the literature where the mean analysis based on secondary data collected from
age of a patient with ulnar nerve neuropathy is the literature, rather than randomized controlled
52.5 years. This would give a net gain for the trial comparing all four surgical options under
base case of 1.45, 2.94, and 4.43 months com- consideration. Deterministic data uses indirect
paring subcutaneous decompression, to anterior information and requires a wider sensitivity
subcutaneous transposition, submuscular trans- analysis to test the robustness of conclusions.
position with medial epicondylectomy, respec- High quality directly measured data from RCTs
tively. Of note, this is based on the base case or meta-analysis of RCTs provide the strongest
22 Decision Analysis and Surgery 235
evidence on which clinical decisions can be arthrodesis which is likely more chronically
based. However, in such circumstances, where disabling than an ulnar nerve submuscular
strong data are readily available, the use of transposition at the elbow.
decision analytic modeling may not be necessary,
unless residual uncertainty exists.
Will the Results Help me in Caring
Could the Uncertainty in the Evidence Change for My Patient?
the Results?
Can the Results be Generalized to This Patient?
After interpreting the results in the base case The perspective, design, and assumptions used
analysis, it is critical to test how vulnerable the in creating a decision analysis become critical in
results are to the variation in probabilities and determining the generalizability of the model, and
key outcome measures. If minor changes in the in particular its relevance to the target population.
probability or effectiveness measures affect the The model must be based on clinical scenarios that
outcome in a way that would be clinically are similar, predictive and helpful to the actual
important to the patient in question, then the reality of the clinical question at hand. Probabili-
decision model may have errors in design and ties and utilities must reflect the values and risks
should be reconsidered. A sensitivity analysis is present in the decision-makers’ environment.
considered robust if changing the variables The applicability of a decision analysis for
throughout its possible intervals do not change policy, group, or patient-specific interventions
the strategy or conclusion. depends on the clinical characteristics of patients
In the Brauer and Graham [1] article, the for whom the analysis is intended. The analysis
authors explored the strength of their model should have a description of the patients’ con-
with a wide sensitivity analysis. The robustness ditions to allow for generalizability. If the
of their conclusions is demonstrated by the authors do not describe the samples used in the
clinically unlikely probability requiring the studies you can search inclusion and exclusion
failure of direct decompression to be greater criteria in the original references to see how
than 82% before subcutaneous transposition similar the characteristics are to your patient. If
would be the preferred treatment. The authors the analysis is designed for patients with differing
also found stable conclusions unless submus- characteristics than yours, you can assess the
cular transposition had a bad outcome greater sensitivity analysis for the variable used, and
than 82% (eight times their base case of 10%) intervals, to see where your patient may fall in
The sensitivity analysis recommending direct this current model.
decompression as the treatment of choice was In the decision analysis by Brauer and Gra-
also stable and robust unless the disutility tri- ham [1] the authors state their objective is to
pled to greater than that for subcutaneous create a general policy for recommended surgical
transposition and submuscular transposition treatment for moderate, sensory only ulnar neu-
which they stated would be clinically unlikely. ropathy at the elbow. They outline the advan-
Threshold levels were also encountered if the tages and disadvantages of the four procedures as
disutility of submuscular transposition was well as the assumptions used in this decision
greater than a factor of two times, which, would model. The clinical features match that of your
be similar to that of a complex wrist recon- patient as he does not have any motor loss or
struction with scaphoid excision and metacarpal other comorbidities.
236 G. M. Rockwell and J. Murphy
Do the Probability Estimates fit my Patient’s 0.98 for large scar no pain. The effect of com-
Clinical Features? plications was measured by disutilities also
measured from the literature. The authors
Every decision analysis should contain perti- describe explicitly their postoperative routine, the
nent positive and negative characteristics for the complications, and advantages and disadvantages
patient population to which it is applied. associated with each procedure to determine the
Reviewing the sensitivity analysis in the article disutilities measured. The authors do emphasize
will permit the gauging of similarities with the the threshold and robustness of their models from
patient in question. In this article, the authors their one-way sensitivity analysis.
clearly state the target population has moderate This decision analysis has limited generaliz-
sensory ulnar neuropathy at the elbow, no motor ability to our current patient due to the indirect
weakness, and unresponsive to conservative in- calculation of utilities determined from similar
terventions. The generalizability of this decision disease states in the literature, which are neither
analysis is therefore limited to this group of patient nor procedure specific. However, the
patients. broad sensitivity analysis demonstrated robust-
ness of their conclusions, despite a bias against
Do the Utilities Reflect How my Patients will the direct decompression.
Value the Decision Outcomes? A more recent study with directly measured
utilities for ulnar nerve at the elbow from family
Quality of life, despite having universally and patients with ulnar nerve entrapment using
accepted quantitative values, is strongly influ- time trade-off techniques demonstrated similar
enced by individual values and social context. utilities for direct decompression and anterior
Surgeon comfort, experience, and discussion of subcutaneous transposition at 0.98 but much
potential risks for various interventions can all lower utilities 0.82 for anterior submuscular
influence patients’ choices. In this decision transposition and 0.88 for medial epicondylec-
analysis all bad outcomes were treated with an tomy. Despite the more elaborate time trade-off
additional surgery in the form of a revision and direct measurement of utilities from patients
submuscular transposition. This may not be with the disease process in the study by Song
representative of the reality in any given clinical et al. [14], the overall conclusion on the choice of
scenario and may impact on decision nodes and direct decompression being the preferred treat-
thereby the validity of the model. If the decision ment option for primary cubital tunnel release
analysis is designed for an individual patient then was the same as the simpler decision analysis by
the utilities should be measured directly from that Brauer and Graham [1].
patient. If utilities are based on a large group of
patients or members of the public, then the utility
values may include those of your patient, but the Resolution of the Scenario
intervals of values must be very broad. One-way
or two-way sensitivity analysis can further see if After working through the above steps, the resi-
the value of your patient will affect the final dents had a much clearer understanding of the
decision. approach to surgical management of moderate
In this example, utilities were decided on by sensory ulnar neuropathy at the elbow. Despite
experts from published guidelines in the field of the absence of a well-powered randomized con-
upper extremity and hand surgery compared to trol trial comparing the four most common sur-
standard known utilities for wrist fusion recon- gical treatment strategies for ulnar nerve
struction and normal patients. Elsewhere, utilities decompression at the elbow, the decision analy-
were 0.95 for chronic sensorimotor disturbance sis by Brauer and Graham [1], demonstrated
mild in nature, 0.99 for small scar with no pain, fairly robustly that simple decompression is the
22 Decision Analysis and Surgery 237
preferred surgical treatment for ulnar neuropathy 9. Bartels RH. Ulnar neuropathy at the elbow:
at the elbow. The residents felt happy to discuss follow-up and prognostic factors determining
this management plan with their patient. outcome. Neurology. 2005;64(9):1664–5.
(author reply 1664–5).
10. Prpa B, Huddleston PM, An KN, Wood MB.
Appendix 1. Articles Located During Revascularization of nerve grafts: a qualita-
Literature Search tive and quantitative study of the soft-tissue
bed contributions to blood flow in canine
1. Wali AR, Park CC, Brown JM, Mandeville R. nerve grafts. J Hand Surg Am. 2002;27
Analyzing cost-effectiveness of ulnar and (6):1041–7.
median nerve transfers to regain forearm flex-
ion. Neurosurg Focus. 2017;42(3):E11.
2. Park SE, Bachman DR, O’Discoll SQ. The
safety of using proximal anteromedial portals References
in elbow arthroscopy with prior ulnar nerve
transposition. Arthroscopy. 2016;32(6): 1. Brauer CA, Graham B. The surgical treatment of
1003–9. cubital tunnel syndrome: a decision analysis. J Hand
Surg (Eur Vol). 2007;6(32E):654–62.
3. Nagi SS, Mahns DA. Mechanical allodynia in 2. Detsky AS, Naglie G, Krahn MD, Naimark D,
human glaborous skin mediated by Redelmeier DA. Primer on medical decision analysis:
low-threshold cutaneous mechanoreceptors part I—getting started. Med Decis Making.
with unmyelinated fibres. Exp Brain Res. 1997;17:123–5.
3. Detsky AS, Naglie G, Krahn MD, Redelmeier DA,
2013;231(2):139–51. Naimark D. Primer on medical decision analysis: part
4. Page MH, O’Connor D, Pitt V, 2—building a tree. Med Decis Making.
Massy-Westropp N. Exercise and mobilisa- 1997;17:126–35.
tion interventions for carpal tunnel syndrome. 4. Naglie G, Krahn MD, Naimark D, Redelmeier DA,
Detsky AS. Primer on medical decision analysis: part
Cochrane Database Syst Rev. 2012;13(6): 3—estimating probabilities and utilities. Med Decis
CD009899. Making. 1997;17:136–41.
5. Lee KM, Chung CY, Gwon DK, Sung KH, 5. Naimark D, Krahn MD, Naglie G, Redelmeier DA,
Kim TW, Choi IH, et al. Medial and lateral Detsky AS. Primer on medical decision analysis: part
5—working with Markow process. Med Decis
crossed pinning versus lateral pinning for Making. 1997;17:152–9.
supracondylar fractures of the humerus in 6. Mastracci TM, Thoma A, Farrokhyar F, Tandan VR,
children: decision analysis. J Pediatr Cina CS. User’s guide to the surgical literature: how
Orthop. 2012;32(2):131–8. to use a decision analysis. Can J Surg. 2007;50
(5):403–9.
6. Koscielniak-Nielsen ZJ, Frederiksen BS, 7. Davis Sears E, Chung KE. Decision analysis in
Rasmussen H, Hesselbjerg L. A comparison plastic surgery: a primer. Am Soc Plast Surg.
of ultrasound-guided supraclavicular and 2010;126(4):1373–80.
infraclavicular blocks for upper extremity 8. Krahn MD, Naglie G, Naimark D, Redelmeier DA,
Detsky AS. Primer on medical decision analysis: part
surgery. Acta Anaesthesiol Scan. 2009;53 4—analyzing the model and interpreting the results.
(5):620–6. Med Decis Making. 1997;17:142–51.
7. Gasco J. Surgical options for ulnar nerve 9. Richardson WS, Detsky AS. User’s guides to the
entrapment: an example of individualized medical literature. VII. How to use a clinical decision
analysis. A. Are the results of the study valid?
decision analysis. Hand (NY). 2009;4 Evidence-Based Medicine Working Group. JAMA.
(4):350–6 1995;273:1292–5.
8. Brauer CA, Graham B. The surgical treatment 10. Richardson WS, Detsky AS. User’s guides to the
of cubital tunnel syndrome: a decision analy- medical literature. VII. How to use a clinical decision
analysis B. What are the results and will they help me
sis. J Hand Surg Eur Vol. 2008;32(6):654–62 in caring for my patients? JAMA. 1995;273:1610–3.
238 G. M. Rockwell and J. Murphy
11. Torrance GW. Utility approach to measuring health 14. Song JW, Chung KC, Prosser LA. Treatment of ulnar
related quality of life. J Chronic Dis. 1987;40:593–600. neuropathy at the elbow: cost-utility analysis. J Hand
12. Naimark DM, Naglie G, Detsky AS. The meaning of Surg. 2012;37A:1617–29.
life expectancy: what is a clinically significant gain? 15. Caliandro P, La Torre G, Padue R, Giannini F,
J Gen Intern Med. 1994;9:702–7. Padua L. Treatment for ulnar neuropathy at the
13. Tsevat J, Weinstein MC, Williams L, Tosteson AN, elbow. Cochrane Database Syst Rev. 2016;15:
Goldman L. Expected gains in life expectancy for CD006839.
various coronary heart disease risk factor modifica-
tions. Circulation. 1991;83:1194–201.
Economic Evaluations in Surgery
23
Achilles Thoma, Feng Xie, Jenny Santos
and Charles H. Goldsmith
Surgeons of all specialties are constantly intro- value of the forgone benefits” because the resource
duced to new surgical techniques or approaches to is not available for its alternative use [1].
solve surgical problems. These innovations are New innovations in surgery are often touted by
disseminated via conferences, workshops, or pub- their proponents as being cost-effective with the
lications in specialty journals. The typical surgeon recommendation to adopt them in our practice and
faces difficulty in deciding whether to adopt a new patients. Unfortunately, the term cost-effective is
surgical innovation when conflicting opinions by more often than not misused in the surgical liter-
experts are presented. A surgeon needs to consider ature. For example, Ziolkowski et al. [2] found that
the opportunity cost when adopting a novel surgi- most economic evaluations published in plastic
cal intervention and abandoning one that they surgery and touted to be cost-effectiveness studies
usually use. Opportunity cost is defined as “the were simply cost comparisons.
A surgical technique or surgical approach to
be considered as “cost-effective” must have
integrated the costs and effectiveness [3, 4].
A. Thoma (&) J. Santos
Department of Surgery, Division of Plastic Surgery,
Economic analysis is a set of formal, quantitative
McMaster University, Hamilton, ON, Canada methods used to compare alternative strategies
e-mail: athoma@mcmaster.ca with respect to their resource use and their
J. Santos expected outcomes [4]. Economic evaluation is a
e-mail: santosj8@mcmaster.ca unique study design just as randomized con-
F. Xie trolled trial and case-control studies are. As most
Faculty of Health Sciences, McMaster University, surgeons do not have background training in
Hamilton, ON, Canada health economics, one can understand how
e-mail: fengxie@mcmaster.ca
important terminology such as “cost-effective”, is
A. Thoma F. Xie C. H. Goldsmith misused. This misuse, however, may have direct
Department of Health Research Methods, Evidence,
and Impact, McMaster University, Hamilton, ON,
consequences if surgeons adopt new techniques
Canada or surgical approaches, which are touted
C. H. Goldsmith
“cost-effective”, when in truth they are not;
Faculty of Health Sciences, Simon Fraser University, inefficient use of scarce healthcare resources is
Burnaby, BC, Canada one consequence. OECD data estimate that 20%
C. H. Goldsmith of health expenditure worldwide is wasted,
Department of Occupational Science and resulting in minimal-if any-improvement of
Occupational Therapy, Faculty of Medicine, health outcomes [5]. Although we do not have
The University of British Columbia, Vancouver,
BC, Canada
specific breakdown figures for surgery, we have
no reason to believe that surgery is immune to evaluation, it is often called a partial economic
wasteful practices. evaluation, Cost-Effectiveness analysis (CEA),
This chapter has three objectives. First, it will Cost–Utility Analysis (CUA), and Cost–Benefit
introduce surgeons to the terminology used in the Analysis (CBA) [4]. The main difference in
economic evaluations and demystify this impor- these analyses is how the outcome or conse-
tant study design. Second and most importantly, quences of the treatments under comparison are
it will help the reader appraise and understand an measured. The distinguishing features of these
article that purports to be an economic evaluation four economic evaluations have been summa-
in surgery. Third, it will hopefully stimulate rized in Table 23.1.
surgeons to consider “piggy-backing” economic There are two main types of methodologies
evaluations to their effectiveness studies. We used in economic evaluations. The first type is
believe that an economic evaluation alongside a the model-based and the second is trial-based
robust randomized control trial that compares a economic evaluations. In the first, the
novel surgical technique to a standard technique model-based evaluation (also known as Deter-
provides the best level of evidence to adopt or ministic Analysis or probabilistic analysis), a
reject a novel surgical intervention. We will model is built which in its simpler form is the
attempt to keep the mathematics to the bare decision tree which is explained in Chap. 22 of
minimum and make the chapter understandable this book. Primary data are usually derived from
and hopefully fun to read. the literature, for example, by pooling the evi-
dence (preferably through a systematic review of
the literature). From these pooled data, one
derives probabilities of complications or positive
Explanation of the Types health outcomes labeled as “health states”. These
of Economic Evaluations health states are then entered into a decision
and Terminology Used analysis tree, which in its most basic form will
look like the one illustrated in Fig. 23.1a.
There are four types of health economic evalua- Clinical investigators then proceed to estimate
tions; Cost analysis (CA): this is cost comparison the expected costs and expected benefits of the
study and usually not considered a full economic “health states” of the interventions under study
DC
ICER ¼
DE
Mean Costnovel surgical intervention
Mean Costcomparative surgical intervention
¼
Mean Effectivenessnovel surgical intervention
Mean Effectivenesscomparative surgical intervention
• ICER = $40,000/successful breast recon- by both specialties and the decision is easier
struction after mastectomy to make.
• ICER = $50, 00/successful knee replacement
making [3, 4]. Here, we calculate the Incremental Hypothesis testing is done via the p-value. In
Cost–Utility Ratio (ICUR). economic evaluations, one performs a sensitiv-
DC
ICUR ¼
DU
Mean Costnovel surgical intervention
Mean Costcomparative surgical intervention
¼ !
Mean QALYnovel surgical intervention
Mean QALYComparative intervention
(If the ICUR < $50,000/QALY, this is a ity analysis to determine how robust the con-
strong indication that the novel intervention clusions are [4]. After the main analysis is
should be adopted [8]). As a general guidance, it performed based on the point estimate, the
is probably better to use a Willingness-to-Pay investigators redo the analysis based on one or
(WTP) threshold as the $50,000 figure has no two standard deviations around the point esti-
theoretical basis and it is mostly used in the USA. mate of the costs or effectiveness or both, to see
whether or not the conclusions of the study
change. Probabilistic analyses with nonpara-
For Cost–Benefit Analysis metric “bootstrapping” with replacement are
used in more advanced economic analyses [4,
In this type of analysis, we attach a monetary 9]. The conclusions are reported as robust if all
value to the consequence of an intervention using calculations, main analysis, and sensitivity
a Willingness-to-Pay (WTP) approach. analyses favor the novel surgical intervention.
X
n
bi ðtÞ ci ðtÞ Now that you have mastered the important
NSBi ¼ terminology and principles, we will help you
t¼1 ð1 þ r Þt1
understand how to appraise and understand a
published economic evaluation report, by taking
where NSB = Net Social Benefit, t = year,
you through the following clinical scenario.
bi(t) = benefits derived in year t, ci(t) = costs
derived in year t and r = discount rate, n = life-
time of study.
Clinical Scenario
(This type of economic evaluation is not used
much in health care as it attaches a monetary
At the hospital orthopedic rounds, a young sur-
value to the effectiveness of a health state, which
geon asks his chief of service for an additional
health economists believe discriminates against
OR room for arthroscopic knee surgery. He
the poor) [4].
claims there is an increasing demand for this
useful procedure. His chief is skeptical because
Sensitivity Analysis what he read that the benefits of arthroscopic
surgery is controversial. He doubts he can per-
In clinical effectiveness studies, the results are suade the hospital administration to invest money
reported by a point estimate such as a mean and in this endeavor and asks his junior colleague to
a validity estimate such as a standard deviation. present supporting evidence that arthroscopic
244 A. Thoma et al.
surgery is cost-effective compared to just physi- surgery and nonoperative methods. Now, we
cal therapy from the patient’s point of view. proceed to appraise the economic evaluation to
determine whether the results are valid and
whether they are relevant to our practice.
Finding the Evidence
To identify the best evidence and inform his Are the Results Valid?
colleagues, the surgeon begins by conducting a
literature search, as described in Chap. 3, in this Did the Analysis Provide a Full
book and according to the “Users” guide to the Economic Comparison of Healthcare
surgical literature: how to perform a high-quality Strategies?
literature search” [10].
The effectiveness of a surgical intervention An economic analysis compares two or more
can be found in a high-quality Randomized healthcare interventions; in our case two surgical
Controlled Trial (RCT) or a meta-analysis of interventions to a surgical problem. Specifically, it
number of RCTs. In this case, an RCT that compares the costs and the consequences (out-
compares arthroscopic surgery to optimal non- comes) of these interventions. The Marsh et al. [11]
operative therapy, in which the investigators article is a full economic evaluation as they com-
coupled an economic evaluation, would provide pared the costs in dollars and the effectiveness of the
the best evidence of cost-effectiveness. The sur- interventions, arthroscopy and a nonoperative
geon follows the PICOT format, as described in method, using the Western Ontario McMaster
Chap. 4, for the identification of important key Ostearthritis Index (WOMAC) and Quality-
words used in the search process: Adjusted Life Years (QALYs). It seems this is a
full economic evaluation and we proceed to the next
Population: Patients with knee arthritis question.
Intervention: Arthroscopic surgery
Comparison: Physical Therapy (Physiotherapy or
Optimal nonsurgical therapy) Were Relevant Viewpoints
Outcome: Cost-effectiveness Considered?
Time horizon: At least a year follow-up
In an economic evaluation, there are a few pos-
We performed a literature search by choosing sible perspectives (viewpoints). By this, we mean
the filtered database, COCHRANE reviews and an who bears the costs associated with the use of the
unfiltered database, PubMed. With the PICOT new surgical procedures. In appraising an eco-
format in mind, we used the search strategy: “os- nomic evaluation specifically, we ask who is
teoarthritis (P) AND arthroscopic surgery (I) AND benefiting from this study? Is it the patient, the
physical therapy (C) AND cost-effectiveness hospital, the third-party payer or the society or
analysis (O)”. Between COCHRANE and others? There are occasions where a novel inter-
PubMed, we identified 7 articles. We excluded vention may be found to be cost-effective from
articles because they did not include one perspective (i.e., the hospital) but not neces-
cost-effectiveness analyses, if they were not recent sarily from another (i.e., the patient). For exam-
(last 10 years), or if they did not measure QALY. ple, the hospital may save money by discharging
One by Marsh et al. [11], which was listed in patients earlier after a surgical intervention com-
both databases, caught our attention. We read the pared to an older program when the patients are
abstract and we found it relevant to our purpose. hospitalized for a longer period but from the
This article includes both costs and effectiveness patient’s point of view, this may be more costly if
measured in QALYs and it is a full economic the patient’s spouse has to take time off work to
evaluation. It compares arthroscopic knee care for them during recovery at home.
23 Economic Evaluations in Surgery 245
In the Marsh et al. [11] study, we are told that Marsh et al. considered different baseline risks as
the investigators conducted the cost-effectiveness they performed a subgroup analysis [11]. This
analysis from the perspective of the Canadian included patients with less severe radiographic
healthcare payer and the societal perspectives. disease (KL grade 2) and patients reporting
These perspectives are legitimate; however, we mechanical symptoms of catching or locking.
were concerned that they did not mention the The clinical strategies they considered, arthro-
patient’s perspective. As the societal perspective scopic versus nonoperative seems appropriate.
also includes any out of pocket costs to the
patient such as physical therapy, medications or
assistive devices not covered by the provincial Were the Costs and Outcomes
insurance plan and indirect costs such as time off Properly Measured and Valued?
employment, we wonder why they did not
include explicitly the patient perspective. Their Was Clinical Effectiveness Established?
claim of taking a societal perspective may be
incomplete, as other aspects, such as childcare As mentioned in the introduction, there are two
and other familial obligations, could be different methodologically distinct types of economic
for the patient between the two interventions. evaluations, the model-based (deterministic)
The perspective taken in an economic evalu- method, and the trial-based (stochastic) method.
ation may depend on the question asked. The The preferred method is trial-based. In this type,
panel in cost-effectiveness in health strongly patient-derived data are extracted from a
recommended that the societal perspective is the well-executed RCT, in which the investigators
most important and should be considered, if collect costs from various perspectives parallel to
possible [3]. Marsh and colleagues did this but as the RCT. This type of economic evaluation
mentioned above, there may be some limitations provides high internal validity but at the expense
to it [11]. of external validity as the subjects in the study
may not be typical of community patients. If
multiple RCTs exist, one can then pool the
Were All Relevant Clinical Strategies results in meta-analysis thus increasing general-
Compared? izability because the pooled estimate of the
effectiveness is derived from a wider spectrum of
When conducting an economic evaluation, it patients. To improve generalizability, one may
important that the investigators compare all rele- relax the inclusion criteria in the RCT thus
vant strategies for the condition under investiga- including patients that represent the whole pop-
tion. For example, the investigators should be ulation thus making this a pragmatic RCT and
comparing a novel procedure to a standard one and pragmatic economic evaluation.
not one that is rarely used. In addition, investiga- The clinical effectiveness in the Marsh et al.
tors should be considering patients of different article was measured by the WOMAC scale,
baseline risks. For example, if a general surgeon is which is a validated osteoarthritis instrument
performing a CEA on a novel hernia repair in the with total scores varying from 0 to 2400, higher
military population, this may not be generalizable scores indicating more pain and stiffness and
to a retirement community population. If a CEA is reduced physical function [12–14].
performed comparing two approaches to hand Marsh et al. [11] also uses QALYs for per-
surgery, one should consider both the Workers forming a cost–utility analysis, the preferred type
Compensation Benefit (WCB) patient and the of economic evaluation by policy makers. To use
non-WCB patient population, as the WCB patients QALYs, health utilities (such as HRQL or
are considered high-risk patients. weights) are required [4, 7, 15]. Marsh et al. [11]
This is accomplished by doing a literature used the Standard Gamble Technique to estimate
review of the population at risk. It seems that health utilities. A health utility score is anchored
246 A. Thoma et al.
at 0 (death) and 1 (perfect health). The other activities. The indirect costs and out of pocket
measure taken into account to measure QALYs is costs combined are included in the “societal
the duration of the corresponding health state. perspective”. The out of pocket costs fall into the
We believe that the investigators measured the patient perspective, so we are perplexed as to
effectiveness of the competing interventions why Marsh et al. [11] were not explicit on using
appropriately. the patient perspective separately.
value of ICER/ICUR, the greater the cost to outcome (WOMAC and QALY), by WTP (up to
improve the outcome (patient health). Accep- the clinically relevant threshold of $100,000 per
tance or rejection of novel surgical technologies unit gained), and stratified by perspective (Health
is based on this. It also depends on the patient care and Societal).
and the circumstances at hand, such as frequency
of the intervention and ability of the healthcare
system to support it. Ultimately, some type of Are Estimates of Costs and Outcomes
threshold of acceptability will be decided upon Related to the Baseline Risk
by a consensus of experts. in the Treatment Population?
In the Marsh et al. [11] study, there was also
an estimate of the total cost for each patient over As patients who are considered “high risk” are
the 2-year follow-up, therefore discounting (ac- generally more likely to benefit from a treatment
counting for the difference in cost presently than those who are considered “low risk” , it is
versus the cost in the future) was not necessary imperative to divide the population to reflect
[11]. Therefore, it seems that costs and outcomes these groups to determine if there is in fact a
were appropriately integrated. difference in cost and benefit between groups.
Marsh et al. [11] included a table outlining the
baseline characteristics of the population by in-
Was Appropriate Allowance Made tervention group (operative vs. nonoperative) and
for Uncertainties in the Analysis? they were very similar between groups. However,
the investigators could have categorized the
It is imperative to determine whether the continuous variables (i.e., age and BMI). This
ICER/ICUR values actually represent cost- would provide a better sense of the different
effectiveness, as there are many values that can baseline risks or benefits of one treatment over the
fall around the mean. This can be accomplished other for older individuals or those considered
by recalculating the ICER/ICUR using both the “over-weight” or “obese”. Those who have less
best- and worst-case scenarios, referred to as a severe disease (KL grade 2) and patients reporting
sensitivity analysis. If the conclusion on mechanical symptoms of catching or locking may
cost-effectiveness stays the same, it can be not be the only groups to consider [11].
decided with more confidence that the treatment
is truly cost-effective. Marsh et al. included a
sensitivity analysis, in which they used either What Are the Results?
extreme of their 95% CI surrounding the mean
differences in WOMAC scores and QALY. They What Were the Incremental Costs
estimated ICER and ICUR values that assumed and Outcomes Between the Two
the highest possible treatment effect observed in Strategies?
their sample, favoring either added arthroscopy
or nonoperative treatments only, followed by Marsh et al. [11] found a statistically significant
Cost-Effectiveness Acceptability Curves difference in mean cost between groups, health-
(CEAC). We believe they satisfied this criterion. care payer perspective and societal perspective
The economic evaluation by Marsh et al. also (Table 23.2). In terms of outcomes, there were
includes the CEACs, which indicate the proba- differences in both outcomes WOMAC (favoring
bilities of an intervention being cost-effective at surgery) and QALY (favoring nonoperative)
various Willingness-to-Pay (WTP) values [11]. between intervention groups. However, these
WTP values represent the amount one is willing were not statistically significant (p = 0.87 and
to spend per one unit increase in WOMAC or p = 0.72, respectively; Table 23.2).
QALY value [16]. Marsh et al. [11] performed a The net benefit regression models (WOMAC
Net Benefit Regression model (NBR) for each and QALY) did not indicate arthroscopic surgery
248 A. Thoma et al.
effective but also more costly. It is here that effective and less costly. If it falls in cell 2, we
economic evaluations needs to be performed to reject it, as it is less effective and more costly. We
find if indeed the novel interventions are use a similar reasoning for the other scenarios. In
cost-effective. most cases of modern surgery however, most new
Another way of explaining these possibilities innovations are more costly and at the same time
is the cost-effectiveness plane shown in more effective. This is similar to a new interven-
Fig. 23.3, with the effectiveness on the x-axis and tion falling in the right upper quadrant of the
the cost on the y-axis. If the novel intervention cost-effectiveness plane (Fig. 23.3). It is under
falls in the right lower quadrant, we have a win– these circumstances that we need to perform a
win situation meaning it is more effective and CEA or CUA. In the past, if a new innovation had
less costly. Alternatively, if it falls in the left an ICUR of $20,000/QALY the recommendation
upper quadrant, we have a lose–lose situation was given to accept it [19]. In recent years, this
meaning it is less effective and costs more. The figure has increased to $50,000/QALY [8]. The
slope of this line will determine its acceptability above were proposed thresholds in the literature.
or not. This slope is a ratio, which is precisely A more official threshold is the one proposed by
what the ICER/ICUR represents. The National Institute for Health and Clinical
Excellence (NICE), which suggests anywhere
from $27,000 to $41,000 to accept a new inno-
Will the Results Help Me in Caring vation [20].
for My Patients? Marsh et al. [11] found that arthroscopy was,
in fact, less effective and more costly than
Are the Treatment Benefits Worth nonoperative approaches. Surgery in itself also
the Harms and Costs? carries risks not encountered in nonoperative
approaches, such as anesthetic complications,
When we compare two surgical interventions deep venous thrombosis, and pulmonary
(novel versus standard), surgeons need to under- embolism [11]. Although these are uncommon,
stand that there are nine possibilities, as explained they can have lethal outcomes. In deciding on
in Fig. 23.2. If the novel intervention falls in cell 1, which approach to take, one may consider these
we accept the novel intervention, as it is more risks.
252 A. Thoma et al.
reconstruction: a cost-effectiveness analysis. Plast 14. Davies GM, Watson DJ, Bellamy N. Comparison of
Reconstr Surg. 2004;113:1650–61. the responsiveness and relative effect size of the
7. Thoma A, McKnight L. Quality adjusted life year Western Ontario and McMaster Universities
(QALY) as a surgical outcome measure. A primer for Osteoarthritis Index and the short-form Medical
plastic surgeons. Plast Reconstr Surg. 2010;125 Outcomes Study Survey in a randomized, clinical
(4):1279–87. trial of osteoarthritis patients. Arthritis Care Res.
8. Neumann PJ, Cohen JT, Weinstein MC. Updating 1999;12:172–9.
COST-Effectiveness—the curious resilience of the 15. Whitehead SJ, Ali S. Health outcomes in economic
$50,000-per-QALY threshold. N Engl J Med. evaluation: the QALY and utilities. Br Med Bull.
2014;371(9):796–7. 2010;96(1):5–21.
9. Thoma A, Kaur MN, Tsoi B, Ziolkowski N, Duku E, 16. Hoch JS, Briggs AH, Willan AR. Something old,
Goldsmith CH. Cost-effectiveness analysis parallel to something new, something borrowed, something
a randomized controlled trial comparing vertical scar blue: a framework for the marriage of health
reduction and inverted T-shaped reduction mamma- econometrics and cost-effectiveness analysis. Health
plasty. Plast Reconstr Surg. 2014;134(6):1093–107. Econ. 2002;11:415–30.
10. Waltho DA, Kaur MN, Haynes RB, Farrokhyar F, 17. Thoma A, Sprague S, Tandan V. Users guide to the
Thoma A. Users’ guide to the surgical literature: how surgical literature: how to use an article on economic
to perform a high-quality literature search. Can J analysis. Can J Surg. 2001;44:347–54.
Surg. 2015;58:349–58. 18. Thoma A, McKnight L, Knight C. The use of
11. Marsh JD, Birmingham TB, Giffin JR, economic evaluation in hand surgery. Hand Clin.
Isaranuwatchai W, Hoch JS, Feagan BG, et al. 2009;25:113–23.
Cost-effectiveness analysis of arthroscopic surgery 19. Laupacis A, Feeny D, Detsky AS, Tugwell PX. How
compared with non-operative management for attractive does a new technology have to be to
osteoarthritis of the knee. BMJ Open. 2016;5:1–10. warrant adoption and utilization? Tentative guideli-
12. Bellamy N, Buchanan WW, Goldsmith CH, Camp- nes for using clinical and economic evaluations.
bell J, Stitt LW. Validation study of WOMAC: a CMAJ. 1992;146(4):473–81.
health status instrument for measuring clinically 20. McCabe C, Claxton K, Culyer AJ. The NICE
important patient relevant outcomes to antirheumatic cost-effectiveness threshold: what it is and what that
drug therapy in patients with osteoarthritis of the hip means. Pharmacoeconomics. 2008;26(9):733–44.
or knee. J Rheumatol. 1988;15:1833–40. 21. Husereau D, Drummond M, Petrou S, Carswell C,
13. Ehrich EW, Davies GM, Watson DJ, Bolognese Ja, Moher D, Greenberg D, et al. Consolidated Health
Seidenberg BC, Bellamy N. Minimal perceptible Economic Evaluation Reporting Standards
clinical improvement with the Western Ontario and (CHEERS) statement. Eur J Health Econ. 2013;14
McMaster Universities osteoarthritis index question- (3):367–72.
naire and global assessments in patients with
osteoarthritis. J Rheumatol. 2000;27:2635–41.
Studies Reporting Harm in Surgery
24
Robin McLeod
In this clinical scenario, both the benefits and and control groups should be similar with the
risks of NSAIDs require assessment. While this exception of the risk factor which is being
might not be considered a risk of surgery per se, studied.
the perioperative care of patients, which includes Case-control studies are often used to study
such interventions as pre-operative antibiotics, rare diseases or as a preliminary study where
DVT prevention and pain management, is an little is known about the association between the
essential part of the care of surgical patients. In risk factor and the disease of interest. One of the
this example, the use of post-operative NSAIDs, most notable use of a case control study was the
surgeons should be concerned because of the risk demonstration that smoking is associated with
of an anastomotic leak which is definitely a lung cancer [1]. For more information on
surgical complication. case-control studies see Chap. 17.
In line with the tenets of Evidence Based The other method for determining whether a
Surgery, the best evidence for determining harm suspected risk factor leads to an adverse effect is
is required as much as it is needed for assessing to perform a cohort study (level 2 evidence).
benefit. Ideally, evidence from a randomized Cohort studies are longitudinal studies in which
controlled trial (RCT) is the best evidence for all patients are followed from a defined time but
making decisions about harm. However, RCTs unlike RCTs, the patients are not randomized to
are generally designed to evaluate the effective- the two groups but instead patients “choose” the
ness of a treatment to improve patient outcomes group they are in. Cohort studies can be retro-
and only secondarily are they intended to mea- spective or prospective and are also used to
sure complications or adverse events. It is cer- assess the effect of exposure or protective factors
tainly not ethical to randomize patients to on outcome. The limitation of cohort studies is
determine which intervention is more harmful. In that patients are not randomized and even if they
addition, some harmful effects may occur infre- are well matched there may be some differences
quently, and thus, the sample size would have to which are not apparent or cannot be measured.
be very large to be able to make conclusions The key strengths and limitations of the above
about the frequency of the harmful effects. Some study designs are summarized in Table 24.1. For
side effects are not immediately apparent after more information on cohort studies see Chap. 16.
treatment and therefore follow-up of patients In this particular question, it may have been
participating in RCTs would have to be exten- possible to do a RCT to assess the risk of anas-
ded. For more information on RCTs see tomotic leak while assessing the efficacy of
Chaps. 11–14. NSAIDs in controlling post-operative pain when
Instead of using RCTs to ascertain the risk of NSAIDs were first introduced. However, when
harm, cohort and case-control studies are most these studies were performed, there was no
often used. Case-control studies are observa- suggestion that NSAIDs might be associated
tional studies which are often used in epidemi- with a risk of anastomotic leak. Now, it would be
ology. In these studies, individuals who have the somewhat unethical to do a study looking at
medical condition (harmful side effect in this harm as the primary outcome.
example) are identified (cases) and then a second
cohort of patients who do not have the condition
are matched to the cases. The advantage of these Finding the Evidence
studies is that they require fewer resources and
follow-up is not required but the evidence is To identify the best evidence, we performed
weaker than that of a RCT (level 3 evidence) (see a literature search. We first formulated our re-
Chap. 5. As well, the characteristics of the case search question using the PICOT (population,
24 Studies Reporting Harm in Surgery 257
intervention, comparison, outcome, time hori- systematic reviews, and while they would be
zon) format as seen below. appropriate for your research question you are
drawn to articles 4 and 6, which are cohort
• Population: males and females undergoing an studies. Both cohort studies utilize prospectively
elective colorectal resection with anastomosis collected data, however, a full text review reveals
• Intervention: NSAIDs that the study by Klein et al. [3] use a Danish
• Comparison: no NSAIDs prescribed Colorectal Cancer Group database that has been
• Outcome: anastomotic leak validated and has a completeness rate of over
• Time horizon: seven days following surgery 96%. You decide to go with the Klein et al. [3]
study entitled, postoperative use of non-steroidal
Using the PICOT format your research ques- anti-inflammatory drugs in patients with anasto-
tion would be: “Does the use of NSAIDs, as motic leakage requiring reoperation after col-
compared to no NSAIDS, lead to anastomotic orectal resection: cohort study based on
leaks in elective colorectal resection with anas- prospective data.
tomosis seven days following surgery?”
You perform a PubMed search based on the
above-mentioned PICOT terms, you enter: Study Appraisal
“colorectal surgery” AND “non-steroidal
anti-inflammatory drugs” AND “anastomotic To properly understand, and present the infor-
leak” AND “postoperative”. The original search mation to your colleague, you have elected to use
yields 8 article (Appendix 1). You do a title and an article by Thoma et al. [4] to appraise your
abstract review of each of the 8 articles; articles selected article. The questions used to appraise
2, 5 and 8 are not specific to your topic and an article about harm in surgery are shown in
article 7 is an abstract. Articles 1 and 3 are Box 1.
258 R. McLeod
Table 24.2 select patient characteristics this were a retrospective cohort study or a
Demographic variables Number of case-control study. We can assume that the
patients exposure (i.e.: whether patients received
Age Median 70 (62–77) NSAIDs) was well documented in the electronic
(interquartile medical records registry since all orders must be
range) put into this register. In addition, if NSAIDs were
Sex Male 1441 ordered into the electronic medical record, three
Female 1315 reviewers who were blinded as to whether the
ASA status I 637 patient had/did not have an anastomotic leak
II 1639 reviewed the charts to determine if the patients
III 428 actually received NSAIDs and if so, how many
doses.
IV 20
Comorbidities – –
Was the follow-up sufficiently complete?
Ischaemic Yes 341
heart disease The standard follow-up to assess adverse events
No 974
post-operatively in patients having colorectal
Hypertension Yes 834 surgery is 30 days. According to the literature,
No 478 anastomotic leakage is most commonly detected
Lung disease Yes 196 5–7 days following surgery [5]. In this study,
No 1984 there was complete follow-up in all but four
Diabetes Yes 259 patients at 30 days. You are therefore confident
No 1060 that the follow-up was an appropriate length, and
sufficiently complete.
Tumour T 1 161
stage 2 330
Was the correct temporal relationship
3 1720
demonstrated?
4 516
The authors do not report whether patients had
Resection Colonic 1988 used NSAIDs prior to surgery. While in some
Rectal 768 studies it would be important to document when
Procedure Open 1760 the exposure occurred, it is unlikely that NSAIDs
Laparoscopic 996 prior to surgery would have an effect because
NSAIDs have a short half-life.
However, there were differences in alcohol con- Was there a dose-response gradient?
sumption and smoking, intraoperative blood loss Regular post-operative consumption of NSAIDs
and transfusion and whether the procedure was was defined as ingestion of at least 50 mg of
performed open or laparoscopically. Known risk diclofenac per day or ibuprofen 800 mg per day
factors for anastomotic leakage include gender for at least two of the first seven post-operative
(higher in males), smoking, obesity, alcohol days [3]. In reality, 95% of patients in the
abuse and preoperative steroid use [5]. ibuprofen group ingested at least 1200 mg per
day and 99% in the diclofenac group ingested at
Were the circumstances and methods for deter- least 100 mg per day [3]. The authors did not do
mining exposure similar for patients and a dose-response analysis. However, in addition to
controls? the comparisons based on the definitions above,
Both circumstances and methods appear to be they reported that within the control group there
similar for both groups. In this prospective cohort were 231 patients (12.3%) who ingested less than
study, the likelihood of bias is much lower than if two days of NSAIDs. The anastomotic leak rate
260 R. McLeod
in this latter group was compared to those who How precise is the estimate of the risk?
received no NSAIDs and the anastomotic leak The precision of the risk is mainly affected by the
rates were similar (5.2% vs. 5.1%). sample size. The 95% CI will be wider with a
small sample size and narrower in studies with a
What are the Results? large sample size. In this analysis, the risk of
anastomotic leak for patients taking diclofenac
How strong is the association between exposure was 7.2 and based on the 95% CI, we can be
and outcome? 95% certain that the risk lies somewhere between
The authors report that the anastomotic leak rate 3.8 and 13.4. In other words, we can be quite
in the diclofenac group was 12.8%; in the certain that taking diclenofac is associated with
ibuprofen group 8.2%; and 5.1% in the control an increased risk of anastomotic leak. On the
group. Thus, the increased risk of an anastomotic other hand, the 95% confidence intervals for
leak in the diclofenac group was 7.8% (95% CI ibuprofen ranged from 0.8 to 2.9 so while the OR
3.9–12.8%) and in the ibuprofen group was 3.2% for ibuprofen is above 1.0, in fact, the range
(95% CI 1.0–5.7%). could be from 0.8 to 2.9 and ibuprofen usage
Odds Ratios (OR) measure how strong an might or might not be associated with an anas-
association there is between the exposure tomotic leak.
(NSAIDs) and outcome (anastomotic leak). An
Odds Ratio of greater than 1 indicates that the
risk of the outcome (anastomotic leak) is higher
when exposed to the risk factor (NSAID) How can I Apply the Result to my Patients or
whereas an OR equal to one indicates that there Clinical Practice
is no association.
In this study, multivariate logistic regression Were the patients in the appraised study similar
analyses were performed including the following to the patients in my practice?
risk factors: NSAID treatment, drug type, sex, The characteristics of the patients in this study
intraoperative transfusion, hospital and type of are representative of most patients you see for
resection (colonic or rectal operation. The colorectal cancer surgery. However, there are
authors report that the OR of an increased risk of some potential differences, although they likely
an anastomotic leak for patients taking diclofenac do not affect the results. First, 36% of the patients
was 7.2 (95% CI 3.8–13.4) and for those taking had laparoscopic surgery; this analysis included
ibuprofen the increased risk was 1.5 (95% CI patients who had surgery between 2006 and
0.8–2.9). In other words, the risk of an anasto- 2009; the proportion of patients having a
motic leak was increased 7.2-fold in patients laparoscopic approach would, most likely, be
taking diclofenac and 1.5 times in those taking higher now in most jurisdictions [6]. Second, the
ibuprofen (although this was not statistically outcome measure was patients who had an
significant because the CI crossed 1.0). anastomotic leak and required reoperation; the
Of note, the overall post-operative mortality proportion of patients who had an anastomotic
was 3.3% but in patients who had a leak, the leak and did not have a reoperation was not
postoperative mortality was 9.5%. However, reported. Assuming that there were patients who
post-operative mortality was not increased in had an anastomotic leak but did not require
either the diclofenac (4.1%) or ibuprofen groups surgery, the rate of anastomotic leaks would
(1.8%) compared to the control group (3.2%). likely be higher than expected. However, if there
24 Studies Reporting Harm in Surgery 261
is a difference in the management of anastomotic NNH, the anastomotic leak rate in the control
leaks in Denmark so most or all patients are group (5.1%) can be subtracted from the
reoperated, there may not be a difference in the diclofenac group leak rate: 12.8% minus 5.1%
rate of anastomotic leaks. Instead there may be a equals 7.1%. Then 100 divided by 7.1(%) is
difference in management since most patients 14.1. So, that means that the NNH is 14.1 or in
who develop an anastomotic leak are treated other words, for every 14 patients treated with
non-operatively in North America. diclofenac one patient would develop an anas-
tomotic leak requiring reoperation and for every
Is the exposure similar to what might occur in my 50 patients treated with ibuprofen one patient
patient? would develop an anastomotic leak requiring
A variety of NSAIDs are available and may be reoperation.
prescribed to patients having elective colorectal
surgery. This study suggests that further inves- Are there any benefits known to be associated
tigation is required to determine whether the risk with exposure?
of anastomotic leaks is lower in some NSAIDs After reviewing the evidence regarding harm, we
and therefore may be safe for patients undergo- must weigh the potential benefits against the
ing a colorectal resection. The exposure would potential adverse events with the use of
therefore be similar for any patient that was post-operative NSAID use. There are certainly
prescribed NSAIDs; therefore your patient would benefits to using NSAIDs for pain in patients
have the same risk of exposure as those patients having colorectal surgery. While this study did
in the Klein et al. [3] article. not present data on pain control and length of
post-operative ileus, others have shown that
What is the magnitude of the risk? NSAIDs are effective in decreasing pain
The Number Needed to Harm (NNH) is similar post-operatively and the need for opioids [8].
to the Number Needed to Treat (NNT) . How- However, this study does show that the risk of an
ever, while the NNT is a measure of how many anastomotic leak requiring surgery is also
patients need to be treated in order for one patient increased and it is well accepted that an anasto-
to benefit from the treatment, the NNH is a motic leak (especially those requiring
measure of how many people need to be treated re-operation) is one of the most common causes
(or exposed to a risk factor) for one person to of post-operative mortality following a colorectal
have a particular adverse effect [7]. To determine resection. Thus, this study provides evidence that
the clinical importance of the results, it is there needs to be caution in prescribing NSAIDs
worthwhile to calculate the NNH If both are post-operatively in patients who have had a
calculated, a low number for the NNT and a high colonic or rectal resection and anastomosis.
number for the NNH would be the optimal On the other hand, we also need more evidence
finding [7]. For more information on magnitude to understand the benefits and risks of NSAIDs
and precision of treatment effects, please see before our surgeon and anaesthesiologist who are
Chap. 6. making the ERAS guidelines can make strong
The NNH is conceptually and mathematically recommendations. For instance, this study
simple for studies where there are distinct showed that the risk was much lower in patients
exposed and unexposed groups. In this study, it receiving ibuprofen (NNH 50) than diclofenac
can be calculated by using the raw data for (NNH 14 patients) so perhaps depending on its
anastomotic leaks in the three groups or use the efficacy, ibuprofen may be recommended but
percentages. Using the latter, the anastomotic diclofenac should be avoided. To be included in
leak rates were 12.8% in the diclofenac group this evaluation, patients had to have taken
and 9.2% in the ibuprofen group compared to NSAIDs for at least two days in the first seven
5.1% in the control group. Thus, to calculate the days post-operatively. The authors did not do a
262 R. McLeod
temporal or dose related evaluation which might surgeon and anaesthesiologist decided to rec-
assist in making recommendations. Finally, we ommend that diclofenac should not be used for
also know that the risk of anastomotic leaks is pain control in patients having a colorectal
higher following rectal surgery than colonic resection and anastomosis. However, they also
resections so further evidence might change our noted that further studies are needed to under-
recommendation in these cohorts of patients. stand which NSAIDs are not associated with
anastomotic leaks and can be used with greater
So what should the surgeon and anaesthesi- confidence in this cohort of patients.
ologist recommend? As in many decisions in
surgery, there is a balance between the benefit
and the risk. However, given the seriousness of Appendix 1
an anastomotic leak, a cautious approach should
be taken in the recommendation of NSAIDS. 1. Modasi A, Pace D, Godwin M, Smith C,
Curtis B. NSAID administration post col-
orectal surgery increases anastomotic leak
Conclusion rate: systematic review/meta-analysis. Surg
Endosc. 2018; EPub ahead of print. https://
Surgeons are frequently relied upon to provide doi.org/10.1007/s00464-018-6355-1.
information about surgical procedures to patients 2. Gupta A, Bah M. NSAIDS in the treatment of
or, as in this scenario, hospital policies. In both postoperative pain. Curr Pain Headache
instances, decisions should be made based on the Rep. 2016;20(11):62.
best evidence, taking into consideration the 3. Slim K, Joris J, Beloeil H, Groupe Franco-
benefits and risks of surgery as well as the phone de Réhabilitation Améliorée après
patient’s wishes or demands. Virtually every Chirurgie (GRACE). Colonic anastomoses
surgical procedure is associated with some and non-steroidal anti-inflammatory drugs.
amount of harm. If there is strong evidence J Visc Surg. 2016;153(4):269–75.
regarding both the benefits and harm of the in- 4. Paulasir S, Kaoutzanis C, Welch KB, Van-
tervention, it is important that these outcomes are dewarker JF, Krapohl G, Lampman RM, et al.
discussed with patients. While the discussion Nonsteroidal anti-inflammatory drugs: do
should take into account the patients’ desires, it is they increase the risk of anastomotic leaks
important that the information is provided in a following colorectal operations? Dis Colon
way that can be understood by patients so they Rectum. 2015;58(9):870-7.
can make a decision. 5. Nepogodiev D, Chapman SJ, Glasbey JC,
Kelly M, Khatri C, Fitzgerald E, et al. Mul-
ticentre observational cohort study of
Resolution of the Scenario NSAIDs as risk factors for postoperative
adverse events in gastrointestinal surgery.
Although a RCT would provide stronger evi- 6. Klein M, Gögenur I, Rosenberg J. Postopera-
dence, the methodology of this prospective tive use of non-steroidal anti-inflammatory
cohort study is strong and there was no apparent drugs in patients with anastomotic leakage
bias between the two groups which could render requiring reoperation after colorectal resec-
the results invalid. There was a significantly tion: cohort study based on prospective data.
positive association between one of the NSAIDs 7. Kelin M. Postoperative non-steroidal
(diclofenac) and anastomotic leaks following anti-inflammatory drugs and colorectal anas-
colorectal surgery. Indeed, the OR was 7.2 (95% tomotic leakage. NSAIDs and anastomotic
CI 3.8–12.4). On the other hand, there was not a leakage. Dan Med J. 2012;59(3):B4420
significant association with ibuprofen (OR 1.5, 8. Klein M, Krarup PM, Kongsbak MB,
95% 0.8–2.9). Having this information, the Agren MS, Gögenru I, Jorgensen LN, et al.
24 Studies Reporting Harm in Surgery 263
Effect of postoperative diclofenac on anasto- colorectal resection: cohort study based on prospective
motic healing, skin wounds and subcutaneous data. BMJ. 2012;345:e6166.
4. Thoma A, Kaur MN, Farrokhyar F, Waltho D,
collagen accumulation: a randomized, blin- Levis C, Lovrics P, Goldsmith CH. Users’ guide to
ded, placebo-controlled, experimental study. the surgical literature: how to assess an article about
Eur Surg Res. 2012;48(2):73–8. harm in surgery. Can J Surg. 2016;559(5):351–7.
5. Daams F, Luyer M, Lange JF. Colorectal anastomotic
leakage: aspects of prevention, detection and treat-
ment. World J Gastroenterol. 2013;19(15):2293–7.
6. Pascual M, Salvans S, Pera M. Laparocopic colorectal
References surgery: current status and implementation of the latest
technological innovations. World J Gastroenterol.
2016;22(2):704–17.
1. Doll R, Hill AB. Mortality in relation to smoking ten years’ 7. Andrade C. The numbers needed to treat and the
observations of British doctors. BMJ. 1964;1:1460–7. numbers needed to harm (NNT, NNH) statistics: what
2. Levine MAH, Ioannidis J, Haines AT, Guyatt G. Harm they tell us and what they do not. J Clin Psychiatry.
(observational studies). In: Guyatt G, Rennie D, 2015;76(3):e330–3.
Meade MO, Cook DJ, editors. Users’ guide to the medical 8. Rutegârd J, Rutegârd M. Non-steroidal anti-
literature: a manual for evidence-based clinical practice. inflammatory drugs in colorectal surgery: a risk factor
2nd ed. New York: McGraw-Hill; 2008. p. 363–81. for anastomotic complications? World J Gastroenterol.
3. Klein M, Gogenur I, Rosenberg J. Postoperative use of 2012;4(12):278–80.
non-steroidal anti-inflammatory drugs in patients with
anastomotic leakage requiring reoperation after
Evaluating Surveys
and Questionnaires 25
in Surgical Research
i. Was there a sufficient response The survey should have a clearly defined target
rate? population. In an ideal scenario, the survey is
ii. Are appropriate statistical administered to the entire target population to
methods used? obtain the most complete and accurate data. This
iii. Is the reporting of the results is often impractical due to the large size of the
transparent? population or study constraints; a sample of the
iv. Are the conclusions target population is often surveyed (Fig. 25.1)
appropriate? [9]. The sampling frame can be defined as the
empirically measurable members of the target
III. Will the results change practice? population from which the sample is drawn
[7–10]. The sampling element refers to the
i. Are the results generalizable to respondents of the sampling frame whose
my practice? responses are collected and analyzed [10].
ii. Will the conclusions from this There may be individuals of the target population
survey change my that cannot be surveyed and thus fall outside the
practice/behavior? sampling frame. Thus, it is important to select a
268 B. H. Chin and C. J. Coroneos
Fig. 25.1 Illustrative diagram of sampling strategy used how it will affect the generalizability of the survey. Also
by Parvez et al. [5] for American general surgeons. When consider potential inclusion of individuals that are not part
selecting the sampling frame, consider individuals of the of the target population within the sampling frame
target population that may be potentially excluded and
specific sampling frame that best captures the Parvez et al. [5] contacted all currently prac-
target population of interest. ticing general surgeons in Canada from a list
The reader should consider if the sampling obtained from MedSelect and a random sample
frame appropriately represents the target popu- of American general surgeons from the American
lation, and how the individuals within the pop- Medical Association (Fig. 25.1). Given the re-
ulation are identified to be respondents. If a search question, the target population would be
membership registry is used, readers must con- general surgeons who perform breast cancer
sider if individuals outside of the target popula- surgery. It is difficult to evaluate if MedSelect is
tion (e.g., fellowship trainees, surgeons not in an appropriate representative sampling frame for
practice) may be included. Random sample Canadian general surgeons; the resource itself is
selection is based on probability designs which not described, nor how it compares to
include: simple random sampling, systematic the membership from a professional group (e.g.,
random sampling, stratified random sampling, the Canadian Association of General Surgeons),
and cluster sampling [10, 11]. Deliberate and it was not found with an Internet search. The
nonprobability-based sampling is utilized to sampling frame for American general surgeons is
study populations that may be difficult to iden- likely appropriate, but a justification of selecting
tify. Such designs include purposive sampling, the American Medical Association over surgical
quota sampling, chunk sampling, and snowball bodies such as the American College of Surgeons
sampling [10]. Ultimately, the degree to which or The American Society of General Surgeons
survey results can be generalized will depend on would strengthen the authors’ methodology.
how similar the sample is to the target population Furthermore, given that a random sample of
of interest. American general surgeons was surveyed, it
25 Evaluating Surveys and Questionnaires in Surgical Research 269
pilot testing with ten respondents and adjusted the results of a reliable questionnaire are con-
the wording of one question based on the sistent and reproducible.
responses. Validity is defined as the extent to which an
instrument (e.g., questionnaire) measures what it
vi. Was clinical sensibility testing performed? intends to measure. Face validity is a subjective
assessment of survey validity by experts and
While the focus of pretesting is primarily on the participants, often conducted during clinical
structure and understanding of individual ques- sensibility testing [7, 11]. Content validity is
tions, the goal of clinical sensibility testing is to formal expert evaluation of whether the ques-
evaluate the comprehensiveness, clarity, and tionnaire items accurately evaluate all aspects of
face validity of the questionnaire [7]. The pro- the topic (i.e., all pertinent clinical domains) [7,
cess evaluates how well the survey assesses the 11]. Construct validity is determined based on a
pertinent clinical domains identified during conceptual inference and relationship between
development. Clinical sensibility testing should the instrument and object being measured (e.g.,
be a structured assessment (e.g., use of a stan- IQ test for intelligence) [7, 11]. Lastly, criterion
dardized assessment form) by independent validity compares the results of survey items to a
evaluators. Parvez et al. [5] do not explicitly “gold standard” or accepted existing measures [7,
report clinical sensibility testing in their survey 11]. Validity assessments are important for future
development. use of a given survey in a specific group and
context. However, for most surveys, only face
vii. Was reliability and validity testing validity is warranted and it is completed during
performed? clinical sensibility testing.
Overall, reliability and validity testing are
Reliability and validity testing essentially answer important to ensure that the questionnaire can
the following questions: attain the true answer to a research question. This
is particularly vital when questions are used to
• Does my survey answer the question consis- infer conclusions for a different construct based
tently? (reliability) on a theoretical association or relationship. For
• Does my survey answer the correct question? example, a survey developed to evaluate nutri-
(validity) tional health can ask questions on diet habits
(e.g., how often do you eat vegetables?).
Reliability determines whether questions yield Although there is a scientific basis for the rela-
consistent answers over time and repeated tionship between diet and nutrition, it would be
administration [11, 33, 34]. A reliable test can important to empirically demonstrate that veg-
also accurately differentiate respondents [12, 33]. etable consumption is a valid measure of nutri-
The response to a given question should be alike tional health through the methods described
among respondents who feel similarly, and above. In contrast, face validity may be adequate
divergent among respondents who differ [12]. when a survey is used for a descriptive purpose
For interrater reliability, two independent as in a cross-sectional design.
respondents will have similar responses where The survey by Parvez et al. [5] does not report
expected. Furthermore, internal consistency is assessment of reliability or validity. Although
determined when several questions that address pretesting was completed, the absence of relia-
the same domain produce similar answers [33]. bility and validity testing may affect the credi-
In test–retest reliability, one individual will have bility of their results. However, this should also
consistent results when answering the same be considered in the context of its descriptive
question at different times [11]. In other words, objective and cross-sectional design.
272 B. H. Chin and C. J. Coroneos
II. What are the Results? Nonresponders and missing data can bias the
results of the survey and affect its generalizabil-
i. Was there a sufficient response rate? ity. Although strategies should be implemented
during the survey design to maximize the
High response rates provide confidence that the response rate, nonresponse can be partially
survey results are representative of the target addressed through statistical methods such as
population. This reduces potential bias between multiple imputation [36]. The process and
responders and nonresponders, as well as assumptions underpinning multiple imputation
improves the validity and generalizability of the are beyond the scope of this chapter, but readers
study [7]. Response rates of 60–70% are con- should consider if there was any attempt made to
sidered acceptable and demonstrate face validity statistically address missing data.
in the medical community [12, 35]. Given its Parvez et al. [5] completed a descriptive
importance, a proper sampling frame, selected cross-sectional survey comparing BCT patterns
sample, and techniques to enhance the response between Canadian and American general
rate are critical to achieving a high response rate. surgeons. Appropriate statistical analyses were
Parvez et al. [5] report response rates of 51% performed for categorical variables using
(730/1443) from Canadian general surgeons and chi-squared, and nonparametric ordinal variables
26% (372/1447) American general surgeons. using Mann-Whitney U test across seven
This represents the actual response rate. How- domains. Baseline comparison of characteristics
ever, only a proportion of responders performed of the Canadian and American general surgeons
breast surgery. Thus, the analyzed response rate was performed. Stepwise linear regression anal-
was 21% (302/1443) for Canada and 14% ysis was used to adjust for demographic variables
(198/1447) for the United States. An issue pre- that could influence group differences. Addi-
viously identified is that the sampling frame is tionally, despite the overall low response rate,
not precise to the target population (i.e., general there was no attempt to adjust for missing data.
surgeons who perform breast surgery). The result
is that the chosen sampling frame likely under- iii. Is the reporting of the results transparent?
estimates the response rate. A reader should
consider the generalizability of the results based The reported results of the survey should directly
on these factors. address the primary question. Sufficient detail is
required to ensure that the findings are clear and
ii. Were appropriate statistical methods used? transparent. The authors must also report methods
used to handle and analyze missing data. The
Surveys can collect descriptive (raw data) or conclusion and implications of the study should
explanatory (inference between constructs) data be discussed and align with the data presented.
[6, 7]. Descriptive surveys present factual data Finally, whether the sample surveyed is repre-
and estimate a parameter of interest. In contrast, sentative of the population should be considered.
explanatory surveys generate connections Parvez et al. [5] present results in relative fre-
between concepts to test a hypothesis. Surveys quencies (%), but there is insufficient raw data for
measuring one metric are considered unidimen- reanalysis if warranted. It is explicitly stated that
sional scales while those evaluating more than results were included and analyzed for individual
one metric are multidimensional scales [12]. questions even if the survey was partially com-
Sample size calculation is as important for sur- pleted. The domains are adequately presented in
veys as it is with other study designs, and is tables and graphical format, but could also display
influenced by the research question, hypotheses, the raw data that generated the percentile values.
and overall design. For readers who plan to Access to the survey questionnaire would be
conduct a survey, we recommend consulting a valuable in an appendix to appraise individual
biostatistician when developing your protocol. question stems and responses.
25 Evaluating Surveys and Questionnaires in Surgical Research 273
iv. Are the conclusions appropriate? proportion of responders performed breast sur-
gery: Canadian (21%, n = 302) versus American
The discussion should summarize the results and (14%, n = 198). Additionally, the 1466 Ameri-
implications in a concise manner that aligns with can surgeons surveyed was a random sample of
the study data. Limitations of the study, partic- the American Medical Association (sampling
ularly with respect to methodology, should be frame). The proportion of American general
explored. Notably, the impact of the nonresponse surgeons surveyed from the sampling frame is
rate to the validity of study findings should be not specified. This raises the question if the
discussed; unlike other forms of research, American sample is sufficiently representative
responders and nonresponders often cannot be and generalizable to the American surgeon pop-
compared. Finally, surveys are self-reported, and ulation. This impacts the validity of the differ-
may not accurately reflect “real-world” clinical ences highlighted between Canadian and
practice; results may reflect respondents’ inten- American surgeons.
tions or ideal practice instead of the care practi-
cally delivered. Parvez et al. [5] highlight several ii. Will the conclusions from this survey change
findings that demonstrate statistically significant my practice/behaviour?
variations in surgical practice, including deci-
sions that deviate from the current standard of The survey by Parvez et al. [5] demonstrates
care. Study limitations are appropriately dis- statistically significant differences in surgical
cussed with respect to low response rate, risk of practice between Canadian and American sur-
response bias and missing data. Unfortunately, geons. There is also statistically significant vari-
they do not discuss power calculation, nor the ation within the Canadian and American cohorts.
reliability and validity of the questionnaire. The The variations in the definition of a negative
authors indicate that the results may not be margin for ductal carcinoma in situ (DCIS) and
generalizable, but indicate a wide variety of recommending re-excision for <2 mm margin
practice patterns; this is an impetus for research demonstrates a lack of consensus. Although there
and standardization in BCT. is no clear practice changing recommendations,
findings highlight the need for a systematic
III. Will the Results Change Practice? standardization and guidelines for BCT.
21. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Inform Assoc [Internet]. 2000;7(4):416–25. Avail-
Stratford PW, Knol DL, et al. The COSMIN check- able from: http://www.ncbi.nlm.nih.gov/pubmed/
list for assessing the methodological quality of 10887169.
studies on measurement properties of health status 29. Asch DA, Jedrziewski MK, Christakis NA. Response
measurement instruments: an international Delphi rates to mail surveys published in medical journals.
study. Qual Life Res [Internet]. 2010;19(4):539–49. J Clin Epidemiol [Internet]. 1997;50(10):1129–36.
Available from: http://www.ncbi.nlm.nih.gov/pubmed/ Available from: http://www.ncbi.nlm.nih.gov/
20169472. pubmed/9368521.
22. Kim HL, Hollowell CM, Patel RV, Bales GT, 30. Cummings SM, Savitz LA, Konrad TR. Reported
Clayman RV, Gerber GS. Use of new technology response rates to mailed physician questionnaires.
in endourology and laparoscopy by American urol- Health Serv Res [Internet]. 2001;35(6):1347–55.
ogists: internet and postal survey. Urology [Internet]. Available from: http://www.ncbi.nlm.nih.gov/pubmed/
2000;56(5):760–65. Available from: http://www. 11221823.
ncbi.nlm.nih.gov/pubmed/11068295. 31. Sprague S, Quigley L, Bhandari M. Survey design in
23. Raziano DB, Jayadevappa R, Valenzula D, orthopaedic surgery: getting surgeons to respond.
Weiner M, Lavizzo-Mourey R. E-mail versus con- J Bone Joint Surg Am [Internet]. 2009;91(Suppl
ventional postal mail survey of geriatric chiefs. 3):27–34. Available from: http://www.ncbi.nlm.nih.
Gerontologist [Internet]. 2001;41(6):799–804. Avail- gov/pubmed/19411497.
able from: http://www.ncbi.nlm.nih.gov/pubmed/ 32. Bowden A, Fox-Rushby JA, Nyandieka L, Wan-
11723348. jau J. Methods for pre-testing and piloting survey
24. Braithwaite D, Emery J, De Lusignan S, Sutton S. questions: illustrations from the KENQOL survey of
Using the Internet to conduct surveys of health health-related quality of life. Health Policy Plan
professionals: a valid alternative? Fam Pract [Inter- [Internet]. 2002;17(3):322–30. Available from: http://
net]. 2003;20(5):545–51. Available from: http:// www.ncbi.nlm.nih.gov/pubmed/12135999.
www.ncbi.nlm.nih.gov/pubmed/14507796. 33. Streiner DL, Norman GR. Health measurement
25. Sierles FS. How to do research with self- scales: a practical guide to their development and
administered surveys. Acad Psychiatry [Internet]. use. 4th ed. Oxford: Oxford University Press; 2008.
2003;27(2):104–13. Available from: http://www. 34. Kirshner B, Guyatt G. A methodological framework
ncbi.nlm.nih.gov/pubmed/12824111. for assessing health indices. J Chronic Dis [Internet].
26. Fischbacher C, Chappel D, Edwards R, Summer- 1985;38(1):27–36. Available from: http://www.ncbi.
ton N. Health surveys via the Internet: quick and nlm.nih.gov/pubmed/3972947.
dirty or rapid and robust? J R Soc Med [Internet]. 35. Elder JP, Artz LM, Beaudin P, Carleton RA,
2000;93(7):356–59. Available from: http://www. Lasater TM, Peterson G, et al. Multivariate evalua-
ncbi.nlm.nih.gov/pubmed/10928022. tion of health attitudes and behaviors: development
27. McLean SA, Feldman JA. The impact of changes in and validation of a method for health promotion
HCFA documentation requirements on academic research. Prev Med (Baltim) [Internet]. 1985;14(1):
emergency medicine: results of a physician survey. 34–54. Available from: http://www.ncbi.nlm.nih.
Acad Emerg Med [Internet]. 2001;8(9):880–85. gov/pubmed/4034513.
Available from: http://www.ncbi.nlm.nih.gov/pubmed/ 36. Rubin DB. Multiple imputation of nonresponse in
11535480. surveys. Toronto: Wiley; 1987.
28. Schleyer TK, Forrest JL. Methods for the design and
administration of web-based surveys. J Am Med
Opinion Pieces in Surgery
26
M. Torchia, D. Austin and I. L. Gitajn
You are a young orthopedic surgeon in a com- Based on the above clinical scenario, you form a
munity practice. One night while on call, you are research question using the PICOT format:
consulted about a 56-year-old female who pre-
sents with a complex four-part proximal humerus Population: patients with displaced proximal
fracture after a low-speed bike accident. Further humerus fractures
imaging shows the fracture to have a mild degree Intervention: open reduction and internal fixation
of posteromedial comminution but no “head- (ORIF)
splitting” components. The fracture is closed, Comparative Intervention: nonoperative
there is no associated neurovascular compromise, management
and the patient has no other injuries. The patient Outcome: clinical outcome scores
is in good health and is very active, playing Time Horizon: any time after intervention.
tennis at least three times per week with her
husband and participating in aerobic exercise like Using the PICOT format your research ques-
biking and hiking at least once per week. While tion becomes: In patients with displaced proxi-
you believe that reconstructing the bone by open mal humerus fractures, does open reduction and
reduction and internal fixation (ORIF) may help internal fixation result in better outcome scores
the patient you want to ensure there is not a as compared to non-operative management after
nonoperative option. You turn to the literature to surgery? Using the following search terms:
examine the underlying evidence behind this “displaced proximal humerus fracture” AND
treatment modality in hopes of counseling your “ORIF” AND “nonoperative” AND “outcome”
patient regarding the best treatment plan. you perform a literature search in MEDLINE.
Your search yields three articles (Appendix 1);
the first article compares three different surgical
techniques; the second article is a survey
between surgeons and orthopedic traumatolo-
gists, however, does not include outcomes of
patients; and the third is in elderly patients. You
M. Torchia D. Austin I. L. Gitajn (&) are not confident that the three articles identified
Division of Orthopaedics, Dartmouth-Hitchcock, are appropriate for your patient. Feeling dis-
Lebanon, NH, USA couraged, you decide to bring up this clinical
e-mail: Ida.Leah.Gitajn@hitchcock.org
dilemma with colleagues, one of who directs you Such criticisms of evidence in particular, and
to an expert opinion article by Sperling et al. [1], EBM more broadly, may induce a false dichotomy
entitled; the difficult proximal humerus fracture: between those who would practice EBM to
tips and techniques to avoid complications and exclusion of clinical reality and those who would
improve results. The article provides opinions on rely on accumulated clinical experience to the
techniques to fix displaced proximal humerus exclusion of published evidence. Yet returning ad
fractures, especially those with displaced fontes, or “to the original sources”, to use a term
tuberosities and comminution, which your popularized by the Renaissance humanists, the
patient has. early descriptions of EBM viewed the clinical
experience as a necessary aspect of incorporating
external evidence into clinical practice. As Sackett
Expert Opinion in the Context [5] states, “Good doctors use both individual
of Evidence-Based Medicine clinical expertise and the best available external
and Surgery evidence, and neither alone is enough. Without
clinical expertise, practice risks becoming tyran-
Scrutiny of the evidence behind clinical deci- nized by evidence, for even excellent external
sions is becoming increasingly pervasive. While evidence may be inapplicable to or inappropriate
the term “evidence-based medicine” (EBM) was for an individual patient. Without current best
first attributed to Gordon H. Guyatt at McMaster evidence, practice risks becoming rapidly out of
University in 1991 [2], the origins of the term date, to the detriment of patients” [5]. Thus, even
can be traced back much further to the though “expert opinion” is relegated to the bottom
eighteenth-century European enlightenment per- of most evidence hierarchies [18–20], it remains an
iod [3]. Perhaps no example better illustrates the integral aspect to the practice of EBM, and even
thought of this era than the Scottish naval sur- more so when there is a gap in evidence regarding a
geon James Lind’s controlled trial of oranges and particular clinical situation.
lemons for the prevention of scurvy, published in
1753 [4].
A popular definition of EBM was proposed by What Is the Value of Expert Opinion?
what many consider the father of the field,
Sackett [5], “the conscientious, explicit, and Perspective in the presence of evidence gaps
judicious use of current best evidence in making One of the certainties of clinical medicine is
decisions about the care of individual patients”. uncertainty [21, 22]. Given that medical knowledge
As EBM grew in prominence and popularity [6], involves multiple “ways of knowing” [23], many in
most of the focus and subsequent criticism was medicine decry the elevation of EBM as a “new
aimed at the creation, content and application of paradigm” which replaces “intuition, unsystematic
what was deemed to be “evidence” [7]. Specif- clinical experience, and physiologic rationale”
ically, while double-blind, randomized placebo- [24]. Indeed, there is a large body of literature which
controlled trials were deemed to be the highest documents the role of personal experience and
standard in clinical research [8, 9], many judgment, rather than clinical practice guidelines or
researchers expressed concerns regarding the algorithms, when physicians make complex medi-
external validity of these trials [10–13] and cal decisions [25–29]. The aggregate of this litera-
quality of the studies [14–16]. Moreover, the ture points to the accumulation of clinical expertise
limit s of evidence in clinical practice was rec- as an iterative process of hypothesis generation,
ognized by Naylor [17], who articulated “grey discovery and refinement; not dissimilar to the
zones of clinical practice where the evidence scientific method that undergirds published evi-
about risk-benefit ratios of competing clinical dence. Seen from this perspective, expert “opinion”
options are incomplete or contradictory.” is an unfortunate misnomer.
26 Opinion Pieces in Surgery 279
Beyond the epistemological questions of what observations from years of clinical practice,
exactly physicians know and how they know it, expert opinion can provide a check to the
there is the additional consideration that there are potential “tyranny of evidence” [5], which, can
evidence gaps in nearly all aspects of medicine result when research is not subject to the realities
[30]. Understanding that published evidence will of day-to-day clinical care. Seen from the origi-
not and cannot answer all the nuances that nal perspective of EBM as a harmony of indi-
accompany each clinical situation; expert opinion vidual clinical expertise and the best available
can offer valuable perspective. This is particu- external evidence [5], turning to expert opinion
larly relevant in the surgical fields, where evi- in the context of flawed studies is consistent with
dence often focuses on the outcomes of the practicing EBM.
surgical treatment rather than the intraoperative
details on which many of the outcomes depend.
Hard-won clinical expertise, refined over many
years, and thousands of patient encounters, can What Are the Limitations of Expert
bring perspective and practicality to clinical Opinion?
decisions that exist at the fringes of published
evidence. Expert opinion has been shown to be unreliable
Expertise when available published research is when compared to accepted gold standards of
flawed evidence [39]. One potential reason for this
In 2005, Dr. John Ioannidis published a unreliability is the sheer volume of medical evi-
provocative article titled Why Most Published dence available. A MEDLINE search revealed
Research Findings Are False [31]. In it, the that 869,666 citations were added to the index in
author laid out a mathematical explanation for 2016, while another study estimated that the
why published research is more often false than global scientific output doubles every 9 years
true. In the same year, he published another [40]. While these figures are only rough point
article showing that of 49 highly cited research estimations, the orders of magnitude involve
articles (cited more than 1000 times) in three attesting to the difficulty of staying current in the
major medical journals, over a third were sub- age of scientific “information overload” [41].
sequently contradicted by future studies or found Another reason why expert opinion can be
effects stronger than those of future studies. unreliable is due to introduced bias from conflicts
Moreover, less than half (44%) of the studies had of interest. Although many physicians feel as if
been replicated [32]. In this way, the paper they can resist financial conflicts of interest [42],
refuted with real-world examples the mathemat- empirical evidence would imply otherwise [43–
ical model that was originally proposed [31, 32]. 46]. That financial conflicts of interest can
While Dr. Ioannidis’s results may be the most introduce bias into the research process is
rigorous to examine the dynamic of contradicted self-evident, and matters of opinion are particu-
medical research, his work echoes the concerns larly ripe for such manifestation. Yet conflicts of
that many investigators have previously raised; interest are not always so obvious. Clinicians
namely, that incorrect inferences are being drawn may be biased due to an ardent belief in a par-
based on a flawed understanding of statistics ticular medical theory or be particularly attached
[33–35], that selective reporting of results biases to their own findings [31]. Moreover, many
the published literature [36], and that pervasive experts who write expert opinion pieces in the
and underreported conflicts of interest in the literature also serve as editors and reviewers of
medical research can further increase bias [37, journals; such experts could have a perverse
38]. Ironically, these are many of the criticisms incentive to downplay or reject research that is
that have been levied at expert opinion. deemed contrary to their viewpoints, thus intro-
When faced with evidence that is flawed or with ducing bias into the peer-reviewed process and
conclusions that are grossly inconsistent with potentially propagating a false body of evidence.
280 M. Torchia et al.
Expert Opinion in the Context appraise and correctly interpret the published
of the Surgical Literature evidence. This task is complicated by the emer-
gence of advanced statistical methods employed
Surgical practice relies heavily on the motor on massive data sets, new statistical techniques
skills (technical ability) and intraoperative of combining results in meta-analyses, and the
decision-making (non-technical skills) of the opaque reporting of methodology in manuscripts.
surgeon in the operating room for the outcomes Thus, clinicians may be simultaneously aware of
that are commonly reported in the surgical liter- new research while lacking the skills to interpret
ature. Neither has been studied extensively in the and apply the evidence. This may partially
peer-reviewed literature, despite the intraopera- explain why the large regional variations in
tive environment being the location where often medical practice first documented decades ago
the most consequential acts and decisions are [47–50] are still observed [51–53] despite sig-
made that affect the outcomes which are repor- nificant changes in the evidence base in most
ted. For many surgical procedures, there are medical fields. Expert opinion can clarify the
several, perhaps even dozens, of different ways main findings of a study and how it applies to a
to achieve the same outcome, and the nuances of population of interest. Of course, this presumes
each method are rarely discussed in the formal that the expert has no ulterior motive or biases in
literature. Moreover, there are thousands of dif- interpreting the literature and also assumes that
ferent decisions that must be made during the the expert has the skills necessary to correctly
course of even the simplest operation. Expert interpret and apply the evidence. Nonetheless,
opinion in the surgical literature, perhaps to a with the emergence of ever-increasing statistical
greater degree than the medical literature, pro- and methodological complexity in published
vides a means to discuss and disseminate the studies, expert opinion can be used to guide
nuances of technique and decision-making that readers through potential pitfalls.
are both unique to the surgical field and rarely
able to be discussed within other venues.
Seen in this way, expert opinion provides a Does Expert Opinion Apply to Me?
way for surgeons who have years of clinical
experience, often with the most complex cases, Surgical outcomes are intimately related to the
to prioritize and clarify the goals of surgery and technical ability of the surgeon. Indeed, one study
the intraoperative decision-making processes. demonstrated that greater surgical skill as mea-
Given the extraordinary number of intraoperative sured by blinded, independent surgeons judging a
decisions that need to be made and the multiple videotape of surgeons conducting bariatric surgery
means to achieve the same goal, surgeons with was associated with fewer postoperative compli-
less experience can often feel overwhelmed by cations, readmission, and visits to the emergency
these demands. When covering the content of department [54]. Additionally, there are docu-
surgical technique and decision-making, there- mented associations between increased hospital
fore, expert opinion can provide clinically rele- and surgeon volume and better outcomes across
vant information. several surgical specialties [55–58]. Therefore, in
Beyond the topics of surgical technique or the surgical literature, not all evidence is equally
decision-making, expert opinion can also help to applicable to surgeons and practice environments.
guide other clinicians through the hazards of In a similar fashion to evaluating the external va-
interpreting research on their own. Beyond sim- lidity of a study, one must also ask whether expert
ply keeping up with the latest research findings, opinion is relevant to one’s particular surgical
physicians must also understand how to critically training and practice environment.
26 Opinion Pieces in Surgery 281
1. What relevant training does the author What relevant training does the author(s) have?
(s) have? The primary author is a fellowship-trained
2. What relevant clinical experience does shoulder and elbow surgeon. Therefore, the pri-
the author(s) have? mary author has formal subspecialty training on
the topic of the article. Additionally, all authors
Technical Knowledge of the article were invited by their primary pro-
fessional organization (the American Academy
3. Has the author(s) published material on of Orthopaedic Surgeons) to contribute to the
the subject matter in your clinical sce- article based on their reputation as experts in the
nario before? topic.
4. Has the author(s) previous publications What relevant clinical experience does the
shown successful outcomes in similar author(s) have?
cases as that in your clinical scenario? The primary author is an established academic
5. Has the author(s) reported the results shoulder and elbow surgeon with over 240
with a credible time horizon (i.e.: were publications on shoulder and elbow conditions.
the results measured early, late or at the Further, all authors, including the primary author,
appropriate time)? were recognized by their professional
282 M. Torchia et al.
organization for their expertise in the subject subject of the article. Although this disclosure does
matter under consideration and thus invited to not specify which author or authors received
contribute to the article. something of value, the disclosure is clearly visible
on the first page of the article.
Have they declared this or any other type of
Technical Knowledge
conflict of interest in previous publications?
A quick review of the literature does not reveal
Has the author(s) published material on the
other conflict of interest disclosures in previous
subject matter in your clinical scenario before?
publications.
Yes
Has the author(s) previous publications shown
successful outcomes in similar cases as that in
your clinical scenario? Resolution of Case
Yes
Has the author(s) previous publications shown After an extensive discussion with the patient
successful outcomes in similar cases as that in regarding her options, she elects to proceed with
your clinical scenario? ORIF of her proximal humerus fracture. Utilizing
Yes. The article in question extensively discusses techniques emphasized in the opinion paper you
the relevant literature and cites multiple articles found prior to the case, you are able to focus on
with variable follow-up time periods. providing stability to the posteromedial region of
metaphyseal bone and obtaining an anatomic
reduction of the greater and lesser tuberosity. The
Perception as an Expert patient tolerates the surgery without any com-
plications, and after a course of physical therapy
What is the perception of the author(s) by oth- can participate in all of her previous activities
ers? (i.e.: are they considered an expert by their without issue. At her latest follow-up one year
colleagues? Do people turn to them for an from surgery she is satisfied with her results and
opinion? Have they been invited as key speakers you discharge her from your care.
at specialty conferences?)
All the authors of the article were asked by their
specialty society to contribute their expertise to
the subject matter based on their reputation as Appendix 1: Articles Identified
experts in the field. The authors were therefore in Literature Search
selected from their peers as those with excep-
tional experience and knowledge that would be 1. LaMartina J 2nd, Christmas KN, Simon P,
valuable to share with the greater orthopaedic Streit JJ, Allert JW, Clark J, et al. Difficulty in
surgery community. decision making in the treatment of displaced
proximal humerus fractures: the effect of
uncertainty on surgical outcomes. J Shoulder
Conflict of Interest Elbow Surg. 2018;27(3):470–77.
2. Bhat SB, Secrist ES, Austin LS, Getz CL,
Is there any conflict of interest between the Krieg JC, Mehta S, et al. Displaced proximal
author(s) and any type of industry (i.e., humerus fractures in older patients: shoulder
Pharmaceuticals)? surgeons versus traumatologists. Orthopedics.
The article states that one or more of the authors or 2016;39(3):e509–13.
the departments with which they are affiliated have 3. Li Y, Zhao L, Zhu L, Li J, Chen A. Internal
received something of value from a commercial or fixation versus nonoperative treatment for
other party-related directly or indirectly to the displaced 3-part or 4-part proximal humeral
26 Opinion Pieces in Surgery 283
32. Ioannidis JP. Contradicted and initially stronger with reported results. J Arthroplasty. 2003;18(7
effects in highly cited clinical research. JAMA. Suppl 1):138–45.
2005;294(2):218–28. 47. Wennberg J, Gittelsohn A. Small area variations in
33. Sterne JA, Smith GD. Sifting the evidence—what’s health care delivery: a population-based health
wrong with significance tests? Phys Ther. 2001;81 information system can guide planning and regula-
(8):1464–9. tory decision-making. Science. 1973;182
34. Wacholder S, Chanock S, Garcia-Closas M, El (4117):1102–8.
Ghormli L, Rothman N. Assessing the probability 48. Wennberg JE, Freeman JL, Culp WJ. Are hospital
that a positive report is false: an approach for services rationed in New Haven or over-utilised in
molecular epidemiology studies. J Natl Cancer Inst. Boston? Lancet. 1987;1(8543):1185–9.
2004;96(6):434–42. 49. Wennberg JE, Gittelsohn A. Health care delivery in
35. Risch NJ. Searching for genetic determinants in the Maine I: patterns of use of common surgical
new millennium. Nature. 2000;405(6788):847–56. procedures. J Maine Med Assoc. 1975;66(5):123–49.
36. Chan AW, Hrobjartsson A, Haahr MT, Gotzsche PC, 50. Wennberg JE, Gittelsohn A, Soule D. Health care
Altman DG. Empirical evidence for selective report- delivery in Maine II: conditions explaining hospital
ing of outcomes in randomized trials: comparison of admission. J Maine Med Assoc. 1975;66(10):255–
protocols to published articles. JAMA. 2004;291 61, 269.
(20):2457–65. 51. Birkmeyer JD, Reames BN, McCulloch P, Carr AJ,
37. Krimsky S, Rothenberg L, Stott P, Kyle G. Scientific Campbell WB, Wennberg JE. Understanding of
journals and their authors’ financial interests: a pilot regional variation in the use of surgery. Lancet.
study. Psychother Psychosom. 1998;67(4–5): 2013;382(9898):1121–9.
194–201. 52. Deyo RA, Mirza SK. Trends and variations in the use
38. Papanikolaou GN, Baltogianni MS, of spine surgery. Clin Orthop Relat Res.
Contopoulos-Ioannidis DG, Haidich AB, Gian- 2006;443:139–46.
nakakis IA, Ioannidis JP. Reporting of conflicts of 53. Song Y, Skinner J, Bynum J, Sutherland J,
interest in guidelines of preventive and therapeutic Wennberg JE, Fisher ES. Regional variations in
interventions. BMC Med Res Methodol. 2001;1(1):3. diagnostic practices. N Engl J Med. 2010;363(1):
39. Antman EM, Lau J, Kupelnick B, Mosteller F, 45–53.
Chalmers TC. A comparison of results of 54. Birkmeyer JD, Finks JF, O’Reilly A, Oerline M,
meta-analyses of randomized control trials and Carlin AM, Nunn AR, et al. Surgical skill and
recommendations of clinical experts. Treatments for complication rates after bariatric surgery. N Engl J
myocardial infarction. JAMA. 1992;268(2):240–8. Med. 2013;369(15):1434–42.
40. Bornmann L, Mutz R. Growth rates of modern 55. Katz JN, Losina E, Barrett J, Phillips CB,
science: a bibliometric analysis based on the number Mahomed NN, Lew RA, et al. Association between
of publications and cited references. ASIS&T. hospital and surgeon procedure volume and out-
2015;66(11):2215–22. comes of total hip replacement in the United States
41. Landhuis E. Scientific literature: information over- Medicare population. JBJS. 2001;83(11):1622–9.
load. Nature. 2016;535(7612):457–8. 56. Sosa JA, Bowman HM, Tielsch JM, Powe NR,
42. Boyd EA, Cho MK, Bero LA. Financial Gordon TA, Udelsman R. The importance of surgeon
conflict-of-interest policies in clinical research: issues experience for clinical and economic outcomes from
for clinical investigators. Acad Med. 2003;78 thyroidectomy. Ann Surg. 1998;228(3):320.
(8):769–74. 57. Kalkanis SN, Eskandar EN, Carter BS, Barker FG.
43. Wazana A. Physicians and the pharmaceutical Microvascular decompression surgery in the United
industry: is a gift ever just a gift? JAMA. 2000;283 States, 1996 to 2000: mortality rates, morbidity rates,
(3):373–80. and the effects of hospital and surgeon volumes.
44. Bekelman JE, Li Y, Gross CP. Scope and impact of Neurosurgery. 2003;52(6):1251–62.
financial conflicts of interest in biomedical research: 58. Birkmeyer JD, Sun Y, Wong SL, Stukel TA. Hospi-
a systematic review. JAMA. 2003;289(4):454–65. tal volume and late survival after cancer surgery. Ann
45. Dana J, Loewenstein G. A social science perspective Surg. 2007;245(5):777.
on gifts to physicians from industry. JAMA. 59. McCracken SG, Marsh JC. Practitioner expertise in
2003;290(2):252–5. evidence-based practice decision making. Res Soc
46. Ezzet KA. The prevalence of corporate funding in Work Pract. 2008;18(4):301–10.
adult lower extremity research and its correlation
Simple Statistical Tests and P Values
27
Charles H. Goldsmith, Eric K. Duku, Achilles Thoma
and Jessica Murphy
reduce the chance of postoperative complications ideas to make them more explicit for the math-
any time after surgery? ematically inclined reader.
To answer your research question, you con- The clinical article that the authors have
duct a literature search on June 10, 2018, using selected to illustrate the statistical principles is a
PubMed. From the PICOT format terms men- randomized controlled trial (RCT) with two arms
tioned above, you choose the following for your (or groups): control and intervention. The pri-
search: High Risk AND Abdominal Surgery mary outcome measure is complication rates in
AND Prehabilitation AND Complications; you the two groups. Two secondary outcome mea-
narrow your search to the last 10 years. Your sures are stated as the number of complications
search yields six articles (see Appendix 1). You per patient and severity of these complications.
screen the articles by title, and then by abstracts. The paper by Barberan-Garcia et al. [1] reports
Following this screening, you feel as though the other findings using other statistical tests and
article by Barberan-Garcia et al. [1] appears to be P Values; however, these are not described in
the best suited for the research question. This this chapter. Our discussion will focus on those
article focuses on a personalized prehabilitation used in comparing these two groups with
program for high-risk patients. It is recent (2018) descriptive statistics, test statistics with their
and it utilizes a randomized, blinded controlled names, and how they can be computed with
study design. The other articles found in your reliable statistical software. We will then repro-
search focus on general and digestive surgery duce these statistical test results and compare our
populations, sarcopenia, are two or more years results with those reported in the
old, or are systematic reviews; because of this, Barberan-Garcia et al. article [1]. Along the way,
you are confident in using the Barberan-Garcia we will comment on the assumptions that go with
et al. [1] article. these tests. We do not produce tests of these
assumptions, just whether the paper’s authors [1]
have provided enough detail to convince us that
Chapter Goal the tests used and conclusions drawn from them
support the findings we need to resolve the
The authors of this chapter would recommend clinical scenario. We include confidence intervals
that readers consider reading Chap. 29 as some (CIs) and P values to help interpret the findings
of the overlapping ideas are discussed there. in the appraised paper [1]. [For more informa-
Reporting of studies in the surgical literature tion, readers may consult Chap. 28 or work by
usually contains the use of statistical principles Cadeddu et al. [2].] Numbers, tables, and text
such as estimation and hypothesis testing. These will be used to describe the statistical output from
ideas are often displayed using equations and the cited software as well as how the software
mathematically oriented ways of doing calcula- does the computations. Graphs, although they
tions with a calculator; however, statistical soft- often can be helpful to our understanding of these
ware has been widely developed to do these ideas, will not be used.
calculations. The goal of this chapter is to
describe the statistical principles in English;
leaving the mathematical symbolism to a mini- Critical Evaluation of the Paper
mum. We have decided to keep our estimation for Reporting RCTs
and hypothesis testing to a few simple methods,
which are often seen in the surgical literature. We This chapter uses the validity checklist for scor-
leave the detailed statistical and mathematical ing meta-analysis articles by Cochrane [3] as it
descriptions to the statistics literature. Some of seemed to be most relevant to this scenario (see
the references listed throughout the chapter have Appendix 3). The interventions in our scenario
more detailed discussions of these statistical are not surgical; they compare prehabilitation to
27 Simple Statistical Tests and P Values 287
In Table 27.2, we list the test statistics, the null Table 27.3 Primary outcome counts with modified ITT
hypothesis, and their reference distributions for analysis
computing the associated P values that will be Complications\Arm I C Total
encountered in the resolution of the clinical
Yes 19 39 58
scenario. These tests can also be analyzed using
confidence intervals to test the null hypotheses No 43 24 67
listed. Both chi-square and Student’s t reference Total 62 63 125
significance) was set at a = 0.05 in Table 27.1. specifically required.] For the control group, the
The confidence coefficient is computed from the rate of complications is 39/63 = 0.619 or 61.9%;
a value by determining its compliment and for the intervention group, the rate of complica-
expressing it as a percentage, or (1 − a) 100%, in tions is 19/62 = 0.306 or 30.6%. The paper
this case, 95%. From Table 27.1, we also saw authors reported these as integer percentages: 62
that the authors reported two-sided tests; this also and 31%, Table 27.1. [Reporting to more deci-
means that the confidence intervals are two mal places is not reasonable for these sample
tailed, showing a minimum and a maximum as sizes; however, the extra decimal places should
two numbers, usually reported as (minimum to be used to do further calculations such as with a
maximum) to avoid using a dash (–) to separate relative risk. We use the statistical software to
them if the data could be reported as negative compute these here.]
numbers, such as with differences.
The chi-square test can be computed from a
2 2 contingency table illustrated in Could All of This Be Due to the Play
Table 27.3. The rows in the table designate of Chance?
complication counts (Yes) and non-complication
counts (No) with the bottom margin of the table To consider the role of chance in our results, the
showing total counts for each column. The first sample size calculation, Table 27.1, used by
two total counts in the last row come from the Barberan-Garcia et al. [1] is based on percent-
header in Table 4 of the Barberan-Garcia et al. ages, not proportions, Table 27.1. A background
[1] article. The columns in the table show I (in- complication rate of 30% using colorectal
tervention) counts, C (control) counts, as well as patients which could be considered as the control
total counts for each row of the table. group rate was used. Also, the reduction in the
The Yes row was derived from Table 4 of the rate was stated to be at least 20% and so this
appraised article [1], the total by adding implied either a difference in percentages of that
19 + 39 = 58. The total in the bottom right cor- size to be considered as a clinically important
ner is obtained by adding the two sample sizes minimum important difference (MID) [4], with
62 + 63 = 125, the number of patients that similar interpretations for ratios such as the rel-
Barberan-Garcia et al. [1] claims is the ative risk. From proportions (percentages), the
intention-to-treat (ITT) total. However, the variance and standard deviation can be com-
paper’s authors [1] have misused this term, it puted, so they are not needed to be stated sepa-
should be “Modified ITT” as it leaves out the 19 rately. Also, a and b are defined, so all four
patients that were randomized, Table 27.1, but Greek letters: a, b, d, and r are specified along
did not get surgery. We are NOT told details of with the two-sided or two-tailed tests [Ts],
these 19 patients as they were also left out of Table 27.1, and the study design (D). While the
Table 27.1 in the original article [1], and no software used to calculate the sample size needed
complications were recorded or presented for for the trial is specified, the outcome measure [O]
them; even though the intervention patients could and test statistic [Tt] were not specified among
have had complications resulting from the pre- the four tests that Barberan-Garcia et al. [1]
habilitation intervention, and the control patients planned to use, Table 27.1. Therefore, six of the
may have had complications in any event. eight criteria for a sample size calculation (Recall
Using the data in Table 27.3, the overall rate n = f (a, b, d, r, D, O, Ts, Tt) from Chap. 29)
of complications is 58/125 = 0.464 or 46.4%, were specified by the paper’s authors [1]. The
which the authors report as 46% but not as a paper’s authors also inflated the computer output
proportion, Table 27.1. [While it is simple to to accommodate 20% dropouts, to get the 70
convert proportions to percentages and vice needed per group. [Of the 140 required, the
versa, they can matter when you enter them into authors randomized 144.] For our discussion
statistical software when one or the other is here, we assume that the outcome (primary) was
290 C. H. Goldsmith et al.
the rate of complications and the test statistic was the rows that have the numerators/denominators
the chi-square test. for the percentages (or proportions) of compli-
cations, such as in Table 27.3. For a test statistic
with 1 DF, we expect numerical values in the
Computing the Test Statistic region of the mean = 1, so the 12.277 seems
bigger than the mean; it is in the right-hand tail of
The data in Table 27.3 were entered into Minitab what we would expect.
18 [5] using the “Chi-square Test for Associa- We can also use the data in Table 27.3 to do a
tion.” The output is shown in Table 27.4. [We do statistical test that compares two proportions (or
not show here that the output included the data in percentages) where we compute a difference
Table 27.3 which is useful to check that the data between the two proportions as the test statistic
entry step was correct.] and its associated CI. Here, the null hypothesis is
For this discussion, we will use the Pearson that the population difference between the two
method for computing the value of the chi-square arms is 0 (zero) and the test statistic has a sam-
test statistic. [If there is a choice in statistical pling distribution of the test statistic that is nor-
software output, surgical researchers should state mal with a mean of 0 (zero) and variance (and
which one was chosen and why.] In this case, the standard deviation) of 1 (one). However, since
value of the test statistic is 12.277, it was com- differences can be negative or positive, the
puted with 1 DF and the P value was 0.000. [It is sampling distribution has two tails that we could
important that researchers be careful with statis- decide to test a one-tail test to get a P value as 1P
tical software output.] The number of decimal or as P1, depending on which tail is to be set by
places printed can be modified in most software the alternative hypothesis. We will use both tails
packages. Minitab 18 [5] produces three decimal because we would like to detect either harm or
places by default. This does not mean that you benefit for the intervention of prehabilitation, i.e.,
should blindly report this P Value in your paper, compute P2. The sample size justification also
it should be reported as “< 0.001”, Table 27.1. appears to be claiming a MID of a least a 20%
[If you are interested in the other method for difference in complications rates to be credible as
computing the chi-square test statistic, Minitab well as detectable in the trial justification,
18 [5] has a help file to explain the differences Table 27.1. Because Barberan-Garcia et al. [1]
between the two methods.] did not mention which test statistic or the out-
come measure explicitly, one might surmise that
the paper’s authors meant the testing of equal
What Does the Number Mean proportions (percentages) of complications from
to a Reader? the two arms of the trial.
with 1 DF. This distribution is sometimes called being checked in any paper using these distri-
a reference distribution as P values are computed butions. For example, the assumptions for the
from this reference distribution considering the chi-square test statistic to have a chi-square ref-
numerical value that corresponds to the signifi- erence distribution for P value calculation are:
cance level specified for the study; here, two independent arms, each patient in each arm
a = 0.05. With the chi-square test statistic, both should be independent of each other, and the
positive and negative deviations are computed allocation of patients to arms should be random
from what would be expected by the null and have complete reporting. The first three seem
hypothesis. The “square” part of the chi-square to be satisfied; however, not the last one, as the
test statistic indeed squares computationally authors use a modified ITT, leaving out those 19
these numbers so both the large negative devia- patients who were randomized but did not get
tions and large positive deviations from the null surgery, Table 27.1. A sensitivity analysis could
hypothesis get relegated to the right-hand tail of be used in this case to compare the results of
the reference distribution. There are two tails to those randomized to those used in the modified
the set of test statistics, but they get forced into a ITT analysis; which we report later. Multiple
single right-hand tail of this reference distribu- imputation [6] is not sensible here because the
tion. Even so, from the original data there are two baseline variables for the 19 patients left out from
tails involved so this becomes a two-tail test, as the results were not available to be used for the
required by Barberan-Garcia et al. [1] and for our imputations.
clinical scenario. One way to see this is to have There is currently much controversy about
stated this as P2 to indicate a two-tail test, P1 as a using P values to judge the importance of re-
one-tail test in the right-hand part of the depar- search findings. Some of the debate is cited in the
ture from the null hypothesis and 1P as a one-tail references: [7–13]. Some of the remedies to help
test in the left-hand part of the departure from the include using CIs to express the study findings.
null hypothesis. So, in this case, we could just as Please see Chap. 28, for more information.
easily reported that P2 < 0.001, rather than the A second way of reporting the findings in
notation that the paper’s authors used. Table 27.3 is to use a null hypothesis related to
The chi-square reference distribution with 1 relative risk (RR) of complications in the inter-
DF has a mean of 1, variance of 2, and standard vention group compared to the control
deviation as the square root of 2 or 1.414 to 3 dp group. An RR compares the rate of complica-
(decimal place). So the computed test statistic of tions in the intervention group to the rate in the
12.277 seems to have many standard deviations control group using the ratio with intervention in
[(12.277 − 1)/1.414 = 7.97, around 8] above the the numerator and control in the denominator.
mean in the reference chi-square distribution. To The null hypothesis here is that the population
make this more specific, given the size of level of RR is 1 (one). [Be careful with the data entry into
significance of 0.05 and the area under the ref- the software used as the reverse for the numerator
erence distribution of one, the chance or proba- and denominator can give the inverse of RR as
bility that the reference distribution exceeds 1/RR instead.] To do the computation here, we
12.277 will be the area in the right-hand tail. are using statistical software called CIA for
Minitab 18 [5] has computed this as 0.000 in confidence interval analysis [14] to do the com-
Table 27.4 and it should be reported as putations with confidence intervals rather than
P2 < 0.001. Because this P2 value is less than P values. Using the 2 2 entries in Table 27.3,
the significance level, these complication rate we use “Relative Risk for parallel groups,” with
data are declared to be statistically significant. results are shown in Table 27.5.
Many times, there are standard assumptions To report these values like in the
that need to be met, if the sampling distribution Barberan-Garcia et al. [1] paper, Table 27.1, the
for the test statistic is well described by the ref- estimate and the boundaries of the 95% CI are
erence distribution. These should be reported as rounded to 1 dp as 0.5; 0.3–0.8. Using the
292 C. H. Goldsmith et al.
Table 27.5 Proportions for relative risk calculations Table 27.6 Secondary outcome: results per patient by
with Minitab [5] group using Minitab [5]
Statistic Estimate 95% Minitab Fisher Group n Mean SD SE mean++
CI P+ exact
Control 63 1.40 1.60 0.20
P+
Intervention 62 0.50 1.00 0.13
RR 0.495 0.325– 0.000 0.001
0.755 Total+ 125
Notes +Computed with proportions test in Minitab 18 [5], Notes +Added to the table by CHG with calculator
reported as P2 < 0.001 for the normal test of the computation with these entries. ++ is added by Minitab 18
difference in proportions as well as the Fisher exact test [5], not in Barberan-Garcia et al. [1]. The rest of the
as P2 = 0.001. However, the latter is not usually used results may have been rounded to 1 dp by
since all of the counts in Table 27.3 exceed 5 Barberan-Garcia et al. [1]; the raw data were not in the
appraised article [1] to check their veracity. Having a
distribution of the counts of complications for all patients
by group would have permitted their recomputation
output, we could choose normal approximation
of the logarithms of the RR null hypothesis test
and this can be reported P2 < 0.001 for it or one complication, the Yes row in Table 27.3.
P2 = 0.001 for the Fisher exact test. This com- They could be then tested with a two indepen-
pares favorably with Table 27.1, except for dent sample Student’s t test. Computations are
which test was used. Our computations would done in Minitab 18 [5]. However, the SDs are
agree with the Barberan-Garcia et al. [1] using both bigger than the means and suggests that the
the two proportions test statistic, but NOT with distribution of the number of complications is
the normal approximation since the Fisher exact skewed right. It is also possible that the SD for
test was not needed. The event rates were not rare the control group is larger than the intervention
or highly skewed where the Fisher exact test gets group so might violate the variance homogeneity
used. The reference distribution for the Fisher assumption of the usual Student’s t test. We
exact test is the hypergeometric distribution. The may have to use the Welsh variation of the
assumptions for the relative risk tests are the Student’s t test which does not assume variance
same as for the chi-square test; however, the homogeneity. [CHG’s rule without testing vari-
normal approximations for the test statistic ance homogeneity is doubling of one SD over the
depend on the proportions being between 0.2 and other, which is not true here.] The assumptions of
0.8 as they were here. Otherwise, the values randomness are still a problem, but independence
could be highly dependent on the distance away between and within the groups is still satisfied
from 0.5 in the tails of the proportion distribu- (Table 27.6).
tion. Sensitivity analysis shown in Appendix 2 The equal variance assumption for the Stu-
for the chi-square test should be similar for the dent’s t test computed the mean difference as
comparison of the ratio of proportions called RR 0.900, pooled SD of 1.337 with a 95% CI: 0.427–
as well as the Fisher exact test. The one corner 1.373. Since the null hypothesis here would be 0
shown to be bigger than a = 0.05 and so not (zero), it is clear that the 95% CI does not contain
statistically significant should be the same as the the null value. The pooled Student’s t test is 3.76
RR in Appendix 2, although we do not show with 123 DF and a P value of 0.000. This could be
these. reported as P2 < 0.001 and is lower than the sig-
Now, what about the secondary outcomes nificance level of a = 0.05, so the results are sta-
computed from the complications; their mean tistically significant. We were not given what
and standard deviations with a P value shown in Barberan-Garcia et al. [1] consider to be the MID
Table 27.1. These are called rates and are com- for this measure, so the clinical meaning may not
puted from the total count of complications be possible without other data.
divided by the sample size values per patient The Welsh t test does not assume that the
rather than whether or not a patient had at least variances or SDs of the two groups are equal, and
27 Simple Statistical Tests and P Values 293
so Minitab 18 [5] does not report the pooled discussion here uses two-tailed confidence
standard deviation in its output. The mean dif- intervals, although Minitab 18 [5] and CIA [14]
ference is still 0.900, with a 95% CI: 0.428– allow one-tailed confidence intervals if they
1.372. Since the null hypothesis here would also make sense. Readers are advised to read Strei-
be 0 (zero), it is clear that the 95% CI does not ner’s article [15] for a discussion of the reasoning
contain the null value. The Welsh t test is 3.78 for the choices between one- and two-tailed tests.
with 104 DF and a P value of 0.000. This could
be reported as P2 < 0.001 and is lower than the Software
significance level of a = 0.05, so the results are Below are three commonly used, and easily
statistically significant. accessible statistical software options that readers
The impact of the variance homogeneity may consider utilizing.
assumption is small. The 95% CIs are different in R: This software is free and is supported by
the third dp, and test statistics are different by statisticians all over the world under the creative
0.02, a small amount and the DF are quite dif- commons license. As this chapter is being writ-
ferent suggesting that the reference Student’s ten, it has survived for 25 years, currently con-
t distribution for computing the P value is dif- tains 12,900 expansion packages and 220,000
ferent; however, to 3 dp, the P values are both additional functions. As an example, 251 new
0.000 and regardless of which one was chosen it packages were added to the CRAN [16] network
still would be reported a P2 < 0.001. Hence, the in July 2018. All of these are available for any-
variance homogeneity assumption would not one to try for their personal and study usage.
make any difference to the interpretation of the Package authors tend to be very responsive to
tests. sensible questions. Readers wanting more infor-
Like the other tests, the two independent mation are invited to consider the references [6,
sample Student’s t test would have been nice to 21–23].
be able to do a sensitivity analysis to counter the CIA: This software is specifically for com-
modified ITT that was reported in the appraised putation of confidence intervals [14]. It comes as
article [1]. The one thing that we could do is to part of the book when purchased.
assume that all data from the 19 patients left out Minitab [5]: The current version of this soft-
of the analysis had some specific number of ware is 18.1. Academic and student versions
complications, which were not reported. available for those with appropriate academic
Last, the third way to consider the complica- affiliations.
tions was to analyze the severity data on the Those who choose to use software such as
complications. However, we were not given any Microsoft Excel, web, cell phone, and cloud apps
data on the severity except how they were coded for their statistical calculations should be wary of
with no reference citation of how. their poor computational validity as suggested by
Barberan-Garcia et al. [1] claimed that there was many different peer-reviewed publications [17–
no difference in the severity, Table 27.1. How- 20]. Your collaborative biostatistician should
ever, leaving out these 19 patients, this claim help you find the reviews of software in these as
does seem credible given the large differences different new devices and locations that you can
shown in enhanced aerobic capacity with DET, cite when you use such questionable sources to
P < 0.001 and the reduced length of stay in the support your surgical research.
ICU shown in Table 4 of the appraised article
[1], even though P = 0.078. Data should have Sensitivity Analysis
been provided to support this finding. A credible RCT needs to report all the data for all
patients randomized to conduct an ITT analysis,
Confidence Intervals which is considered the Gold Standard way to
Just like hypothesis tests and P values, confi- report study findings. When that does not happen
dence intervals can be one or two tailed. Our as in the Barberan-Garcia et al. [1] paper,
294 C. H. Goldsmith et al.
variations from ITT such as a modified ITT significance level, and the D of 14% is smaller
reported in the appraised paper [1] need to be than the stated 20% so this case is NOT statis-
supplemented with a sensitivity analysis for any tically significant, nor a clinically important
assumption that has not been met to make the finding.
statistical analysis valid. We conducted such a In summary, three of the four sensitivity
sensitivity analysis by comparing the modified analyses support the modified ITT analysis and
ITT reported in the appraised paper [1] and one of the four does not. The dashes in the e part
making some assumptions about the 19 patients of the tables were not computed, yet in the
randomized but removed from the tables in the right-hand corner of the 3 3 tables, there
Barberan-Garcia et al. [1] paper. For these 19 maybe other cases apart from c where the sen-
patients, we are not given any baseline data on sitivity data do not support the modified ITT
them in Table 27.1 of the original paper [1], so analysis. This suggests that the fact of the dis-
we do not know any of their baseline character- crepancy among these sensitivity analyses cre-
istics. This prevented us from doing multiple ated by the missing 19 patients casts serious
imputations of these data related to these 19 doubt about the clinical importance as well as the
patients to see if it had any impact on the study statistical significance of the prehabilitation
conclusions. Such imputations can be conducted reported in the Barberan-Garcia et al. paper [1].
using R software [6, 21–23] and should be used Although we do not show it here, we anticipate
to study the missing data impact on the conclu- that the relative risk analysis and the proportions
sions as outlined in the New England Journal of analyses would also NOT pass the sensitivity
Medicine [22] that summarizes a 2010 National analyses of these assumptions using tables simi-
Science Foundation report. A NEJM editorial lar to those in Appendix 2; however, using these
published with the article suggests NEJM does other test statistics.
not want to see manuscripts if this is not done; If we had access to Table 27.1 data at baseline
the surgical literature should take heed. for the 19 deleted patients, a strategy called
We have created a sensitivity analysis shown multiple imputation analysis could be conducted
in Appendix 2 that measures the veracity of the with the package called mice (multiple imputa-
group difference analysis done with the primary tion using chained equations) in R as outlined by
outcome of rates of complications as they are van Buuren [6]. What imputation does is produce
reported in Table 27.3. To do so, we create four models that are fitted to the available data using
different uses of whether the 19 patients would variables in the study to make multiple predic-
have an impact on the analysis and reporting by tions for all the missing data. Next, for each
including these patients as have complication imputed data set, do a complete planned analysis
Yes or No between the two arms of the trial data as described in a statistical analysis plan for the
making tables that contain all 144 patients ran- study, to measure the impact of missingness on
domized versus Table 27.3 which has 125 and the conclusions and this impact should be
excludes these 19 patients. Notice that all the reported as part of the Barberan-Garcia et al.
created tables have a total of 144 patients, with paper [1], as now some journals require [24].
73 in the intervention group and 71 in the control
group, exactly as randomized. The P values Multiplicity
created by the chi-square tests as done for When there is more than one source of data that
Table 27.3 show that three of these analyses reflect on a single outcome variable such as the
have P values that are below the level of sig- Barberan-Garcia et al. paper [1] with complica-
nificance, so these sensitivity analyses also show tions measured in three ways: counts, mean per
statistically significant findings, and effect sizes patient, and severity. It means a reader trying to
denoted by D that are larger than the MID of decide about a clinical scenario needs to consider
20%. However, one of the sensitivity analyses, c, which of the three to use in the resolution of the
showed P values that are larger than the scenario. We already know that the paper’s
27 Simple Statistical Tests and P Values 295
authors claimed that the counts was considered to trial, so might be the most important for the
be the primary outcome measure and presumably patient and discussions with the patient.
that is what was used in the sample size justifi- Barberan-Garcia et al. [1] report using a
cation, although not stated by the appraised paper chi-square test and a P value of 0.001, which was
authors [1]. What other choices could be made statistically significant at the 0.05 level, although
using this multiplicity of outcome measures. the authors did not report the value of the
Multiplicity leads to consideration of how one chi-square statistic. Using the same data as the
might use the risk of a Type I error called a, authors, we showed how to compute the
which we now know drives the P values to chi-square statistic and found the P value was
determine whether the test statistics used are indeed less than 0.001! A flaw in this thinking is
statistically significant. One common way is to that the authors thought they were doing an ITT
consider the three ways judged separately with analysis; however, they were incorrect because
the Bonferroni adjustment of the level used by they excluded 19 patients who were randomized
having each of the outcomes compared to a/ from their analysis. A subsequent sensitivity
3 = 0.05/3 = 0.0266. With our scenario, this will analysis was not totally convincing since one of
not work since the adjusted level has an the analysis suggested that the P value was
assumption of statistical independence, which is around 0.1 and so it would not be statistically
not true since there are three ways all of which significant. Thus, the benefit of the prehabilita-
are derived from the same outcome, and hence tion intervention would be 14% which was less
are highly correlated. Because of this, we chose than what paper’s authors had considered as the
NOT to make any level adjustment for the minimum important difference of 20% in the
interpretation of the test statistics. If all three of sample size justification. The appraised paper’s
these ways are to be judged simultaneously, then authors [1] also show the results of a relative risk
the proper way to do so with a multivariate test analysis that was reported with a 95% confidence
statistic, called Hotelling’s T-squared test; how- interval as their test statistic method. However,
ever, we could NOT consider doing this due to the numerical value of this test statistic was not
lack of data reporting by the paper’s authors. reported. The paper’s authors [1] report stated
However, there is statistical literature on the that the RR was 0.5, 95% CI: 0.3–0.8, P = 0.001.
multiplicity issue that is not followed well in the We were able to use the same data as the paper’s
health sciences journals, including those from authors and our calculation gave 0.495, 95% CI:
surgery. Readers are directed references [8, 25– 0.325–0.755, which when rounded to one deci-
29] for more information. Again, we would like mal place gave the same results as the authors
to emphasize that, if you are a surgical [1]; however, with a P value of less than 0.001,
researcher, your collaborative biostatistical col- just like we obtained with the chi-square test.
league can help with the selection of a suitable Note, both of these CIs excluded 1 (one) which is
method, and provide references to support what the value assumed in the null hypothesis. We
was used. were able to redo the sensitivity analysis with the
19 patients who were left out and came to the
Resolution of Clinical Scenario same conclusion. However, one of the four
Recall that in the PICOT we created, the outcome extremes computed had a confidence interval that
was complications. However, in the paper, there included 1 (one) from the null hypothesis. This
are three ways of considering complications: provides doubt about the veracity of the data
their rate, mean per patient, and their severity. shown by the authors; suggesting that there
The first of these was the primary outcome of the maybe no benefit to the patients.
296 C. H. Goldsmith et al.
For the mean complication per patient analy- Reporting Statistical Findings
sis, we were unable to do a sensitivity analysis, in the Surgical Literature
because we were not given any data on the 19
patients excluded, nor did we have their baseline When summarizing the information that was
characteristics to do a multiple imputation anal- reported in Table 27.1, many ideas were written
ysis. We were able to redo the analyses of the about statistical issues that displayed a lack of
means found in the paper. The analysis of the reporting clarity, and we often disagreed with
means reported in the Barberan-Garcia et al. [1] what Barberan-Garcia et al. [1] had said. The
paper was very similar to our computations. most serious issue from our viewpoint is the
For the severity of the complications, the claim that an “Intention-To-Treat” analysis was
authors report no data, Table 27.1, to support the done. This was NOT correct because the authors
sentence that they claimed there was no differ- [1] chose to ignore the 19 patients that were
ence in the severity of complications. This is not randomized to the prehabilitation arm or not, and
quite credible as the large increases are seen in reported on a total sample of 125 patients instead
physical performance in the intervention group of the 144 randomized. We also could not
as well as the much shorter ICU length of stay. identify from the author list in the appraised
We do not find this conclusion of the authors paper [1], a biostatistical collaborator. To rectify
credible. these concerns in the future, we suggest that
Finally, the fact that there are three different surgical researchers always have a highly trained
ways of expressing the complications; they statistical collaborator as part of their research
would be derived from the same patients and so team to help write about the statistical issues
would be highly correlated measures of outcome, more cogently. Readers interested in statistical
yet there was no mention of how these three reporting guidelines maybe interested in refer-
analyses would reflect on the conclusions of this ences [25, 30–34].
study. Certainly, individual P values would not
lead to the same conclusion for all three vari-
ables, even though that is what was reported, Appendix 1: Articles Identified
apart from severity. Here is where the multi- in the Literature Search
plicity of testing issue should be discussed by
Barberan-Garcia et al. [1] and how they proposed 1. Barberan-Garcia A, Ubré M, Roca J,
to handle it statistically, and if that had an Lacy AM, Burgos F, Risco R, Momblan D,
important impact on their conclusion for the Balust J, Blanco I, Martinez-Palli G. Person-
outcome of complications. alized prehabilitation in high-risk patients
Our overall conclusion is that the PICOT undergoing elective major abdominal sur-
question could not be answered by the results of gery: a randomized blinded controlled trial.
the appraised study [1]. Better evidence should Ann Surg. 2018;267(1):50–6. https://doi.org/
be found that does not have the same method- 10.1097/00000000000002293.
ological flaws as the paper we found seems to 2. Berkel AEM, Bongers BC, vanKamp M-JS,
have. Maybe a trial older than 2018 would have Kotte H, Weltevreden P, deJongh FHC,
used ITT properly and the complications would Eijvogel MMM, Wymenga ANM,
not have been coded in three different ways to Bigirwamungu-Bargeman M, vander Palen J,
confuse the reader. van Det MC, van Meeteren NLU, Klasse JM.
27 Simple Statistical Tests and P Values 297
The effects of prehabilitation versus usual Publication Date: 2018. Available from:
care to reduce postoperative complications in http://clinicaltrials.gov/show/nct02934230.
high-risk patients with colorectal cancer of 2016.
dysplasia scheduled for elective colorectal
resection: study protocol of a randomized
controlled trial. BMC Gastroenterol.
2018;18:29. https://doi.org/10.1186/s12876- Appendix 2: Sensitivity Analysis
018-0754-6. for Table 27.3 Using Chi-Square Test
3. Mayo NE, Feldman L, Scott S, Zavorsky G, and Minitab 18
Kim DJ, Charlebois P, Stein B, Carli F.
Impact of preoperative change in physical (a) Assume outcomes for missing patients were
function on postoperative recovery: argument all NO, for both arms
5. NCT02934230. The prehabilitation study: (d) Assume outcomes for missing patients were
exercise before surgery to improve patient all Yes, for both arms
function in people. [Registration]. Online
298 C. H. Goldsmith et al.
(e) Sensitivity analysis summary of all four G 2. Were all randomized participants No
tables above: think checkerboard! Using a analyzed in the group to which they
3 3 board were allocated?
H Are reports of the study free of Unsure
I No I Yes suggestion of selective outcome
reporting?
C No < 0.001, < 0.001, a – 0.097, 0.096, c
Other sources of potential bias:
– – –
I 1. Were the groups similar at Unsure
C Yes < 0.001, < 0.001, b – 0.003, 0.002, d baseline regarding the most
important prognostic indicators?
Entries are P2 values, letters indicate the J 2. Were co-interventions avoided or Unsure
tables above, and dashes (–) indicate parts of the similar?
sensitivity that we did not compute. Notice that K 3. Was the compliance acceptable in Unsure
the bolded cells, a, b, and d are all less than all groups?
a = 0.05, so are statistically significant, while c L 4. Was the timing of the outcome Unsure
cell that is not bolded so is not statistically assessment similar in all groups?
significant. M Is there a serious and fatal flaw Flawed
with this study?
(focus on the impact of selection
bias, information bias, reporting
Appendix 3: Risk of Bias Scoring errors, and confounding)
of the Barberan-Garcia et al. [1] Are other sources of potential bias Unsure
Paper unlikely?
- funding bias, conflict of interest
statement
Risk of Bias - outcome measure not valid
Collaboration’s tool for assessing risk of bias in 20. Melard G. On the accuracy of statistical procedures
randomised trials. BMJ. 2011;343:d5928. in Microsoft Excel 2010. Comput Stat. 2014;29
4. Schunemann HJ, Guyatt GH. Commentary—good- (5):1095–128.
bye M(C)ID! Hello MID, where do you come from? 21. The R Foundation. The R project for statistical
Health Serv Res. 2005;40:593–7. computing [Internet]. Available from: https://www.r-
5. Minitab Inc. Minitab [Internet]. [cited 2018 Sep 6]. project.org. Accessed 24 Jul 2018.
Available from: http://www.minitab.com/en-us/. 22. Pearson RK. Exploratory data analysis using R. Boca
6. van Buuren S. Flexible imputation of missing data, Raton FL: Chapman & Hall/CRC; 2018.
2nd ed. Boca Raton, FL: Chapman & Hall/CRC; 23. Thieme N. R generation. The story of a statistical
2018. programming language that became a subcultural
7. Goodman SN. p values, hypothesis tests, and likelihood: phenomenon. Significance. 2018;15(4):14–9.
implications for epidemiology of a neglected historical 24. Little RJ, D’Agostino R, Cohen ML, Dickersin K,
debate. Am J Epidemiol. 1993;137(5):485–96. Emerson SS, Farrar JT, et al. The prevention and
8. Fraser DAS, Reid N. Crisis in science? Or crisis in treatment of missing data in clinical trials. NEJM.
statistics! Mixed messages in statistics with impact 2012;367:1355–60.
on science. J Statist Res. 2016;48–50(1):1–9. 25. Mayo-Wilson E, Fusco N, Li T, Hong H, Canner JK,
9. Wasserstein RL, Lazar NA. The ASA’s statement on Dickersin K for the MUDS Investigators. Multiple
p-values: content, process, and purpose. Am Stat. outcomes and analyses in clinical trials create
2016;70(2):129–33. challenges for interpretation and research synthesis.
10. Mark DB, Lee KL, Harrell FE Jr. Understanding the J Clin Epidemiol. 2017;86:39–50.
role of P values and hypothesis tests in clinical 26. Perneger TV. What’s wrong with Bonferroni adjust-
research. JAMA Cardiol. 2016;1(9):1048–54. ments? BMJ. 1998;316(7137):1236–8.
11. Ioannidis JPA. The proposal to lower P values 27. Cook RJ, Farewell VT. Multiplicity considerations in
thresholds to .005. JAMA. 2018;319(14):1429–30. the design and analysis of clinical trials. J Roy Statist
12. Wild CJ, Pfannkuch M, Regan M. Towards more Soc A. 1996;159(1):93–110.
accessible conceptions of statistical inference. J Roy 28. US Department of Health and Human Services, Food
Statist Soc A. 2011;174(Part 2):1–23. and Drug Administration, Center for Drug Evalua-
13. Hurlbert SH, Lombardi CM. Lopsided reasoning on tion and Research (CDER), Center for Biologics
lopsided tests and multiple comparisons. Aust NZ J Evaluation and Research (CBER). Multiple end-
Stat. 2012;54(1):23–42. points in clinical trials [Internet]. [cited 2018 Sep 6].
14. Altman DG, Machin D, Bryant TN, Gardner MJ, Available from: https://www.fda.gov/downloads/
editors. Statistics with confidence, 2nd ed. BMJ drugs/guidancecomplianceregulatoryinformation/
Books; 2011. guidances/ucm536750.pdf.
15. Streiner DL. One-tailed and two-tailed tests. Statis- 29. Bland JM, Altman DG. Multiple significance tests;
tics commentary series; commentary #12. J Clin the Bonferroni method. Statistics notes. BMJ.
Psychopharmacol. 2015;35(6):628–9. 1995;310:170.
16. CRAN (Comprehensive R Archive Network). The 30. Livingston EH. Introducing the JAMA guide to
Comprehensive R Archive Network [Internet]. [cited statistics and methods. NEJM. 2014;312(1):35.
2018 Sep 6]. Available from: http://CRAN.R-project.org. 31. Aslan D, Sandberg S. Simple statistics in diagnostic
17. Wheatley M. 5 key considerations when choosing tests. J Med Biochem. 2007;26:309–13.
open source statistical software. Data informed. Big 32. Lydersen S. Statistical review: frequently given
data and analytics in the enterprise. http://data- comments. Ann Rheum Dis. 2015;74:323–5.
informed.com/author/nadaeum/ [It worked for CHG.]. 33. Lang TA, Altman DG. Basic statistical reporting for
18. Edwards H. A review of probability and statistics articles published in biomedical journals: the “Sta-
apps for mobile devices. In: Maker K, de Sousa B, tistical Analyses and Methods in the Published
Gould R, editors. Sustainability in statistics educa- Literature” or “the SAMPL guidelines”. In:
tion. Proceedings of the ninth international confer- Smart P, Maisonneuve H, Polderman A, editors.
ence on teaching statistics (ICOTS July). Flagstaff Science editors’ handbook. European Association of
AZ: Voorburg; 2014. p. 1–4. Science Editors; 2013. p. 1–8.
19. McCullough BD, Yalata AT. Spreadsheets in the 34. Wan X, Wang W, Liu J, Tong T. Estimating the
cloud—not ready yet. J Stat Softw. 2013;52(7). sample mean and standard deviations from the
Available from: https://www.jstatsoft.org/article/ sample size, median, range and/or interquartile range.
view/v052i07/v52i07.pdf. BMC Med Res Methodol. 2014;14:13.
Confidence Intervals
28
Jessica Bogach, Lawrence Mbuagbaw
and Margherita O. Cadeddu
Confidence Intervals (CI) have become essential relevant. To understand this in a clinical context,
in most journals for researchers to communicate we present a scenario to demonstrate key points
their study results. Understanding CIs gives relating to CIs.
clinicians the ability to assess both the results of
a study as well as the uncertainty surrounding
these results [1]. It is therefore crucial for sur- Clinical Scenario
geons to understand what CIs are, how they can
inform our assessment of study results and You are a General Surgeon scheduling an oper-
whether the study results are useful for our ation for an elective laparoscopic sigmoid
patients. resection for a patient with recurrent diverticulitis
Surgeons are familiar with statistical signifi- and a known colo-vesicular fistula. As per your
cance and the use of p-values but less so with routine practice, you have asked a Urologist to
CIs. Having a deeper understanding of what CIs place lighted ureteric catheters (UC). When
can tell us about study results will allow a sur- booking the case, a nurse asks if the catheters are
geon to better assess the certainty of the study necessary, as they add about 30 min to the pro-
results and whether they are likely to be clinically cedure time. Your understanding is that ureteric
catheters may not prevent injury but can help
identify a ureteric injury at the time of surgery.
You decide to perform a literature review to
satisfy yourself that what you are planning has
J. Bogach M. O. Cadeddu (&)
merit.
Department of Surgery, Faculty of Health Sciences, You develop a clinical question in the PICOT
McMaster University, 1280 Main Street West, format (See Chap. 3)
Hamilton, ON L8S 4K1, Canada
e-mail: mcadeddu@stjosham.on.ca P Patients undergoing elective sigmoid
J. Bogach resection for diverticulitis
e-mail: Jessica.bogach@medportal.ca; I Use of ureteric catheters (UC)
bogachj@mcmaster.ca C Surgery without UC
L. Mbuagbaw O Ureteric injury
Department of Health Research Methods, Evidence T Within 30 days from surgery.
and Impact, McMaster University, Biostatistics
Unit/FSORC, 50 Charlton Avenue East, St Joseph’s
Healthcare—Hamilton, 3rd Floor Martha Wing, Research question: In patients undergoing
Room H321, Hamilton, ON L8N 4A6, Canada elective sigmoid resection for diverticulitis, does
e-mail: mbuagblc@mcmaster.ca
results. For example, if wound infection rates are ‘Confidence intervals rather than P-values: esti-
being measured in 10% of the hospital’s surgical mation rather than hypothesis testing’ [11]. The
population, and each month a random sample of larger the sample size, the more the estimate
patients is measured, this estimate will vary each approximates the population and the smaller our
month around a central value. This creates a SD will be. For hypothesis testing and estima-
distribution of values. From each month, a mean tion, the SD is a key component in CI
wound infection rate can be calculated along calculation.
with a standard deviation (SD) which shows how
much the estimates varied from the mean infec-
tion rate [10]. The SD is the square root of the Why Are CI’s Different
variance of the estimate. The calculation used for from p-Values?
variance depends on the function you are using
(mean, odds ratio, etc.) to measure effect. Gard- Biomedical literature has a generally accepted
ner and Altman present the different variance and level of significance, with an alpha of 0.05.
Confidence Interval calculations in their paper Although this is convention, it is somewhat
304 J. Bogach et al.
arbitrary [11]. Using a p-value alone, we tend to with a certain degree of confidence and is called
dichotomize outcomes as significant or non- the Confidence Interval. In effect, point estimates
significant, even if p-values are very similar. For in a study are calculated and then CIs are gen-
example, with a p-value of 0.049, we would erated around that estimate to demonstrate the
reject a null hypothesis, but with a p-value of confidence that this interval contains the true
0.051, we would not reject it [12]. Walsh et al. value. CIs have traditionally been calculated for
[13] showed that in major randomized control 95% confidence, but they do not have to be.
trials published in high impact journals, it is often Some studies use 90% CIs or 99% CIs, but most
a single event that can push the p-value into studies typically use 95% CIs for results and
statistical significance. Additionally, even with a calculations.
statistically significant result, there is no infor- We must remember that the true population
mation about the magnitude of difference and value will never change. If the same study was
clinical relevance [11]. repeated, the predicted value and CI would
Confidence intervals do suffer from the same change depending on the results, sample size etc.
fragility, in that if the confidence interval covers If you did the experiment or study multiple times
our ‘null value’, we reject a statistically signifi- in the same population, you would expect that
cant difference. This interval can also be influ- the CI would cover the true value 95% of the
enced by single data points. However, by using time. Figure 28.1 demonstrates this concept.
confidence intervals, we are informed about the
direction and magnitude of the effect, and the
interval may contain clinically relevant effects.
Getting to Estimation
Table 28.3 Examples of 95% confidence intervals and their appropriate interpretation
Study Type of Null Result Reported Interpretation
result value confidence
interval (CI)
Coakley et al. [2] Odds 1 0.45 95% The use of PUCP decreases the
Ratio CI 0.25–0.81 risk of UI
Columbo et al. [17] Relative 1 0.96 95% CI There is no difference in bleeding
A Meta-analysis of the Impact of risk 0.76–1.22 rates requiring re-intervention
aspirin, clopidogrel, and dual with the use of aspirin
antiplatelet therapy on bleeding
complications in noncardiac
surgery [18]
Gianotti et al. [15] Relative 0 0.003 95% CI The use of oral carbohydrate load
Preoperative carbohydrate load difference −0.053 to does not decrease the risk of
versus placebo in major elective 0.059 wound infection
abdominal surgery (PROCY)
Koullouros et al. [19] Odds 1 1.27 Espin-Basany In the first study, there is no
The role of oral antibiotic Ratio et al. [18]: difference in surgical infection
prophylaxis in prevention of CI = 0.48– rate with the use of oral antibiotic
surgical site infection in 3.38 prophylaxis. In the second study,
colorectal surgery (Two studies there is a reduction in surgical
from the meta-analysis) infection rate. However, we
cannot infer any difference
between the two studies because
of the overlapping confidence
intervals without a formal
statistical test
0.30 Sadahiro
et al. [20]:
CI = 0.11–
0.79
Fig. 28.2 Results of 4 hypothetical studies comparing MCID is represented by the dashed line. For each study,
PUCP versus control and outcome on UI, where a 20% the dot represents the mean difference (point estimate) and
reduction in UI is the minimal clinical important differ- the line represent the boundaries of the 95% confidence
ence (MCID). The vertical straight line above the x-axis at intervals around the point estimate
zero represents the null hypothesis of no difference,
28 Confidence Intervals 307
Table 28.4 Factors that impact the width of your confidence interval
Narrows confidence interval Widens confidence interval
Large sample size Small sample size
Smaller standard error Larger standard error
Low degree of confidence (i.e. 90%) High degree of confidence (i.e. 99%)
Homogeneous data Large amount of variability
308 J. Bogach et al.
A frequent criticism of published surgical studies highest level of evidence. While RCTs are the
pertains to whether or not there is adequate preferred study design, they can be incredibly
sample size. In other words, readers are hesitant difficult to implement in surgery. Many reasons
that the chosen sample size would enable the trial have been given for the small number of RCTs
to detect a clinically meaningful difference performed in surgery, with recruitment difficul-
between surgical interventions, if one exists. This ties being one of the most common obstacles [1,
criticism applies to all study designs, however, is 2]. Regardless of these difficulties, when a RCT
particularly relevant to the randomized controlled proclaims that a new surgical innovation is
trial (RCT), which, is touted to produce the superior, or equal to, the standard approach
(standard care), we need to be certain that this
proclamation is valid.
This chapter will explain the methodological
J. Murphy A. Thoma (&) issues that pertain to sample size calculation and
Department of Surgery, Division of Plastic Surgery, provide the reader with the necessary skills to
McMaster University, Hamilton, ON, Canada
e-mail: athoma@mcmaster.ca
appraise an RCT based on its sample size.
Readers will find definitions to the bolded terms,
J. Murphy
e-mail: murphj11@mcmaster.ca
and more information regarding the factors that
influence power and sample size in Appendix 1.
E. K. Duku
Department of Psychiatry and Behavioral
Neurosciences, McMaster University, Hamilton,
ON, Canada Clinical Scenario
e-mail: duku@mcmaster.ca
E. K. Duku A. Thoma C. H. Goldsmith At the last minimal access surgery rounds, a
Department of Health Research Methods, Evidence junior surgical faculty member recommended
and Impact, McMaster University, Hamilton, ON,
Canada
that the service start using a preoperative
e-mail: cgoldsmi@sfu.ca ultrasound-guided transversus abdominis plane
C. H. Goldsmith
(TAP) block to control pain after laparoscopic
Faculty of Health Sciences, Simon Fraser University, surgery for colorectal cancer. She claimed that at
Burnaby, BC, Canada the center where she did her fellowship, TAP
C. H. Goldsmith blocks were used all the time.
Department of Occupational Science and A cynical senior surgeon commented that
Occupational Therapy, Faculty of Medicine, The such an approach is useless; it will only lengthen
University of British Columbia, Vancouver, BC,
Canada
the procedure, enrich the anesthetist, and do
Box 1. Factors Informing Sample Size In surgical research, the clinical scenario or
Factor Further research question informs both the outcome(s) to
information be measured and the target population to be
Outcome of interest Page 312 studied. In research studies, there is a primary
Minimal clinically important Page 313 and sometimes a secondary outcome. The pri-
difference (MCID) mary outcome is the endpoint that is of greatest
Standard deviation Page 313 importance, while the secondary outcome is used
(continued) to measure additional effects of the intervention
[3]. In the above-mentioned scenario, the primary
29 Power and Sample Size 313
outcome would be pain following surgery with expected in patients reporting minimal or no
and without the use of TAP Blocks; the target difference in the outcome of interest. Alterna-
population would be colorectal patients under- tively, a large effect size is expected in patients
going laparoscopic surgery. reporting large differences in the chosen out-
Once the outcome and the target population come. Where there is nothing in the literature, we
are determined, the next step is to decide on the can use the suggestions by Cohen that catego-
appropriate outcome measure (the scale or rizes effect sizes as proxies, i.e., small effect =
instrument) to be used. A literature review can be 0.2; medium effect size = 0.5 or large effect
used to ensure that the outcome measure is: size = 0.8 as proxies for our MCID [8]. The
(1) relevant; (2) meaningful; (3) reliable; formula for calculating effect size is shown
(4) valid and (5) responsive to change. Next, one below [7]
would determine the type of data measurement
the outcome measure would produce (nominal, ½MeanðXInitial Þ MeanðXFinal Þ
ES ¼
ordinal, interval or ratio) as this has a bearing on rbaseline
the type of statistical analysis that will be per-
formed. The outcome measure for the current X: Observed Value; Xinitial: values at baseline;
scenario would be the Numerical Rating Scale XFinal: values following treatment/procedure;
(NRS) pain score, which is continuous (interval) rbaseline : standard deviation of the baseline
in nature. values.
The next step is to review previous literature It is important to note that, in clinical research,
to determine the MCID (d) for your specific it is possible to have a positive or a negative
outcome measure. The concept of MCID was value for ES. Therefore, it is important to con-
introduced in 1989 by Jaeschke et al. MCID is sider your research question when interpreting
defined as the smallest difference in the outcome data. In other words, if we are comparing
of interest perceived generally by patients (or Health-Related Quality of Life (HRQOL) scores
physicians) as beneficial. In other words, the in a sample from baseline to 6 months following
MCID refers to the smallest amount of difference surgery, then we would expect to see an
in an outcome measure considered meaningful improvement in HRQOL, and therefore a nega-
[4–7]. For surgeons, it means the smallest dif- tive ES. For example, a baseline score of 50 on
ference between a novel and standard approach the HRQOL scale SF-36 that increases to 80 at
that will persuade them to adopt the novel in- 6 months will give a negative ES based on the
tervention. If no information on the MCID for above formula. However, if we are measuring
your outcome measure is available, the MCID pain, we would expect a decrease from baseline
could be computed using effect size (ES), defined to 6 months, and therefore a positive ES; for
as the number of standard deviation units of example, a preoperative pain score of 8 on a
change to be expected (ES = d/r). Visual Analog Scale (VAS) of 0–10 postopera-
For an RCT, effect size can be calculated as tively improves to a score of 3.
the difference in the means between two chosen With the MCID for your specific outcome in
times, divided by the standard deviation of scores mind, the research question is reframed into a
at the first time point [7]. We can also think of clinical hypothesis, a statement about the
the effect size in terms of the degree to which a expected relationship between two variables [9].
specified nonzero difference can be observed in In the above scenario, the clinical hypothesis
the population under study. The larger the value would address the suspected relationship
of ES, the greater the degree to which manifes- between pain and TAP blocks (“There will be an
tation of the difference is found in the population. important decrease in pain score for those
As noted by Cohen [8], the ES is a very patients who had TAP Blocks, as compared to
important factor in determining power and those who did not”). The clinical hypothesis can
required sample size. A small effect size can be then be stated as a statistical hypothesis, also
314 J. Murphy et al.
known as the null hypothesis, which states the and is denoted by the first Greek letter a [5, 10].
lack of relationship between the two variables For example, a level of significance of 0.05
(There will be no difference in pain scores indicates that we have a 5% risk of making a
between groups) [9]. The null hypothesis is conclusion that there is a difference even if there
essentially the hypothesis we would like to reject is no difference. If we feel it is very important
or nullify. that a Type I error is not made, then we can
Once you have formulated your hypotheses, choose a level of significance lower than 0.05,
you must determine the best way to answer your say 0.01. To reduce the chance of adopting in-
research question. As it is often difficult, or terventions or procedures that are not effective, it
impossible, to study everyone in a population, a is necessary for us to control our Type I error
subgroup (sample) of individuals from the pop- rate.
ulation can be drawn [10]. The population is a The other type of error associated with mak-
group of individuals within a geographic region ing inferences from a sample to the population is
or institution (e.g., city, country, hospital, or known as a Type II error, represented by the
school) [10]. The sample, which is also known as second Greek letter b. It is referred to as the
the study group, comprises those individuals that chance of not rejecting the null hypothesis when
will actually be a part of the study [10]. The aim it is not true, or the chance of not detecting a
of sampling is to gather a diverse group of difference when it exists [5, 10]. Type II error is
individuals from the population so that as many related to the concept of power, the chance of
individuals are represented as possible. A repre- rejecting the null hypothesis when it is not true
sentative sample allows more confidence when (Fig. 29.1) [5, 10]. This works out computa-
generalizing results found within the sample tionally to 1 − b and is the same as the chance of
back to the population of interest. detecting a difference when there is a difference
Generalizing the results from a sample back to to detect. For example, power of 0.80 or 80%
the population is referred to as statistical infer- indicates that if a difference exists between two
ence. Generalization is subject to some errors, groups in a study, there is a 20% chance the
one of which is making a false positive state- study will not detect it.
ment; this is known as Type I error [5, 10] (see Power should be measured or calculated
Fig. 29.1). Type I error occurs when the null before (a priori) a trial begins and not after. As
hypothesis is rejected, when in fact, it is true. In Farrokhyar et al. [5] explain, a priori power and
other words, if the results of a study indicate that sample size calculations can reduce the risk of
there is a difference between two treatment Type II error (false-negative result). Through
groups when in reality there is not; a Type I error conducting a power calculation a priori,
was committed. The chance of a Type I error is researchers are aware of the sample size required
referred to as the level of statistical significance to reach power and can then ensure that the
necessary number of participants are recruited. used in research, as we are usually not sure of the
When power is calculated a priori, researchers directionality of the relationship when designing
can be confident in the results of their study the study.
knowing that they had the power to detect a Once your null hypothesis is formulated and
difference if one existed. Conversely, if power is directionality decided on, you must decide on the
calculated post hoc (after the completion of a type of study design to use to test your hypoth-
study) and power was not adequate the esis. The most appropriate type of study design
researchers may have less confidence in their depends on: (1) The research question or goal of
results. For example, after having completed a the study; (2) Whether or not an intervention or
trial, researchers conclude no statistical differ- random allocation is needed; and (3) When the
ence in their outcome of interest between two outcomes are to be measured and what outcome
groups; however, if the study was found to have measure to use. Surgeon-researchers should be
inadequate power, these results may be false familiar with the different study designs and the
(Type II Error). hierarchy of evidence (see Chap. 5) comprehen-
The next steps are to determine the direc- sive web-sites on study designs are available and
tionality of our hypothesis and the variability in beyond the scope of this chapter [13–15].
the outcome measure of interest. Standard devi- Study design can impact sample size [5], for
ation, the dispersion of data, helps us to under- example, if the participants are being assigned
stand how different an individual’s score is from into different study groups using a random allo-
the others [11]. Standard deviation is represented cation mechanism, such as using a random
by the Greek letter r (in the population and s in number generator, this will impact the overall
the sample). Information on variability of speci- number of participants needed. In the case of
fic outcome measures, within the target popula- random allocation, one needs to properly deter-
tion, may be found in the literature. However, mine the overall required sample size, as well as
usually, it is the variability found within a sample how many participants are needed in each group
that is reported, not the variability of the entire so that the study has adequate power. The final
population. If this information is not available, a required sample size is also contingent on the
pilot study may need to be carried out to obtain allocation ratio, sample loss, or attrition during
this information. In addition to reviewing the the study. Thus, the computed sample size of the
literature for variability, the literature can also be study (n) needs to be increased by a factor of 1/
used to determine if your hypothesis needs to be (1 − p) to give us a sample of size n′ [16]. That
tested using one-tailed or two-tailed testing. is, n′ = n/(1 − p); where p is proportion lost, and
A one-tail test involves unidirectional testing of n′ is rounded up to the integer multiple of the
the hypothesis; meaning, it tests for the possi- number of arms or groups in the trial [16]. More
bility of a relationship in one direction and complicated study designs, such as those utiliz-
ignores the possibility of a relationship in the ing stratification through predetermined sub-
other direction [12]. For example, if we are groups (e.g., grouping by recruitment site or
interested in the improvement of Health-Related gender) require larger sample sizes to maintain
Quality of Life (HRQOL) with a new surgical power [5]. Calculation of power and sample size
intervention, we may expect that there will only for these types of study designs are beyond the
be a change in one direction (improvement). scope of this chapter. If readers would like more
However, by doing this, we may be ignoring the information on subgroups, please see Chap. 30
fact that this new intervention may be worsening more detailed information on the adjustments
HRQOL or even causing death in some situa- needed to calculate sample size in more
tions. A two-tail test, on the other hand, involves advanced study designs can be found elsewhere
a bidirectional testing of the hypothesis and tests (see Appendix 3).
for the possibility of a relationship in both Generally, the analysis of the study data is
directions [12]. Two-tailed tests are most often dependent on the study design, the study
316 J. Murphy et al.
Summary of the Appraised Article 24, 48, and 72 h after surgery. Full results are
by Oh et al. [21] available in the original publication [21];
Table 29.1 summarizes the data for pain at rest and
A sample size of 25 patients per group was obtained when coughing for postanesthetic recovery and
by the authors with the assumption that there would 24 h postoperatively. If readers wish to see full
be 30% difference in pain between patients measured results by Oh et al. [21], including pain scores for
using a Numeric Rating Scale (NRS). Allowing for 48 and 72 h, please see the original publication.
potential sample loss of 10%, the authors calculated
they would need 28 patients per group.
The authors report no statistically significant Questions Used to Appraise
differences between the two groups in age, Body
Mass Index (BMI), ASA classifications, surgical You believe that this is an appropriate article to
duration, anesthesia duration or comorbidities. answer the clinical dilemma posed in the sce-
However, there were statistically significant dif- nario; however, before presenting the evidence to
ferences reported in sex distribution between your supervisor and colleagues, you want to
groups (82:17.9% M:F in the TAP group and ensure its credibility. You uncover an article by
51.9:48.2% M:F in the control; p = 0.017), and Caddedu et al. [23] that provides a set of ques-
operation type (39.3:60.7% colon:rectum in the tions to appraise a randomized controlled trial
TAP group and 66.7:33.3% colon:rectum in the (RCT) based on power and sample size. The
control group; p = 0.042). While Oh et al. [21] Caddedu et al. [23] article has general questions
stated whether statistically significant differences on how to appraise an RCT, however, this
existed, the authors did not report if differences chapter is going to focus on those questions
between groups were clinically important. The specific to assessing power (Box 2).
inclusion of this information is common in
RCTs, however, as recommended by Altman
Box 2: Key Questions to Assess Power
[22], it is more impactful to state if there are
Within a Study [23]
clinically important differences between groups.
1. Was a power analysis performed?
Altman [22] states that it is not good practice to
2. Was the sample size calculation detailed for
infer from the lack of statistical significance that
the primary outcome?
the variable under question had no effect on the
3. Is the effect size clinically relevant?
outcome of interest. In other words, even though
subjective assessment requires prior knowledge 4. Would the stated difference in treatment effect
result in a change in practice?
of the prognostic impact of the variables of
5. Is the effect size precise and consistent with
interest, this should be used in place of statistical your clinical experience and previously
analysis to determine the similarity between two published trials?
randomized groups [22]. With this in mind, it 6. If no power analysis was completed, are the
would have been more effective for Oh et al. [21] results reported appropriately to estimate power?
to comment on the known prognostic importance 7. Are confidence intervals included so that the
of those variables compared between groups estimation of the treatment effect can be
(age, BMI, ASA, surgical and anesthesia dura- determined?
tion, and comorbidities).
There are no statistically significant differences
in NRS scores both at rest and on coughing at 1 h
postanesthetic recovery (PAR) although the mean Question 1: Was a power analysis performed?
score for patients in the TAP group was slightly
lower than that for the control group (see As power and sample size are directly related to each
Table 29.1). Postoperative results showed no sta- other, it is important to perform a power analysis
tistically significant differences in NRS scores at a priori. After reviewing the Oh et al. [21] article,
318 J. Murphy et al.
Table 29.1 Pain outcomes between patient groups at four follow-up periods [21]
Postanesthetic recovery 24 h post-operative
(PAR) mean, SD (POD1) mean, SD
Pain at rest TAP block (n = 28) 5.3, 1.9 4.0, 1.6
Control (n = 27) 5.9, 2.0 3.9, 1.7
a
p value 0.24 0.87
Pain when coughing TAP block (n = 28) 6.8, 1.9 5.6, 1.8
Control (n = 27) 7.2, 1.7 6.1, 1.6
p valuea 0.38 0.32
SD standard deviation, TAP transversus abdominis plane
a
Adjusted for gender and operation
Table restructured using information from Oh et al. [21]
you see that no power calculation was performed. The authors then state that a 30% decrease in
The authors state that they considered “a statistical pain scores, which, would equal 4.4, would be
power of 0.8”, however, they did not perform a considered “clinically significant”. As the
power analysis [21]. Using the information learned authors do not state where this 30% decrease was
above, without performing a power calculation a taken from you investigate and find that the
priori we do not know if there was an adequate Schwenk et al. [25] study mentions that their
sample collected to properly detect a difference if sample size was calculated to detect a 30%
one truly occurs. This discovery would be prob- decrease in pain scores. The problem is that the
lematic as if inadequate sample size was used the Schwenk et al. [25] study uses a different pain
study may not have enough power to detect a dif- score than Oh et al. [21] and therefore you are
ference if one truly occurred. With inadequate not convinced that this decrease in pain score
sample size, and consequently, power, validity of would truly be clinically relevant for this sample,
results could be questioned and resources to perform or pain measure.
the study could be potentially wasted. Continuing on with their sample size calcu-
lation, Oh et al. [21] decide to set the alpha at
Question 2: Was the sample size calculation 0.05 and set their power to 0.8, or 80%. Using
detailed for the primary outcome? these criteria, the authors determined that the
study was required to have 25 individuals in each
The primary outcome for the Oh et al. [21] study group. The authors then use a dropout rate of
was stated as pain following surgery, on coughing 10% and determined that 28 patients would be
postoperative day 1 (POD 1). To measure pain, needed in each group.
the authors used a numeric rating scale (NRS) on In regards to the hypothesis statement, “inves-
postoperative days (PODs) up to 3 days following tigating the effects of TAP blocks on pain after
surgery at rest, and on coughing [21]. surgery”, you feel that there is a lack of detail. For
Oh et al. [21] do not include their sample size example, the hypothesis does not state where the
calculation in the published article, instead the expected difference would be seen, i.e., would the
authors reference two previous studies [24, 25]. difference be seen between time points, within
The first study referenced by Oh et al. [21] states patients, or between the patient groups? Through a
that pain scores, at rest, 24 h following surgery review of the methods section, you see that Oh
was 4.3 [24]; the next resource states that pain et al. [21] are comparing the pain scores between
when coughing was two points higher than it was groups at POD1 on coughing. Table 29.2 illus-
at rest [25]. Oh et al. [21] use this previous trates the time points that Oh et al. [21] used for
information to determine that the pain score their calculation. You feel that the authors should
when coughing would be 6.3. have stated the hypothesis as “expecting a 30%
29 Power and Sample Size 319
Table 29.2 Comparing the mean pain scores between tap block and control patients across time points (analysis
performed)
NRS rest NRS cough
TAP block Control Test Tap block Control Test
PAR A B A versus B C D C versus D
POD1 E F E versus F G H G versus H
Table 29.3 Comparing change in pain scores between groups at each time point (suggested analysis)
NRS rest NRS cough
TAP block Control Test Tap block Control Test
PAR A B – C D –
POD1 E F – G H –
POD1-PAR E A ¼ DEA F B ¼ DFB DEA versus DFB G C ¼ DGC H D ¼ DHD DGC versus DHD
however, there was no reference given for this the difference and give an idea of the interval of
information and a 30% decrease in pain was not values where the true value is most likely to be
found [21]. Additionally, you do not believe that found [23]. Confidence intervals are a very
their hypothesis statement was correct (as stated important part of research and help the reader
in Question 2). As the hypothesis statement was to interpret the results of a study, for more infor-
ambiguous, you are not quite confident that the mation on confidence intervals please see
results can convince you to change your practice. Chap. 28.
The decision to change practice is subjective as
introducing TAP blocks into a practice would also
cause a delay in the treatment of the case. Resolution of the Scenario
Therefore, each surgeon would have to weigh the
pros and cons and determine if a possible Based on the evidence presented in the Oh et al.
decrease in pain is worth an increase in time and [21] article, and her own appraisal of the article,
resources for the case. the Fellow informed her program director that
she is retracting her original recommendation.
Question 5: Is the effect size precise and con- She recommends that they repeat the study but
sistent with your clinical experience and previ- this time use robust methodology to find a more
ously published trials? convincing answer. A more robust study could
include some of the following changes:
Oh et al. [21] did not state an effect size, simi-
larly, no reference was provided for the chosen • A clearly stated hypothesis that matches the
30% decrease in pain and therefore it is difficult research question and guides data analysis
to determine if the variables used by the authors – e.g., To investigate the difference in
are justified. change in pain scores on coughing
between the control and TAP block from
Question 6: If no power analysis was completed, PAR to POD1.
are the results reported appropriately to estimate • The authors should have performed a more
power? appropriate literature search to determine
criteria for clinically important differences
While the authors did not calculate power, the and attrition rates. Criteria should be used that
results reported by Oh et al. [21] do allow you to match the sample, the procedure, and the
conduct your own power analysis: outcome measure being used.
Using the mean difference of 0.5 in pain score – i.e., the 30% decrease used by Oh et al.
on coughing at POD1, standard deviation of 1.6, [21] originated in a study looking at a
alpha of 0.05 (two-sided test) and 28 patients per different intervention, and a different out-
group, the trial post hoc power calculation is come measure
0.231. – i.e., the 10% attrition rate seems to be chosen
at random, there was no stated evidence to
Question 7: Are confidence intervals included so support that this attrition rate was applica-
that the estimation of the treatment effect can be ble. The authors should have used previous
determined? literature or performed a feasibility study.
• The authors collect data for 24, 48, and 72 h
Oh et al. [21] did not provide confidence intervals following surgery, however, they report on
in their results or conclusions. Confidence inter- the 24 h results as the primary outcome and
vals are used alongside p-values to describe the consider the information for the other two
results seen in a study [23]. For example, while a p- time points as secondary outcomes
value will indicate if the difference between two • The authors should have utilized a biostatis-
treatments is statistically significant, the confi- tician to determine a proper sample size for
dence interval will indicate both the magnitude of the outcomes that they wanted to measure.
29 Power and Sample Size 321
Appendix 2: Using G*Power rate (b, power = 1 − b), but these can all be
to Compute Power [30] customized by the user as we have done. Addi-
tionally, the user must select the statistical test
Figure 29.2a shows a screenshot of G*Power being performed and whether it is a one or
with the majority of the factors summarized in two-tailed study. Figure 29.2–d gives examples
Table 29.1 highlighted. G*Power defaults to the of a G*Power calculation and how changing
most commonly selected effect size (d), test various factors impacts the required sample size.
statistic, Type I error rate (a), and Type II error This chapter does not explain the three output
20. The R Foundation. The R project for statistical colorectal resections. A prospective randomized trial.
computing [Internet] [accessed 2018 Jul 24]. Avail- Surg Endosc. 1998;12:1131–6.
able from: https://www.r-project.org/. 26. Malterud K, Dirk Siersma V, Dorrit Guassora A.
21. Oh T, Yim J, Kim J, Eom W, Lee S, Park S, et al. Sample size in qualitative interview studies: Guided
Effects of preoperative ultrasound-guided transversus by information power. Innov Methods. 2016;26
abdominis plane block on pain after laparoscopic (13):1753–60.
surgery for colorectal cancer: a double-blind random- 27. Sim J, Saunders B, Waterfield J, Kingstone T. Can
ized controlled trial. Surg Endosc. 2017;31(1):127–34. sample size in qualitative research be determined a
22. Altman DG. Comparability of randomised groups. priori? Int J Soc Res Methodol. 2018; EPub Ahead of
The Statistician. 1985;34:125–36. Print. https://doi.org/10.1080/13645579.2018.1454643.
23. Caddedu M, Farrokhyar F, Thoma A, Haines T, 28. Onwuegbuzie AJ, Leech NL. A call for qualitative
Garnett A, Goldsmith CH, et al. Users’ guide to the power analyses. Qual Quant. 2007;41:105–21.
surgical literature: how to assess power and sample 29. Centers for Disease Control and Prevention C.
size. Laparoscopic vs open appendectomy. Can J Health-related quality of life (HRQOL): measurement
Surg. 2008;51(6):476–82. properties: validity, reliability, and responsiveness
24. Kang SB, Park JW, Jeong SY, Nam BH, Choi HS, [Internet]. CDC; 2016 [cited 2018 Jun 12]. Available
Kim DW, et al. Open versus laparoscopic surgery for from: https://www.cdc.gov/hrqol/measurement.htm.
mid or low rectal cancer after neoadjuvant chemora- 30. Bucher A, Erdfelder E, Franz F, Lang AG. G*Power:
diotherpy (COREAN trial): short-term outcomes of statistical power analysis for Windows and Mac.
an open-label randomised controlled tiral. Lancet [Internet]; 2017. [cited 2018 Aug 12]. Available from
Oncol. 2010;11:637–45. http://www.gpower.hhu.de/.
25. Schwenk W, Bohm B, Muller JM. Postoperative pain
and fatigue after laparoscopic or conventional
Subgroup Analyses in Surgery
30
Alexandra Hatchell and Sophocles H. Voineskos
asks you about the planned method of surgical you only obtain reliable and valid evidence, you
wound closure. Even though you have reiterated limit your search with “randomized controlled
that the primary goal of this operation is to trials (RCT)”. This search produces two articles
remove her cancer, she seems fixated on the (Appendix 1). As you review the titles of the
potential postoperative cosmetic appearance of articles, you notice that the first article contains
her surgical scar. “subcuticular sutures versus staples for skin
You have noticed that many of your plastic closure after open gastrointestinal surgery” [4].
surgery colleagues frequently use subcuticular Since it appears this article will be directly
sutures to close surgical incisions and assume the comparing these two skin closure methods in
cosmetic results of these wounds are likely to be gastrointestinal surgery, you believe it should
quite good. Therefore, you wonder whether a provide some insight into your decision for your
subcuticular closure would result in a better patient with the upcoming right hemicolectomy.
surgical scar for this patient compared to closure You download the article for further reading and
with staples, while still minimizing the chance of assessment.
infection or other wound complications. You
decide to perform a literature search to answer
this question. Summary of the Appraised Article
Table 30.1 Demographic characteristics and primary outcomesa of patients receiving skin closure with either
subcuticular sutures or staples
Subcuticular sutures Staples Odds ratio (95% p-
(n = 562) (n = 518) CI) value
Median age (years; IQR) 68 (61–75) 68 (61–74)
Sex
Male [n (%)] 388 (69.0) 365 (70.5)
Female [n (%)] 174 (31.0) 153 (29.5)
Surgery
Upper gastrointestinal [n (%)] 385 (68.5) 417 (80.5)
Lower gastrointestinal [n (%)] 177 (31.5) 101 (19.5)
Wound complication rate 47 (8.4) 59 (11.5) 0.709 (0.474– 0.12
[n (%)] 1.062)
Surgical site infection [n (%)] 36 (6.4) 36 (7.0) 0.928 (0.558– 0.81
1.543)
Nonsurgical site infection 11 (2.0) 23 (4.5) 0.435 (0.189– 0.0238
[n (%)] 0.940)
Hypertrophic scar formation 93 (16.7) 111 (21.6) 0.726 (0.528– 0.0429
[n (%)] 0.998)
a
Adapted from Tsujinaka and colleagues [4]
upon the type of surgery performed (upper versus patients who had upper GI surgery (17.3% versus
lower GI surgery). Significance of subgroup tests 23.7%, p = 0.028).
was set at alpha = 0.05.
Intention-to-treat analyses (Tables 30.1 and
30.2) revealed wound complications occurred in Appraisal of the Selected Article
8.4% of patients who received subcuticular
sutures and 11.5% of patients who received sta- You find articles by Sun and colleagues [5]
ples (p = 0.12). In the subgroup of patients who and Dijkman and colleagues [6] that help you
had upper GI surgery, nonsurgical site infections appraise the subgroup analysis in Tsujinaka and
were lower for patients with subcuticular sutures colleagues’ RCT [4]. The credibility of their
compared to staples (1.6% versus 4.6%, subgroup analysis can now be critically analyzed
p = 0.015), but there were no differences in on a point-by-point basis (Box 1).
surgical site (superficial incision) infections
between groups (p = 0.53). In the subgroup of
Box 1. Criteria to Assess the Credibility
patients who had lower GI surgery, significantly
of a Subgroup Analysis
fewer patients who had subcuticular sutures
experienced overall wound complications
A. Are the results valid?
(10.2% versus 19.8%, p = 0.03) and specifically
surgical site infections (7.4% versus 15.8%, 1. Was the subgroup analysis based
p = 0.04) compared to those who had staples. on rationale indication?
Intention-to-treat analyses also revealed signifi- 2. Was the subgroup analysis prede-
cantly lower hypertrophic scar rates in the sub- fined a priori or carried out post hoc?
cuticular suture group compared to the staples 3. Was the subgroup analysis one of
group (16.7% versus 21.6%, p = 0.043). This a small number?
finding was also mirrored in the subgroup of
330 A. Hatchell and S. H. Voineskos
Table 30.2 Primary outcomesa of patients receiving upper or lower gastrointestinal surgery and skin closure with
either subcuticular sutures or staples
Upper gastrointestinal surgery Lower gastrointestinal surgery
Subcuticular Staples Odds p- Subcuticular Staples Odds p-
sutures (n = 413) ratio value sutures (n = 101) ratio value
(n = 382) (95% (n = 176) (95%
CI) CI)
Wound 29 (7.6) 39 (9.4) 0.788 0.38 18 (10.2) 20 (19.8) 0.463 0.0301
complication (0.459– (0.217–
rate [n (%)] 1.339) 0.978)
Surgical site 23 (6.0) 20 (4.8) 1.259 0.53 13 (7.4) 16 (15.8) 0.425 0.0399
infection [n (%)] (0.649– (0.179–
2.461) 0.992)
Nonsurgical site 6 (1.6) 19 (4.6) 0.331 0.0149 5 (2.8) 4 (4.0) 0.710 0.73
infection [n (%)] (0.107– (0.149–
0.875) 3.666)
Hypertrophic 66 (17.3) 98 (23.7) 0.672 0.0282 27 (15.3) 13 (12.9) 1.226 0.72
scar formation (0.465– (0.576–
[n (%)] 0.965) 2.729)
a
Adapted from Tsujinaka and colleagues [4]
4. Did the power calculation account 3. Does the emphasis of the discus-
for between-subgroup treatment sion and conclusion remain on
effects? overall treatment effect?
5. Was randomization stratified for
important subgroup variables? C. Will the results help me in caring
6. Can chance alone explain the for my patients in my practice?
subgroup difference? Were inter-
action tests used for assessing 1. Is the subgroup difference consis-
subgroup treatment effect tent across other studies?
interactions? 2. Is the subgroup effect or interac-
7. Were the significance levels of tion clinically important?
treatment effect interactions 3. Are the patients in the subgroup
adjusted for multiplicity? comparable to my patients?
8. Were subgroups checked for
comparability of prognostic
factors?
A. Are the results valid?
B. What were the results?
1. Was the subgroup analysis based on a
1. Are all performed subgroup anal- rationale indication?
yses reported?
2. Are the subgroup analyses repor- When critically appraising a subgroup analy-
ted as relative risk reductions? sis, it is important to ensure that the analysis is
based upon a logical rationale. If no logical
30 Subgroup Analyses in Surgery 331
rationale exists, it is difficult for a subgroup subgroups. However, they must have defined the
analysis to be of practical value regardless of subgroups a priori given that they stratified
whether the results are statistically significant according to the subgroup variable before ran-
[1, 7]. Certain patients may be expected to have domization. This stratification is a useful way to
different treatment effects due to the impact of reduce potential bias as it ensures, if the sample
differences in risk pertaining to a specific out- is large enough, that the patient characteristics
come or differences in pathophysiology. In these would be balanced in the two trial arms within
situations, it may be appropriate to separate these each subgroup stratum.
patients into subgroups. Unfortunately, Tsuji-
naka and colleagues [4] do not provide a ratio- 3. Was the subgroup analysis one of a small
nale for the choice of separating patients into the number?
upper and lower GI subgroups. The choice of
subgroups would have been more justified if the The chance of type I error, or falsely obtaining
authors had provided evidence-based hypotheses significant subgroup effects and interactions,
that the subgroups would differ due to certain increases as the number of subgroup analyses
characteristics, such as the potential differences increases [9–11]. The number of subgroup anal-
in contamination of upper and lower GI surgical yses is the product of the number of outcome
sites. analyses and the number of subgroups created.
To mitigate the risk of chance and sampling
2. Was the subgroup analysis predefined a priori error, the number of subgroups should be mini-
or carried out post hoc? mized. Furthermore, subgroup analyses should
be restricted to the primary outcome of the main
Subgroup analyses may be performed prior to RCT as well as the secondary outcomes appli-
or after the initiation of the study, but these cable to specific subgroups. Ensuring subgroup
analyses should be interpreted differently [1, 8]. analyses are succinct and meaningful will also
Authors should identify and define specific sub- aid the authors in delivering a clear message to
groups in the protocol of the RCT a priori, the readers regarding the findings of the study.
including when the trial is initially registered. Tsujinaka and colleagues [4] repeat the main
Furthermore, both the direction and magnitude of effect analyses of the primary (wound compli-
the hypothesized subgroups effect should also be cations) and secondary (hypertrophic scar for-
reported a priori. These key steps avoid post hoc mation) outcomes on the subgroups of patients
interpretations of the data that could bias the who had either upper or lower GI surgery.
authors’ conclusions and fit the data However, subgroups analyses are also applied to
retroactively. the six component outcomes of the overall
Although post hoc analyses may provide wound complication rate which increased the
important clinical information and considerations, number of outcomes analyzed from 2 to 8.
these analyses should be approached with caution Therefore, the results should be interpreted with
[1, 9]. Usually, post hoc analyses may be per- caution given the increased risk of type I error
formed due to the generation of new hypotheses if from the multiple subgroup analyses.
the original data analyzed produces unexpected
results. Since subgroup analyses are generated by 4. Did the power calculation account for
the trial data, the difference of treatment effect between-subgroup treatment effects?
between subgroups may actually be due to the in-
tervention rather than the subgroup characteristic The power of a trial is defined as the proba-
[8, 9]. Therefore, the credibility of post hoc sub- bility to detect a clinically important difference
group hypotheses is low [8, 9]. between two groups if one truly exists. Power is
Tsujinaka and colleagues [4] did not clearly positively associated with the magnitude of the
state in the article that they predefined the chosen treatment effect as well as the sample size of the
332 A. Hatchell and S. H. Voineskos
study. Usually, sample size calculations for randomization maintains the randomization of
RCTs are based upon a power of at least 80% to known and unknown prognostic variables
detect differences in treatment effect between the between treatment groups that may influence the
primary treatment groups. As a result, subgroup treatment effect [14]. Therefore, both type I and
analyses are often underpowered to detect dif- II errors may be reduced with stratification.
ferences in treatment effect due to the much Stratification is especially important to consider
smaller number of participants within each sub- in smaller trials, where the risk of differences in
group [6, 7, 9, 12]. To detect interactions of the prognostic variables and type I error can be high
same size and with the same power as the overall [14]. Since subgroups are by definition smaller
effect, sample sizes should be inflated by four than the larger treatment arm groups, clinicians
[13]. Furthermore, interaction effects are gener- should not assume that prognostic variables
ally smaller than treatment effects, and thus even between groups are similar at baseline unless
larger sample sizes would need to be obtained to stratification has taken place with randomization
demonstrate a statistically significant difference. [14]. Since randomization ensures subgroups are
The much larger sample sizes needed for reliable similar with the exception of treatment, valid
subgroup analyses are practically challenging to conclusions can be made about the treatment
achieve in the clinical setting. When subgroup efficacy within subgroups. Tsujinaka and col-
analyses are underpowered, the results might not leagues [4] performed stratified randomization
be reliable and should be interpreted with according to the subgroup variable (e.g., upper or
caution. lower GI surgery).
In the trial by Tsujinaka and colleagues [4],
the sample size was calculated for a power of 6. Can chance alone explain the subgroup dif-
80% to detect an overall treatment effect, but this ference? Were interaction tests used for
did not account for the subgroups. Since the assessing subgroup treatment effect
authors applied the same statistical tests for each interactions?
within-subgroup analysis, the same number of
patients calculated for the subcuticular suture and Investigators and clinicians may be misled by
staples groups (530) would have been needed for the underappreciated potential of chance to
each subgroup to obtain 80% power. Addition- influence the results of a trial [1, 9]. Tests for
ally, more patients should have been included in interaction can help determine whether the dif-
each subgroup to account for the smaller inter- ferences in effects may be explained by chance
action effect. These patient numbers were not alone, where the null hypothesis of this test is
achieved in the subgroups. Therefore, the prob- that no difference exists in the true effect between
ability of false-negative results was high for the subgroups. The lower the p-value, the more
subgroups and the results may not be considered likely the null hypothesis can be rejected.
as valid. As such, the conclusions drawn about Therefore, rather than just considering a thresh-
differences in wound complications and hyper- old of significance (e.g., alpha < 0.05), clinicians
trophic scar formation between the upper and may opt to consider the p-value where the null
lower GI subgroups should be critically hypothesis is increasingly likely to be false as the
questioned. criterion level of alpha gets progressively smaller
[11].
5. Was randomization stratified for important Tsujinaka and colleagues [4] analyzed the
subgroup variables? subgroups with logistic regression to assess for
statistical interactions between treatments.
When designing a RCT with predefined sub-
groups, it is important to stratify randomization 7. Were the significance levels of treatment
by the important subgroups [8]. Stratified effect interactions adjusted for multiplicity?
30 Subgroup Analyses in Surgery 333
8. Were the subgroups checked for compara- The magnitude of treatment effect can be
bility of prognostic factors? presented via either absolute or relative risk
reduction [6]. Absolute risk reduction (ARR)
Despite randomization and stratification (if pertains to the difference in absolute risk for a
applicable), differences in prognostic variables particular outcome between the treatment groups
may still exist between subgroups due to chance being analyzed. Meanwhile, relative risk reduc-
[6]. As such, it is helpful to evaluate the sub- tion (RRR) provides an estimation of risk that is
groups for differences in prognostic variables removed by the treatment of interest.
after randomization, especially if those prognostic RRR may be more appropriate to use when
variables are believed to bias the treatment effect. describing subgroup effects as RRR tends to be
If subgroups are not similar with regards to more similar across risk groups compared to
specific prognostic variables, the investigators ARR [16, 17]. For example, the ARR of a given
should clearly acknowledge this to caution read- treatment cannot be large if the patient already
ers regarding the interpretation of certain results. has a low baseline risk. Likewise, a patient with a
Tsujinaka and colleagues [4] did not comment high baseline risk may have a large ARR fol-
on differences in prognostic variables between lowing a particular treatment. Therefore,
subgroups. Therefore, this information is although the ARR may be different for these two
unknown. groups, the RRR may actually be similar in value
for both groups. Clearly, readers may conclude
B. What were the results? that there is a significant treatment effect between
subgroups if ARR is only considered, when, in
1. Are all performed subgroup analyses fact, similar RRR between subgroups indicate
reported? that no difference exists.
In the study by Tsujinaka and colleagues [4],
Since the probability of a positive finding due odds ratios were reported for the primary and
to chance increases with the number of subgroup secondary outcomes as well as for the subgroup
analyses performed [8, 11], the validity of a analyses (Tables 30.1 and 30.2). No risk reduc-
subgroup analysis is dependent on how many tions are reported for the subgroups, but a cal-
other subgroup analyses were performed but not culation can be made by the reader using the
reported [8]. Investigators may selectively report percentages from the tables.
334 A. Hatchell and S. H. Voineskos
3. Does the emphasis of the discussion and be very challenging to compare these studies and
conclusion remain on overall treatment achieve meaningful results [6].
effect? Tsujinaka and colleagues [4] report their RCT
to be the first to evaluate subcuticular sutures
Interestingly, industry-funded studies are versus staples for Class 2 (clean-contaminated)
more likely to perform and present subgroup surgery. Your review of the literature did not find
analyses when the primary outcomes have null any additional studies regarding this particular
findings [15]. Tsujinaka and colleagues [4] per- topic.
formed an industry-funded study with a null
finding that showed no differences in wound 2. Is the subgroup effect or interaction clinically
complications in patients receiving subcuticular important?
sutures compared to staples in GI surgery. As
suggested by Sun and colleagues [15], this null Readers may determine whether the observed
finding may have persuaded Tsujinaka and col- subgroup effects or interactions are clinically
leagues [4] to perform post hoc subgroup anal- important based on a variety of factors. For
yses to discover other potentially significant instance, the subgroup analysis should be based
results. on a rational indication, the treatment or inter-
The results of subgroup analyses are used in vention being evaluated should be frequently
more than 25% of RCTs to support the conclu- provided to patients, the subgroup variables
sions of the studies [18]. However, due to the should be commonly used, and the outcomes
exploratory nature of subgroup analyses, dis- presented should be clinically meaningful [6]. By
cussions regarding these results should be limited ensuring these factors exist, subgroup analyses
and should not affect the conclusion of a trial [2]. become clinically meaningful and applicable to
Tsujinaka and colleagues [4] used nearly half of the patient groups of the targeted audience (e.g.,
the text within the discussion to outline the surgeons). Ideally, the results of similar subgroup
results of the subgroup analyses. However, the analyses should also be replicated in future
authors remained true to the primary objective of studies.
their study and did not use the subgroup analyses Tsujinaka and colleagues [4] selected the
to influence the conclusion. Instead, the conclu- subgroups and performed the subgroup analyses
sion is clear and the emphasis is placed on the with clinically important results. Subcuticular
fact that subcuticular sutures are not superior to sutures and staples are both commonly used in
staples in GI surgery. multiple surgical specialties. Upper and lower GI
surgeries are commonly performed by gen-
C. Will the results help me in caring for my eral surgeons throughout the world. Furthermore,
patients in my practice? wound complications and scarring are important
patient outcomes in any surgical procedure.
1. Is the subgroup difference consistent Therefore, the authors presented subgroup anal-
across other studies? yses that were clinically important to the inten-
ded audience of surgeons.
Subgroup analyses are far more credible when
the subgroup interaction effect is consistent 3. Are the patients in the subgroup comparable
across multiple studies [9, 11]. However, it is to my patients?
difficult to compare subgroup analyses across
studies. Since subgroups can be small, a lack of To determine whether a subgroup of patients
power and unreliable results may be an issue [6]. presented in a study is comparable to your own
Furthermore, as it is common for surgical studies group of patients, readers should critically assess
to vary greatly with regards to study design, the patient characteristics within the subgroup of
populations, interventions, and outcomes, it can interest [6]. These characteristics are usually
30 Subgroup Analyses in Surgery 335
Kobayashi S, Yamasaki M, Akamaru Y, the surgical literature. J Bone Joint Surg Am.
Miyamoto A, Mizushima T, Shimizu J, 2011;93:e8.
Umeshita K, Ito T, Doki Y, Mori M. Subcuticular 12. Yusuf S, Wittes J, Probstfield J, Tyroler HA. Anal-
sutures versus staples for skin closure after open ysis and interpretation of treatment effects in sub-
gastrointestinal surgery: a phase 3, multicentre, groups of patients in randomized clinical trials.
open-label, randomised controlled trial. Lancet. JAMA. 1991;266:93–8.
2013;382:1105–12. 13. Brookes ST, Whitely E, Egger M, Smith GD,
5. Sun X, Briel M, Walter SD, Guyatt GH. Is a Mulheran PA, Peters TJ. Subgroup analyses in
subgroup effect believable? Updating criteria to randomized trials: risks of subgroup-specific analy-
evaluate the credibility of subgroup analyses. ses; power and sample size for the interaction test.
BMJ. 2010;340:c117. J Clin Epidemiol. 2004;57:229–36.
6. Dijkman B, Kooistra B, Bhandari M. Evidence-based 14. Kernan WN, Viscoli CM, Makuch RW, Brass LM,
surgery working group. How to work with a Horwitz RI. Stratified randomization for clinical
subgroup analysis. Can J Surg. 2009;52:515–22. trials. J Clin Epidemiol. 1999;52:19–26.
7. Oxman AD, Guyatt GH. A consumer’s guide to 15. Sun X, Briel M, Busse JW, You JJ, Akl EA, Mejza F,
subgroup analyses. Ann Intern Med. 1992;116: Bala MM, Bassler D, Mertz D, Diaz-Granados N,
78–84. Vandvik PO, Malaga G, Srinathan SK, Dahm P,
8. Rothwell PM. Treating individuals 2. Subgroup Johnston BC, Alonso-Coello P, Hassouneh B,
analysis in randomised controlled trials: importance, Truong J, Dattani ND, Walter SD, Heels-Ansdell
indications, and interpretation. Lancet D, Bhatnagar N, Altman DG, Guyatt GH. The
2005;365:176–86. influence of study characteristics on reporting of
9. Guyatt G, Wyer P, Ioannidis J. When to believe a subgroup analyses in randomised controlled trials:
subgroup analysis. In: Guyatt G, Drummond R, systematic review. BMJ. 2011;342:d1569.
Meade MO, et al., editors. User’s guides to the 16. Furukawa TA. From effect size into number needed
medical literature: a manual for evidence-based to treat. Lancet. 1999;353:1680.
clinical practice. 2nd ed. Toronto (ON): 17. Schmid CH, Lau J, McIntosh MW, Cappelleri JC.
McGraw-Hill; 2008. p. 571–93. An empirical study of the effect of the control rate as
10. Buyse ME. Analysis of clinical trial outcomes: some a predictor of treatment efficacy in meta-analysis of
comments on subgroup analyses. Control Clin Trials. clinical trials. Stat Med. 1998;17:1923–42.
1989;10:187S–94S. 18. Assmann SF, Pocock SJ, Enos LE, Kasten LE.
11. Sun X, Heels-Ansdell D, Sprague S, Bandhari M, Subgroup analysis and other (mis)uses of baseline
Walter SD, Sanders D, Schemitsch E, Tornetta P III, data in clinical trials. Lancet. 2000;355:1064–9.
Swiontkowski M, Guyatt G. Is a subgroup claim
believable? A user’s guide to subgroup analyses in
Introduction to Clinical Practice
Guidelines 31
Christopher J. Coroneos, Stavros A. Antoniou,
Ivan D. Florez and Melissa C. Brouwers
for all clinicians to think about the CPG ques- Lyman et al. [16] cite a multidisciplinary panel
tions and patient populations not solely with the composed of medical oncology, surgery, com-
lens determining if there is a direct alignment munity oncology, patient/advocacy representa-
with the clinical activities they might perform but tion, and guideline implementation under the
rather with the lens if the scope aligns with ASCO Clinical Practice Guidelines Committee
patients with whom they might be involved with (CPGC). A linked supplement discusses the role
somewhere in the course of the care trajectory. of specific members, content experts, and patient
For example, when considering the surgical representatives.
ablative and reconstructive aspects of breast
cancer, surgeons must consider the timing of
neoadjuvant/adjuvant chemotherapy and adju- iii. Is the CPG methodologically rigorous?
vant radiotherapy. Procedures must be timed
with the consideration of the entire scope of This is often seen as the most important CPG
cancer management, and adjuvant therapy if quality dimension in the AGREE II tool.
indicated may have a negative effect on imme- High-quality guidelines use systematic review as
diate reconstruction. In our ASCO CPG scenario, an evidentiary foundation. Chapter 15 discusses
Lyman et al. address questions on the role of the key issues and methodological steps of sys-
anticoagulation for VTE prophylaxis in different tematic reviews. High-quality study designs such
clinical contexts (hospitalized patients, ambula- as randomized controlled trials (RCTs) can make
tory patients, patients undergoing surgery) but the development of CPG recommendations more
also on the timing and duration of the regimen straightforward. However, RCTs are not required
and patients’ knowledge [16]. to make CPGs. Most important are the methods
by which the evidence base is constructed—how
studies are searched, selected, appraised, and
ii. Is stakeholder involvement considered? synthesized. Methods to reduce bias and provide
the guideline team a valid and credible source of
This AGREE II quality domain focuses on who is information on which to make judgments are
involved in the development of the guideline, who key, even in the absence of RCTs. High-quality
is the guideline designed for, and how patients are methods to interpret the evidence and come to
involved in the process. A multidisciplinary consensus on final recommendations ensure that
guideline development team comprised of clini- the results are transparent, credible, and fair. In
cians, methodologists, and patients ensures the absence of high level evidence provided by
appropriate methodological, content, and experi- RCTs, CPG developers may be tempted to lower
ential perspectives are “at the table”. This serves the threshold for providing strong recommenda-
as a strategy to mitigate bias, and enable debate tions. Guideline users need to assess the link
and the thoughtful contextualization of the evi- between recommendations and supporting evi-
dence to inform the recommendations. Patient dence before clinical implementation. GRADE
involvement is of great importance; authors rarely strategies to evaluate the body of evidence and
report patient-important outcomes in systematic the Evidence to Decision Framework are popular
reviews [19]. Having surgeons involved in methods that reflect the quality expectations and
guideline generation is important to the criteria in the “Rigour” domain [8, 22]. In the
group. While opinion leaders have a small but ASCO CPG example, Lyman et al. describe a
measurable effect on physicians [20], it is mag- thorough methodology, including literature
nified among surgeons [21]. Thus, surgeons’ search, rigorous study selection criteria, specific
participation can play a key role for the develop- data abstraction, evaluation of study quality, and
ment of the CPG but also for the adoption of the summarized data tables [16, 17]. For example,
recommendations. In our ASCO CPG example, RCTs and systematic reviews of RCTs including
342 C. J. Coroneos et al.
at least 50 patients per intervention were included stakeholders, and that are implementable. Rec-
in the development of this evidence-base. ommendations will be relevant if they address a
clinical/health problem that is important to the
target user (e.g., physicians, nurses), provide
iv. Is the CPG development process actionable guidance appropriate to their scope of
independent? practice and the patients they see, and will result
in clinical changes that are important to their
Editorial independence from the funding body patients. In the ASCO CPG [16], the recom-
and being explicit about the competing interests mendations specified the patient population and
of CPG authors are essential to maintaining the only considered research evidence that addressed
credibility of the document and reducing real or the primary outcome, prevention of VTE.
perceived conflicts of interest. Interest groups Consideration of the values of different
may differ in their motivations, and subsequently stakeholders is a defining feature of high-quality
in their interpretation of a body of evidence [23]. recommendations. Stakeholders include a broad
For example, a patient-focused cancer interest range of actors including target users (e.g., clin-
group may support the introduction of a new icians), patients, policy makers, and the CPG
population screening program, whereas public developers themselves (see Table 31.2, Values
health interest groups do not believe it is domain). For example, patients’ values and
cost-effective [7]. CPG panels should not only preferences can influence the acceptability of the
declare all the potential interests related to the recommended action, and the role of the patient
interventions or topics addressed in the CPG but in the shared decision-making process. Under-
also they should make explicit the way the standing CPG developer values provides users
identified conflicts were analyzed and handled. information about the relative importance of
Lastly, the potential impact of the CPG funding different factors (e.g., the priority of different
organization may have on the content of the outcomes) and how CPG developers managed
guideline. situation when values between other parties did
The ASCO CPG program has a publicly not align. This aspect was not explicitly addres-
available Competing Interests Policy [17]. The sed by Lyman et al. [16].
Lyman et al. [16] CPG provides links to the Finally, the benefits of CPGs will only be
policy and provides a complete listing of author realized if recommendations are put into practice.
competing interests. While there are not gold The implementability of CPG recommendations
standards for managing competing interests, the requires an alignment between the actual rec-
transparency provides users with information ommendations and the goal(s) of the CPG. For
from which to make judgments. Conflicts of example, are the recommendations intended to be
interest for individual authors are listed at the end adopted immediately or are they to be used to
of the Lyman et al. [16] CPG: none were leverage clinical policy or funding decisions with
employed by, or owned stock with a relevant the goal of eventual access to the care option.
corporation, five authors acted in a consulting or Tools to support their adoption also reflect
advisory role, three authors received honoraria, high-quality recommendations. The ASCO CPG
and five authors received research funding. has derivative products such as slide decks (for
clinicians and consumers), patient information
page, example order sets, and the like. These
What Are the Results? tools facilitate the operationalization of the rec-
ommendations in practice.
i. Are quality recommendations presented? Lyman et al. [16] provide relevant recom-
mendations addressing our clinical scenario,
High-quality recommendations are those that are including: (i) “3.1: All patients with malignant
clinically relevant, consider the values of disease undergoing major surgical intervention
31 Introduction to Clinical Practice Guidelines 343
should be considered for pharmacologic throm- adoption. Understanding factors that may
bopro-phylaxis with either UFH or LMWH impede or facilitate uptake and the provisions
unless contraindicated because of active bleed- of resources and tools to enable adoption are
ing or high bleeding risk”, (ii) “3.2: Prophylaxis part of an overall quality agenda; a point often
should be commenced preoperatively”, (iii) “3.4: overlooked in surgery [24]. Recommendations
A combined regimen of pharmacologic and should be structured for both academic or
mechanical prophylaxis may improve efficacy, community level practice, stratified for risk, and
especially in the highest-risk patients”, and integrated into normal workflow across different
(iv) “3.5: Pharmacologic thromboprophylaxis settings and context [25–28]. Adapting guide-
for patients undergoing major surgery for cancer lines to a local setting is important in over-
should be continued for at least 7–10 days. coming surgeon behavior [29]. These
Extended prophylaxis with LMWH for up to considerations are especially true when consid-
4 weeks postoperatively should be considered ering VTE prophylaxis [30]. Finally, readers
for… additional risk factors”. should remember that structured processes exist
to modify and tailor to an existing guideline to
a local setting [31]. The entire CPG does not
ii. How is the CPG presented? need to be adopted; the guideline can be
adapted to answer specific questions, and suit
Presentation is not a quality dimension intended needs and resources of a different setting
to make guideline “look pretty”; information and without undermining its validity. The ASCO
recommendations should be easily identifiable, CPG provides recommendations that are trans-
transparent, and explicit. Ensuring guideline ferable to a variety of practices that do not
users know what actions to take, for what require specific settings, resources, or additional
patients, in what circumstances, and with what personnel to implement. The interventions
caveats and qualifying factors provide the foun- described, both mechanical and pharmacologic,
dation for optimized care. For example, the use are available at any center [16, 17].
of “Bottom Line” section in the ASCO CPG
provides users with concise advice that is com-
prised of easy to find recommended clinical Resolution of Clinical Scenario
actions [16, 17]. For example, “Patients under-
going major cancer surgery should receive pro- You appraise the CPG by Lyman et al. and dis-
phylaxis starting before surgery and continuing cuss it with your colleagues at your next multi-
for at least 7–10 days”, and “Extending postop- disciplinary clinic [16, 17]. You believe that the
erative prophylaxis up to 4 weeks should be guideline is suitable for its use and the specific
considered in those with high-risk features”. The recommendations are applicable to your patient.
full report also provides a summary of original As per the recommendations, you decide to use
and updated recommendations for each question preoperative low molecular weight heparin as is
so that the reader can more clearly understand routine at your institution, but add concomitant
where changes to current clinical practices are mechanical prophylaxis with intraoperative lower
required. extremity sequential compression device. You
intend to continue your patient’s DVT prophy-
laxis while admitted postoperatively, again, as is
How Can I Apply the Results routine at your institution. However, you also
to Patient Care? plan to refer your patient for a hematology con-
sultation with the Thrombosis service specifically
Appropriate engagement, sound methods, and to manage DVT prophylaxis after hospital dis-
evidence-informed recommendations are charge for a total of four weeks. You intend to
important but not sufficient to facilitate discuss the guideline at your hospital’s next
344 C. J. Coroneos et al.
quality improvement rounds, with the intention of 4. Brouwers MC, Kho ME, Browman GP, Burgers JS,
adapting it locally for breast reconstruction Cluzeau F, Feder G, et al. Development of the
AGREE II, part 2: assessment of validity of items
patients in the future, including predefined criteria and tools to support application. Can Med
for hematology/thrombosis consultation, preop- Assoc J. 2010;182(10):E472–8.
erative dosing timing, specific pharmacologic 5. Brouwers MC, Rawski E, Spithoff K, Oliver TK,
prophylaxis medications and dosing, and prede- McGlynn, Schuster, et al. Inventory of cancer guide-
lines: a tool to advance the guideline enterprise and
fined duration of pharmacologic prophylaxis in improve the uptake of evidence. Expert Rev Pharma-
routine patients. coeconomics & Outcomes Res. 2011;11(2):151–61.
6. Antoniou SA, Tsokani S, Mavridis D, López-Cano
M, Antoniou GA, Stefanidis D, et al. Guideline
assessment project: filling the GAP in surgical
Conclusions guidelines. Ann Surg. 2018, https://doi.org/10.1097/
SLA.0000000000003036 (Epub ahead of print).
Considerable effort has been dedicated to creat- 7. Shaneyfelt TM, Mayo-Smith MF, Rothwangl J. Are
ing tools to support clinicians’ efforts to differ- guidelines following guidelines?: The methodologi-
cal quality of clinical practice guidelines in the
entiate between high and low-quality CPGs and peer-reviewed medical literature. JAMA. 1999;281
recommendations and to be able to create (20):1900–5.
high-quality CPGs. Surgeons need to remember 8. Guyatt GH, Oxman AD, Vist GE, Kunz R,
that CPGs are distinct from the compulsory steps Falck-Ytter Y, Alonso-Coello P, et al. GRADE: an
emerging consensus on rating quality of evidence
in more familiar clinical pathways. Practically, and strength of recommendations. BMJ (Clin Res
they should be used as tools to facilitate surgical ed). 2008;336(7650):924–6.
decision-making, by interpreting the evidence for 9. Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G,
options, and balancing risks and benefits. From a Brozek J, et al. GRADE guidelines: 1. Introduction—
GRADE evidence profiles and summary of findings
broader system perspective in surgery, CPGs tables. J Clin Epidemiol. 2011;64(4):383–94.
influence policy by demonstrating strengths and 10. Brouwers MC, Kho ME, Browman GP, Burgers JS,
limitations in the evidence base, and defining Cluzeau F, Feder G, et al. Development of the
cost effective interventions, resource allocation, AGREE II, part 1: performance, usefulness and areas
for improvement. Can Med Assoc J. 2010;182
and quality indicators. In this chapter, we apply (10):1045–52.
two internationally recognized tools designed to 11. Brouwers MC, Kho ME, Browman GP, Burgers JS,
achieve this goal—the AGREE II and the Cluzeau F, Feder G, et al. AGREE II: advancing
guideline development, reporting and evaluation in
AGREE-REX. The surgical community is
health care. Can Med Assoc J. 2010;182(18):E839–42.
encouraged to take advantage of the existing 12. Liew NC, Alemany GV, Angchaisuksiri P, Bang
tools and to create opportunities for improvement S-M, Choi G, De DS, et al. Asian venous throm-
as they continue to build a strong evidence- boembolism guidelines: updated recommendations
for the prevention of venous thromboembolism. Int
informed culture.
Angiol: J Int Union Angiol. 2017;36(1):1–20.
13. Bang S-M, Jang MJ, Kim KH, Yhim H-Y, Kim Y-K,
Nam S-H, et al. Prevention of venous thromboem-
bolism: Korean society of thrombosis and hemostasis
References evidence-based clinical practice guidelines. J Korean
Med Sci. 2014;29(2):164–71.
1. Steinberg E, Greenfield S, Wolman DM, Mancher M, 14. Robson A, Sturman J, Williamson P, Conboy P,
Graham R. Clinical practice guidelines we can trust. Penney S, Wood H. Pre-treatment clinical assessment
National Academies Press; 2011. in head and neck cancer: United Kingdom National
2. Qaseem AF, Frode; Macbeth F, Ollenschläger G, multidisciplinary guidelines. J Laryngol Otol.
Phillips S, van der Wees P. Guidelines international 2016;130(S2):S13–22.
network: toward international standards for clinical 15. Piróg MM, Jach R, Undas A. Thromboprophylaxis in
practice guidelines. Ann Intern Med. 2012;156 women undergoing gynecological surgery or assisted
(7):525–31. reproductive techniques: new advances and chal-
3. Browman GP, Brouwers M, Fervers B, Sawka C. lenges. Ginekol Pol. 2016;87(11):773–9.
Population-based cancer control and the role of 16. Lyman GH, Bohlke K, Falanga A. Venous throm-
guidelines-Towards a “systems” approach. Cancer boembolism prophylaxis and treatment in patients
control. 2010;1. with cancer: American society of clinical oncology
31 Introduction to Clinical Practice Guidelines 345
clinical practice guideline update. J Oncol Pract. 24. Browman GP. Improving clinical practice guidelines
2015;11(3):e442–4. for the 21st century: attitudinal barriers and not
17. American Society of Clinical Oncology. Venous technology are the main challenges. Int J Technol
Thromboembolism Prophylaxis and Treatment in Assess Health Care. 2000;16(04):959–68.
Patients With Cancer Update, 2018 [July 30, 2018]. 25. Pai M. Electronic strategies to enhance venous
Available from: https://www.asco.org/practice- thromboprophylaxis in hospitalized medical patients:
guidelines/quality-guidelines/guidelines/supportive- a randomized controlled trial. 2012.
care-and-treatment-related-issues#/9911. 26. Lloyd NS, Douketis JD, Cheng J, Schünemann HJ,
18. Brouwers MC, Spithoff K, Florez ID, Kerkvliet K, Cook DJ, Thabane L, et al. Barriers and potential
Alonso-Coello P, Burgers J, et al. The development of solutions toward optimal prophylaxis against deep
the AGREE-REX (Appraisal of Guidelines Research vein thrombosis for hospitalized medical patients: a
and Evaluation—Recommendations excellence). survey of healthcare professionals. J Hosp Med.
A tool to assess the credibility, trustworthiness and 2012;7(1):28–34.
implementability of guidelines recommendations. 27. Heit JA, O’Fallon WM, Petterson TM, Lohse CM,
TBD. 2018;Manucript in preparation. Silverstein MD, Mohr DN, et al. Relative impact of
19. Agarwal A, Johnston BC, Vernooij RW, risk factors for deep vein thrombosis and pulmonary
Carrasco-Labra A, Brignardello-Petersen R, Neu- embolism. Arch Intern Med. 2002;162(11):1245–8.
mann I, et al. Authors seldom report the most 28. Burgers JS, Grol RP, Zaat JO, Spies TH, van der
patient-important outcomes and absolute effect mea- Bij AK, Mokkink HG. Characteristics of effective
sures in systematic review abstracts. J Clin Epi- clinical guidelines for general practice. Br J Gen
demiol. 2017;81:3–12. Pract. 2003;53(486):15–9.
20. Rogers EM. Diffusion of innovations. Simon and 29. Guadagnoli E, Soumerai SB, Gurwitz JH, Borbas C,
Schuster; 2010. Shapiro CL, Weeks JC, et al. Improving discussion
21. Young JM, Hollands MJ, Ward J, Holman CAJ. Role of surgical treatment options for patients with breast
for opinion leaders in promoting evidence-based cancer: local medical opinion leaders versus audit
surgery. Arch Surg. 2003;138(7):785–91. and performance feedback. Cancer Res Treat.
22. Alonso-Coello P, Schünemann HJ, Moberg J, 2000;61(2):171–5.
Brignardello-Petersen R, Akl EA, Davoli M, et al. 30. Lau BD, Haut ER. Practices to prevent venous
GRADE Evidence to Decision (EtD) frameworks: a thromboembolism: a brief review. BMJ Qual Saf.
systematic and transparent approach to making well 2013:bmjqs-2012-001782.
informed healthcare choices. 1: Introduction. bmj. 31. Fervers B, Burgers J, Voellinger R, Brouwers M,
2016;353:i2016. Browman G, Graham I, et al. Guideline adaptation:
23. Coulter I, Adams A, Shekelle P. Impact of varying an approach to enhance efficiency in guideline
panel membership on ratings of appropriateness in development and improve utilisation. BMJ Qual
consensus panels: a comparison of a multi-and single Saf. 2011:qshc. 2010.043257.
disciplinary panel. Health Serv Res. 1995;30(4):577.
Index
flow chart of study design, 203 European Organization for Research and Treatment of
important aspects of, 204 Cancer Scales, 76 See also EORTC trial
false negative, 204 Evaluating, literature on diagnostic test, see Case series;
false positive, 204 Surgical intervention, evaluating; Surgical
likelihood ratio, 204 research, evaluating surveys and questionnaires
true negative, 204 in
true positive, 204 Evidence-based medicine (EBM), 1, 9, 24, 37, 103, 194
nomogram for result, 207 of CPGs, 197
study summary, 202–203 expert opinion on, 278
Diagnostic tests, 41, 201, 202, 208 in surgery, 2, 4
characterizing, 207 Evidence-Based Medicine (EBM) Working group, 180
understanding, 204 Evidence-based resources, 31
Directed acyclic graph (DAG), 164 Evidence-based surgery (EBS), 1, 2, 9, 37, 47, 93, 143
Disability arm shoulder and hand (DASH), 62, 64 basis for decision making, 2
Quick(DASH), 64 cohort studies in, 159 See also Cohort studies
Distal radius fracture (DRF), 61, 66 definition, 9
post-DRF, 67 evidence-based approach, 9, 10
Document(ation), 1, 73, 194, 259, 278, 280, 337, 342 construction of clinically relevant, answerable
PRISMA flow diagram, 150, 155 question, 9–10, 11
Duke Clinical Research Institute (DCRI), 2 critically appraising literature, 10, 14–15
DynaMed Plus, 28, 32, 34 applying evidence to clinical practice, 11, 15
evaluating effectiveness and efficiency, 11, 15
planning and carrying out literature search, 10,
E 11–14
EASY trial, 94, 99 expert opinion in, 278
EBSCO CINAHL, 30 LOE, 2, 3
Economic analysis, 95, 239, 244 overall philosophy of, 6
Economic evaluation, 6, 46, 98, 227 See also Surgical Expert opinion, 51, 74, 76, 197
techniques, economic evaluations application, 280
Economic research, 37, 46, 47 in context of
EMBASE, 10, 26, 28, 30, 32, 33, 148, 219, 302, 308 evidence-based medicine and surgery, 278
Emotional functioning, 99, 100 surgical literature, 280
Endoscopic Carpal Tunnel Release (ECTR) technique, 21 limitations of, 279
End Result Idea, 194 mechanism-based reasoning, 40
Enhanced Recovery after Surgery (ERAS) program, 255, questions to appraise, 281
261 trusting, 281
EORTC trial, 97, 132 value of, 278–279
overall survival in, 131 Expertise-based randomized controlled trial (EBRCT),
QLQ-C30, 98, 99, 100, 101 136
QLQ-CR29, 98, 99, 100, 101 block randomization, 139
quality of life questionnaire, 131 concealment of allocation, 139–140
questionnaires, 96 considering learning curve, 138
Epistemonikos, 28, 34 equally skilled surgeons, 139
ePROVIDE (database), 63 guidelines to assess, 136
EQ-5D (EuroQOL 5-Dimension), 62, 63, 96 ITT analysis, 141
EQ-5D-5L (EuroQOL 5-Dimension 5-Level), 98 minimizing bias, 140
EQUATOR (Enhancing the QUAlity and Transparency prognostic factor, 140
Of health Research) Network, 5, 6, 252 time to treatment, 140
Equivalent-forms reliability, 76 research validity, 137–138
Essential Evidence Plus, 28, 32 resolution of scenario, 144
Estimation, 286, 304, 305, 317, 320 stratification, 140–141
horizon estimation, 148 sufficient level of expertise, 138–139
interval estimates, 104 sufficient number of surgeons, 139
OR estimation, 176 trial results, 141
outcomes and, 122 applicability to practice, 142–143
point estimates, 104 measurement, 141–142
of true values, 163 overweighing risks and costs, 143
Ethical, 18, 19, 119, 120, 122, 171, 194, 211, 256 patient-important outcomes, 143
352 Index
Hypothesis testing, 286, 287, 303, 304, 305 prospective cohort design, 41, 42
RCT study design, 41, 42
retrospective cohort design, 42
I Level II evidence, 38, 41, 43, 45, 46
IDEAL (Idea, Development, Exploration, Assessment, Level III evidence, 38, 40, 41, 43, 46
and Long-term follow-up), 117, 119, 120 Level IV evidence, 40, 41, 43, 46
Identifying concepts, 24, 35 Level V evidence, 41, 43, 46
Identifying search terms, 25 Levels of evidence (LOE), 2, 37
Ileostomy, 93, 94 CTFPHE’s LOE, 3
Incremental cost-effectiveness ratio (ICER), 241, 242, for diagnostic research, 41–43
246, 247, 248, 251, 252 clinical example, 43
Incremental cost-utility ratio (ICUR), 98, 243, 246, 247, for economic research, 44–46
248, 251, 252 clinical example, 46–47
Information bias, 163, 172, 183, 185, 186, 298 Oxford Centre for Evidence-Based Medicine, 48
Institute for Healthcare Improvement (IHI), 195 for prognostic research, 43–44
Institute of Medicine (IOM), 195 clinical example, 44
Intention to treat principle, 54–55 Sackett’s LOE, 3
hazard ratio (HR), 131 for therapeutic research, 37–41
ITT analysis, 107–108, 129, 131, 141, 289, 329 clinical example, 41
Internal consistency reliability, 76, 271 Lichtenstein tension-free method, 59
Internal responsiveness, 97 Likelihood ratio (LR), 204, 210, 211
Internal validity, 52, 154, 258 calculated, 211
International Society for Pharmacoeconomics and Out- for each category, 206
comes Research, 73 of negative test (LR–), 204, 205
Internet-derived expertise, 201 of positive test (LR+), 204, 205
Interrater reliability, 76, 271 Likert scale, 269, 270
Intervention, see Outcome measure, information on; Limits and search filters, 30–31
PICOT format; Randomized controlled trials Lind, James, 1
(RCTs); Surgical intervention, evaluating Literature search, 71–72, 88–89, 94, 104, 116–117,
Intra-class correlation coefficient (ICC), 76, 77 125–126, 137, 160, 172–173, 184, 193, 202,
Intraocular pressure (IOP), 87 219, 225, 265–266, 277–278, 285–286, 302,
Invasive lobular breast carcinoma, 71 328, 338
Item response theory (IRT), 77 applying to patients, 79
appraising on prognosis, 219–222
within cardiac surgery, 178
J decision analysis literature, 229
Joint Commission on Accreditation of Hospital Organi- existing literature, 167
zations (JCAHO), 196 expert opinion in surgical literature, 280
Journal of Bone and Joint Surgery (JBJS), 37, 38, 41, 46 reporting statistical findings in, 296
Journal of Clinical Oncology, 72 Low-molecular-weight heparin (LMWH), 89, 90, 91, 343
Journal of the American Medical Association (JAMA), 2,
116
M
Mann–Whitney U tests, 166, 272
K Marfan syndrome, 218
Keywords, 25, 26, 27, 30, 52, 94, 146, 219 McGill pain questionnaire (MPQ), 62, 64
Knee arthroscopy, 88, 90, 91 McMaster-Toronto Arthritis Patient Preference Disability
Knowledge Questionnaire (MACTAR), 78
clinical knowledge, 281–282 Mechanism-based reasoning, 40
technical knowledge, 281, 282 Medial epicondylectomy, 232
translation, new advances in surgery, 6 Medical Subject Headings (MeSH), 52, 71, 72, 193
non-MeSH, 52
MEDLINE, 30, 32, 33, 148, 219, 279
L MedSelect, 268
Lancet, 116, 126 Meta-analyses (MA), 38, 51, 145, 233 See also Sys-
Laryngoscopy, 202 tematic reviews (SR)
Level I evidence, 38, 41–43, 45, 46–47 of continuous outcome measures, 152
case-control design, 42 limitation of, 153, 154
case series design, 42, 43 Michigan hand questionnaire (MHQ), 62, 64
354 Index
W
U Well-defined research question, 104
Ulnar neuropathy, 225 Welsh t test, 292
Upper limb functional index (ULFI), 64 Wilcoxon tests, 287
UptoDate, 28, 31, 32 WOMAC index (Western Ontario and McMaster
Ureteric catheters (UC), 301 Universities Osteoarthritis Index), 246, 247,
Ureteric injury (UI), 301 248
US Food and Drug Administration, 5, 72, 95, 126 Work limitations questionnaire (WLQ), 62
See also Food and Drug Administration (FDA)
Users’ Guide for Surgical Literature, 137