Home About Journal AHEAD OF PRINT Current Issue Back Issues Instructions Submission Search Subscribe Blog    

Users Online: 2408 
Print this page  Email this page Small font sizeDefault font sizeIncrease font size 

SYMPOSIUM Table of Contents   
Year : 2007  |  Volume : 41  |  Issue : 1  |  Page : 32-36
Outcome measurements in orthopedic

Departments of Surgery, Divisions of Orthopedic Surgery, McMaster University, Hamilton, Ontario and University of Toronto, Toronto, Ontario, Canada

Click here for correspondence address and email


The choice of outcome measure in orthopedic clinical research studies is paramount. The primary outcome measure for a study has several implications for the design and conduct of the study. These include: 1) sample size determination, 2) internal validity, 3) compliance and 4) cost. A thorough knowledge of outcome measures in orthopedic research is paramount to the conduct of a quality study. The decision to choose a continuous versus dichotomous outcome has important implications for sample size. However, regardless of the type of outcome, investigators should always use the most 'patient-important' outcome and limit bias in its determination.

Keywords: Evidence-based medicine, outcomes, research

How to cite this article:
Bhandari M, Petrisor B, Schemitsch E. Outcome measurements in orthopedic. Indian J Orthop 2007;41:32-6

How to cite this URL:
Bhandari M, Petrisor B, Schemitsch E. Outcome measurements in orthopedic. Indian J Orthop [serial online] 2007 [cited 2020 Jan 26];41:32-6. Available from:

   Types of Outcome Measures Top

Investigators have a variety of options when considering outcomes for their studies. Regardless of the specific outcome measure used, outcomes should be "patient-important" and as objective as possible. Mortality is one example of an important and objective outcome measure. However, the majority of orthopedic research focuses upon return to function or measures other than death. Thus, investigators should be familiar with instruments that measure patient function or quality of life. Jackowski and Guyatt[1] have summarized the key issues in the use of such measures [Table - 1]. One of the choices that investigators face when trying to identify an appropriate measure is whether to use generic or disease-specific instruments to measure health status.

A generic instrument is one that measures general health status inclusive of physical symptoms, function and emotional dimensions of health. An example of a generic instrument includes the Short Form-36. A disadvantage of generic instruments however, is that they may not be sensitive enough to be able to detect small but important changes. Disease-specific measures are tailored to inquire about the specific physical, mental and social aspects of health affected by a disease (e.g. arthritis). An example of a disease-specific instrument includes the Western Ontario McMaster Osteoarthritis Index.

The most commonly used generic instrument in the orthopedic surgical literature is the Short Form-36 (SF-36). The SF-36 is a multi-purpose, short-form health survey consisting of 36 questions.[2],[3] The SF-36 has proven useful in surveys of general and specific populations, comparing the relative burden of diseases and in differentiating the health benefits produced by a wide range of different treatments.[2],[3] The experience to date with the SF-36 has been documented in nearly 4,000 publications; citations for those published in 1988 through 2000 are documented in a bibliography covering the SF-36 and other instruments in the "SF" family of tools.[2],[3]

The SF-36 contains multi-function item scales to measure eight domains: physical function (10 items); role physical (four items); bodily pain (two items); general health (five items); vitality (four items); social functioning (two items); role emotional (four items); and mental health (five items). The two summary measures of the SF-36 are the physical component summary and the mental component summary. The scores for the multi-function item scales and the summary measures of the SF-36 vary from zero to 100, with 100 being the best possible score and zero being the lowest possible score. The SF-36 takes less than 15min to complete. It can be self-administered or interview-administered. The SF-36 is available in number languages. To use the SF-36, permission must be obtained through Quality Metric (

Utility or performance measures are a unique form of generic instrument that measure health status by quantifying wellness on a continuum anchored by death and optimum health. Assessment of health utility is rooted in decision theory, which models the decision-making process expected of rational individuals when faced with uncertain outcomes. Through placement on a continuum with anchors of death and full health, preference measurement provides a means to compare alternative interventions, patient populations and diseases and is particularly useful when attempting to measure the cost-effectiveness of competing interventions in which the cost of an intervention is related to the number of quality-adjusted life-years (QALYs) gained.

   Limiting Bias in Outcomes Evaluation Top
[Table - 2][2]

Bias in the measurement of outcomes can be minimized by the use of validated outcome measures, objective outcome measures, blinded assessment of outcomes and independent adjudication of outcomes. Whenever possible, an outcome measure should be blinded. By blinding, the outcome assessor should not be aware of the treatment allocation of the patient in a clinical study. In many surgical trials, however, blinding is impossible and investigators must use alternative methods to minimize bias. In such situations, the outcome measure can be independently adjudicated. By this, we mean that the outcome should be determined an 'independent' person or group of individuals who are not otherwise involved in the study. The operating surgeon should not be the individual evaluating outcomes of his or her own patients. When outcomes (e.g. radiographic fracture healing) are subjective in their determination, independent adjudication of one or more persons is an excellent way to limit bias.

   Sample Size Implications and Outcome Measures Top

Outcome measurement and sample size

This section focuses on the choice of an outcome measure and sample size. The statistical power of a study is the probability that it will find a difference between two treatments when one actually exists. By convention, investigators set the acceptable study power to 80% (i.e. 20% chance of false-positive results). Small studies are at risk of being underpowered (study power <80%). Surgeons must endeavor to optimize the study power when they anticipate a small sample size for their studies. The choice of the main outcome variables may play a crucial role in such circumstances.

Bhandari et al evaluated the impact of the choice of outcome variable on the statistical power in trials of orthopedic trauma.[4] They hypothesized that small studies with continuous outcome variables (time to fracture union) would achieve higher estimates of study power than those that reported dichotomous outcome variables (% union rates). In a review of 196 RCTs published in 32 medical journals Bhandari et al identified a total of 19,942 patients. Study sample sizes ranged from 10 to 662 patients. The vast majority of the studies were conducted at only one center (99.0% or 194/196) and focused upon interventions related to fracture repair (99.0% or 194/196). Fractures of the hip were the primary focus of over one-third of the included studies (34.2% or 67/196). These authors identified 76 studies (39%) with sample sizes of 50 patients or less. Two groups were formed: 29 studies reported continuous outcomes and 47 studies reported dichotomous outcomes. The mean sample size of the studies in each group was similar ( P >0.05). Those studies that reported continuous outcomes had a significantly greater study power than those studies that reported dichotomous outcomes ( P =0.042). Twice as many studies that reported continuous outcomes achieved conventionally acceptable study power (80% or more) than those that reported dichotomous outcomes (37% vs. 18.6%, respectively, P =0.04) [Figure - 1].

The power of a statistical test is typically a function of the magnitude of the treatment effect, the designated Type I error rate (a, risk of false-positive result) and the sample size ( n ). When designing a trial, investigators can decide upon the desired study power (typically 80%) and calculate the necessary sample size to achieve this goal. If investigators are conducting a post-hoc power analysis after the completion of the study, they will take the actual sample size used to calculate the study's power.

Moher and colleagues identified 383 randomized trials published in the top medical journals JAMA, New England Journal of Medicine and The Lancet . Although Moher et al did not compare the statistical power and the type of outcome variable, they evaluated 70 trials with negative results and found that 68% lacked acceptable statistical power (80%).[5] Lochner and colleagues identified 117 randomized trials in orthopedics with a negative result (nonsignificant result) and reported that over 90% lacked sufficient statistical power to make definitive conclusions.[6] Of the small randomized trials in this review, we identified 78% that were underpowered.

In conclusion, the prevalence of published studies that fail to meet acceptable standards of statistical power is widespread. Surgeons can limit this problem by carefully selecting the outcome variable to optimize the study power and obviate the need for large samples of patients.

Continuous variables are significantly better suited to improving statistical power in small trials than dichotomous variables.

   Sample Size Calculation Top

Even at best, a sample size calculation is based upon the best available "guestimate" of treatment difference between treatment groups.

Comparing two means (continuous variable)[7],[8],[9],[11],[12]

Let's consider a study that aims to compare pain scores in patients with arthroplasty versus internal fixation in patients with displaced hip fractures. Previous studies using the pain score have reported standard deviations for trauma patients of 12 points. Based upon previous studies, we want to be able to detect a difference of 5 points on this pain score between treatments. Thus, the number of patients required per treatment arm to obtain 80% study power (b=0.20) at a 0.05 alpha level of significance is as follows:

N 1 = n 2 =2(σ 2)(z 1-α/2 + z 1-β) 2 2


n 1 = sample size of Group one

n 2 = sample size of Group two

Δ = difference of outcome parameter between groups (5 points)

σ = sample standard deviations[12]

Z 1-α/2 = z 0.975 = 1.96 (for α=0.05)

Z 1-β = z 0.80 = 0.84 (for β=0.2)

From the equation above, our proposed study will require 90 patients per treatment arm to have adequate study power n 1 = n 2 = 2(12 2) (1.96 + 0.84) 2/ 5 2 = 90.

Reworking the above equation, the study power can be calculated for any given sample size by transforming the above formula and calculating the z-score:

Z 1-β = (n 1 2)/2(σ 2)) 1/2 - z 1-α/2

The actual study power that corresponds to the calculated z-score can be looked up in readily available statistical literature[6] or on the internet (keyword: "z-table"). From the above example the z-score will be 0.84 = (90(5 2)/2(12 2)) 1/2 - 1.96 for a sample size of 90 patients. The corresponding study power for a z-score of 0.84 is 80%.

Comparing binomial proportions (percentages for dichotomous variables)

Let's now assume that we wish to change our outcome measure to differences in secondary surgical procedures between operatively and nonoperatively treated ankle fractures. We consider a clinically important difference to be 5%. Based upon the previous literature, we estimate that the secondary surgical rates in operatively and nonoperatively treated ankles will be 5% and 10%, respectively. The number of patients required for our study can now be calculated as follows:

N 1 = n 2 = [(2p m q m ) 1/2 z 1-α/2 + (p 1 q 1 + p 2 q 2 ) 1/2 z 1-β] 2/ Δ 2


n 1 = sample size of Group one

n 2 = sample size of Group two

p 1 , p 2 = sample probabilities (5% and 10%)

q 1 , q 2 = 1 - p 1 , 1 - p 2 (95% and 90%)

p m = (p 1 + p 2 )/2 (7.5%)

q m = 1 - p m (92.5%)

Δ = difference = p 2 - p 1 (5%)

Z 1-α/2 = z 0.975 = 1.96 (for α=0.05)

Z 1-β = z 0.80 = 0.84 (for β=0.2)

Thus, we need 433 patients per treatment arm to have adequate study power for our proposed trial.

n 1 = n 2 = [(2 x 0.075 x 0.925) 1/2 x 1.96 + (0.05 x 0.95 + 0.1 x 0.9) 1/2 x 0.84 2] / 0.05 2= 433

Reworking the above equation, the study power can be calculated for any given sample size by transforming the above formula and calculating the z-score:

Z 1-β = (n (Δ 2)) 1/2 - (2p m q m ) 1/2 z 1-α/2 ) / (p 1 q 1 + p 2 q 2 ) 1/2

From the above example the z-core will be 0.84 = ((433 x 0.05 2) 1/2 - (2 x 0.075 x 0.925) 1/2 x 1.96) / (0.05 x 0.95 + 0.1 x 0.9) 1/2 for a sample size of 433 patients. The corresponding study power for a z-score of 0.84 is 80%.

Using confidence intervals for sample size calculation

It can also be useful to calculate the precision of a study based on the above sample size calculation. Precision is defined as the width of the 95% confidence interval (CI). Being 95% confident means that if we repeat the study an unlimited number of times, the true difference between groups will be included in the CI in 95% of the samples. For any power and clinically relevant or hypothesized difference (Δ) the predicted confidence interval can be calculated using this formula:

Predicted 95% CI = observed difference ± 0.7 Δ 0.80

Predicted Precision = 2*0.7Δ 0.80 = 1.4Δ 0.80


Δ 0.80= true difference for which there is 80% power.

Often, choosing an expected difference between two groups can be arbitrary. An alternative method to determine an expected difference can be derived from using 95% confidence intervals. For example, rather then hypothesizing a 5% difference between operative and nonoperative treatment of ankle fractures we might be more comfortable stating that we will not accept a confidence interval for an observed difference that is wider than 7%. Thus we can work backwards from our predicted confidence interval to calculate the expected difference between groups:

0.07 = 1.4Δ 0.80

Δ 0.80 = 0.07/1.4 = 0.05

Now we can use the sample size calculation for the proportions above to calculate the number of patients required for our study.

Calculating the precision illustrates the trade-off between the magnitude of the hypothesized or clinically relevant difference used in the sample size calculation and the likelihood of finding a statistically significant difference. Choosing a higher hypothesized difference decreases the required number of studied subjects, but it also increases the predicted 95% confidence interval, which then is more likely to include 0 and therefore yielding statistically not significant results. While it is tempting to "hypothesize" a larger difference of the primary outcome parameter in order to decrease the required sample size, it is therefore advisable to choose a realistic difference when calculating the required sample size. Also, the benefit of calculating the predicted precision is that it may be easier to understand for a nonstatistician that the primary outcome parameter would be within a specific range (in this example 7%) rather than dealing with the more abstract concept of study power.

Depending on the magnitude of the required study subjects, the investigators have to evaluate the feasibility of single versus multi-center study and the enrollment period. Finally, investigators should not confuse clinical significance with statistical significance. Any result will be statistically significant if enough study subjects are used.

   Conclusion Top

A thorough knowledge of outcome measures in orthopedic research is paramount to the conduct of a quality study. The decision to choose a continuous versus dichotomous outcome has important implications for sample size. However, regardless of the type of outcome, investigators should always use the most 'patient-important' outcome and limit bias in its determination.

   References Top

1.Jackowski D, Guyatt G. A guide to health measurement. Clin Orthop Relat Res 2003;413:80-9.  Back to cited text no. 1  [PUBMED]  [FULLTEXT]
2.McHorney CA, Ware JE Jr, Raczek AE. The MOS 36-Item Short-Form Health Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care 1993;31:247-63.   Back to cited text no. 2    
3.Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection. Med Care 1992;30:473-83.  Back to cited text no. 3    
4.Bhandari M, Lochner H, Tornetta P 3rd. Effect of continuous versus dichotomous outcome variables on study power when sample sizes of orthopaedic randomized trials are small. Arch Orthop Trauma Surg 2002;122:96-8.  Back to cited text no. 4  [PUBMED]  [FULLTEXT]
5.Moher D, Dulberg CS, Wells GA. Statistical power, sample size and their reporting in randomized controlled trials. JAMA 1994;272:122-4.   Back to cited text no. 5  [PUBMED]  
6.Lochner HV, Bhandari M, Tornetta P 3rd. Type-II error rates (beta errors) of randomized trials in orthopaedic trauma. J Bone Joint Surg 2001;83-A:1650-5.  Back to cited text no. 6  [PUBMED]  [FULLTEXT]
7.Zlowodzki M, Bhandari M, Brown G, Cole P, Swiontkowski MF. Planning a randomized trial: Determining the study sample size. Tech Orthop 2004;19:72-6.  Back to cited text no. 7    
8.Bristol DR. Sample sizes for constructing confidence intervals and testing hypotheses. Stat Med 1989;8:803-11.  Back to cited text no. 8  [PUBMED]  
9.Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med 1994;121:200-6.  Back to cited text no. 9  [PUBMED]  [FULLTEXT]
10.Guyatt G, Jaeschke R, Heddle N, Cook D, Shannon H, Walter S. Basic statistics for clinicians: 1. Hypothesis testing. Can Med Assoc J 1995;152:27-32.  Back to cited text no. 10    
11.Guyatt G, Jaeschke R, Heddle N, Cook D, Shannon H, Walter S. Basic statistics for clinicians: 2. Interpreting study results: confidence intervals. Can Med Assoc J 1995;152:169-73.  Back to cited text no. 11    
12.Streiner DL. Sample size and power and psychiatric research. Can J Psychiatr 1990;35:616-20.  Back to cited text no. 12  [PUBMED]  

Correspondence Address:
Mohit Bhandari
Hamilton General Hospital, 7 North, Suite 727, 237 Barton St. East, Hamilton, Ontario, L8L 2X2
Login to access the Email id

Source of Support: None, Conflict of Interest: None

DOI: 10.4103/0019-5413.30523

Rights and Permissions


[Figure - 1]


[Table - 1], [Table - 2]

This article has been cited by
1 Efficacy of Electrical Stimulators for Bone Healing: A Meta-Analysis of Randomized Sham-Controlled Trials
Ilyas S. Aleem,Idris Aleem,Nathan Evaniew,Jason W. Busse,Michael Yaszemski,Arnav Agarwal,Thomas Einhorn,Mohit Bhandari
Scientific Reports. 2016; 6(1)
[Pubmed] | [DOI]
2 Platelet-rich plasma in orthopedic therapy: a comparative systematic review of clinical and experimental data in equine and human musculoskeletal lesions
Patrícia M Brossi,Juliana J Moreira,Thaís SL Machado,Raquel YA Baccarin
BMC Veterinary Research. 2015; 11(1)
[Pubmed] | [DOI]
3 Design and execution of clinical trials in orthopaedic surgery
R. Mundi,H. Chaudhry,S. Mundi,K. Godin,M. Bhandari
Bone and Joint Research. 2014; 3(5): 161
[Pubmed] | [DOI]
4 Atlantoaxial instability in acute odontoid fractures is associated with nonunion and mortality
Nathan Evaniew,Blake Yarascavitch,Kim Madden,Michelle Ghert,Brian Drew,Mohit Bhandari,Desmond Kwok
The Spine Journal. 2014;
[Pubmed] | [DOI]
5 Design considerations in implant-related randomized trials
Van Oldenrijk, J., Sierevelt, I.N., Schafroth, M.U., Poolman, R.W.
Journal of Long-Term Effects of Medical Implants. 2007; 17(2): 153-163


    Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
    Email Alert *
    Add to My List *
* Registration required (free)  

    Types of Outcome...
    Limiting Bias in...
    Sample Size Impl...
    Sample Size Calc...
    Article Figures
    Article Tables

 Article Access Statistics
    PDF Downloaded323    
    Comments [Add]    
    Cited by others 5    

Recommend this journal