To measure the effectiveness of the irt, which of the following does not need to be evaluated?

journal article

Validating the Interpretations and Uses of Test Scores

Journal of Educational Measurement

Vol. 50, No. 1, Special Issue on Validity (Spring 2013)

, pp. 1-73 (73 pages)

Published By: National Council on Measurement in Education

https://www.jstor.org/stable/23353796

Read and download

Log in through your school or library

Abstract

To validate an interpretation or use of test scores is to evaluate the plausibility of the claims based on the scores. An argument-based approach to validation suggests that the claims based on the test scores be outlined as an argument that specifies the inferences and supporting assumptions needed to get from test responses to score-based interpretations and uses. Validation then can be thought of as an evaluation of the coherence and completeness of this interpretation/use argument and of the plausibility of its inferences and assumptions. In outlining the argument-based approach to validation, this paper makes eight general points. First, it is the proposed score interpretations and uses that are validated and not the test or the test scores. Second, the validity of a proposed interpretation or use depends on how well the evidence supports the claims being made. Third, more-ambitious claims require more support than less-ambitious claims. Fourth, more-ambitious claims (e.g., construct interpretations) tend to be more useful than less-ambitious claims, but they are also harder to validate. Fifth, interpretations and uses can change over time in response to new needs and new understandings leading to changes in the evidence needed for validation. Sixth, the evaluation of score uses requires an evaluation of the consequences of the proposed uses; negative consequences can render a score use unacceptable. Seventh, the rejection of a score use does not necessarily invalidate a prior, underlying score interpretation. Eighth, the validation of the score interpretation on which a score use is based does not validate the score use.

Journal Information

The Journal of Educational Measurement (JEM) is a quarterly journal that publishes original measurement research, reports on new measurement instruments, reviews of measurement publications, and reports about innovative measurement applications. The topics addressed are of interest to those concerned with the practice of measurement in field settings as well as researchers and measurement theorists. In addition to presenting new contributions to measurement theory and practice, JEM also serves as a vehicle for improving educational measurement applications in a variety of settings.

Publisher Information

The National Council on Measurement in Education (NCME) is a professional organization for individuals involved in assessment, evaluation, testing, and other aspects of educational measurement. Members are involved in the construction and use of standardized tests and performance-based assessment, assessment program design and implementation, and program evaluation. NCME is incorporated exclusively for scientific, educational, literary, and charitable purposes. These include: (1) the encouragement of scholarly efforts to advance the science of measurement and its applications in education and (2) the dissemination of knowledge about the theory, techniques, and instrumentation available for measurement; procedures appropriate to the interpretation and use of such techniques and instruments; and applications of educational measurement in individual and group contexts. NCME members include university faculty; test developers; state and federal testing and research directors; professional evaluators; testing specialists in business, industry, education, community programs, and other professions; licensure, certification, and credentialing professionals; graduate students from educational, psychological, and other measurement programs; and others involved in testing issues and practices.

Rights & Usage

This item is part of a JSTOR Collection.
For terms and use, please refer to our Terms and Conditions
Journal of Educational Measurement © 2013 National Council on Measurement in Education
Request Permissions

This chapter provides an overview of considerations for the development of outcome measures for observational comparative effectiveness research (CER) studies, describes implications of the proposed outcomes for study design, and enumerates issues of bias that may arise in incorporating the ascertainment of outcomes into observational research, and means of evaluating, preventing and/or reducing these biases. Development of clear and objective outcome definitions that correspond to the nature of the hypothesized treatment effect and address the research questions of interest, along with validation of outcomes or use of standardized patient reported outcome (PRO) instruments validated for the population of interest, contribute to the internal validity of observational CER studies. Attention to collection of outcome data in an equivalent manner across treatment comparison groups is also required. Use of appropriate analytic methods suitable to the outcome measure and sensitivity analysis to address varying definitions of at least the primary study outcomes are needed to draw robust and reliable inferences. The chapter concludes with a checklist of guidance and key considerations for outcome determination and definitions for observational CER protocols.

Introduction

The selection of outcomes to include in observational comparative effectiveness research (CER) studies involves the consideration of multiple stakeholder viewpoints (provider, patient, payer, regulatory, industry, academic and societal) and the intended use for decisionmaking of resulting evidence. It is also dependent on the level of funding and scope of the study. These studies may focus on clinical outcomes, such as recurrence-free survival from cancer or coronary heart disease mortality; general health-related quality of life measures, such as the EQ-5D and the SF-36; or disease-specific scales, like the uterine fibroid symptom and quality of life questionnaire (UFS-QOL); and/or health resource utilization or cost measures. As with other experimental and observational research studies, the hypotheses or study questions of interest must be translated to one or more specific outcomes with clear definitions.

The choice of outcomes to include in a CER study will in turn drive other important design considerations such as the data source(s) from which the required information can be obtained (see chapter 8), the frequency and length of followup assessments to be included in the study following initial treatment, and the sample size, which is influenced by the expected frequency of the outcome in addition to the magnitude of relative treatment effects and scale of measurement.

In this chapter, we provide an overview of types of outcomes (with emphasis on those most relevant to observational CER studies); considerations in defining outcomes; the process of outcome ascertainment, measurement and validation; design and analysis considerations; and means to evaluate and address bias that may arise.

Conceptual Models of Health Outcomes

In considering the range of health outcomes that may be of interest to patients, health care providers, and other decisionmakers, key areas of focus are medical conditions, impact on health-related or general quality of life, and resource utilization. To address the interrelationships of these outcomes, some conceptual models have been put forth by researchers with a particular focus on health outcomes studies. Two such models are described here.

Wilson and Cleary proposed a conceptual model or taxonomy integrating concepts of biomedical patient outcomes and measures of health-related quality of life. The taxonomy is divided into five levels: biological and physiological factors, symptoms, functioning, general health perceptions, and overall quality of life.1 The authors discuss causal relationships between traditional clinical variables and measures of quality of life that address the complex interactions of biological and societal factors on health status, as summarized in Table 6.1.

Table 6.1

Wilson and Cleary's taxonomy of biomedical and health-related quality of life outcomes.

An alternative model, the ECHO (Economic, Clinical, Humanistic Outcomes) Model, was developed for planning health outcomes and pharmacoeconomic studies, and goes a step further than the Wilson and Cleary model in incorporating costs and economic outcomes and their interrelationships with clinical and humanistic outcomes (Figure 6.1).2 The ECHO model does not explicitly incorporate characteristics of the patient as an individual or psychosocial factors to the extent that the Wilson and Cleary model does, however.

To measure the effectiveness of the irt, which of the following does not need to be evaluated?

Figure 6.1

The ECHO model. See Kozma CM, Reeder CE, Schultz RM. Economic, clinical, and humanistic outcomes: a planning model for pharmacoeconomic research. Clin Ther. 1993;15(6):1121-32. This figure is copyrighted by Elsevier Inc. and reprinted with permission. (more...)

As suggested by the complex interrelationships between different levels and types of health outcomes, different terminology and classifications may be used, and there are areas of overlap between the major categories of outcomes important to patients. In this chapter, we will discuss outcomes according to the broad categories of clinical, humanistic, and economic and utilization outcome measures.

Outcome Measurement Properties

The properties of outcome measures that are an integral part of an investigator's evaluation and selection of appropriate measures include reliability, validity, and variability. Reliability is the degree to which a score or other measure remains unchanged upon test and retest (when no change is expected), or across different interviewers or assessors. It is measured by statistics including kappa, and the inter- or intra-class correlation coefficient. Validity, broadly speaking, is the degree to which a measure assesses what it is intended to measure, and types of validity include face validity (the degree to which users or experts perceive that a measure is assessing what it is intended to measure), content validity (the extent to which a measure accurately and comprehensively measures what it is intended to measure), and construct validity (the degree to which an instrument accurately measures a nonphysical attribute or construct such as depression or anxiety, which is itself a means of summarizing or explaining different aspects of the entity being measured).3 Variability usually refers to the distribution of values associated with an outcome measure in the population of interest, with a broader distribution or range of values said to show more variability.

Responsiveness is another property usually discussed in the context of patient-reported outcomes (PROs) but extendable to other measures, representing the ability of a measure to detect change in an individual over time.

These measurement properties may affect the degree of measurement error or misclassification that an outcome measure is subject to, with the consideration that the properties themselves are specific to the population and setting in which the measures are used. Issues of misclassification and considerations in reducing this type of error are discussed further in the section on “avoidance of bias in study design.”

Clinical Outcomes

Clinical outcomes are perhaps the most common category of outcome to be considered in CER studies. Medical treatments are developed and must demonstrate efficacy in preapproval clinical trials to prevent the occurrence of undesirable outcomes such as coronary events, osteoporosis, or death; to delay disease progression such as in rheumatoid arthritis; to hasten recovery or improve survival from disease, such as in cancer or H5N1 influenza; or to manage or reduce the burden of chronic diseases including diabetes, psoriasis, Parkinson's disease, and depression. Postapproval observational CER studies are often needed to compare newer treatments against the standard of care; to obtain real-world data on effectiveness as treatments are used in different medical care settings and broader patient populations than those studied in clinical trials; and to increase understanding of the relative benefits and risks of treatments by weighing quality of life, cost, and safety outcomes alongside clinical benefits. For observational studies, this category of outcome generally focuses on clinically meaningful outcomes such as time between disease flares; number of swollen, inflamed joints; or myocardial infarction. Feasibility considerations sometimes dictate the use of intermediate endpoints, which are discussed in further detail later in the chapter.

Definitions of Clinical Outcomes

Temporal Aspects

The nature of the disease state to be treated, the mechanism, and the intended effect of the treatment under study determine whether the clinical outcomes to be identified are incident (a first or new diagnosis of the condition of interest), prevalent (existing disease), or recurrent (new occurrence or exacerbation of disease in a patient who has a previous diagnosis of that condition). The disease of interest may be chronic (a long-term or permanent condition), acute (a condition with a clearly identifiable and rapid onset), transient (a condition that comes and goes), or episodic (a condition that comes and goes in episodes), or have more than one of these aspects.

Subjective Versus Objective Assessments

Most clinical outcomes involve a diagnosis or assessment by a health care provider. These may be recorded in a patient's medical record as part of routine care, coded as part of an electronic health record (EHR) or administrative billing system using coding systems such as ICD-9 or ICD-10, or collected specifically for a given study.

While there are varying degrees of subjectivity involved in most assessments by health care providers, objective measures are those that are not subject to a large degree of individual interpretation, and are likely to be reliably measured across patients in a study, by different health care providers, and over time. Laboratory tests may be considered objective measures in most cases and can be incorporated as part of a standard outcome definition to be used for a study when appropriate. Some clinical outcomes, such as all-cause mortality, can be ascertained directly and may be more reliable than measures that are subject to interpretation by individual health care providers, such as angina or depression.

Instruments have been developed to help standardize the assessment of some conditions for which a subjective clinical assessment might introduce unwanted variability. Consider the example of a study of a new psoriasis treatment. Psoriasis is a chronic skin condition that causes lesions affecting varying amounts of body surface area, with varying degrees of severity. While a physician may be able to assess improvement within an individual patient, a quantifiable measure that would be reproducible across patients and raters improves the information value of comparative trials and observational studies of psoriasis treatment effectiveness. An outcome assessment that relies on purely subjective assessments of improvement such as, “Has the patient's condition improved a lot, a little, or not at all?” is vulnerable to measurement error that arises from subjective judgments or disagreement among clinicians about what comprises the individual categories and how to rate them, often resulting in low reproducibility or inter-rater reliability of the measure. In the psoriasis example, an improved measure of the outcome would be a standardized assessment of the severity and extent of disease expressed as percentage of affected body surface area, such as the Psoriasis Area Severity Index or PASI Score.4 The PASI score requires rating the severity of target symptoms [erythema (E), infiltration (I), and desquamation (D)] and area of psoriatic involvement (A) for each of four main body areas [head (h), trunk (t), upper extremities (e), lower extremities (l)]. Target symptom severity is rated on a 0–4 scale; area of psoriatic involvement is rated on a 0–6 scale, with each numerical value representing a percentage of area involvement.4 The final calculated score ranges from 0 (no disease) to 72 (severe disease), with the score contribution of each body area weighted by its percentage of total body area (10, 20, 30, and 40% of body area for head, upper extremities, trunk, and lower extremities, respectively).4 Compared with subjective clinician assessment of overall performance, using changes in the PASI score increases reproducibility and comparability across studies that use the score.

Relatedly, the U.S. Food and Drug Administration (FDA) has provided input on types of Clinical Outcome Assessments (COAs) that may be considered for qualification for use in clinical trials, with the goals of increasing the reliability of such assessments within a specific context of use in drug development and regulatory decisionmaking to measure a specific concept with a specific interpretation. Contextual considerations include the specific disease of interest, target population, clinical trial design and objectives, regionality, and mode of administration. The types of COAs described are:5

  • Patient-reported outcome (PRO) assessment: A measurement based on a report that comes directly from the patient (i.e., the study subject) about the status of particular aspects of or events related to a patient's health condition. PROs are recorded without amendment or interpretation of the patient's response by a clinician or other observer. A PRO measurement can be recorded by the patient directly, or recorded by an interviewer, provided that the interviewer records the patient's response exactly.

  • Observer-reported outcome (ObsRO) assessment: An assessment that is determined by an observer who does not have a background of professional training that is relevant to the measurement being made, i.e., a nonclinician observer such as a teacher or caregiver. This type of assessment is often used when the patient is unable to self-report (e.g., infants, young children). An ObsRO assessment should only be used in the reporting of observable concepts (e.g., signs or behaviors); ObsROs cannot be validly used to directly assess symptoms (e.g., pain) or other unobservable concepts.

  • Clinician-reported outcome (ClinRO) assessment: An assessment that is determined by an observer with some recognized professional training that is relevant to the measurement being made.

Other considerations related to use of PROs for measurement of health-related quality of life and other concepts are addressed later on in this chapter.

Composite Endpoints

Some clinical outcomes are composed of a series of items, and are referred to as composite endpoints. A composite endpoint is often used when the individual events included in the score are rare, and/or when it makes biological and clinical sense to group them. The study power for a given sample size may be increased when such composite measures are used as compared with individual outcomes, since by grouping numerous types of events into a larger category, the composite endpoint will occur more frequently than any of the individual components. As desirable as this can be from a statistical point of view, challenges include interpretation of composite outcomes that incorporate both safety and effectiveness, and broader adoption of reproducible definitions that will enhance cross-study comparisons. For example, Kip and colleagues6 point out that there is no standard definition for MACE (major adverse cardiac events), a commonly used outcome in clinical cardiology research. They conducted analyses to demonstrate that varying definitions of composite endpoints, such as MACE, can lead to substantially different results and conclusions. The investigators utilized the DEScover registry patient population, a prospective observational registry of drug-eluting stent (DES) users, to evaluate differences in 1-year risk for three definitions of MACE in comparisons of patients with and without myocardial infarction (MI), and patients with multi-lesion stenting versus single-lesion stenting (also referred to as percutaneous coronary intervention or PCI). The varying definitions of MACE included one related to safety only [composite of death, MI, and stent thrombosis (ST)], and two relating to both safety and effectiveness [composite of death, MI, ST, and either (1) target vessel revascularization (TVR) or (2) any repeat vascularization]. When comparing patients with and without acute MI, the three definitions of MACE yielded very different hazard ratios. The safety-only definition of MACE yielded a hazard ratio of 1.75 (p<0.05), indicating that patients with acute MI were at greater risk of 1-year MACE. However, for the composite of safety and effectiveness endpoints, the risk of 1-year MACE was greatly attenuated and no longer statistically significant. Additionally, when comparing patients with single versus multiple lesions treated with PCI, the three definitions also yielded different results; while the safety-only composite endpoint demonstrated that there was no difference in 1-year MACE, adding TVR to the composite endpoint definition led to a hazard ratio of 1.4 (p<0.05) for multi-lesion PCI versus single-lesion PCI. This research serves as a cautionary tale for the creation and use of composite endpoints. Not only can varying definitions of composite endpoints such as MACE lead to substantially different results and conclusions; results must also be carefully interpreted, especially in the case where safety and effectiveness endpoints are combined.

Intermediate Endpoints

The use of an intermediate or surrogate endpoint is more common in clinical trials than in observational studies. This type of endpoint is often a biological marker for the condition of interest, and may be used to reduce the followup period required to obtain results from a study of treatment effectiveness. An example would be the use of measures of serum lipids as endpoints in randomized trials of the effectiveness of statins, for which the major disease outcomes of interest to patients and physicians are a reduction in coronary heart disease incidence and mortality. The main advantages of intermediate endpoints are that the followup time required to observe possible effects of treatment on these outcomes may be substantially shorter than for the clinical outcome(s) of primary interest, and if they are measured on all patients, the number of outcomes for analysis may be larger. Much as with composite endpoints, using intermediate endpoints will increase study power for a given sample size as compared with outcomes that may be relatively rare, such as primary myocardial infarction. Surrogate or intermediate outcomes, however, may provide an incomplete picture of the benefits or risk. Treatment comparisons based on intermediate endpoints may differ in magnitude or direction from those based on major disease endpoints, as evidenced in a clinical trial of nifedipine versus placebo7- 8 as well as other clinical trials of antihypertensive therapy.9 On one hand, nifedipine, a calcium channel blocker, was superior to placebo in reduction of onset of new coronary lesions; on the other hand, mortality was sixfold greater among patients who received nifedipine versus placebo.7

Freedman and colleagues have provided recommendations regarding the use of intermediate endpoints.10 Investigators should consider the degree to which the intermediate endpoint is reflective of the main outcome, as well as the degree to which effects of the intervention may be mediated through the intermediate endpoint. Psaty and colleagues have cautioned that because drugs have multiple effects, to the extent that a surrogate endpoint is likely to measure only a subset of those effects, results of studies based on surrogate endpoints may be a misleading substitute for major disease outcomes as a basis for choosing one therapy over another.9

Table 6.2Clinical outcome definitions and objective measures

ConceptualTemporal AspectsObjective Measure
Incident invasive breast cancer Incident SEER or state cancer registry data
Myocardial infarction Acute, transient (in regard to elevated Troponin-I) Review of laboratory test results for troponin and other cardiac enzymes for correspondence with a standard clinical definition
Psoriasis Chronic, prevalent Psoriasis Area Severity Index (PASI score) or percent body surface area assessment
Systematic lupus erythematosus (SLE) Chronic condition with recurrent flares (Episodes may have acute onset) Systemic Lupus Erythematosus Disease Activity Index (SLEDAI)

Selection of Clinical Outcome Measures

Identification of a suitable measure of a clinical outcome for an observational CER study is a process in which various aspects of the nature of the disease or condition under study should be considered along with sources of information by which the required information may be feasibly and reliably obtained.

The choice of outcome measure may follow directly from the expected biological mechanism of action of the intervention(s) under study and its impact on specific medical conditions. For example, the medications tamoxifen and raloxifene are selective estrogen receptor modulators that act through binding to estrogen receptors to block the proliferative effect of estrogen on mammary tissue and reduce the long-term risk of primary and recurrent invasive and non-invasive breast cancer.11 Broader or narrower outcome definitions may be appropriate to specific research questions or designs. In some situations, however, the putative biologic mechanism may not be well understood. Nonetheless, studies addressing the clinical question of comparative effectiveness of treatment alternatives may still inform decisionmaking, and advances in understanding of the biological mechanism may follow discovery of an association through an observational CER study.

The selection of clinical outcome measures may be challenging when there are many clinical aspects that may be of interest, and a single measure or scale may not adequately capture the perspective of the clinician and patient. For example, in evaluating treatments or other interventions that may prolong the time between flares of systematic lupus erythematosus (SLE), researchers may use an index such as the Systemic Lupus Erythematosus Disease Activity Index (SLEDAI) which measures changes in disease activity. Or they may use the SLICC/ACR damage index, an instrument designed to assess accumulated damage since the onset of the disease.12-14 This measure of disease activity has been tested in different populations and has demonstrated high reliability, evidence for validity, and responsiveness to change.15 Yet, multiple clinical outcomes in addition to disease activity may be of interest in studying treatment effectiveness in SLE, such as reduction or increase in time to flare, reduction in corticosteroid use, or occurrence of serious acute manifestations (e.g., acute confusional state or acute transverse myelitis).16

Interactions With the Health Care System

For any medical condition, one should first determine the source of reporting or detection that may lead to initial contact with the medical system. The manner in which the patient presents for medical attention may provide insights as to data source(s) that may be useful in studying the condition. The decision whether to collect information directly from the physician, through medical record abstraction, directly from patients, and/or through use of electronic health records (EHRs) and/or administrative claims data will follow from this. For example, general hospital medical records are unlikely to provide the key components of an outcome such as respiratory failure, which requires information about use of mechanical ventilation. In contrast, hospital medical records are useful for the study of myocardial infarction, which must be assessed and treated in a hospital setting and are nearly always accompanied by an overnight stay. General practice physician office records and emergency department records may be useful in studying the incidence of influenza A or urticaria, with selection of which of these sources depending on the severity of the condition. A prospective study may be required to collect clinical assessments of disease severity using a standard instrument, as these are not consistently recorded in medical practice and are not coded in administrative data sources. The chapter on data sources (chapter 8) provides additional information on selection of appropriate sources of data for an observational CER study.

Humanistic Outcomes

While outcomes of interest to patients generally include those of interest to physicians, payers, regulators, and others, they are often differentiated by two characteristics: (1) they are clinically meaningful with practical implications for disease recognition and management (i.e., patients generally have less interest in intermediate pathways with no clear clinical impact); and (2) they include reporting of outcomes based on a patient's unique perspective, e.g., patient-reported scales that indicate pain level, degree of functioning, etc. This section deals with measures of health-related quality of life (HRQoL) and the range of measures collectively described as patient-reported outcomes (PROs), which include measures of HRQoL. Other humanistic perspectives relevant to patients (e.g., economics, utilization of health services, etc.) are covered elsewhere.

Health-related quality of life (HRQoL) measures the impact of disease and treatment on the lives of patients and is defined as “the capacity to perform the usual daily activities for a person's age and major social role.”17 HRQoL commonly includes physical functioning, psychological well-being, and social role functioning. This construct comprises outcomes from the patient perspective and are measured by asking the patient or surrogate reporters about them.

HRQoL is an outcome increasingly used in randomized and non-randomized studies of health interventions, and as such FDA has provided clarifying definitions of HRQoL and of improvements in HRQoL. The FDA defines HRQoL as follows:

HRQL is a multidomain concept that represents the patient's general perception of the effect of illness and treatment on physical, psychological, and social aspects of life. Claiming a statistical and meaningful improvement in HRQL implies: (1) that all HRQL domains that are important to interpreting change in how the clinical trial's population feels or functions as a result of the targeted disease and its treatment were measured; (2) that a general improvement was demonstrated; and (3) that no decrement was demonstrated in any domain.18

Patient-Reported Outcomes

Patient-reported outcomes (PROs) include any outcomes that are based on data provided by patients or by people who can report on their behalf (proxies), as opposed to data from other sources.19 PROs refer to patient ratings and reports about any of several outcomes, including health status, health-related quality of life, quality of life defined more broadly, symptoms, functioning, satisfaction with care, and satisfaction with treatment. Patients can also report about their health behaviors, including adherence and health habits. Patients may be asked to directly report information about clinical outcomes or health care utilization and out-of-pocket costs when these are difficult to measure through other sources. The FDA defines a PRO as “a measurement based on a report that comes directly from the patient (i.e., study subject) about the status of a patient's health condition without amendment or interpretation of the patient's response by a clinician or anyone else. A PRO can be measured by self-report or by interview provided that the interviewer records only the patient's response.”18

In this section we focus mainly on the use of standard instruments for measurement of PROs, in domains including specific disease areas, health-related quality of life, and functioning. PRO measures may be designed to measure the current state of health of an individual or to measure a change in health state. PROs have similarities to other outcome variables measured in observational studies. They are measured with components of both random and systematic error (bias). To be most useful, it is important to have evidence about the reliability, validity, responsiveness, and interpretation of PRO measures, discussed further later in this section.

Types of Humanistic Outcome Measures

Generic Measures

Generic PRO questionnaires are measurement instruments designed to be used across different subgroups of individuals, and contain common domains that are relevant to almost all populations. They can be used to compare one population with another, or to compare scores in a specific population with normative scores. Many have been used for years, and have well established and well understood measurement properties.

Generic PRO questionnaires can focus on a comprehensive set of domains, or on a narrow range of domains such as symptoms or aspects of physical, mental, or social functioning. An example of a generic PRO measure is the Sickness Impact Profile (SIP), one of the oldest and most rigorously developed questionnaires, which measures 12 domains that are affected by illness.20 The SIP produces two subscale scores, one for physical and one for mental health, and an overall score. Another questionnaire, the SF-36, measures eight domains including general health perceptions, pain, physical functioning, role functioning (as limited by physical health), social functioning, mental health, and vitality.21 The SF-36 produces a Physical Component Score and a Mental Component Score.22 The EQ-5D is another generic measure of health-related quality of life, intended for self-completion, that generates a single index score. This scale defines health in terms of 5 dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression.

Each dimension has three response categories corresponding to no problem/some problem/extreme problem. Taken as a whole, the EQ-5D defines a total of 243 possible states, to which two further states (dead and unconscious) have been added.23 Another broadly used indicator of quality of life relates to the ability to work. The Work Productivity Index (WPAI) was created as a patient-reported quantitative assessment of the amount of absenteeism, presenteeism, and daily activity impairment attributable to general health (WPAI:GH) or to a specific health problem (WPAI:SHP) (see below), in an effort to develop a quantitative approach to measuring the ability to work.24

Examples of generic measures that assess a more restricted set of domains include the SCL-90 to measure symptoms,25 the Index of Activities of Daily Living to measure independence in performing basic functioning,26 the Psychological General Well-Being Index to measure psychological well-being (PGWBI),27 and the Beck Depression Inventory.28

Disease- or Population-Specific Measures

Specific PRO questionnaires are sometimes referred to as “disease-specific.” While a questionnaire can be disease- or condition-specific (e.g., chronic heart failure), it can also be designed for use in a specific population (e.g., pediatric, geriatric), or for use to evaluate a specific treatment (e.g., renal dialysis). Specific questionnaires may be more sensitive to symptoms that are experienced by a particular group of patients. Thus, they are thought to detect differences and changes in scores when they occur in response to interventions.

Some specific measurement instruments assess multiple domains that are affected by a condition. For example, the Arthritis Impact Measurement Scales (AIMS) includes nine subscales that assess problems specific to the health-related quality of life of patients with rheumatoid arthritis and its treatments.29 The MOS-HIV Health Survey includes 10 domains that are salient for people with HIV and its treatments.30

Some of these measures take a modular approach, including a core measure that is used for assessment of a broader set of conditions, accompanied by modules that are specific to disease subtypes. For example, the FACIT and EORTC families of measures for evaluating cancer therapies each include a core module that is used for all cancer patients, and specific modules for each type of cancer, such as a module pertaining specifically to breast cancer.31-33

Other measures focus more narrowly on a few domains most likely to be affected by a disease, or most likely to improve with treatment. For example, the Headache Impact Test includes only six items.34 In contrast, other popular measures focus on symptoms that are affected by many diseases, such as the Brief Pain Inventory and the M.D. Anderson Symptom Inventory (MDASI), which measure the severity of pain and other symptoms and the impact of symptoms on function, and have been developed, refined, and validated in many languages and patient subgroups over three decades.35-36

It is possible, though not always advisable, to design a new PRO instrument for use in a specific study. The process of developing and testing a new PRO measure can be lengthy—generally requiring at least a year in time–and there is no guarantee that a new measure will work as well as more generic but better tested instruments. Nonetheless, it may be necessary to do so in the case of an uncommon condition for which there are no existing PRO measures, for a specific cultural context that differs from the ones that have been studied before, and/or to capture effects of new treatments that may require a different approach to measurement. However, when possible, in these cases it is still prudent to include a PRO measure with evidence for reliability and validity, ideally in the target patient population, in case the newly designed instruments fail to work as intended. This approach will allow comparisons with the new measure to assess content validity if there is some overlap of the concepts being measured.

Item Response Theory (IRT) and Computer Adaptive Testing (CAT)

Item Response Theory (IRT) is a framework for the development of tests and measurement tools, and for the assessment of how well the tools work. Computer Adaptive Testing (CAT) represents an area of innovation in measuring PROs. CAT allows items to be selected to be administered so that questions are relevant to the respondent and targeted to the specific level of the individual, with the last response determining the next question that is asked. Behind the scenes, items are selected from “item banks,” comprising collections of dozens to hundreds of questions that represent the universe of potential levels of the dimension of interest, along with an indication of the relative difficulty or dysfunction that they represent. For example, the Patient-Reported Outcomes Measurement Information System (PROMIS) item bank for physical functioning includes 124 items that range in difficulty from getting out of bed to running several miles.37 This individualized administration can both enhance measurement precision and reduce respondent burden.38 Computer adaptive testing is based on IRT methods of scaling items and drawing subsets of items from a larger item bank.39 Considerations around adaptive testing involve balancing the benefit of tailoring the set of items and measurements to the specific individual with the risk of inappropriate targeting or classification if items answered incorrectly early on determine the later set of items to which a subject is able to respond. PROMIS40 is a major NIH initiative that leverages these desirable properties for PROs in clinical research and practice applications.

Descriptive Versus Preference Format

Descriptive questionnaires ask about general or common domains and complaints, and usually provide multiple scores. Preference-based measures, generally referred to as utility measures, provide a single score, usually on a 0–1 scale, that represents the aggregate of multiple domains for an overall estimate of burden.

Most of the questionnaires familiar to clinical researchers fall into the category of descriptive measures, including all of those mentioned in the preceding paragraphs. Patients or other respondents are asked to indicate the extent to which descriptions of specific feelings, abilities, or behaviors apply to them. Utility measures are discussed further in the following section.

Other Attributes of PROs

Within each of the above options, there are several attributes of PRO instruments to consider. These include response format (numeric scales vs. verbal descriptors or visual analogue scales), the focus of what is being assessed (frequency, severity, impairment, all of the above), and recall period. Shorter, more recent recall periods more accurately capture the individual's actual experience, but may not provide as good an estimate of their typical activities or experiences. (For example, not everyone vacuums or has a headache every day.)

Content Validity

Content validity is the extent to which a PRO instrument covers the breadth and depth of salient issues for the intended group of patients. If a PRO instrument is not valid with respect to its content, then there is an increased chance that it may fail to capture adequately the impact of an intervention. For example, in a study to compare the impact of different regimens for rheumatoid arthritis, a PRO that does not assess hand function could be judged to have poor content validity, and might fail to capture differences among therapies. FDA addresses content validity as being of primary interest in assessing a PRO, with other measurement properties being secondary. and defines content validity as follows:

Evidence from qualitative research demonstrating that the instrument measures the concept of interest including evidence that the items and domains of an instrument are appropriate and comprehensive relative to its intended measurement concept, population, and use. Testing other measurement properties will not replace or rectify problems with content validity.18

Content validity is generally assessed qualitatively rather than statistically. It is important to understand and consider the population being studied, including their usual activities and problems, the condition (especially its impact on the patient's functioning), and the interventions being evaluated (including both their positive and adverse effects).

Responsiveness and Minimally Important Difference

Responsiveness is a measure of a PRO instrument's sensitivity to changes in health status or other outcome being measured. If a PRO is not sufficiently responsive, it may not provide adequate evidence of effectiveness in observational studies or clinical trials. Related to responsiveness is the minimally important difference that a PRO measure may detect. Both the patient's and the health care provider's perspectives are needed to determine if the minimally important difference detectable by an instrument is in fact of relevance to the patient's overall health status.41

Floor and Ceiling Effects

Poor content validity can also lead to a mismatch between the distribution of responses and the true distribution of the concept of interest in the population. For example, if questions in a PRO to assess ability to perform physical activities are too “easy” relative to the level of ability in the population, then the PRO will not reflect the true distribution. This problem can present as a “ceiling” effect, where a larger proportion of the sample reports no disability. Similarly, “floor” effects are seen when questions regarding a level of ability are skewed too difficult for the population and the responses reflect this lack of variability.

Interpretation of PRO Scores

Clinicians and clinical researchers may be unfamiliar with how to interpret PRO scores. They may not understand or have reference to the usual distribution of scores of a particular PRO in a clinical or general population. Without knowledge of normal ranges, physicians may not know what cutpoints of scoring indicate that action is warranted. Without reference values from a comparable population, researchers will not know whether an observed difference between two groups is meaningful, and whether a given change within or between groups is important. The task of understanding the meaning of scores is made more difficult by the fact that different PRO measurement tools tend to use different scoring systems. For most questionnaires, higher scores imply better health, but for some, a higher score is worse. Some scales are scored from 0 to 1, where 0=dead and 1=perfect health. Others are scores on a 0–100 scale, where 0 is simply the lowest attainable score (i.e., the respondent indicates the “worst” health state in response to all of the questions) and 100 is the highest. Still others are “normalized,” so that, for example, a score of 50 represents the mean score for the healthy or nondiseased population, with a standard deviation of 10 points. It is therefore crucial for researchers and users of PRO data to understand the scoring system being used for an instrument and the expected distribution, including the distributional properties.

For some PRO instruments, particularly generic questionnaires that have been applied to large groups of patients over many years, population norms have been collected and established. These can be used as reference points. Scoring also can be recalculated and “normalized” to a “T-score” so that a specific score (often 50 or 100) corresponds to the mean score for the population, and a specific number of points (often 5 or 10) corresponds to 1 standard deviation unit in that population.

Selection of a PRO Measure

There are a number of practical considerations to take into account when selecting PRO measures for use in a CER study. The measurement properties discussed in the preceding sections also require evaluation in all instances for the specific instrument selected, within a given population, setting, and intended purpose.

Population

It is important to understand the target population that will be completing the PRO assessment. These may range from individuals who can self-report, to individuals requiring the assistance of a proxy or medical professional (e.g., children, mentally or cognitively limited individuals, visually impaired individuals). Some respondents may be ambulatory individuals living in the community, whereas others may be inpatients or institutionalized individuals.

If a PRO questionnaire is to be used in non–English-speaking populations or in multiple languages, it is necessary to have versions appropriately adapted to language and culture. One should have evidence for the reliability and validity of the translated and culturally adapted version, as applied to the concerned population. One also should have data showing the comparability of performance across different language and cultural groups. This is of special importance when pooling data across language versions, as in a multinational clinical trial or registry study.

Burden

It is important to match the respondent burden created by a PRO instrument to the requirements of the population being studied. Patients with greater levels of illness or disability are less able to complete lengthy questionnaires. In some cases, the content or specific questions posed in a PRO may be upsetting or otherwise unacceptable to respondents. In other cases, a PRO questionnaire may be too cognitively demanding or written at a reading level that is above that of the intended population. The total burden of study-related data collection on patients and providers must also be considered, as an excessive number of forms that must be completed are likely to reduce compliance.

Cost and Copyright

Another practical consideration is the copyright status of a PRO being considered for use. Some PRO questionnaires are entirely in the public domain and are free for use. Others are copyrighted and require permission and/or the payment of fees for use. Some scales, such as the SF-12 and SF-36, require payment of fees for scoring.

Mode and Format of Administration

As noted above, there are various options for how a questionnaire should be administered and how the data should be captured, each method having both advantages and disadvantages. A PRO questionnaire can be (1) self-administered at the time of a clinical encounter, (2) administered by an interviewer at the time of a clinical encounter, (3) administered with computer assistance at the time of a clinical encounter, (4) self-administered by mail, (5) self-administered on-line, (6) interviewer-administered by telephone, or (7) computer-administered by telephone. Self-administration at the time of a clinical encounter requires little technology or up-front cost, but requires staff for supervision and data entry and can be difficult for respondents with limited literacy or sophistication. Face-to-face administration engages respondents and reduces their burden but requires trained interviewers. Computer-assisted administration provides an intermediate solution but also requires capital investment. Mailed surveys afford more privacy to respondents, but they generate mailing expenses and do not eliminate problems with literacy. Paper-based formats require data entry, scoring, and archiving and are prone to calculation errors. Online administration is relatively inexpensive, especially for large surveys, and surveys can be completed any time, but not all individuals have Internet access. Administration by live telephone interview is engaging and allows interviewer flexibility but is also expensive. “Cold calls” to potential study participants may result in low response rates, given the increased prevalence of caller ID screening systems and widespread skepticism about “telemarketing.”

Interactive voice response systems (or IVRS) can also be used to conduct telephone interviews, but it can be tedious to respond using the telephone key pad, and this format strikes some as impersonal.

Static Versus Dynamic Questionnaires

Static forms are the type of questionnaire that employs a fixed-format set of questions and response options. They can be administered on paper, by interview, or through the Internet. Dynamic questionnaires select followup questions to administer based on the responses already obtained for previous questions. Since they are more efficient, more domains can be assessed.

Economic and Utilization Outcomes

While clinical outcomes represent the provider and professional perspective, and humanistic outcomes represent the patient perspective, economic outcomes, including measures of health resource utilization, represent the payer and societal perspective. In the United States, measures of cost and cost-effectiveness are often excluded from government-funded CER studies. However, these measures are important to a variety of important stakeholders such as payers and product manufacturers, and are routinely included in cost-effectiveness research in countries such as Australia, the United Kingdom, Canada, France, and Germany.42

Research questions addressing issues of cost-effectiveness and resource utilization may be formulated in a number of ways. Cost identification studies measure the cost of applying a specified treatment to a population under a certain set of conditions. These studies describe the cost incurred without comparison to alternative interventions.

Some cost identification studies describe the total costs of care for a particular population, whereas others isolate costs of care related to a specific condition; this latter approach requires that each episode of care be ascribed as having been related or unrelated to the illness of interest and involves substantial review.43 Cost-benefit studies are typically measured in dollars or other currency. These studies compare the monetary costs of an intervention against the standard of care with the cost savings that result from the benefits of that treatment. In these studies, mortality is also assigned a dollar value, although techniques for assigning value to a human life are controversial. Cost-effectiveness is a relative concept, and its analysis compares the costs of treatments and benefits of treatments in terms of a specified outcome, such as reduced mortality or morbidity, years of life saved, or infections averted.

Types of Health Resource Utilization and Cost Measures

Monetary Costs

Studies most often examine direct costs (i.e., the monetary costs of the medical treatments themselves, potentially including associated costs of administering treatment or conditions associated with treatment), but may also include measures of indirect costs (e.g., the costs of disability or loss of livelihood, both actual and potential). Multiple measures of costs are commonly included in any given study.

Health Resource Utilization

Measures of health resource utilization, such as number of inpatient or outpatient visits, total days of hospitalization in a given year, or number of days treated with IV antibiotics, are often used as efficient and easily interpretable proxies for measuring cost, since actual costs are dependent on numerous factors (e.g., institutional overhead, volume discounts) and can be difficult to obtain, since they often may be confidential, since, in part, they reflect business acumen in price negotiation. Costs may also vary by institution or location, such as the cost of a day in the hospital or a medical procedure. Resource utilization measures may be preferred when a study is intended to yield results that may be generalizable to health systems or reimbursement systems other than those under study, as they are not dependent on a particular reimbursement structure such as Medicare. Alternatively, a specific cost or reimbursement structure, such as the amount reimbursed by the Centers for Medicare and Medicaid Services (CMS) for specific treatment items, or average wholesale drug costs, may be applied to units of health resource use when conducting studies that pool data from different health systems.

Utility and Preference-Based Measures

PROs and cost analyses intersect around the calculation of cost-utility. Utility measures are derived from economic and decision theory. The term utility refers to the value placed by the individual on a particular health state. Utility is summarized as a score ranging from 0.0 representing death to 1.0 representing perfect health.

In health economic analyses, utilities are used to justify devoting resources to a treatment. There are several widely used preference-based instruments that are used to estimate utility.

Preference measures are based on the fundamental concept that individuals or groups have reliable preferences about different health states. To evaluate those preferences, individuals rate a series of health states: for example, a person with specific levels of physical functioning (able to walk one block but not climb stairs), mental health (happy most of the time), and social role functioning (not able to work due to health). The task for the individual is to directly assign a degree of preference to that state. These include the Standard Gamble and Time Tradeoff methods, 44-45 the EQ-5D, also referred to as the Euroqol,23 the Health Utilities Index,46-47 and the Quality of Well-Being Scale.48

Quality-Adjusted Life Years (QALYs)

Utility scores associated with treatment can be used to weight the duration of life according to its quality, and are thereby used to generate QALYs. Utility scores are generally first ascertained directly in a sample of people with the condition in question, either cross-sectionally or over time with a clinical trial. Utility values are sometimes estimated indirectly using other sources of information about the health status of people in a population. The output produced by an intervention can be calculated as the area under the cost-utility curve.

For example, if the mean utility score for patients receiving antiretroviral treatment for HIV disease is 0.80, then the outcome for a treated group would be survival time multiplied by 0.80.

Disability-Adjusted Life Years (DALYs)

DALYs are another measure of overall disease burden expressed as the number of years lost to poor health, disability, or premature death.49 As with QALYs, mortality and morbidity are combined in a single metric. Potential years of life lost to premature death are supplemented with years of health life lost due to less than optimal health. Whereas 1 QALY corresponds to one year of life in optimal health, 1 DALY corresponds to one year of healthy life lost.

An important aspect of the calculation of DALYs is that the value assigned to each year of life depends on age. Years lived as a young adult are valued more highly than those spent as a young child or older adult, reflecting the different capacity for work productivity during different phases of life. DALYs are therefore estimated for different chronic illnesses by first calculating the age- and sex-adjusted incidence of disease. A DALY is calculated as the sum of the average years of life lost, and the average years lived with a disability. For example, to estimate the years of healthy life lost in a region due to HIV/AIDS, one would first estimate the prevalence of the disease by age. The DALY value is calculated by summing the average of years of life lost and the average number of years lived with AIDS, discounted based on a universal set of standard weights based on expert valuations.

Selection of Resource Utilization and Cost Measures

The selection of measures of resource utilization or costs should correspond to the primary hypothesis in terms of the impact of an intervention. For example, will treatment reduce the need for hospitalization or result in a shorter length of stay? Or, will treatment or other intervention reduce complications that require hospitalization? Or, will a screening method reduce the total number of diagnostic procedures required per diagnosis?

It is useful to consider what types of costs are of interest to the investigators and to various stakeholders. Are total costs of interest, or costs associated with specific resources (e.g., prescription drug costs)? Are only direct costs being measured, or are you also interested in indirect costs such as those related to days lost from work?

When it is determined that results will be presented in terms of dollars rather than units of resources, several different methods can be applied. In the unusual case that an institution has a cost-accounting system, cost can be measured directly. In most cases, resource units are collected, and costs are assigned based on local or national average prices for the specific resources being considered, for example, reimbursement from CMS for a CT scan, or a hospital day. Application of an external standard cost system reduces variability in costs due to region, payer source, and other variables that might obscure the impact of the intervention in question.

Study Design and Analysis Considerations

Study Period and Length of Followup

In designing a study, the required study period and length of followup are determined by the expected time frame within which an intervention may be expected to impact the outcome of interest. A study comparing traditional with minimally invasive knee replacement surgery will need to follow subjects at least for the duration of the expected recovery time of 3 to 6 months or longer. The optimal duration of a study can be problematic when studying effects that may become manifest over a long time period, such as treatments to prevent or delay the onset of chronic disease. In these cases, data sources with a high degree of turnover of patients, such as administrative claims databases from managed care organizations, may not be suitable. For example, in the case of Alzheimer's disease, a record of health care is likely to be present in health insurance claims. However, with the decline in cognitive function, patients may lose ability to work and may enter assisted care facilities, where utilization is not typically captured in large health insurance claims systems. Some studies may be undertaken for the purpose of determining how long an intervention can be expected to impact the outcome of interest. For example, various measures are used to aid in reducing obesity and in smoking cessation, and patients, health care providers, and payers are interested in knowing how long these interventions work (if at all), for whom, and in what situations.

Notwithstanding the limitations of intermediate endpoints (discussed in a preceding section), one of the main advantages of their use is the potential truncation of the required study followup period. Consider, for example, a study of the efficacy of the human papilloma virus vaccine, for which the major medical endpoint of interest is prevention of cervical cancer. The long latency period (more than 2 years, depending on the study population) and the relative infrequency of cervical cancer raise the possibility that intermediate endpoints should be used. Candidates might include new diagnoses of genital warts, or new diagnoses of the precancerous conditions cervical intraepithelial neoplasia (CIN) or vaginal intraepithelial neoplasia (VIN), which have shorter latency periods of less than 1 year or 2 years (minimum), respectively. Use of these endpoints would allow such a study to provide meaningful evidence informing the use of the HPV vaccine in a shorter timeframe, during which more patients might benefit from its use. Alternatively, if the vaccine is shown to be ineffective, this information could avoid years of unnecessary treatment and the associated costs as well as the costs of running a longer trial.

Avoidance of Bias in Study Design

Misclassification

The role of the researcher is to understand the extent and sources of misclassification in outcome measurement, and to try to reduce these as much as possible. To ensure comparability between treatment groups with as little misclassification (also referred to as measurement error) of outcomes as possible, a clear and objective (i.e., verifiable and not subject to individual interpretation insofar as possible) definition of the outcome of interest is needed. An unclear outcome definition can lead to misclassification and bias in the measure of treatment effectiveness. When the misclassification is nondifferential, or equivalent across treatment groups, the estimate of treatment effectiveness will be biased toward the null, reducing the apparent effectiveness of treatment, which may result in an erroneous conclusion that no effect (or one smaller than the true effect size) exists. When the misclassification differs systematically between treatment groups, it may distort the estimate of treatment effectiveness in either direction.

For clinical outcomes, incorporation of an objective measure such as a validated tool that has been developed for use in clinical practice settings, or an adjudication panel for review of outcomes with regard to whether they meet the predetermined definition of an event, would both be approaches that increase the likelihood that outcomes will be measured and classified accurately and in a manner unlikely to vary according to who is doing the assessment. For PROs, measurement error can stem from several sources, including the way in which a question is worded and hence understood by a respondent, how the question is presented, the population being assessed, the literacy level of respondents, the language in which the questions are written, and elements of culture that it represents.

To avoid differential misclassification of outcomes, care must also be taken to use the same methods of ascertainment and definitions of study outcomes whenever possible. For prospective or retrospective studies with contemporaneous comparators, this is usually not an issue, since it is most straightforward to utilize the same data sources and methods of outcome ascertainment for each comparison group. A threat to validity may arise in use of a historical comparison group, which may be used in certain circumstances. For example, this occurs when a new treatment largely displaces use of an older treatment within a given indication, but further evidence is needed for the comparative effectiveness of the newer and older treatments, such as enzyme replacement for lysosomal storage disorders. In such instances, use of the same or similar data sources and equivalent outcome definitions to the extent possible will reduce the likelihood of bias due to differential outcome ascertainment.

Other situations that may give rise to issues of differential misclassification of outcomes include: when investigators are not blinded to the hypothesis of the study, and “rule-out” diagnoses are more common in those with a particular exposure of interest; when screening or detection of outcomes is more common or more aggressive in those with one treatment than another (i.e., surveillance bias, e.g., when liver function testing are preferentially performed in patients using a new drug compared to other treatments for that condition); and when loss to followup occurs that is related to the risk of experiencing the outcome. For example, once a safety signal has been identified and publicized, physicians have been alerted and then look more proactively for clinical signs and symptoms in treated patients. This situation is even greater for products that are subject to controlled distribution or Risk Evaluation and Mitigation Strategies (REMS). Consider clozapine, an anti-schizophrenia drug that is subject to controlled distribution through a “no blood, no drug” monitoring program. The blood testing program was implemented to detect early development of agranulocytemia. When comparing patients treated with clozapine with those treated with other antischizophrenics, those using clozapine may appear to have a worse safety profile with respect to this outcome.

Sensitivity analyses may be conducted in order to estimate the impact of different levels of differential or nondifferential misclassification on effect estimates from observational CER studies. These approaches are covered in detail in chapter 11.

Validation and Adjudication

In some instances, additional information must be collected (usually from medical records) to validate the occurrence of the outcome of interest, including to exclude erroneous or “rule-out” diagnoses. This is particularly important for medical events identified in administrative claims databases, for which a diagnosis code associated with a medical encounter may represent a “rule out” diagnosis or a condition that does not map to a specific diagnosis code. For some complex diagnoses, such as unstable angina, a standard clinical definition must be applied by an adjudication panel that has access to detailed records inclusive of subjects' relevant medical history, symptomatic presentation, diagnostic work-up, and treatment. Methods of validation and adjudication of outcomes strengthen the internal validity and therefore the evidence that can be drawn from a CER study. However, they are resource-intensive.

Issues Specific to PROs

PROs are prone to several specific sources of bias. Self-reports of health status are likely to differ systematically from reports by surrogates, who, for example, are likely to report less pain than the individuals themselves.50 Some biases may be population-dependent. For example, there may be a greater tendency of some populations to succumb to acquiescence bias (agreeing with the statements in a questionnaire) or social desirability bias (answering in a way that would cast the respondent in the best light).51 In some situations, however, a PRO may be the most useful marker of disease activity, such as with episodic conditions that cause short-duration disease flares such as low back pain and gout, where patients may not present for health care immediately, if at all.

The goal of the researcher is to understand and reduce sources of bias, considering those most likely to apply in the specific population and topics under study. In the case of well understood systematic biases, adjustments can be made so that distributions of responses are more consistent. In other cases, redesigning items and scales, for example, by including both positively and negatively worded items, can reduce specific kinds of bias.

Missing data, an issue covered in more detail in chapter 10, pose a particular problem with PROs, since PRO data are usually not missing at random. Instead, respondents whose health is poorer are more likely to fail to complete an assessment. Another special case of missing data occurs when a patient dies and is unable to complete an assessment. If this issue is not taken into account in the data analysis, and scores are only recorded for living patients, incorrect conclusions may be drawn. Strategies for handling this type of missing data include selection of an instrument that incorporates a score for death, such as the Sickness Impact Profile 20, 52 or the Quality of Well-Being Scale,48 or through an analytic strategy that allows for some missing values.

Failure to account for missing PRO data that are related to poor health or death will lead to an overestimate of the health of the population based on responses from subjects who do complete PRO forms. Therefore, in research using PROs, it is very important to understand the extent and pattern of missing data, both at the level of the individual as well as for specific items or scales on an instrument.53

A strategy should be put in place to handle missing data when developing the study protocol and analysis plans. Such strategies that pertain to use of PROs in research are discussed in further detail in publications such as the book by Fairclough and colleagues.

Analytic Considerations

Form of Outcome Measure and Analysis Approach

To a large extent, the form of the primary outcome of interest—that is, whether the outcome is measured and expressed as a dichotomous or polytomous categorical variable or a continuous variable, and whether it is to be measured at a single time point, measured repeatedly at fixed intervals, or measured repeatedly at varying time intervals—determines the appropriate statistical methods that may be applied in analysis. These topics are covered in detail in chapter 10.

Sensitivity Analysis

One of the key factors to address in planned sensitivity analyses for an observational CER study is how varying definitions of the study outcome or related outcomes will affect the measures of association from the study. These investigations include assessing multiple related outcomes within a disease area; for example, assessing multiple measures of respiratory function such as FEV1, FEV1% predicted, and FVC in studies of asthma treatment effectiveness in children; assessing the effect of different cutoffs for dichotomized continuous outcome measures; for example, the use of Systemic Lupus Erythematosus Disease Activity Index-2000 scores to define active disease in lupus treatment studies,54 or the use of different sets of diagnosis codes to capture a condition such as influenza and related respiratory conditions, in administrative data. These and other considerations for sensitivity analyses are covered in detail in chapter 11.

Conclusion

Future Directions

Increased use of EHRs as a source of data for observational research, including registries, other types of observational studies, and specifically for CER, has prompted initiatives to develop standardized definitions of key outcomes and other data elements that would be used across health systems and different EHR platforms to facilitate comparisons between studies and pooling of data. The National Cardiovascular Research Infrastructure partnership between the American College of Cardiology and Duke Clinical Research Institute, which received American Recovery and Reinvestment Act funding to establish intra-operable data standards based on the National Cardiovascular Data Registry, is an example of such a current activity.55

Summary

This chapter has provided an overview of considerations in development of outcome definitions for observational CER studies; has described implications of the nature of the proposed outcomes for the study design; and has enumerated issues of bias that may arise in incorporating the ascertainment of outcomes into observational research. It has also suggested means of preventing or reducing these biases.

Development of clear and objective outcome definitions that correspond to the nature of the hypothesized treatment effect and address the research questions of interest, along with validation of outcomes where warranted or use of standardized PRO instruments validated for the population of interest, contribute to the internal validity of observational CER studies. Attention to collection of outcome data in an equivalent manner across treatment comparison groups is also required. Use of appropriate analytic methods suitable to the outcome measure, and sensitivity analysis to address varying definitions of at least the primary study outcomes, are needed to make inferences drawn from such studies more robust and reliable.

Checklist: Guidance and key considerations for outcome selection and measurement for an observational CER protocol

GuidanceKey ConsiderationsCheck
Propose primary and secondary outcomes that directly correspond to research questions.

Followup period should be sufficient to observe hypothesized effects of treatment on primary and secondary outcomes.

Provide clear and objective definitions of clinical outcomes.

Outcomes should reflect the hypothesized mechanism of effect of treatment, if known.

Provide justification that the outcome is reliably ascertained without additional validation, when applicable and feasible, or propose validation and/or adjudication of endpoints.

If an intermediate (surrogate) endpoint is proposed, provide justification why the main disease outcome of interest is not being used, and that the intermediate endpoint reflects the expected pathway of the effect of treatment on the main outcome of interest.

Provide clear and relevant definitions of cost or health resource utilization outcomes.

Outcomes chosen should reflect the hypothesized effect of treatment on specific components of medical cost and/or resource utilization, if known.

Outcomes should be able to be measured directly or via proxy from data sources proposed for study.

For costs, consider proposing standard benchmark costs to be applied to units of resource utilization; especially when multiple health systems, payment systems, and/or geographic regions are included in study population or data source.

Describe a plan for use of a validated, standard instrument for measurement of patient-reported outcomes.

The instrument chosen should reflect the hypothesized effect of treatment on specific aspects of disease symptoms or treatment, or quality of life, if known.

Propose use of a standard instrument that has been validated for use in population representative of the study population, when possible.

Have the instrument validated for use in translation to other specific languages if it is intended to be used in those languages for study, when possible.

Have the instrument validated for the intended mode of administration, when possible.

Address issues of bias expected to arise, and propose means of bias minimization.

Describe potential issues of bias, misclassification, and missing data that may be expected to occur with the proposed outcomes, including those specific to PRO data.

Provide a plan for minimization of potential bias, misclassification, and missing data issues identified.

Analysis

Proposed analytic methods should correspond to the nature of the outcome measure (e.g., continuous, categorical [dichotomous, polychotomous, or ordinal], repeated measures, time-to-event).

Plan sensitivity analyses relating to expected questions that arise around the study outcomes.

Propose sensitivity analyses that address different relevant definitions of the study outcome(s) or multiple related outcomes (e.g., different measures of subclinical and clinical cardiovascular disease).

References

1.

Wilson IB, Cleary PD. Linking clinical variables with health-related quality of life. A conceptual model of patient outcomes. JAMA. 1995 Jan 4;273:59–65. [PubMed: 7996652]

2.

Kozma CM, Reeder CE, Schultz RM. Economic, clinical, and humanistic outcomes: a planning model for pharmacoeconomic research. Clin Ther. 1993;15(6):1121–32. [PubMed: 8111809]

3.

Streiner DL. Norman GR Health Measurement Scales: a Practical Guide to their Development and Use. 4th ed. Oxford University Press; 2008.

4.

Fredriksson T, Pettersson U. Severe psoriasis--oral therapy with a new retinoid. Dermatologica. 1978;157(4):238–44. [PubMed: 357213]

5.6.

Kip KE, Hollabaugh K, Marroquin OC, et al. The problem with composite end points in cardiovascular studies. The story of major adverse cardiac events and percutaneous coronary intervention. J Am Coll Cardiol. 2008;51:701–7. [PubMed: 18279733]

7.

Lichtlen PR, Hugenholtz PG, Rafflenbeul W, et al. Retardation of angiographic progression of coronary artery disease by nifedipine: results of the International Nifedipine Trial on Antiatherosclerotic Therapy (INTACT). Lancet. 1990;335:1109–13. [PubMed: 1971861]

8.

Psaty BM, Siscovick DS, Weiss NS, et al. Hypertension and outcomes research. From clinical trials to clinical epidemiology. Am J Hypertens. 1996;9:178–83. [PubMed: 8924268]

9.

Psaty BM, Lumly T. Surrogate end points and FDA approval: a tale of 2 lipid-altering drugs. JAMA. 2008;299(12):1474–6. [PubMed: 18364491]

10.

Freedman LS, Graubard BI, Schatzkin A. Statistical validation of intermediate endpoints for chronic diseases. Stat Med. 1992 Jan 30;11(2):167–78. [PubMed: 1579756]

11.

Vogel VG, Costantino JP, Wickerham DL, et al. Update of the National Surgical Adjuvant Breast and Bowel Project Study of Tamoxifen and Raloxifene (STAR) P-2 trial: preventing breast cancer. Cancer Prev Res (Phila). 2010 Jun;3(6):696–706. [PMC free article: PMC2935331] [PubMed: 20404000]

12.

Gladman DD, Urowitz MB. The SLICC/ACR damage index: progress report and experience in the field. Lupus. 1999;8:632–7. [PubMed: 10568900]

13.

Bombardier C, Gladman DD, Urowitz MB, et al. the Committee on Prognosis Studies in SLE. Derivation of the SLEDAI: a disease activity index for lupus patients. Arthritis Rheum. 1992;35:630–40. [PubMed: 1599520]

14.

Gladman DD, Ibanez D, Urowitz MB. Systemic lupus erythematosus disease activity index 2000. J Rheumatol. 2002;29:288–91. [PubMed: 11838846]

15.

Griffiths B, Mosca M, Gordon C. Assessment of patients with systemic lupus erythematosus and the use of lupus disease activity indices. Best Pract Res Clin Rheumatol. 2005 Oct;19(5):685–708. [PubMed: 16150398]

16.17.

Guyatt GH, Feeny DH, Patrick DL. Measuring health-related quality of life. Ann Intern Med. 1993 Apr 15;118(8):622–9. [PubMed: 8452328]

18.19.

Acquadro C, Berzon R, Dubois D, et al. Incorporating the patient's perspective into drug development and communication: an ad hoc task force report of the Patient-Reported Outcomes (PRO) Harmonization Group meeting at the Food and Drug Administration, February 16, 2001. Value Health. 2003 Sep;6:522–31. [PubMed: 14627058]

20.

Bergner M, Bobbitt RA, Carter WB, et al. The Sickness Impact Profile: development and final revision of a health status measure. Med Care. 1981 Aug;19(8):787–805. [PubMed: 7278416]

21.

Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection. Med Care. 1992 Jun;30(6):473–83. [PubMed: 1593914]

22.

Ware JE Jr, Kosinski M, Bayliss MS, et al. Comparison of methods for the scoring and statistical analysis of SF-36 health profile and summary measures: summary of results from the Medical Outcomes Study. Med Care. 1995 Apr;33(4 Suppl):AS264–79. [PubMed: 7723455]

23.

EuroQol--a new facility for the measurement of health-related quality of life. The EuroQol Group. Health Policy. 1990 Dec;16(3):199–208. [PubMed: 10109801]

24.

Reilly MC, Zbrozek AS, Dukes EM. The validity and reproducibility of a work productivity and activity impairment instrument. Pharmacoeconomics. 1993 Nov;4(5):353–65. [PubMed: 10146874]

25.

Derogatis LR, Cleary PA. Factorial invariance across gender for the primary symptom dimensions of the SCL-90. Br J Soc Clin Psychol. 1977 Nov;16(4):347–56. [PubMed: 588890]

26.

Katz S, Akpom CA. 12. Index of ADL. Med Care. 1976 May;14(5 Suppl):116–8. [PubMed: 132585]

27.

Dupuy HJ. The Psychological General Well-Being (PGWB) Index. In: Wenger NK, Mattson ME, Furberg CD, et al., editors. Assessment of Quality of Life in Clinical Trials of Cardiovascular Therapies. Chap 9. Le Jacq Publishing; 1984. pp. 170–83.

28.

Beck AT, Ward CH, Mendelson M, et al. An inventory for measuring depression. Arch Gen Psychiatry. 1961;4:561–71. [PubMed: 13688369]

29.

Meenan RF, Gertman PM, Mason JH. Measuring health status in arthritis. The arthritis impact measurement scales. Arthritis Rheum. 1980 Feb;23(2):146–52. [PubMed: 7362665]

30.

Wu AW, Revicki DA, Jacobson D, et al. Evidence for reliability, validity and usefulness of the Medical Outcomes Study HIV Health Survey (MOS-HIV). Qual Life Res. 1997 Aug;6(6):481–93. [PubMed: 9330549]

31.

Cella D, Nowinski CJ. Measuring quality of life in chronic illness: the functional assessment of chronic illness therapy measurement system. Arch Phys Med Rehabil. 2002 Dec;83(12 Suppl 2):S10–7. [PubMed: 12474167]

32.

Aaronson NK, Ahmedzai S, Bergman B, et al. The European Organization for Research and Treatment of Cancer QLQ-C30: a quality-of-life instrument for use in international clinical trials in oncology. J Natl Cancer Inst. 1993 Mar 3;85(5):365–76. [PubMed: 8433390]

33.

Sprangers MA, Cull A, Groenvold M, et al. EORTC Quality of Life Study Group. The European Organization for Research and Treatment of Cancer approach to developing questionnaire modules: an update and overview. Qual Life Res. 1998 May;7(4):291–300. [PubMed: 9610213]

34.

Kosinski M, Bayliss MS, Bjorner JB, et al. A six-item short-form survey for measuring headache impact: the HIT-6. Qual Life Res. 2003 Dec;12(8):963–74. [PubMed: 14651415]

35.

Cleeland CS. Symptom burden: multiple symptoms and their impact as patient-reported outcomes. J Natl Cancer Inst Monogr. 2007;37:16–21. [PubMed: 17951226]

36.

Cleeland CS, Ryan KM. Pain assessment: global use of the Brief Pain Inventory. Ann Acad Med Singapore. 1994;23(2):129–38. [PubMed: 8080219]

37.

Hung M, Clegg DO, Greene T, et al. Evaluation of the PROMIS physical function item bank in orthopaedic patients. J Orthop Res. 2011;29(6):947–53. [PubMed: 21437962]

38.

Bjorner JB, Chang CH, Thissen D, et al. Developing tailored instruments: item banking and computerized adaptive assessment. Qual Life Res. 2007;16 Suppl 1:95–108. [PubMed: 17530450]

39.

Reise SP. Item response theory: fundamentals, applications, and promise in psychological research. Current Directions in Psychological Science. 2005 April;14(2):95–101.

40.

Reeve BB, Hays RD, Bjorner JB, et al. Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care. 2007 May;45:S22–S31. [PubMed: 17443115]

41.

Revicki DA, Cella D, Hays RD, et al. Responsiveness and minimal important differences for patient reported outcomes. Health Qual Life Outcomes. 2006;4:70. [PMC free article: PMC1586195] [PubMed: 17005038]

42.

Chalkidou K, Tunis S, Lopert R, et al. Comparative effectiveness research and evidence-based health policy: experience from four countries. Milbank Q. 2009 Jun;87(2):339–67. [PMC free article: PMC2881450] [PubMed: 19523121]

43.

Lanes SF, Lanza LL, Radensky, et al. Resource utilization and cost of care for rheumatoid arthritis and osteoarthritis in a managed care setting: the importance of drug and surgery costs. Arthritis and Rheumatism. 1997;40(8):1475–81. [PubMed: 9259428]

44.

Torrance GW. Measurement of health state utilities for economic appraisal. J Health Econ. 1986 Mar;5(1):1–30. [PubMed: 10311607]

45.

Torrance GW. Utility approach to measuring health-related quality of life. J Chronic Dis. 1987;40(6):593–603. [PubMed: 3298297]

46.

Feeny D, Furlong W, Boyle M, et al. Multi-attribute health status classification systems. Health Utilities Index. Pharmacoeconomics. 1995 Jun;7(6):490–502. [PubMed: 10155335]

47.

Feeny D, Furlong W, Saigal S, et al. Comparing directly measured standard gamble scores to HUI2 and HUI3 utility scores: group- and individual-level comparisons. Soc Sci Med. 2004 Feb;58(4):799–809. [PubMed: 14672594]

48.

Kaplan RM, Anderson JP. The General Health Policy Model: an integrated approach. In: Lenert L, Kaplan RM. Validity and interpretation of preference-based measures of health-related quality of life. Med Care. 2000 Sep;38:II138–II150. [PubMed: 10982099]

49.

Murray CJ. Quantifying the burden of disease: the technical basis for disability-adjusted life years. Bull World Health Organ. 1994;72(3):429–45. [PMC free article: PMC2486718] [PubMed: 8062401]

50.

Wilson KA, Dowling AJ, Abdolell M, et al. Perception of quality of life by patients, partners and treating physicians. Qual Life Res. 2000;9(9):1041–52. [PubMed: 11332225]

51.

Ross CK, Steward CA, Sinacore JM. A comparative study of seven measures of patient satisfaction. Med Care. 1995 Apr;33(4):392–406. [PubMed: 7731280]

52.

Bergner M, Bobbitt RA, Pollard WE, et al. The sickness impact profile: validation of a health status measure. Med Care. 1976 Jan;14:57–67. [PubMed: 950811]

53.

Fairclough DL. Design and Analysis of Quality of Life Studies in Clinical Trials. 2nd ed. Boca Raton: Chapman and Hall/CRC Press; 2010.

54.

Yee CS, Farewell VT, Isenberg DA, et al. The use of Systemic Lupus Erythematosus Disease Activity Index-2000 to define active disease and minimal clinically meaningful change based on data from a large cohort of systemic lupus erythematosus patients. Rheumatology (Oxford). 2011 May;50(5):982–8. [PMC free article: PMC3077910] [PubMed: 21245073]

55.

National Cardiovascular Research Infrastructure (NCRI). [February 3, 2012]. Available at: https://www​.ncrinetwork.org/

Which of the following actions is recommended when assets of a low value are being attacked? The breach may be permitted to proceed so that information on the attacker can be determined, but doing so depends on the goals of the business.

Which of the following is the most accurate definition of concurrent validity?

Which of the following is the most accurate definition of concurrent validity? It assesses the validity of a test by administering it to people already on the job and then correlating test scores with existing measures of each person's performance.