Which measurement would a researcher use to test for reliability when the data are in Likert scale response format?

Within the Higher Education (HE) sector, we often use surveys to measure student aspirations, intermediate outcomes and specific constructs of interest, sense of belonging for example. Likert scales are one of the most commonly used survey designs. Although these scales may look very familiar, it is worth giving careful thought to their design, use and implementation to ensure that the data gathered offers meaningful insight. To simplify the process, we have created eight guidelines for Widening Participation (WP) teams and evaluators to follow when working with Likert scales.

This TASO webinar on validity and reliability may also be a useful resource for understanding the following guidelines.

Why use a Likert scale?

The Likert scale, created in 1932 by Rensis Likert, is one of the most popular types of survey scales. Its main purpose is to assess how much the respondent agrees or disagrees with questions or statements related to specific constructs, such as sense of belonging or self-esteem. Likert scales are ‘summated’ scales, meaning that respondent’s answers on each item are aggregated to obtain a multifaceted measurement of the construct of interest. In other words, Likert scales, when designed adequately, provide a simple and consistent measurement which is particularly useful when assessing complex constructs.

Likert scales are commonly used in the fields of psychology and education research. Outside of the research sphere, they are employed to collect opinions about services such as public facilities or helplines. Most people have encountered the following response options in an airport terminal or similar setting.

Which measurement would a researcher use to test for reliability when the data are in Likert scale response format?

How to design Likert scales

There are eight guidelines we think it’s useful to consider when designing a Likert scale. (click on each guideline to reveal more detail)

1. Develop a clear understanding of what you want to measure

An obvious but essential first recommendation is to thoroughly review the existing literature on the construct of interest. This is necessary to develop a good definition, clearly positioned within the existing literature and related to what we are hoping to assess (Artino et al., 2014). For example, if you would like to design a new scale to measure self-esteem, it is wise to review the literature and existing self-esteem scales to understand how it relates to your research question. This preliminary research can avoid misconceptions, like the idea that high self-esteem improves academic performance which is discredited in the academic literature (Baumeister et al., 2003).

Survey designers should also search how the construct of interest has been measured in the past. It can sometimes be more efficient and reliable to adapt existing scales or items than starting a new questionnaire from scratch. This approach also increases consistency between studies and thus simplifies comparisons. Knowing that, the specific context, sample and purpose of each specific study has to be taken into account.

2. Design easy-to-understand items

Each item on a scale is the representation of one part of the construct of interest. For example, the item – ‘on the whole, I am satisfied with myself’ – is one part of the construct self-esteem on the Rosenberg Self Esteem Scale (Rosenberg, 1965).

When drafting items, survey designers have to stay close to the theory and put themselves in respondents’ shoes. Their main objectives should be simplifying language and minimising ambiguity. In practice, this means:

  • Avoiding technical language (e.g., acronyms) and complex grammatical constructions that respondents might not understand quickly or accurately;
  • Avoiding ‘double-barrelled questions’ (i.e., questions using conjunctions such as and/or/but) where respondents are asked to answer two different questions in one, making it impossible for the researcher to know which one they are actually answering.

Even if they follow these principles, survey designers might still fail to detect misalignments between their intention and respondent’s interpretation. Cognitive interviewing can solve these issues by testing whether respondents correctly and consistently interpret the Likert scale. In practice, interviewees are asked to describe their thinking as they answer each question or after they complete the questionnaire (Peterson et al., 2017).

3. Decide on the number of items needed to measure each construct

Survey designers should keep in mind that there are practical limitations to the number of items that should be included in a questionnaire. Respondents have a limited amount of time and cognitive resources allocated to answer a survey. Adding too many items can therefore decrease response rates and survey reliability.

That is why item selection is a crucial and challenging step in survey design. Nemoto & Beglar (2014) recommend piloting about ten to twelve items, and selecting the best performing six to eight. Other researchers indicate that the ‘sweet spot’ is around five items per construct (Weng, 2004; Nielsen et al., 2017).

4. Create a balanced scale

Respondents have a tendency to agree with declarative statements regardless of their content, a phenomenon referred to as acquiescence bias (Wright, 1975). There are ways to avoid acquiescence bias.

First, when selecting the items to include, survey designers should ensure that the scale’s entire response range will be used to express a certain opinion to minimise the risk that this bias affects results. For example, for a respondent to be identified as having low or high self-esteem, they should have to answer positively to some items and negatively to others.

Additionally, it is preferable to present the items as even-handed questions instead of statements and to avoid ‘agree-disagree’ response options (Fowler 2009, Saris et al., 2010). Instead of asking respondents to rate their level of agreement, using even-handed questions requires verbally labelled response options that reinforce the underlying topic.

For example, it is preferable to use ‘Are you satisfied with yourself?’ rather than ‘Overall, I am satisfied with myself’. A suitable set of response labels could be: not at all satisfied, slightly satisfied, somewhat satisfied, quite satisfied and extremely satisfied.

5. Avoid negatively worded items

For many years, researchers recommended that survey designers include both negatively and positively worded items in questionnaires (Nunnally, 1978; Spector, 1992). However, more recent studies have found that negatively worded items might create more validity and reliability issues than they solve, for various statistical and behavioural reasons (Allen, 2017).

For instance, the negative expression of a statement requires more complex cognitive processes, making it more challenging for respondents with lower reading abilities or general education to provide consistent responses (Swain et al., 2008).

Hence, evidence suggests that survey designers should phrase items with positive language. They can adopt alternative techniques to minimise biases and improve respondents’ attention. For instance, Instructional Manipulation Checks, items asking respondents to demonstrate that they are paying attention by choosing a particular response option (e.g., “Select ‘Strongly disagree’ if you are reading the items and paying attention”) can be added in Likert scales (Berinsky et al., 2014). Another technique consists of changing the direction of the response options for each few items (i.e., alternating between the positive to negative direction from left to right, and the negative to positive direction).

6. Inclusion of a midpoint

There is an ongoing debate regarding whether a midpoint should be included in Likert scales (Croasmun & Ostrom, 2011). This midpoint is commonly labelled as ‘Undecided’, ‘Neither agree nor disagree’ or ‘Don’t know’. Naturally, including or excluding a midpoint makes the scale an odd-numbered or even-numbered scale.

There are statistical and behavioural arguments on each side. For instance, a midpoint can allow respondents to express indecision or neutrality (Johns, 2005) but might be misused (e.g., as an alternative to avoid reporting a potentially controversial opinion), whilst its absence can artificially force respondents to commit to a certain position.

Hence, the question that practitioners should seek to answer is not whether or not to include a midpoint, but rather when to add it or omit it. Chyung et al. (2017) present some evidence-based recommendations on the topic. On the one hand, they recommend including a midpoint when:

  • Respondents are familiar with the topic and should be allowed to express a neutral opinion;
  • The likert scale has to be used as an interval scale for statistical analysis purposes.

On the other hand, a midpoint should be omitted when:

  • Respondents are unfamiliar or uncomfortable with the survey topic;
  • Respondents are not expected to have formed their opinion about the topic;
  • Social desirability bias may have a strong influence on responses;
  • Respondents are likely to put little effort into answering the survey.

More generally, this decision should be made by taking into account factors such as scale sensitivity (i.e., smallest absolute amount of change that can be detected by a measurement), age and education level of respondents, choice of the midpoint label and researcher’s preference (Chyung et al., 2017).

7. Label each anchor point on Likert scales

An anchor is a word or sentence that describes a position on a response scale. Research has shown that, on a Likert scale, it is best to use as many verbal labels as possible. For instance, for the question ‘How would you rate your mental health overall?’, the anchor points could be : excellent, very good, good, fair, bad and very bad. It is particularly important in order to discourage respondents from becoming reliant on the midpoint when responding to a scale (Matell & Jacoby, 1972). We thus recommend labeling each anchor point to maximize data quality by reducing measurement error (Artino & Gehlbach, 2012). Thankfully, it is also what respondents prefer (Johns, 2010)!

8. Number of points on a scale

Closely related is the debate around the number of points to include on a Likert scale. Although Likert originally designed five-point scales, there are no theoretical reasons to rule out different numbers.

Scales with less points are easier and quicker to use but do not provide as much information (Preston and Colman, 2000). Larger scales can be harder for the respondents to interpret but provide researchers with more nuanced and accurate information about the repondants’ answers.

While researchers have not agreed on a single optimal number of points to include on a Likert scale, most recommend between five and seven points (Preston & Colman, 2000; Chen and al., 2015). The decision is dependent upon the characteristics of the study, particularly the cognitive capacities of its respondents. For instance, Weijters et al. (2010) recommend using five-point scales for the general population, and seven-point scales for populations with relatively high levels of verbal skills and experience with questionnaires. Nemoto and Beglar (2013) encourage the use of four-point scales for young respondents.

Conclusion

In conclusion, often there is not a unique right way to build a Likert scale but rather a multitude of trade-offs to consider. However, survey designers should always try to make the most appropriate decisions contingent upon the project’s characteristics. When biases cannot be avoided or controlled for, it is important to explicitly acknowledge them.

It is beyond the scope of this article but worth noting that survey design is closely related to statistical data analysis. Understanding how to design a Likert scale that fits a specific type of data analysis requires statistical knowledge that is not provided in this article. In the future, TASO is hoping to publish more information on this topic specifically.

Get in touch

If you have conducted research, or know of research, that could be added to this page, please get in touch with us.

References

Allen, M. ed., 2017. The SAGE encyclopedia of communication research methods. Sage Publications. http://dx.doi.org/10.4135/9781483381411.n264

Artino Jr, A.R. and Gehlbach, H., 2012. AM last page: Avoiding four visual-design pitfalls in survey development. Academic Medicine, 87(10), p.1452. Linked here.

Artino Jr, A.R., La Rochelle, J.S., Dezee, K.J. and Gehlbach, H., 2014. Developing questionnaires for educational research: AMEE Guide No. 87. Medical teacher, 36(6), pp.463-474. https://doi.org/10.3109/0142159X.2014.889814

Baumeister, R.F., Campbell, J.D., Krueger, J.I. and Vohs, K.D., 2003. Does high self-esteem cause better performance, interpersonal success, happiness, or healthier lifestyles?. Psychological science in the public interest, 4(1), pp.1-44. https://doi.org/10.1111/1529-1006.01431

Berinsky, A.J., Margolis, M.F. and Sances, M.W., 2014. Separating the shirkers from the workers? Making sure respondents pay attention on self‐administered surveys. American Journal of Political Science, 58(3), pp.739-753. https://doi.org/10.1111/ajps.12081

Chen, H.M., Huang, M.F., Yeh, Y.C., Huang, W.H. and Chen, C.S., 2015. Effectiveness of coping strategies intervention on caregiver burden among caregivers of elderly patients with dementia. Psychogeriatrics, 15(1), pp.20-25. https://doi.org/10.1111/psyg.12071

Chyung, S.Y., Roberts, K., Swanson, I. and Hankinson, A., 2017. Evidence‐based survey design: The use of a midpoint on the Likert scale. Performance Improvement, 56(10), pp.15-23. https://doi.org/10.1002/pfi.21727

Croasmun, J.T. and Ostrom, L., 2011. Using likert-type scales in the social sciences. Journal of Adult Education, 40(1), pp.19-22. Linked here.

DeVellis, R.F., 2013. Scale development: Theory and applications (Vol. 26). Sage publications. Linked here.

Fowler Jr, F.J., 2009. Survey interviewing. In Survey Research Methods (Vol. 4, pp. 127-144). Thousand Oaks, CA: Sage Publications. Linked here.

Gehlbach, H., 2010. The social side of school: Why teachers need social psychology. Educational Psychology Review, 22(3), pp.349-362. https://doi.org/10.1007/s10648-010-9138-3

Johns, R., 2010. Likert items and scales. Survey question bank: Methods fact sheet, 1(1), pp.11-28. Linked here.

Joshi, A., Kale, S., Chandel, S. and Pal, D.K., 2015. Likert scale: Explored and explained. British Journal of Applied Science & Technology, 7(4), p.396. doi: 10.9734/BJAST/2015/14975

Likert, R., 1932. A theory for the measurement of attitudes. Archives of Psychology, 140, pp.1-55. Linked here.

Matell, M.S. and Jacoby, J., 1971. Is there an optimal number of alternatives for Likert scale items? Study I: Reliability and validity. Educational and psychological measurement, 31(3), pp.657-674. https://doi.org/10.1177/001316447103100307

Nemoto, T. and Beglar, D., 2013. Likert-scale questionnaires. In JALT 2013 conference proceedings (pp. 1-8). Linked here.

Nielsen, T., Makransky, G., Vang, M.L. and Dammeyer, J., 2017. How specific is specific self-efficacy? A construct validity study using Rasch measurement models. Studies in Educational Evaluation, 53, pp.87-97. https://doi.org/10.1016/j.stueduc.2017.04.003

Nunnally, J.C., 1978. Psychometric theory 3E. Tata McGraw-hill education. Linked here.

Preston, C.C. and Colman, A.M., 2000. Optimal number of response categories in rating scales: reliability, validity, discriminating power, and respondent preferences. Acta psychologica, 104(1), pp.1-15. https://doi.org/10.1016/S0001-6918(99)00050-5

Rosenberg, M., 1965. Rosenberg self-esteem scale (RSE). Acceptance and commitment therapy. Measures package, 61(52), p.18. Linked here.

Saris, W., Revilla, M., Krosnick, J.A. and Shaeffer, E.M., 2010. Comparing questions with agree/disagree response options to questions with construct-specific response options. Survey Research Methods. 2010; 4 (1): 61-79.  http://dx.doi.org/10.18148/srm/2010.v4i1.2682

Spector, P.E., 1992. Summated rating scale construction: An introduction (Vol. 82). Sage. Linked here.

Swain, S.D., Weathers, D. and Niedrich, R.W., 2008. Assessing three sources of misresponse to reversed Likert items. Journal of Marketing Research, 45(1), pp.116-131. https://doi.org/10.1509/jmkr.45.1.116

Weng, L.J., 2004. Impact of the number of response categories and anchor labels on coefficient alpha and test-retest reliability. Educational and Psychological Measurement, 64(6), pp.956-972. https://doi.org/10.1177/0013164404268674

Weijters, B., Cabooter, E. and Schillewaert, N., 2010. The effect of rating scale format on response styles: The number of response categories and response category labels. International Journal of Research in Marketing, 27(3), pp.236-247. https://doi.org/10.1016/j.ijresmar.2010.02.004

Wright, J. D. (1975). Does acquiescence bias the ‘Index of Political Efficacy?’. The Public Opinion Quarterly, 39(2), 219-226. Linked here.

Which type of reliability can be obtained for the Likert statements?

Likert scale (0,1,2,3) is 92% reliable while the Likert-type of scale had 90, 89, and 88% reliability. scales. We standardize the scale efficacy in a 5.0 system, the non-Likert scale is 4.73 and 2.35, 2.45, and 2.41 for Likert scales.

How do you measure reliability of a scale?

Once you have created a scale, you should test to see if it is reliable; that is, to see if the scale items are internally consistent. The most commonly used test is Cronbach's alpha coefficient. You can assume reliability if the coefficient is greater than . 7.

Which measurement would a researcher use to test for reliability when the data are in dichotomous format?

Which measurement would a researcher use to test for reliability when the data are in dichotomous ("yes/no") format? The KR-20 coefficient is used to estimate the homogeneity of instruments.

Which type of reliability is examined by Cronbach's alpha?

Cronbach's alpha is a measure of internal consistency, that is, how closely related a set of items are as a group. It is considered to be a measure of scale reliability. A “high” value for alpha does not imply that the measure is unidimensional.