1 Introduction
Confirmatory Factor Analysis (CFA) is a key method for assessing the validity of a measurement instrument through its internal structure (Bandalos 2018; Hughes 2018; Sireci and Sukin 2013). Validity is arguably the most crucial characteristic of a measurement model (Furr 2021), as it addresses the essential question of what measuring instruments truly assess (Bandalos 2018). This concern is closely linked with the classical definition of validity: the degree to which a test measures what it claims to measure (Bandalos 2018; Furr 2021; Sireci and Sukin 2013; Urbina 2014), aligning with the tripartite model still embraced by numerous scholars (Widodo 2018).
The tripartite model of validity frames the concept using three categories of evidence: content, criterion, and construct (Bandalos 2018). Content validity pertains to the adequacy and representativeness of test items relative to the domain or objective under investigation (Cohen, Schneider, and Tobin 2022). Criterion validity is the correlation between test outcomes and a significant external criterion, such as performance on another measure or future occurrences (Cohen, Schneider, and Tobin 2022). Construct validity evaluates the test’s capacity to measure the theoretical construct it is intended to assess, taking into account related hypotheses and empirical data (Cohen, Schneider, and Tobin 2022).
Introduced in the American Psychological Association (APA) “Standards for Educational and Psychological Testing” in 1966, the tripartite concept of validity has been a cornerstone in the social sciences for decades (Bandalos 2018). However, its fragmented and confusing nature has led to widespread criticism, prompting a shift towards a more holistic view of validity (Sireci and Sukin 2013). This evolution was signified by the publication of the 1999 standards (AERA, APA, and NCME 1999), and further by the 2014 standards (AERA, APA, and NCME 2014), which redefined test validity in terms of the interpretations and uses of test scores (Furr 2021). Under this new paradigm, validation requires diverse theoretical and empirical evidence, recognizing validity as a unified concept – construct validity – encompassing various evidence sources for evaluating potential interpretations of test scores for specific purposes (Furr 2021; Urbina 2014).
Thus, key authorities in psychological assessment now define validity as the degree to which evidence and theory support the interpretations of test scores for their intended purposes (AERA, APA, and NCME 2014). Validity involves a comprehensive evaluation of how well empirical evidence and theoretical rationales uphold the conclusions and actions derived from test scores or other assessment types (Bandalos 2018; Furr 2021; Urbina 2014).
According to APA guidelines (AERA, APA, and NCME 2014), five types of validity evidence are critical: content, response process, association with external variables, consequences of test use, and internal structure. Content validity examines the extent to which test content accurately represents the domain of interest exclusively (Furr 2021). The response process refers to the link between the construct and the specifics of the examinees’ responses (Sireci and Sukin 2013). Validity based on external variables concerns the test’s correlation with other measures or constructs expected to be related or unrelated to the evaluated construct (Furr 2021). The implications of test use focus on the positive or negative effects on the individuals or groups assessed (Bandalos 2018).
Evidence based on internal structure assesses how well the interactions among test items and their components align with the theoretical framework used to explain the outcomes of the measurement instrument (AERA, APA, and NCME 2014; Rios and Wells 2014). Sources of internal structural validity evidence may include analyses of reliability, dimensionality, and measurement invariance.
Reliability is gauged by internal consistency, reflecting i) the reproducibility of test scores under consistent conditions and ii) the ratio of true score variance to observed score variance (Rios and Wells 2014). Dimensionality analysis aims to verify if item interrelations support the inferences made by the measurement model’s scores, which are assumed to be unidimensional (Rios and Wells 2014). Measurement invariance confirms that item properties remain consistent across specified groups, such as gender or ethnicity.
CFA facilitates the integration of these diverse sources to substantiate the validity of the internal structure (Bandalos 2018; Flora and Flake 2017; Hughes 2018; Reeves and Marbach-Ad 2016; Rios and Wells 2014). In the applied social sciences, researchers often have a theoretical dimensional structure in mind (Sireci and Sukin 2013), and CFA is employed to align the structure of the hypothesized measurement model with the observed data (Rios and Wells 2014).
CFA constitutes a fundamental aspect of the covariance-based Structural Equation Modeling (SEM) framework (CB-SEM) (Brown 2023; Harrington 2009; Jackson, Gillaspy, and Purc-Stephenson 2009; Kline 2023; Nye 2022). SEM is a prevalent statistical approach in the applied social sciences (Hoyle 2023; Kline 2023), serving as a generalization of multiple regression and factor analysis (Hoyle 2023). This methodology facilitates the examination of complex relationships between variables and the consideration of measurement error, aligning with the requirements for measurement model validation (Hoyle 2023).
Applications of CFA present significant complexities (Crede and Harms 2019; Jessica K. Flake, Pek, and Hehman 2017; Jessica Kay Flake and Fried 2020; Jackson, Gillaspy, and Purc-Stephenson 2009; Nye 2022; Rogers 2024), influenced by data structure, measurement level of items, research goals, and other factors. CFA can proceed smoothly in scenarios involving unidimensional measurement models with continuous items and large samples, but may encounter challenges, such as diminished SEM flexibility, when dealing with multidimensional models with ordinal items and small sample sizes (Rogers 2024).
This leads to an important question: Can certain strategies within CFA applications simplify the process for social scientists seeking evidence of validity in the internal structure of a measurement model? This inquiry does not suggest that research objectives should conform to quantitative methods. Rather, research aims guide scientific inquiry, defining our learning targets and priorities. Quantitative methods serve as tools towards these ends, not as objectives themselves. They represent one among many tools available to researchers, with the study’s purpose dictating method selection (Pilcher and Cortazzi 2023).
However, as the scientific method is an ongoing journey of discovery, many questions, especially in Psychometrics concerning measurement model validation, remain open-ended. The lack of consensus on complex and varied topics suggests researchers should opt for paths offering maximal analytical flexibility, enabling exploration of diverse methodologies and solutions while keeping research objectives forefront (Price 2017).
A recurrent topic in Factor Analysis (FA) is how to handle the measurement level of scale items. Empirical studies (Rhemtulla, Brosseau-Liard, and Savalei 2012; Robitzsch 2022, 2020) advocating for the treatment of scales with five or more response options as continuous variables have shown to enhance CFA flexibility and address validity evidence for the internal structure. The FA literature acknowledges methodological dilemmas faced when dealing with binary and/or ordinal response items with fewer than five options (Rogers 2024, 2022).
For continuous scale items, the maximum likelihood (ML) estimator and its robust variations are applicable. For non-continuous items, estimators from the Least Squares (cat-LS) family are recommended (Nye 2022; Rogers 2024, 2022). Though cat-LS estimators impose fewer assumptions on data, they require larger sample sizes, more computational power, and greater researcher expertise (Robitzsch 2020).
Assessing model fit is more challenging with cat-LS estimated models compared to those estimated by ML, which are better established and more familiar to researchers (Rhemtulla, Brosseau-Liard, and Savalei 2012). Despite their increasing popularity, cat-LS models are newer, less recognized, and seldom available in software (Rhemtulla, Brosseau-Liard, and Savalei 2012). Handling missing data remains straightforward with ML models using the Full Information ML (FIML) method but is problematic with ordinal data (Rogers 2024).
Thus, we can optimize the potential of some of the available software (Arbuckle 2019; Bentler and Wu 2020; Fox 2022; JASP Team 2023; Jöreskog and Sörbom 2022; Muthén and Muthén 2023; Neale et al. 2016; Ringle, Wende, and Becker 2022; Rosseel 2012; The jamovi project 2023) and overcome many of the limitations for ordinal and nominal data, which are still present in some of them (Arbuckle 2019; Bentler and Wu 2020; Neale et al. 2016; Ringle, Wende, and Becker 2022).
This discussion does not intend to oversimplify, digress, or claim superiority of one software over another. Rather, it underscores a fundamental statistical principle: transitioning from nominal to ordinal and then to scalar measurement levels increases the flexibility of statistical methods. Empirical studies in CFA support these clarifications (Rhemtulla, Brosseau-Liard, and Savalei 2012; Robitzsch 2022, 2020).
This article assists applied social scientists in decision-making from selecting a measurement model to comparing and updating models for enhanced CFA flexibility. It addresses power analysis, data preprocessing, estimation procedures, and model modification from three angles: smart choices or recommended practices (Jessica K. Flake, Pek, and Hehman 2017; Nye 2022; Rogers 2024), pitfalls to avoid (Crede and Harms 2019; Rogers 2024), and essential reporting elements (Jessica Kay Flake and Fried 2020; Jackson, Gillaspy, and Purc-Stephenson 2009; Rogers 2024).
The aim is to guide researchers through CFA to access the underlying structure of measurement models without falling into common traps at any stage of the validation process. Early-stage decisions can preempt later limitations, while missteps may necessitate exploratory research or additional efforts in subsequent phases.
Practically, this includes an R tutorial utilizing the lavaan package (Rosseel 2012), adhering to reproducibility, replicability, and transparency standards of the Open Science movement (Gilroy and Kaplan 2019; Kathawalla, Silverstein, and Syed 2021; Klein et al. 2018).
Tutorial articles, following the FAIR principles (Findable, Accessible, Interoperable, and Reusable) (Wilkinson et al. 2016), play a vital role in promoting open science (Martins 2021; Mendes-Da-Silva 2023), by detailing significant methods or application areas in an accessible yet comprehensive manner. This encourages adherence to best practices among researchers, minimizing the impact of positive publication bias.
This tutorial is structured into three sections, beyond the introductory discussion. It includes a thorough review of CFA recommended practices, an example of real-world research application in the R ecosystem, and final considerations, following Martins (2021) format for tutorial articles. This approach, combined with workflow recommendations for reproducibility, aims to support the applied social sciences community in effectively utilizing CFA (Martins 2021; Mendes-Da-Silva 2023).