2  Smart Choices in CFA

This paper presents a comprehensive approach to conducting a standard CFA within the applied social sciences, following the guidelines outlined by Rogers (2024). According to Rogers (2024), a typical CFA study seeks to fit a reflective common factor model with a predefined multifactor structure, established psychometric properties, and a maximum of five Likert-type response options. This scenario frequently occurs in research endeavors where the measurement model facilitates the examination of hypotheses derived from the structural model.

The initial phase in such research involves data preprocessing. Specifically, for categorical data, Rogers (2024) advises employing multiple imputation to handle missing data, taking into consideration the limitations posed by available software and methodologies (Rogers 2024). When a measurement model allows for the treatment of items as continuous variables, addressing this challenge can be deferred to the estimation process stage through the selection of an appropriate estimator (Robitzsch 2022).

This paper reinterprets the insights from Rogers (2024) for CFAs that accommodate continuous item treatment. Thus, a strategic choice involves opting for measurement models that permit this approach, thereby circumventing methodological hurdles (Robitzsch 2022, 2020) associated with binary and/or ordinal response items with up to four or five gradations. Such a decision influences various aspects of the research process, including the choice of software, power analysis, estimation techniques, criteria for model adjustment, and model comparisons. These choices, in turn, affect requirements concerning sample size, computational resources, and the researcher’s expertise (Robitzsch 2020).

Subsequent sections delve into themes previously summarized by Rogers (2024), specifically concerning CFAs with ordinal items. These themes are explored in terms of recommended practices (Jessica K. Flake, Pek, and Hehman 2017; Nye 2022; Rogers 2024), pitfalls to avoid (Crede and Harms 2019; Rogers 2024), and reporting guidelines (Jessica Kay Flake and Fried 2020; Jackson, Gillaspy, and Purc-Stephenson 2009; Rogers 2024), all within the context of selecting measurement models that accommodate continuous data interpretation.

Assuming that readers possess a foundational understanding of the topic, this paper omits certain technical details, directing readers to authoritative texts (Brown 2015; Kline 2023) and scholarly articles that provide an introduction to Covariance-Based Structural Equation Modeling (CB-SEM) (Davvetas et al. 2020; Shek and Yu 2014). The discussion is framed within the CB-SEM paradigm (Brown 2015; Jackson, Gillaspy, and Purc-Stephenson 2009; Kline 2023; Nye 2022), with a focus on CFA. The paper explicitly excludes discussions on measurement model modifications in Variance-Based SEM (VB-SEM), which are predominantly addressed in the literature on Partial Least Squares SEM (PLS-SEM) (Hair et al. 2022, 2017; Henseler 2021).

2.1 Measurement Model Selection

Selecting an appropriate measurement model is a critical initial step in the research process. For robust analysis, it is advisable to prioritize models that provide five or more ordinal response options. Research has shown that a higher number of response gradations enhances the ability to detect inaccurately defined models (Green et al. 1997; Maydeu-Olivares, Fairchild, and Hall 2017), even when using estimators designed for ordinal items (Xia and Yang 2018). This strategy also mitigates some of the methodological challenges associated with the analysis of ordinal data in CFA (Rhemtulla, Brosseau-Liard, and Savalei 2012; Robitzsch 2022, 2020).

When choosing a measurement scale, it is crucial to select ones that have been validated in the language of application and with the study’s target audience (Jessica K. Flake, Pek, and Hehman 2017). Avoid scales that are proprietary or specific to certain professions. An examination of your country’s Psychological Test Assessment System can be an effective starting point. If the desired scale is not found within these resources, consider looking into scales developed with the support of public institutions, non-governmental organizations, research centers, or universities, as these entities often invest significant resources in validating measurement models for broader public policy purposes.

An extensive literature review is essential for selecting a suitable measurement model. This should include consulting specialized journals, books, technical reports, and academic dissertations or theses. Schumacker, Wind, and Holmes (2021) provide a detailed guide for initiating this search. Consideration should also be given to systematic reviews or meta-analyses focusing on measurement models related to your topic of interest. It is important to review both the original articles on the scales and subsequent applications. Kline (2016) offers a useful checklist for assessing various measurement methods.

Incorporate control questions, such as requiring respondents to select “strongly agree” on specific items, and monitor survey response times to gauge participant engagement (Collier 2020).

Avoid adopting measurement models designed for narrow purposes or those lacking rigorous psychometric validation (Jessica Kay Flake and Fried 2020; Kline 2016). The mere existence of a scale does not ensure its validity (Jessica K. Flake, Pek, and Hehman 2017). Also, steer clear of seldom-used or outdated scales, as they may have compromised psychometric properties. Translating a scale from another language for immediate use without thorough translation and retranslation processes is inadvisable. Be cautious of overlooking alternative factorial structures (e.g., higher-order or bifactor models) that could potentially salvage the research if considered thoroughly (Crede and Harms 2019).

When selecting a scale, justify its choice by highlighting its strong psychometric properties, including previous empirical evidence of its application within the target population and its reliability and validity metrics (Jessica Kay Flake and Fried 2020; Jackson, Gillaspy, and Purc-Stephenson 2009; Kline 2016). If the scale has multiple potential factorial structures, provide a rationale for the chosen model to prevent the misuse of CFA for exploratory purposes (Jackson, Gillaspy, and Purc-Stephenson 2009).

Clearly specify the selected model and rationalize your choice by detailing its advantages over other theoretical models. Illustrating the models under consideration can further clarify your research approach (Jackson, Gillaspy, and Purc-Stephenson 2009). Finally, identify and explain any potential cross-loadings based on prior empirical evidence (Brown 2023; Nye 2022), ensuring a comprehensive and well-justified methodological foundation for your study.

2.2 Power Analysis

When addressing Power Analysis (PA) in CFA and SEM, it’s essential to move beyond general rules of thumb for determining sample sizes. Commonly cited guidelines suggesting minimum sizes or specific ratios of observations to parameters (e.g., 50, 100, 200, 300, 400, 500, 1000 for sample sizes or 20/1, 10/1, 5/1 for observation/parameter ratios) (Kline 2023; Kyriazos 2018) are based on controlled conditions that may not directly transfer to your study’s context.

Reliance on lower-bound sample sizes as a substitute for thorough PA risks inadequate power for detecting meaningful effects in your model (Westland 2010; Yilin Andre Wang 2023). Tools like Soper’s calculator (https://www.danielsoper.com/statcalc/), while popular and frequently cited (as of 02/20/2024, with almost four years of existence, it had collected more than 1,000 citations on Google Scholar), should not replace a tailored PA approach. Such calculators, despite their utility, may not fully accommodate the complexities and specific requirements of your research design (Kyriazos 2018; Feng and Hancock 2023; Moshagen and Bader 2023).

A modern perspective on sample size determination emphasizes customizing power calculations to fit the unique aspects of each study, incorporating specific research settings and questions (Feng and Hancock 2023; Moshagen and Bader 2023). This approach underscores that there is no universal sample size or minimum that applies across all research scenarios (Kline 2023).

Planning for PA should ideally precede data collection, enhancing the researcher’s understanding of the study and facilitating informed decisions regarding the measurement model based on existing literature and known population characteristics (Feng and Hancock 2023; Leite, Bandalos, and Shen 2023). A priori PA not only ensures adequate sample size for detecting the intended effects, minimizing Type II errors, but also aids in budgeting for data collection and enhancing overall research design (Feng and Hancock 2023).

PA in SEM can be approached analytically, using asymptotic theory, or through simulation methods. Analytical methods require specifying the effect size in relation to the non-centrality parameter, while simulated PA leverages a population model to empirically estimate power (Moshagen and Bader 2023; Feng and Hancock 2023). These approaches are applicable to assessing both global model fit and specific model parameters.

For CFA, evaluating the power related to the global fit of the measurement model is recommended (Nye 2022). Although analytical solutions have their limitations, they can serve as preliminary steps, complemented by simulation techniques for a more comprehensive PA (Feng and Hancock 2023; Moshagen and Bader 2023).

Several resources offer analytical solutions for global fit PA, including ShinyApps by Jak et al. (2021), Moshagen and Bader (2023), Y. Andre Wang and Rhemtulla (2021), and Zhang and Yuan (2018), with the last application providing a comprehensive suite for Monte Carlo Simulation (SMC) that accommodates missing data, non-normal distributions, and facilitates model testing without extensive coding (Y. Andre Wang and Rhemtulla 2021). For an overview of these solutions and a discussion of analytical approaches, see Feng and Hancock (2023), Jak et al. (2021), Nye (2022), and Yilin Andre Wang (2023).

However, it is a smart decision to run an SMC for the PA of your CFA model using solutions that are consistent with the results’ reproducible and replicability. In this way, even analytical solutions that the researcher may use as a starting point are recommended in the R environment via the semTools packages (Jak et al. 2021) and semPower 2 (Jobst, Bader, and Moshagen 2023; Moshagen and Bader 2023). The first option is compatible with the lavaan syntax and looks to be enough. The second, albeit including SMC in some cases, has a more difficult syntax.

For detailed and tailored PA, especially in complex models or unique study designs, the simsem package offers a robust solution, allowing for the relaxation of traditional assumptions and supporting the use of robust estimators. This package, which utilizes the familiar lavaan syntax, simplifies the learning curve for researchers already accustomed to SEM analyses, providing a user-friendly interface for conducting SMC (Pornprasertmanit et al. 2022).

Publishing the sampling design and methodology enhances the reproducibility and replicability of research, contributing to the scientific community’s collective understanding and validation of measurement models (Jessica K. Flake, Pek, and Hehman 2017; Jessica Kay Flake et al. 2022; Jessica Kay Flake and Fried 2020; Leite, Bandalos, and Shen 2023). In the context of CFA, acknowledging the power limitations of your study can signal potential concerns for the broader inferences drawn from your research, emphasizing the importance of external validity and the relevance of the outcomes over mere precision (Leite, Bandalos, and Shen 2023).

2.3 Pre-processing

Upon gathering and tabulating original data, ideally in non-binary formats such as CSV, TXT, or JSON, the first step in data preprocessing should be to eliminate responses from participants who have abandoned the study. This identification often occurs at the end of preprocessing, where these incomplete responses can offer insights into handling missing data, outliers, and multicollinearity.

Incorporating control questions and measuring response time allows researchers to further refine their dataset by excluding participants who fail control items or complete the survey unusually quickly (Collier 2020). Calculating individual response variability (standard deviation) can identify respondents who may not have engaged meaningfully with the survey, indicated by minimal variation in their responses.

These preliminary data cleaning steps are fundamental yet frequently overlooked in empirical research. They can significantly enhance data quality before engaging in more complex statistical analyses. Visual and descriptive examination of measurement model items is implicitly beneficial for any statistical investigation and should be considered standard practice.

While data transformation methods like linearization or normalization are available, they are generally not necessary given the robust estimation processes that can handle non-normal data (Brown 2015). Parceling items is also discouraged due to its potential to obscure underlying multidimensional structures (Brown 2015; Crede and Harms 2019).

Addressing missing data, outliers, and multicollinearity is critical. Single imputation methods should be avoided as they underestimate error variance and can lead to identification problems in your model (Enders 2023). For missing data under 5%, the impact may be minimal, but for higher rates, Full Information ML (FIML) or Multiple Imputation (MI) should be utilized, with FIML often being the most straightforward and effective choice for CFA (Brown 2015; Kline 2023).

FIML and MI are preferred for handling missing data due to their ability to produce consistent and efficient parameter estimates under conditions similar to MI (Enders 2023; Kline 2023). FIML it can be adapted for non-normal data using robust estimators (Brown 2015).

Calculating the Variance Inflation Factor (VIF) helps identify items with problematic multicollinearity (VIF > 10), which should be addressed to prevent model convergence issues and misinterpretations (Kline 2016; Whittaker and Schumacker 2022). Reflective constructs in CFA require some level of item correlation but not to the extent that it causes statistical or validity concerns.

Consider multivariate outliers rather than univariate ones, identifying and assessing their exclusion based on sample characteristics. Reporting all data cleaning processes, including any loss of items and strategies for assessing respondent engagement, is crucial for transparency. Additionally, documenting signs of multicollinearity and the software or packages used (with versions) enhances the reproducibility and credibility of the research (Jessica Kay Flake and Fried 2020; Jackson, Gillaspy, and Purc-Stephenson 2009).

Finally, making raw data public adheres to the principles of open science, promoting transparency and allowing for independent validation of research findings (Crede and Harms 2019; Jessica Kay Flake et al. 2022; Jessica Kay Flake and Fried 2020). This practice not only contributes to the scientific community’s collective knowledge base but also reinforces the integrity and reliability of the research conducted.

2.4 Estimation Process

In CFA with ordinal items, such as those involving Likert-type scales with up to five points, Rogers (2024) advocates for the use of estimators from the Ordinary Least Squares (OLS) family. Specifically, for smaller samples, the recommendation is to utilize the Unweighted Least Squares (ULS) in its robust form (RULS), and for larger samples, the Diagonally Weighted Least Squares (DWLS) in its robust version (RDWLS), citing substantial supporting research.

Despite this, empirical evidence (Rhemtulla, Brosseau-Liard, and Savalei 2012; Robitzsch 2022) and theoretical considerations (Robitzsch 2020) suggest that treating ordinal data as continuous can yield acceptable outcomes when the response options number five or more. Particularly with 6-7 categories, comparisons between methods under various conditions reveal little difference, and it is recommended to use a greater number of response alternatives (≥5) to enhance the power for detecting model misspecifications (Maydeu-Olivares, Fairchild, and Hall 2017).

The ML estimator, noted for its robustness to minor deviations from normality (Brown 2015), is further improved by using robust versions like MLR (employing Huber-White standard errors and Yuan-Bentler scaled \(\chi^2\). This adjustment allows for generating robust standard errors and adjusted test statistics, with MLR offering extensive applicability including in scenarios of missing data (RFIML) or where data breaches the independence of observations assumption (Brown 2015; Rosseel 2012). Comparative empirical studies have supported the effectiveness of MLR against alternative estimators (Bandalos 2014; Holgado-Tello, Morata-Ramirez, and García 2016; Li 2016; Nalbantoğlu-Yılmaz 2019; Yang and Liang 2013; Yang-Wallentin, Jöreskog, and Luo 2010).

Researchers are advised to carefully describe and justify the chosen estimation method based on the data characteristics and the specific model being evaluated (Crede and Harms 2019). It is also critical to report any estimation challenges encountered, such as algorithm non-convergence or model misidentification (Nye 2022). In case of estimation difficulties, alternative approaches like MLM estimators (employing robust standard errors and Satorra-Bentler scaled \(\chi^2\)) or the default ML with non-parametric bootstrapping, as proposed by Bollen-Stine, can be considered. This latter approach is also capable of accommodating missing data (Brown 2015; Kline 2023).

Additionally, it is important to clarify whether the variance of a marker variable was fixed (=1) to scale the latent variables (Jackson, Gillaspy, and Purc-Stephenson 2009), and to provide both standardized and unstandardized parameter estimates (Nye 2022). These steps are crucial for ensuring transparency, reproducibility, and the ability to critically assess the validity of the CFA results.

2.5 Model Fit

In conducting CFA with ordinal items, such as Likert-type scales, it’s crucial to approach model evaluation with nuance and avoid reliance on rigid cutoff values for fit indices. Adhering strictly to traditional cutoffs – whether more conservative (e.g., SRMR ≤ .06, RMSEA ≤ .06, CFI ≥ .95) or less conservative (e.g., RMSEA ≤ .08, CFI ≥ .90, SRMR ≤ .08) – should not be the sole criterion for model acceptance (Xia and Yang 2019). The origins of these thresholds are in simulation studies with specific configurations (up to three factors, fifteen items, factor loadings between 0.7 and 0.8) (West et al. 2023), and may not universally apply due to the variance in the number of items, factors, model degrees of freedom, misfit types, and presence of missing data (Groskurth, Bluemke, and Lechner 2023; Niemand and Mai 2018; West et al. 2023).

Evaluation of global fit indices (SRMR, RMSEA, CFI) should be done in a collective manner, rather than fixating on any single index. A deviation from traditional cutoffs warrants further investigation into whether the discrepancy is attributable to data characteristics or limitations of the index, rather than indicating a fundamental model misspecification (Nye 2022). Interpreting fit indices as effect sizes can offer a more meaningful assessment of model fit, aligning with their original conceptualization (McNeish and Wolf 2023a; McNeish 2023b).

The SRMR is noted for its robustness across various conditions, including non-normality and different measurement levels of items. Pairing SRMR with CFI can help balance Type I and Type II errors, but reliance on alternative indices may increase the risk of Type I error (Mai, Niemand, and Kraus 2021; Niemand and Mai 2018).

Emerging methods like the Dynamic Fit Index (DFI) and Flexible Cutoffs (FCO) offer tailored approaches to evaluating global fit. DFI, based on simulation, provides model-specific cutoff points, adjusting simulations to match the empirical model’s characteristics (McNeish 2023a; McNeish and Wolf 2023b; McNeish and Wolf 2023a). FCO, while not requiring identification of a misspecified model like DFI, conservatively defines misfit, shifting focus from approximate to accurate fit (McNeish and Wolf 2023b).

For those hesitant to delve into simulation-based methods, Equivalence Testing (EQT) presents an alternative. EQT aligns with the analytical mindset of PA and incorporates DFI principles, challenging the conventional hypothesis testing framework by considering model specification and misspecification size control (Yuan et al. 2016).

When addressing reliability, Cronbach’s Alpha should not be the default measure due to its limitations. Instead, consider McDonald’s Omega or the Greatest Lower Bound (GLB) for a more accurate reliability assessment within the CFA context (Bell, Chalmers, and Flora 2023; Cho 2022; Dunn, Baguley, and Brunsden 2014; Flora 2020; Goodboy and Martin 2020; Green and Yang 2015; Hayes and Coutts 2020; Kalkbrenner 2023; McNeish 2018; Trizano-Hermosilla and Alvarado 2016).

Before modifying the model, first check for Heywood instances, which are standardized factor loadings greater than one or negative variances (Nye 2022) and document the chosen cutoffs for evaluation. Tools and resources like ShinyApp for DFI and the FCO package in R can facilitate the application of these advanced methodologies (McNeish and Wolf 2023a; Mai, Niemand, and Kraus 2021; Niemand and Mai 2018). Always report corrected chi-square and degrees of freedom, alongside a minimum of three global fit indices (RMSEA, CFI, SRMR) and local fit measures to provide a comprehensive view of model fit and adjustment decisions (Crede and Harms 2019; Jessica Kay Flake and Fried 2020).

2.6 Model Comparisons and Modifications

Researchers embarking on CFA should avoid prematurely committing to a specific factor structure without thoroughly evaluating and comparing alternate configurations. It’s advisable to consider various potential structures early in the study design, ensuring the selected model is based on its merits relative to competing theories (Jackson, Gillaspy, and Purc-Stephenson 2009). Since models are inherently approximations of reality, adopting the most effective “working hypothesis” is a dynamic process, contingent on ongoing assessments against emerging alternatives (Preacher and Yaremych 2023).

Good models are characterized not only by their interpretability, simplicity, and generalizability but notably by their capacity to surpass competing models in critical aspects. This competitive advantage frames the selected theory as the prevailing hypothesis until a more compelling alternative is identified (Preacher and Yaremych 2023).

The evaluation of model fit should extend beyond isolated assessments using fit indices. A comprehensive approach involves comparing multiple models, each grounded in substantiated theories, to discern the most accurate representation of the underlying structure. This comparative analysis is preferred over singular model evaluations, fostering a more holistic understanding of the phenomena under study (Preacher and Yaremych 2023).

Uniform application of models across the same dataset, utilizing identical software and sample size, ensures consistency in the researcher’s analytical freedom, mitigating the risk of results manipulation. This standardized approach underpins a more rigorous and transparent investigative process (Preacher and Yaremych 2023).

Model selection is instrumental in pinpointing the most effective explanatory framework for the observed phenomena, enabling the dismissal of less performance models while retaining promising ones for further exploration. This methodological flexibility enhances the depth of analysis, contributing to the advancement of knowledge within the social sciences (Preacher and Yaremych 2023).

Adjustments to a model, particularly in response to unsatisfactory fit indices, should be theoretically grounded and reflective of findings from prior research. Blind adherence to a pre-established model or making hasty modifications can adversely affect the structural model’s integrity. Thoughtful adjustments, potentially revisiting exploratory factor analysis (EFA) or considering Exploratory SEM (ESEM) for cross-loadings representation, are preferable to drastic changes that might shift the study from confirmatory to exploratory research (Brown 2023; Jessica K. Flake, Pek, and Hehman 2017; Jackson, Gillaspy, and Purc-Stephenson 2009; Crede and Harms 2019).

All modifications to the measurement model, especially those enhancing model fit, must be meticulously documented to maintain transparency and support reproducibility (Jessica Kay Flake and Fried 2020). Openly reporting these adjustments, including item exclusions and inter-item correlations, is vital for the scientific integrity of the research (Nye 2022; Jessica Kay Flake et al. 2022).

Regarding model comparison and selection, traditional fit indices (SRMR, RMSEA, CFI) have limitations for direct model comparisons. Adjusted chi-square tests and information criteria like AIC and BIC are more suitable for this purpose, balancing model fit and parsimony. These criteria, however, should be applied with an understanding of their constraints and complemented by theoretical judgment to inform model selection decisions (Preacher and Yaremych 2023; Brown 2015; Huang 2017; Lai 2020, 2021).

Ultimately, model selection in SEM is a nuanced process, blending empirical evidence with theoretical insights. Researchers are encouraged to leverage a range of models based on theoretical foundations, ensuring that the eventual model selection is not solely determined by statistical criteria but is also informed by substantive theory and expertise (Preacher and Yaremych 2023). This balanced approach underscores the importance of theory-driven research in the social sciences, guiding the interpretation and application of findings derived from chosen models.