An Introduction To Using Multidimensional Item Response Theory To Assess Latent Factor Structures

Journal of the Society for Social Work and Research Volume 1, Issue 2, 6682
October 2010 ISSN 1948-822X DOI:10.5243/jsswr.2010.6
An Introduction to Using Multidimensional Item Response Theory to Assess Latent Factor Structures
Philip Osteen University of Maryland This study provides an introduction to the use of multidimensional item response theory (MIRT) analysis for assessing latent factor structure, and compares this statistical technique to confirmatory factor analysis (CFA) in the evaluation of an original measure developed to assess students motivations for entering a social work community of practice. The Partic ipation in a Social Work Community of Practice Scale (PSWCoP) was administered to 506 masters of social work students from 11 accredited graduate programs. The psychometric properties and latent factor structure of the scale are evaluated using MIRT and CFA techniques. Although designed as a 3-factor measure, analysis of model fit using both CFA and MIRT do not support this solution. Instead, analyses using both methods produce convergent results supporting a 4-factor solution. Discussion includes methodological implications for social work research, focusing on the extension of MIRT analysis to assessment of measurement invariance in differential item functioning, differential test functioning, and differential factor functioning. Keywords: item response theory, factor analysis, psychometrics In comparison to classical test theory (CTT), item response theory (IRT) is considered as the standard, if not preferred, method for conducting psychometric evaluations of new and established measures (Embretson & Reise, 2000; Fries, Bruce, & Cella, 2005; Lord, 1980; Ware, Bjorner, & Kosinski, 2000). Dubbed the modern test theory, IRT is used across scientific disciplines, including psychology, education, nursing, and public health. Considered a superior method because of IRTs ability to overcome inherent limitations of CTT, IRT provides researchers with an array of statistical tools for assessing measure characteristics. Unfortunately, there is a resounding paucity of published research in social work using IRT. A review of measurement-based articles appearing in journals specific to the social work field published between 2000 and 2006 showed that fewer than 5% of studies used IRT analysis to evaluate the psychometric properties of new and existing measures (Unick & Stone, 2010). Unick and Stone hypothesized several reasons for the absence of IRT analyses from social work journals, one of which was a lack of familiarity with key conceptual and practical components of IRT. Regardless of the reasons underlying the absence of IRT-based analyses in the social work literature, the field of social work will benefit from researchers becoming more familiar with IRT methods and incorporating these analyses into social work-based Philip J. Osteen is an assistant professor in the University of Maryland School of Social Work. All correspondence concerning this article should be directed to posteen@ssw.umaryland.edu measurement studies. Historically regarded as a method for evaluating latent skill and ability traits in education, the application of IRT to measures of affective latent traits is becoming more common and accepted. As outlined in this article, drawing on the strengths of IRT as an alternative to, or ideally in conjunction with, CTT analyses supports social work researchers development of rigorously substantiated measures. This article provides social work researchers with a basic overview of IRT and a demonstration of the utility of IRT as compared with CTT-based factor analysis by using actual data obtained with the implementation of a novel measure of professional motivations of masters of social work (MSW) students. Published studies comparing IRT and confirmatory factor analysis (CFA) have focused almost exclusively on assessing measurement invariance. This study takes a different approach in comparing IRT and CTT by applying these theories to the assessment of multidimensional latent factor structures. IRT IRT is based on the premise that only two elements are responsible for a persons response on any given item: the persons ability, and the characteristics of the item (Bond & Fox, 2001). The most common IRT model, called the Rasch or one-parameter logistic model, assumes the probability of a given response is a function of the persons ability and the difficulty of the item (Bond & Fox, 2001). More complex IRT models estimate the probability of a given response based on additional item characteristics such as discrimination and guessing (Bond & Fox, 2001). Derived from its
Journal of the Society for Social Work and Research
66
OSTEEN
early use in educational measurement, the term ability may seem mismatched to psychosocial constructs; thus, the term latent trait may be more intuitive, and references to level of ability are synonymous with level of the latent trait. The IRT model produces estimates for both of these elements by calculating itemdifficulty parameters on the basis of the total number of persons who correctly answered an item, and persontrait parameters on the basis of the total number of items successfully answered (Bond & Fox, 2001). The assumptions underlying these estimates are (a) that a person with more of the trait will always have a greater likelihood of success than a person with less of the trait, and (b) that any person will have a greater likelihood of endorsing items requiring less of the trait than items requiring more of the trait (Mller, Sokol, & Overton, 1999). Samejima (1969) and Andrich (1978) extended this model to measures with polytomous response formats (i.e., Likert scales) by adding an estimate to account for the difficultly in crossing the threshold from one level of response to the next (e.g., moving from agree to strongly agree). Scale Evaluation Using IRT The basic unit of IRT is the item response function (IRF) or item characteristic curve. The relationship between a respondents performance and the characteristics underlying item performance can be described by a monotonically increasing function called the item characteristic curve (ICC; Henard, 2000). The ICC is typically a sigmoid curve estimating the probability of a given response based on a persons level of latent trait. The shape of the ICC is determined by the item characteristics estimated in the model. The ICC in a three-parameter IRT model is derived using the formula
P() = c + (1-c)e a(-b-f)

1 + e a(-b-f)
where P, the probability of a response given a persons level of the latent traitdenoted by theta ()is a function of guessing (c parameter), item discrimination (a parameter), item difficulty (b parameter), and the category threshold (f) if using a polytomous response format. For the one-parameter IRT model, the guessing parameter, c, is constrained to zero, assuming little or no impact of guessing. For example a person cannot guess the correct response to an item using a Likert scale because items are not scored as right or wrong. The item discrimination parameter, a, is set to 1 under the assumption that there is equal discrimination across items. In a one-parameter model the probability of a response is determined only by the persons level of the latent trait and the difficulty of the item. Item difficulty is an indication of the level of the underlying trait that
is needed to endorse or respond in a certain way to the item. For items on a rating scale, the IRF is a mathematical function describing the relation between where an individual falls on the continuum of a given construct such as motivation and the probability that he or she will give a particular response to a scale item designed to measure that construct (Reise, Ainsworth, & Haviland, 2005). The basic goal of IRT modeling is to create a sample-free measure. Multidimensional item response theory, or MIRT, is an extension of IRT and is used to explore the underlying dimensionality of an IRT model. Advances in computer software (e.g., Conquest, MULTILOG, & Mplus) allow for testing and evaluation of more complex multidimensional item response models and enable researchers to statistically compare competing dimensional models. ACER Conquest 2.0 (Wu, Adams, & Wilson, 2008), the software used in this study, produces marginal maximum likelihood estimates for the parameters of the models. The fit of the models is ascertained by generalizations of the Wright and Masters (1982) residual-based methods. Alternative dimensional models are evaluated using a likelihood ratio chi-squared statistic (2LR; Barnes, Chard, Wolfe, Stassen, & Williams, 2007). Core statistical output of an IRT analysis of a oneparameter rating scale model includes estimates of person latent trait, item difficulty, model fit, personfit, item-fit, person reliability, item reliability, and step calibration. A two-parameter model would include estimates for item discrimination, and a threeparameter model would include an additional estimate for guessing. Person latent trait is an estimate of the underlying trait present for each respondent. Persons with high person-ability scores possess more of the underlying trait than persons with low scores. Item difficulty is an estimate of the level of underlying trait at which a person has a 50% probability of endorsing the item. Items with higher item-difficulty scores require a respondent to have more of the underlying trait to endorse or correctly respond to the item than items with lower item difficulty scores. Consider a measure of reading comprehension. An item requiring a 12th grade reading level is more difficult than an item requiring a 6th grade reading level. The same concept applies to a measure of motivation; an item requiring a high amount of motivation is more difficult than an item requiring a low amount of motivation. This idea translates to the concept of person-ability or latent trait. A person who reads at a 12th grade level has more ability than a person who reads at a 6th grade level; a person who is more motivated has more of the latent trait than a person who is less motivated. Analysis of item fit. Fit statistics in IRT analysis include infit and outfit mean square (MNSQ) statistics.
67
USING MULTIDIMENSIONAL ITEM RESPONSE THEORY
Infit and outfit are statistical representations of how well the data match the prescriptions of the IRT model (Bond & Fox, 2001). Outfit statistics are based on conventional sum of squared standardized residuals, and infit statistics are based on information-weighted sum squared standardized residuals (Bond & Fox, 2001). Infit and outfit have expected MNSQ values of 1.00; values greater than or less than 1 indicate the degree of variation from the expected score. For example, an item with an infit MNSQ of 1.33 (1.00 + .33), indicates 33% more variation in responses to that item than were predicted by the model. Mean infit and outfit values represent a degree of overall fit of the data to the model, but infit and outfit statistics are also available for assessing fit at the individual item level (item-fit) and the individual person level (person-fit). Item-fit refers to how well the IRT model explains the responses to a particular item (Embretson & Reise, 2000). Person-fit refers to the consistency of an individuals pattern of responses across items (Embretson & Reise, 2000). One limitation of IRT is the need for large samples. No clear standards exist for minimum sample size, although Embretson and Reise (2000) briefly noted that a sample of 500 respondents was recommended, and cautioned that parameter estimations might become unstable with samples of less than 350 respondents. Reeve and Fayers (2005) suggested that useful information about item characteristics could be obtained with samples of as few as 250 respondents. One-parameter models may yield reliable estimates with as few as 50 to 100 respondents (Linacre, 1994). As the complexity of the IRT model increases and more parameters are estimated, sample size should increase accordingly. Smith, Schumacker, and Bush (1998), provided the following sample size dependent cutoffs for determining poor fit: misfit is evident when MNSQ infit or outfit values are larger than 1.3 for samples less than 500, 1.2 for samples between 500 and 1,000, and 1.1 for samples larger than 1,000 respondents. According to Adams and Khoo (1996), items with adequate fit will have weighted MNSQs between .75 and 1.33. Bond and Fox (2001) stated items that are routinely accepted as having adequate fit will have tvalues between -2 and +2. According to Wilson (2005), when working with large sample sizes, the researcher can expect the t-statistic to show significant values for several items regardless of fit; therefore, Wilson suggested that the researcher consider items problematic only if items are identified as misfitting based on both the weighted MNSQ and t-statistic. For rating scale models, category thresholds are provided in the IRT analysis. A category threshold is the point at which the probability of endorsing one
category is equal to the probability of endorsing a corresponding category one step away. Although thresholds are ideally equidistant, that characteristic is not necessarily the reality. Guidelines indicate that thresholds should be at least 1.4 logits but no more than 5 logits (Linacre, 1999). Logits are the scale units for the log odds transformation. When thresholds have small logits, response categories may be too similar and nondiscriminant. Conversely, when the threshold logit is large, response categories may be too dissimilar and far apart, indicating the need for more response options as intermediate points. Infit and outfit statistics are also available for step calibrations. Outfit MNSQ values greater than 2.0 indicate that a particular response category is introducing noise into the measurement process, and should be evaluated as a candidate for collapsing with an adjacent category (Bond & Fox, 2001; Linacre, 1999). In conjunction with the standard output of IRT analysis, MIRT analysis provides information about dimensionality, the underlying latent factor structure. Acer Conquest 2.0 (Wu et al., 2008) software provides estimations of population parameters for the multidimensional model, which include factor means, factor variances, and factor covariances/correlations. Acer Conquest 2.0 also produces maps of latent variable distributions and response model parameter estimates. Analysis of nested models. Two models are considered as being nested if one is a subset of the second. Overall model fit of an IRT model is based on the deviance statistic, which follows a chi-square distribution. The deviance statistic changes as parameters are added or deleted from the model, and changes in fit between nested models can be statistically tested. The chi-square difference statistic (2 D) can be used to test the statistical significance of the change in model fit (Kline, 2005). The 2 D is calculated as the difference between the model chisquare (2 M) values of two nested models using the same data; the df for the 2 D statistic is the difference in dfs for two nested models. The 2 D statistic tests the null hypothesis of identical fit of the two models to the population. Failure to reject the null hypothesis means that the two models fit the population equally. When two nested models fit the population equally well, the more parsimonious model is generally considered the more favorable. Scale Evaluation Using CFA Factor analysis is a more traditional method for analyzing the underlying dimensionality of a set of observed variables. Derived from CTT, factor analysis includes a variety of statistical procedures for exploring the relationships among a set of observed variables with the intent of identifying a smaller number of
68
OSTEEN
factors, the unobserved latent variables, thought to be responsible for these relationships among the observed variables (Tabachnik & Fidell, 2007). CFA is used primarily as a means of testing hypotheses about the latent structure underlying a set of observed data. A common and preferred method for conducting CFA is structural equation modeling (SEM). The term SEM refers to a family of statistical procedures for assessing the degree of fit between observed data and an a priori hypothetical model in which the researcher specifies the relevant variables, which variables affect other variables, and the direction of those effects. The two main goals of SEM analysis are to explore patterns of correlations among a set of variables, both observed and unobserved, and to explain as much variance as possible using the model specified by the researcher (Klem, 2000; Kline, 2005). Analysis of SEM models. Analysis of SEM models is based on the fit of the observed variancecovariance matrix to the proposed model. Although maximum likelihood (ML) estimation is the common method for deriving parameter estimates, it is not the only estimation method available. ML estimation produces parameter estimates that minimize the discrepancies between the observed covariances in the data and those predicted by the specified SEM model (Kline, 2005). Parameters are characteristics of the population of interest; without making observations of the entire population, parameters cannot be known and must be estimated from sample statistics. ML estimation assumes interval level data, and alternative methods, such as weighted least squares estimation, should be used with dichotomous and ordinal level data. Guo, Perron, and Gillespie (2009) noted in their review of social work SEM publications that ML estimation was sometimes used and reported inappropriately. Analysis of model fit. Kline (2005) defined model fit as how well the model as a whole explained the data. When a model is over identified, it is expected that model fit will not be perfect; it is therefore necessary to determine the actual degree of model fit, and whether the model fit is statistically acceptable. Ideally, indicators should load only on the specific latent variable identified in the measurement model. This type of model can be tested by constraining the direct effects between indicators and other factors to zero. According to Kline (2005), indicators are expected to be correlated with all factors in CFA models, but they should have higher estimated correlations with the factors they are believed to measure (emphasis in original, p. 177). A measurement model with indicators loading only on a single factor is desirable but elusive in practice with real data. Statistical comparison of models with cross-
loadings to models without cross-loadings allows the researcher to make stronger assertions about the underlying latent variable structure of a measure. As Guo et al. (2009) noted, modified models allowing cross-loadings between items and factors have been frequently published in social work literature without fully explaining how they related to models without cross-loadings. Analysis of nested models. As noted in the discussion of MIRT analysis, two models are considered to be nested if one is a subset of the second. Overall model fit based on the chi-square distribution will change as paths are added to or deleted from a model. Klines (2005) chi-square difference statistic (2 D) can be used to test the statistical significance of the change in model fit. MIRT versus CFA MIRT and CFA analyses can be used to assess the dimensionality or underlying latent variable structure of a measurement. The choice of statistical procedures raises questions about differences between analyses, whether the results of the two analyses are consistent, and what information can be obtained from one analysis but not the other. IRT addresses two problems inherent in CTT. First, IRT overcomes the problem of item-person confounding found in CTT. IRT analysis yields estimates of item difficulties and person-abilities that are independent of each other, whereas in CTT item difficulty is assessed as a function of the abilities of the sample, and the abilities of respondents are assessed as a function of item difficulty (Bond & Fox, 2001), a limitation that extends to CFA. Second, the use of ordinal level data (i.e., rating scales), which are routinely treated in statistical analyses as continuous, interval-level data, may violate the scale and distributional assumptions of CFA (Wirth & Edwards, 2007). Violating these assumptions may result in model parameters that are biased and impossible to interpret (Wirth & Edwards, 2007, p. 58; DiStefano, 2002). The logarithmic transformation of ordinal level raw data into interval level data in IRT analysis overcomes this problem. IRT and CTT also differ in the treatment of the standard error of measurement. The standard error of measurement is an indication of variability in scores due to error. Under CTT, the standard error of measurement is averaged across persons in the sample or population and is specific to that sample or population. Under IRT, the standard error of measurement is considered to vary across scores in the same population and to be population-general (Embretson & Reise, 2000). The IRT approach to the standard error of measurement offers the following benefits: (a) the precision of measurement can be evaluated at any level of the latent trait instead of
69
averaged over trait levels as in CTT, and (b) the contribution of each item to the overall precision of the measure can be assessed and used in item selection (Hambleton & Swaminathan, 1985). MIRT and CFA differ in the estimation of item fit. Where item fit is assessed through error variances, communalities, and factor loadings in CFA, item fit is assessed through unweighted (outfit) and weighted (infit) mean square errors in IRT analyses (Bond & Fox, 2001). Further, the treatment of the relationship between indicator and latent variable, which is constrained to a linear relationship in CFA, can be nonlinear in IRT (Greguras, 2005). CFA uses one number, the factor loading, to represent the relationship between the indicator and the latent variable across all levels of the latent variable; in IRT, the relationship between indicator and latent variable is given across the range of possible values for the latent variable (Greguras, 2005). Potential implications of these differences include inconsistencies in parameter estimates, indicator and factor structure, and model fit across MIRT and CFA analyses. Both IRT and CFA provide statistical indicators of psychometric performance not available in the other analysis. Using the item information curve (IIC), IRT analysis allows the researcher to establish both item information functions (IIF) and test information functions (TIF). The IIF estimates the precision and reliability of individual items independent of other items on the measure; the TIF provides the same information for the total test or measure, which is a useful tool in comparing and equating multiple tests (Hambleton et al., 1991; Embretson & Reise, 2000). IRT for polytomous response formats also provides estimated category thresholds for the probability of endorsing a given response category as a function of the level of underlying trait. These indices of item and test performance and category thresholds are not available in CFA in which item and test performance are conditional on the other items on the measure. Conversely, CFA offers a wide range of indices for evaluating model fit, whereas IRT is limited to the use of the 2 deviance statistic. Reise, Widaman, and Pugh (1993) explicitly identified the need for modification indices and additional model fit indicators for IRT analyses as a limitation. Participation in a Social Work Community of Practice Scale Although the content of the Participation in a Social Work Community of Practice Scale (PSWCoP) is less important in the current discussion than the methodologies used to evaluate the scale, a brief overview will provide context for interpreting the results of the analyses. The PSWCoP scale is an
assessment of students motivations for entering a masters of social work (MSW) program as conceptualized in Wenger, McDermott, and Snyders (2002) three-dimensional model of motivation for participation in a community of practice. Wenger et al. (2002) asserted that all communities of practice are comprised of three fundamental elements (p. 27): a domain of knowledge defining a set of issues; a community of people who care about the domain; and, the shared practice developed to be effective in that domain. Some individuals are motivated to participate because they care about the domain and are interested in its development. Some individuals are motivated to participate because they value being part of a community as well as the interaction and sharing with others that is part of having a community. Finally, some individuals are motivated to participate by a desire to learn about the practice as a means of improving their own techniques and approaches. The PSWCoP was developed as a multidimensional measure of the latent constructs domain motivation, community motivation, and practice motivation (Table 1). Data were collected from a convenience sample of students enrolled in MSW programs using a crosssectional survey design and compared to the threefactor model developed from Wenger et al. Method Participants A convenience sample of 528 current MSW students was drawn from 11 social work programs accredited by the Council on Social Work Education (CSWE). Participants were enrolled during two separate recruitment periods. The first round of recruitment yielded a nonrandom sample of 268 students drawn from nine academic institutions. The second round of recruitment yielded a nonrandom sample of 260 students drawn from eight institutions. Six institutions participated in both rounds of data collection, three institutions participated in only the first round of data collection, and two institutions participated in only the second round of data collection. The response rate for the study could not be calculated because there was no way to determine the total number of students who received information about the study or had access to the online survey. Twenty-two cases (4.1%) were removed because of missing data, yielding a final sample of 506 students; listwise deletion was used given the extremely small amount of missing data. Data were collected on multiple student characteristics including age, gender, race/ethnicity, sexual orientation, religious affiliation, participation in religious activities, family socioeconomic status (SES), and enrollment status. The mean age of participants
70
OSTEEN
Table 1 Original Items on the Participation in a Social Work Community of Practice Scale Item My main interest for entering the MSW program was to be a part of a community of social workers. I wanted to attend a MSW program so that I could be around people with similar values to me. I chose a MSW program because I thought social work values were more similar to my values than those of other professions. There is more diversity of values among students than I expected. Before entering the program, I was worried about whether or not I would fit in with my peers. Learning about the social work profession is less important to me that being part of a community of social workers. Without a MSW degree, I am not qualified to be a social worker. A MSW degree is necessary to be a good social worker. Learning new social work skills was not a motivating factor in my decision to enter the MSW program. My main reason for entering the MSW program was to acquire knowledge and/or skills. A MSW degree will give me more professional opportunities than other professional degrees. Being around students with similar goals is less important to me than developing my skills as a social worker. Learning how to be a social worker is more important to me than learning about the social work profession. I find social work appealing because it is different than the type of work I have done in the past. I decided to enroll in a MSW program to see if social work is a good fit for me. I wanted to attend a MSW program so that I could learn about the social work profession. Entering the MSW program allowed me to explore a new area of professional interest. My main reason for entering the MSW program was to decide if social work is the right profession for me. *Items deleted from the final version of the PSWCoP was 30.2 years (SD = 8.7 years). The majority of students were female (92%). The majority of the participants were Caucasian (82.6%), with 7.3% of students self-identifying as African American or Black; 4.1% as Hispanic; 1.8% as Asian/Pacific Islander; and 4.1% as a nonspecified race/ethnicity. Students identified their enrollment status as either part-time (19.5%), first year (32.7%), advanced standing (27%), or second year (20.8%). Measures Analyses were conducted on an original measure of students motivations for entering a social work community of practice, defined as pursuing a MSW degree. The PSWCoP was developed and evaluated using steps outlined by by Benson and Clark (1982) and DeVellis (2003). The pilot measure contained 18 items designed to measure three constructs (domain, Factor Community (C_1) Community (C_2) Community (C_3) Community (C_4)* Community (C_5)* Community (C_6)* Practice (P_1) Practice (P_2) Practice (P_3) Practice (P_4) Practice (P_5)* Practice (P_6)* Practice (P_7)* Domain (D_1) Domain (D_2) Domain (D_3) Domain (D_4) Domain (D_5)
community, and practice). Items were measured on a 6point rating scale from strongly disagree to strongly agree. Items from the pilot measure organized by subscale are listed in Table 1. In addition to items on the PSWCoP, students were asked to provide demographic information. Procedures Participants completed the PSWCoP survey as part of a larger study exploring the relationship between students motivations to pursue the MSW degree, their attitudes about diversity and historically marginalized groups, and their endorsement of professional social work values as identified in the National Association of Social Workers (2009) Code of Ethics. This research was approved by the University of Denver Institutional Review Board prior to recruitment and data collection. Recruitment consisted of a two-pronged approach: (a) an e-mail providing an overview of the study and a link
71
to the online survey was sent to students currently enrolled in the MSW program; and (b) an announcement providing an overview of the study and a link to the online survey posted to student-oriented informational Web sites. Interested participants were able to access the anonymous, online survey through www.surveymonkey.com, which is a frequently used online survey provider. Participants were presented with a project information sheet and were required to indicate their consent to participate by clicking on the appropriate response before being allowed to access the actual survey. Results Reliability of scores from the PSWCoP was assessed using both CTT and IRT methods. SPSS (v.16.0.0, 2007) was used to calculate internal consistency reliability (Cronbachs ; inter-item correlations). Acer Conquest 2.0 (Wu et al., 2008) was used to assess item reliability. The dimensionality and factor structure of the PSWCoP were evaluated using both a MIRT and a CFA approach. Acer Conquest 2.0 (Wu et al., 2008) was used to conduct the MIRT analysis and Lisrel 8.8 (Jreskog & Srbom, 2007) was used to conduct the CFA analysis. Acer Conquest 2.0 was used to evaluate the PSWCoP with respect to estimates of levels of latent trait and item difficulty using a one-parameter logistic model. Assessment of the measure was based on model fit, person-fit, item fit, person reliability, item reliability, step calibration, and population parameters for the multidimensional model. Item Selection Items were identified for possible deletion from each subscale using Cronbachs alpha, IRT MNSQ Infit/Outfit results, and theory. Poorly performing items identified through statistical analyses were further assessed using conceptual and theoretical frameworks. A combination of results led to the removal of three items from the community subscale, three items from the practice subscale, but no items from the domain subscale (Table 1). Items C_6, P_6, and P_7 addressed relationships between types of motivations by asking respondents to rate whether one type of motivation was more important than another type. Quantitative differences between types of motivations were not addressed in community of practice theory, and therefore these items were deemed not applicable in the measurement of each type of motivation. Items C_4 and C_5 were deleted from the community subscale because these items specifically addressed relationships between respondents and peers. Community-based motivation arises out of perceived value congruence between the individual and the practice (i.e., professional social work), and not between the individual and other members of the
community of practice. All analyses indicated problems with the practice subscale and ultimately EFA was used with this subscale only. The results of the EFA suggested items P_1 and P_2 formed one factor, and items P_3 and P_4 constituted a second factor. Item P_5 did not load on either factor and was deleted. The results of the item selection process yielded two competing models. The first model consisted of three factors in which all items developed for the practice subscale were kept together; this model most closely reflected the original hypothetical model developed based on community of practice theory. The second model had four factors with the items from the hypothesized practice subscale split into the two factors suggested by the EFA. Internal consistency for each of the subscales on the final version of the PSWCoP was assessed using Cronbachs alpha. Cronbachs alpha was 0.64 for scores from the domain subscale, 0.68 for scores from the community subscale, and 0.47 for scores from the practice subscale (three-factor model). Splitting the practice subscale into two factors yielded a Cronbachs alpha of 0.58 for scores from the skills subscale and .68 for scores from the competency subscale. Although ultimately indicative of a poor measure, low internal consistency did not prohibit the application and comparison of factor analysis using CFA and MIRT. Factor Structure CFA. CFA analyses of the PSWCoPS was conducted using Lisrel 8.8 (Jreskog & Srbom, 2007). The data collected using the PSWCoP were considered ordinal based on the 6-point rating scale. When data are considered ordinal, Jreskog and Srbom (2007) advocated the use of PRELIS to calculate asymptotic covariances and polychloric correlations of all items modeled, and LISREL or SIMPLIS with weighted least squares estimation to test the structure of the data. Failure to use these guidelines may result in underestimated parameters, biased standard errors, and an inflated chi-square (2) model fit statistic (Flora & Curran, 2004). The chi-square difference statistic (2 D) was used to test the statistical significance of the change in model fit between nested models (Kline, 2005). The 2 D was calculated as the difference between the model chi-square (2 M) values of nested models using the same data; the df for the 2 D statistic is the difference in dfs for nested models. The 2 D statistic tested the null hypothesis of identical fit of two models to the population. In all, three nested models were evaluated and compared sequentially: a fourfactor model with cross-loadings served as the baseline model, followed by a four-factor model without crossloadings, and a three-factor model without crossloadings. The four-factor model with cross-loadings
72
OSTEEN
was chosen as the baseline model because it was presumed to demonstrate the best fit having the fewest degrees-of-freedom. The primary models of interest were then compared against this baseline to estimate the change in model fit. Sun (2005) recommended considering fit indices in four categories: sample-based absolute fit indices, sample-based relative fit indices, population-based absolute indices, and population-based relative fit indices. Sample-based fit indices are indicators of observed discrepancies between the reproduced covariance matrix and the sample covariance matrix. Population-based fit indices are estimations of difference between the reproduced covariance matrix and the unknown population covariance matrix. At a minimum, Kline (2005) recommended interpreting and reporting four indices: the model chi-square (samplebased), the Steiger-Land root mean square error of approximation (RMSEA; population-based), the Bentler comparative fit index (CFI; population-based), and the standardized root mean square residual (SRMR; sample-based). In addition to these fit indices, this study examined the Akaike information criteria (AIC; sample-based) and the goodness-of- fit index (GFI; sample-based). According to Jackson, Gillaspy, and Purc-Stephenson (2009), a review of CFA journal articles published over the past decade identified these six fit indices as the most commonly reported. The range of values indicating good fit of observed data to the measurement model varies depending on the specific fit index. The model chi-square statistic tests the null hypothesis that the model has perfect fit in the population. Degrees-of-freedom for the chi-square statistic is equal the number of observations minus the number of parameters to be estimated. Given its sensitivity to sample size, the chi-square test is often statistically significant. Kline (2005) suggested using a normed chi-square statistic obtained by dividing chisquare by df; ideally, these values should be less than three. The SRMR is a measure of the differences between observed and predicted correlations; in a model with good fit, these residuals will be close to zero. Hu and Bentler (1999) suggested that a SRMR <0.08 indicates good model fit. The AIC is an indicator of comparative fit across nested models with an adjustment for model complexity. The AIC is not an indicator of fit for a specific model, but instead the model with the lowest AIC from among the set of
nested models is considered to have the best fit. The GFI is an assessment of incremental change in fit with an adjustment for model complexity; values greater than 0.90 indicate good fit. The RMSEA fit index is a measure of the lack of fit of the researchers model to the population covariance matrix and tests the null hypothesis that the researchers model has close approximate fit in the population. According to Kline (2005), good models have an RMSEA less than 0.05 and models with RMSEA greater than 0.10 have poor fit, while Browne and Cudeck (1993) suggested that a RMSEA less than 0.08 represents acceptable fit. The CFI assesses the improvement in fit of the researchers model over a baseline model that has assumed zero covariances among observed variables; values greater than 0.90 represent an acceptable model fit. Four-Factor Model with Cross-Loadings. A baseline CFA model was constructed using the four latent variables, domain, community, competency, and skills, and items were allowed to cross-load on factors based on modification indices of the LISREL output. Based on the six fit indices previously described, the overall fit of the model was good: 2 = 64.48, df = 35, p=.00175; RMSEA = 0.04 [90%CI:.03,.05]; CFI = 0.98; SRMR = 0.043; AIC = 150.48; GFI =0.97. Note that this solution was mathematically derived, and as such, there was no conceptual justification for the cross-loadings of these multiple items. This model served as the baseline against which competing models were compared. Four Factor Model without Cross-Loadings. The next model to be evaluated used the same four factors as the previous model, but items were constrained to load on specific factors. The standardized solution for the four-factor model without cross loadings is shown in Figure 1. Based on the six fit indices previously described, the overall fit of the model was acceptable: 2 = 185.82, df = 48, p<0.001; RMSEA = 0.07 [90%CI:.06,.09]; CFI = 0.91; SRMR = 0.094; AIC =245.82 ; GFI =0.91. When compared with the fourfactor model with cross-loadings, this model demonstrated a significant increase in model misfit [(12 22)(df1-df2) =121.04(13), p <.001]. However, the fit indices as a whole did not indicate poor fit, and as a conceptually derived and theory-supported model, the four-factor model without cross-loadings was preferable over the four-factor model with crossloadings.
73
Figure 1. Standardized solution for four-factor PSWCoP model Three Factor Model without Cross-Loadings. The three-factor model corresponded to the original model of the PSWCoP (Figure 2). Three latent variables were included in this model:domain, community, and practice. Items were constrained to load on the factor for which they were designed. The four items originally developed for the practice subscale were constrained to load on a single latent variable, which represented a perfect correlation between the previously used latent variables competency and skills. Based on the six fit indices previously described, the overall fit of the model was poor: 2 = 359.90, df = 51, p < 0.001; RMSEA = 0.11 [90%CI:.10,.12]; CFI = 0.8; SRMR = 0.12; AIC = 413.90; GFI =0.85. When compared with the four-factor model without crossloadings, this model demonstrated a significant increase in model misfit [(12 22)(df1-df2) =174.38(3), p < .001]. All of the fit statistics indicated that the data did not fit the model.
Figure 2. Standardized solution for three-factor PSWCoP model
74
OSTEEN
Summary of CFA of the PSWCoP. A summary of fit indices across nested models is provided in Table 2. The model with the best overall fit was the four-factor model in which items were allowed to load across all factors. The fit of this model was good, but the model lacked conceptual support and was not interpretable with respect to the underlying latent structure of the PSWCoP. Although the four-factor model with constrained loadings had a significant increase in model misfit over the four-factor model with crossloadings, the four-factor model with constrained loadings demonstrated acceptable fit. The results of the CFA on the four-factor model without cross-loadings supported the hypothesis of a multidimensional
measure because correlations between latent variables were computed and there were no significant correlations between any pair of latent variables (=.01). The four-factor model with constrained loadings was compared with a three-factor model based on the originally proposed measurement model for the PSWCoP. The conceptual difference between the two models was the placement of the items developed for the practice subscale. Constraining these four items to load on a single latent variable resulted in a large increase in model misfit. All of the reported fit statistics indicated a model with poor fit.
Table 2 Comparison of Fit Indices across Nested Models Model 1: Model 2: 4 Factor Unidimensional Model 4 Factor Model* 2(df) Normed ( /df) p-value (model) 1 2
2 2 (df1 df2) 2 2
Model 3: Unidimensional 3 Factor Model** 359.90(51) 7.05 <.001 174.38(3) <.001 0.11 [.10, .12] 0.80 0.12 413.90 0.85
64.48(35) 1.84 .002
185.52(48) 3.86 <.001 121.04(13) <.001
p-value (model diff) RMSEA RMSEA 90%CI CFI SRMR AIC GFI *Compared to Model 1 **Compared to Model 2 Multidimensional Item Response Theory Analysis The PSWCoP data were then analyzed using a one-parameter IRT model using Winsteps 3.66.0 (Linacre, 2006) Rasch measurement software, and MIRT analyses using ACER Conquest 2.0, generalized item-response modeling software (Wu et al., 2008). The parameters for guessing were all constrained to zero, and the parameters for item discrimination were assumed equal and set to one. A thorough psychometric evaluation should ideally utilize at least a two-parameter model, especially when considering established measures, as item discrimination parameters are rarely the same. However, Reeve and Fayers (2005) suggested that the one-parameter model with equal item discrimination parameters is acceptable in the development and revision phase of scale .04 [.03, .05] 0.98 0.04 150.48 0.97
.07 [.06, .09] .91 .09 245.82 0.91
construction, as was the case with the PSWCoP. The first set of analyses evaluated item difficulty, item fit, and reliability for a unidimensional model. The second set of analyses explored the dimensionality of the PSWCoP by comparing four- and three-factor models. The third set of analyses evaluated item difficulty, item fit, and reliability for the multidimensional models. Rasch measurement results. Winsteps 3.68.0 (Linacre, 2006) Rasch measurement software was used to assess item difficulty, fit, and reliability for a unidimensional model. For affective measures, item difficulty refers to the amount of the construct needed to endorse, or respond positively, to the item. As the PSWCoP was designed to measure motivation, item difficulty was the amount of motivation needed to respond positively to the question. Person-ability or
75
latent trait was the amount of motivation a given student possessed. In general, the range of the latent trait of the sample and item difficulties were the same, and the distribution of persons and items about the mean were relatively symmetrical, indicating a good match between the latent trait of students and the difficulty of endorsing items. Exact numerical values for item difficulty are provided in Table 3 and ranged from -1.05 to +0.94. Item difficulty was scaled according to the theta metric, and indicated the level of the latent trait at which the probability of a given response to the item was .50. Theta () is the level of the latent trait being measured and scaled with a mean of zero and a standard deviation of one. Negative values indicated items that were easier to endorse, and Table 3 Rasch Analysis of Full Survey Item Difficulty and Fit Model Item 1 2 3 4 5 6 7 8 9 10 11 12 Label C_1 C_2 C_3 P_1 P_2 D_1 D_2 D_3 D_4 D_5 P_3 P_4 Est. 0.30 0.04 0.05 -0.56 0.30 0.68 -0.11 0.24 -0.33 0.94 -0.51 -1.05 S.E. .04 .04 .03 .04 .04 .04 .04 .04 .04 .04 .04 .06 MNSQ 1.05 0.93 1.02 1.01 0.98 0.94 0.91 1.01 1.07 0.97 1.17 0.93
positive values indicated items that were harder to endorse. Item fit is an indication of how well an item performs according to the underlying IRT model being tested, and it is based on the comparison of observed responses to expected responses for each item. Adams and Khoo (1996) suggested that items with good fit have infit scores between 0.75 and 1.33; Bond and Fox (2001) suggested that items with good fit have t values between -2 and +2. Table 3 provides the fit statistics for the items of the PSWCoP survey; according to this output, only item P_3_R exceeded Bond and Foxs guideline, and no items exceeded Adams and Khoos guideline.
Infit ZSTD 0.9 -1.1 0.5 0.1 -0.4 -1.1 -1.4 0.1 1.1 -0.4 2.1 -0.7 MNSQ 1.06 0.94 1.06 1.06 1.00 0.93 0.89 1.05 1.08 0.95 1.35 0.92
Outfit ZSTD 1.2 -1.0 1.1 0.8 0.1 -1.1 -1.6 0.9 1.1 -0.7 4.0 -0.9
IRT analysis produced an item reliability index indicating the extent to which item estimates would be consistent across different samples of respondents with similar abilities. High item reliability indicates that the ordering of items by difficult will be somewhat consistent across samples. The reliability index of items for the PSWCoP pilot survey was 0.99, and indicated consistency in ordering of items by difficulty. IRT analysis also produced a person-reliability index that indicated the extent of consistency in respondent ordering based on level of latent trait if given an equivalent set of items (Bond & Fox, 2001). The reliability index of persons for the PSWCoP was 0.60, and indicated low consistency in ordering of persons by level of level of latent trait, which was possibly due
to a constricted range of the latent trait in the sample or a constricted range of item difficulty. MIRT factor structure. One of the core assumptions of IRT is unidimensionality; in other words, that person-ability can be attributed to a single, latent construct, and that each item contributes to the measure of that construct (Bond & Fox, 2001). However, whether intended or not, item responses may be attributable to more than one latent construct. MIRT analyses allow the researcher to assess the dimensionality of the measure. Multidimensional models can be classified as either within items or between items (Adams, Wilson, & Wang, 1997). Within-items multidimensional models have items that can function as indicators of more than one dimension,
76
OSTEEN
and between-items multidimensional models have subsets of items that are mutually exclusive and measure only one dimension. Competing multidimensional models can be evaluated based on changes in model deviance and number of parameters estimated. A chi-square statistic is calculated as the difference in deviance (G2) between two nested models with df equal to the difference in number of parameters for the nested models. A statistically significant result indicates a difference in model fit. When a difference in fit is found, the model with the smallest deviance is selected; when a difference in model fit is not found, the more parsimonious model is selected. The baseline MIRT model corresponded to the four-factor model with no cross-loadings estimated in Table 4 Comparison of Model Fit Across Nested Models
the CFA (Figure 1). This baseline model was a between-items multidimensional model with items placed in mutually exclusive subsets. The four dimensions in the model were community, competency, domain, and skills. The baseline model fit statistic was G2=17558.64 with 26 parameters. A three dimensional, between-items, multidimensional model, corresponding to the theoretical model of the PSWCoP (Figure 2) was tested against the baseline model. The three-dimensional model fit statistic was G2=17728.83 with 22 parameters. When compared with the fourdimensional model, the change in model fit was statistically significant and indicated that the fit of the three-dimensional model was worse than the fit of the four-dimensional model (2 (4) = 170.19, p < .001).
Four Factor (Between)
Three Factor* (Between)
Deviance (G2 ) Df G21- G22 df1-df2 (G21- G22)/(df1-df2 ) p-value * Compared to the Four Factor, Between-Items Model Based on the change in model fit between nested models, the four dimensional, between-items model had the better fit. This model resulted in a more accurate reproduction of the probability of endorsing a specific level or step of an item for a person with a particular level of the latent trait (Reckase, 1997). Thus, the four-dimensional model yielded the greatest reduction in discrepancy between observed and expected responses. Item difficulty. MIRT analyses yielded an itemperson map by dimension. The output of the MIRT item-person map (Figure 3) provided a visual estimate of the latent trait in the sample, item difficulty, and each dimension. Items are ranked in the right-hand column by difficulty, with items at the top being more difficult than items at the bottom. Although the range of item difficulty was narrow, items were well dispersed around the mean. Each dimension or factor has its own column with estimates for respondents
17558.64 26
17728.83 22 -170.19 4 42.55 < .001
abilities. Two inferences were made based on the MIRT item-person map. First, although the range of item difficulty was narrow, items appeared to be dispersed in terms of difficulty with a range of -0.81 to +0.84. Furthermore, regarding Dimensions 1, 2, and 3, the item difficulties appeared to be well matched to levels of the latent trait, though over a limited range of the construct as scaled via the theta metric. Second, based on the means of the dimensions, Dimension 2 (competency, x2=0.069) and Dimension 3 (domain, x3=-0.074) did a better job of representing all levels of these types of motivation than the other two dimensions. The small positive mean of Dimension 1 (community, x1=0.335), indicated that students sampled for this study found it somewhat easier to endorse those items, whereas the large positive mean of Dimension 4 (skills, x4=1.42) indicated that students sampled for this study found it very easy to endorse those items.
77
Figure 3. MIRT Latent Variable Item-Person Map Item fit. Table 5 summarizes the items characteristics. In addition to the estimation of item difficulties, infit and outfit statistics are reported. Using Adams and Khoos (1996) guideline, only item C_2_2 Table 5 Item Parameter Estimates for 4 Dimensional Model Model Item 1 2 3 4 5 6 7 8 9 10 11 12 Label C_1 C_2 C_3 P_1 P_2 D_1 D_2 D_3 D_4 D_5 P_3_R P_4 Est. 0.40 0.21 -0.61* -0.14 0.14* 0.11 0.51 -0.65 -0.81 0.84* 0.33 -0.33* S.E. 0.03 0.03 0.04 0.03 0.03 0.03 0.03 0.03 0.03 0.06 0.04 0.04 MNSQ 0.77 0.68 1.02 1.01 0.96 1.21 1.02 1.29 1.17 0.95 1.00 0.99 showed poor fit (MNSQ=0.68). In contrast, Bond and Foxs (2001) guideline identified several items as having poor fit (based on a 95% CI for MNSQ): C_1, C_2, D_1, D_3, and D_4.
Infit ZSTD -3.8 -5.6 0.4 0.2 -0.5 3.0 0.4 4.1 2.5 -0.7 -0.0 -0.2 MNSQ 0.77 0.67 1.04 1.00 0.93 1.18 1.04 1.30 1.22 0.98 1.02 1.00
Outfit ZSTD -4.5 -6.5 0.6 0.0 -1.1 3.0 0.7 4.4 3.1 -0.2 0.4 -0.0
*Indicates that a parameter estimate is constrained
78
OSTEEN
Discussion The rigor and sophistication with which social workers conduct psychometric assessments can be strengthened. Guo et al. (2010) found that social workers under utilize CFA, and more generally SEM, analyses. Further, even when those approaches are used appropriately, considerable room remains for improvement in reporting (Guo et al., 2010). Similarly, Unick and Stone (2010) found the use of IRT analyses for psychometric evaluation was noticeably missing from the social work literature. Developing familiarity and proficiency with strong psychometric methods will empower social workers in developing and selecting appropriate measures for research, policy, and practice. Integration of CFA and MIRT Results The primary result from both the CFA and MIRT analyses was the establishment of the PSWCoP as a multidimensional measure. Both sets of analyses identified a four-factor model in which items loaded on a single factor as having the best model fit when compared with the three-factor model. In addition, both analytic strategies identified significant problems with the PSWCoP. Low subscale internal consistencies might be due to the small number of items for the community, skills, and competency subscales, as well as the inability to capture the complexity of different types of motivation for participating in a social work community of practice. CFA identified multiple items with high (>.7) error variances, and IRT analyses indicated poor fit for several items. Although the results of the analyses identified the PSWCoP as having limited utility, these poor psychometric properties did not prohibit CFA and MIRT analyses. The CFA analysis was found to be more informative at the subscale level, whereas the MIRT analysis was found to be more informative at the item level. CFA was more informative regarding subscale composition and assessing associations among factors. The CFA analysis led to a final form of the PSWCoP with four subscales, and beginning evidence supporting the factorial validity of the measure. As indicated by the nonsignificant correlations among factors, each subscale appeared to be tapping separate constructs. Although MIRT allows the researcher to model factor structure, this approach does not estimate relationships between factors. MIRT analyses were found to be more informative for assessing individual item performance. Item difficulty estimates were obtained for the PSWCoP as a whole and for each subscale. Items on the PSWCoP appeared to be a good match for the levels of latent trait of the respondents with regards to the community, domain, and competency factors, but too easy for the skills factor. Based on infit and outfit statistics, MIRT analyses identified additional items exhibiting poor fit
as compared with the CFA. Specifically, two items on the community subscale had large standardized fit scores in the IRT analysis but displayed high factor loadings and low error variances in the CFA. The IRT analyses also provided estimates of the item information function and test information function, making it possible to get specific estimates of standard errors of measurement instead of relying on an averaged standard error of measurement obtained from the CFA. Strengths and Limitations Reliance on a convenience sample is a significant limitation of this study. The extent to which participants in this study were representative of the larger population of MSW students was indiscernible. Although IRT purports to generate sample-independent item characteristic estimations, the stability of these estimations is enhanced when the sample is heterogeneous with regard to the latent trait. It is possible that students who self-selected to complete the measure were overly similar. A study strength is its contribution to the field of psychometric assessment. Previous studies comparing IRT and CFA have dealt almost exclusively with assessing measurement invariance across multiple samples (e.g., Meade & Lautenschlager, 2004; Raju, Lafitte, & Byrne, 2002; Reise et al., 1993). The current study addresses emerging issues in measurement theory by applying IRT analyses to multidimensional latent variable measures, and comparing MIRT and CFA assessments of factor structure in a novel measure. Implications for Social Work Research In addition to the benefits of using IRT/MIRT analytic procedures outlined in this paper, the ability of these techniques to assess differential item functioning (DIF) and differential test functioning (DTF) is a major advantage over CTT methods. Wilson (1985) described DIF as an indication of whether an item performs the same for members of different groups who have the same level of the latent trait, whereas DTF is invariant performance of a set of items across different groups (Badia, Prieto, Roset, Dez-Prez, & Herdman, 2002). If DIF/DTF exists, respondents from the subgroups who share the same level of a latent trait do not have the same probability of endorsing a test item (Embretson & Reise, 2000, p. 252). The ability to assess potential bias in items and tests provides a powerful method for developing culturally competent measures (Teresi, 2006). Valid comparisons between groups require measurement invariance, and IRT provides an additional tool for examining both items and tests.
79
Additional benefits of IRT analyses include the ability to conduct test equating and develop adaptive testing. The core question of test equating is the extent to which scores from two measures presumed to measure the same construct are comparable. For example, are the Beck Depression Inventory and the Center for Epidemiological Studies Depression Scale (CES-D) equitable? Adaptive testing allows the researcher to match specific items to different levels of ability to more finely discern a persons ability; persons estimated to have high ability may receive a different set of items than a person estimated to have low ability. With the increasing availability of statistical software for conducting MIRT analyses, the potential also exists for developing models with greater complexity for testing differential factor functioning (DFF). Akin to testing measurement invariance using CFA techniques, DFF analyses will provide researchers with an assessment of potential bias in performance of factors (e.g., subscales) across groups. A final consideration is choosing between the different psychometric strategies outlined in this paper. Ideally, both methods should be integrated. Doing so gives the researcher access to unique information available only from each analytic method, allows the researcher to compare common elements of both analyses, and minimizes the impact of each methods limitations. If applying both methods is not possible, theoretical and practical considerations can inform the decision. IRT is a stronger choice when data are dichotomous or ordinal because raw scores are transformed to an interval scale. If the relationship between items and factors are nonlinear or unknown, IRT will yield less biased estimates than CFA. If the construct to be measured is presumed to be unidimensional, IRT is a better strategy because of the additional information provided in the item analysis. Both MIRT and CFA are informative in assessing latent factor structures, but only CFA allows the researcher to estimate relationships between factors. Both strategies perform better with large sample sizes, but IRT is affected more negatively by smaller samples given the larger number of parameters being estimated. If possible, IRT/MIRT analysis should be limited to samples of 200 items or more. Conversely, IRT analyses yield stable results with very few items, whereas CTT reliability varies in part as a function of the number of items.
References Adams, R. J., & Khoo, S. T. (1996). ACER Quest [Computer software]. Melbourne, Australia: ACER. Adams, R. J., Wilson, M., & Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-24. doi:10.1177/0146621697211001 Andrich, D. (1988). Rasch models for measurement. Newbury Park, CA: Sage. Badia, X., Prieto, L., Roset, M., Dez-Prez, A., & Herdman, M. (2002). Development of a short osteoporosis quality of life questionnaire by equating items from two existing instruments. Journal of Clinical Epidemiology, 55, 32-40. doi:10.1016/S0895-4356(01)00432-2 Benson, J., & Clark, F. (1982). A guide for instrument development and validation. American Journal of Occupational Therapy, 36, 789-800. Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model (2nd ed.). Mahwah, NJ: Lawrence Erlbaum. DeVellis, R. F. (2003). Scale development: Theory and applications. Thousand Oaks, CA: Sage. DiStefano, C. (2002). The impact of categorization with confirmatory factor analysis. Structural Equation Modeling, 9, 327-346. doi:10.1207/S15328007SEM0903_2 Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. Flora, D. B., & Curran, P. J. (2004). An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychological Methods, 9, 466-491. doi:10.1037/1082-989X.9.4.466 Fries, J., Bruce, B., & Cella, D. (2005). The promise of PROMIS: Using item response theory to improve assessment of patient-reported outcomes. Clinical and Experimental Rheumatology, 23(5, Suppl. 39), S53S57. Greguras, G. J. (2005). Managerial experience and the measurement equivalence of performance ratings. Journal of Business and Psychology, 19 (3), 383397. doi:10.1007/s10869-004-2234-y Guo, B., Perron, B. E., & Gillespie, D. F. (2009). A systematic review of structural equation modeling in social work research. British Journal of Social Work, 39, 1556-1574. doi:10.1093/bjsw/bcn101 Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer/Nijhoff.
80
OSTEEN
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Henard, D. H. (2000). Item response theory. In L. G. Grimm & P. R. Yarnold (Eds.), Reading and understanding more multivariate statistics, (67-98). Washington, DC: American Psychological Association. Jackson, D. L., Gillaspy, J. A, & Purc-Stephenson, R. (2009). Reporting practices in confirmatory factor analysis: An overview and some recommendations. Psychological Methods, 14(1), 6-23. doi:10.1037/a0014694 Jreskog, K. G., & Srbom, D. (2007). LISREL 8.80 for Windows [Computer software]. Licolnwood, IL: Scientific Software International. Klem, L. (2000). Structural equation modeling. In L. G. Grimm and P. R. Yarnold (Eds.), Reading and understanding more multivariate statistics, (pp. 227-260). Washington, DC: American Psychological Association. Kline, R .B. (2005). Principles and practices of structural equation modeling (2nd ed.). New York, NY: Guilford Press. Linacre, J. K. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7(4), 328. Linacre, J. K. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3(2), 103-122. Linacre, J. K. (2006). Winsteps Rasch measurement 3.68.0 [Software]. Chicago, IL: Author. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. Meade, A.W., & Lautenschlager, G. J. (2004). A comparison of item response theory and confirmatory factor analytic methodologies for establishing measurement equivalence/invariance. Organization Research Methods, 7(4), 361-388. doi:10.1177/1094428104268027 Mller, U., Sokol, B., & Overton, W.F. (1999). Developmental Sequences in class reasoning and propositional reasoning. Journal of Experimental Child Psychology, 74, 69-106. doi:10.1006/jecp.1999.2510 Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2001). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517-529. doi:10.1037/00219010.87.3.517
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago, IL: MESA Press. Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25-27. doi:10.1177/0146621697211002 Reeve, B. B., & Fayers, P. (2005). Applying item response theory modeling for evaluating questionnaire items and scale properties. In P. M. Fayers & R. D. Hays (Eds.), Assessing quality of life in clinical trials: Methods and practice (2nd ed., pp. 55-73). New York, NY: Oxford University Press. Reise, S. P., Ainsworth, A. T., & Haviland, M. G. (2005). Item response theory: Fundamentals, applications, and promises in psychological research. Current Direction sin Psychological Science, 14(2), 95-101. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552-566. doi:10.1037/0033-2909.114.3.552 Samejima, F. (1969). Estimation of latent ability using a response format of graded scores. (Psychometric Monograph No. 17) .Richmond, VA: Psychometric Society. Retrieved from http://www.psychometrika.org/journal/online/MN1 7.pdf Smith, R. M., Schumaker, R. E., & Bush, M. J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2(1), 66-78. PMid:9661732 SPSS. (2007). SPSS for Windows, Rel. 16.0.0 [Software]. Chicago: SPSS, Inc. Sun, J. (2005). Assessing goodness of fit in confirmatory factor analysis. Measurement and Evaluation in Counseling and Development, 37(4), 240-256. Swaminathan, H., & Gifford, J. A. (1979). Estimation of parameters in the three-parameter latent-trait model. Laboratory of Psychometric and Evaluation Research (Report No. 90). Amherst: University of Massachusetts. Tabachnik, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th ed.). Boston, MA: Allyn and Bacon. Teresi, J. A. (2006). Overview of quantitative measurement methods: Equivalence, invariance and differential item functioning in health applications. Medical Care, 44, S39S49.
81
Unick, G. J., & Stone, S. (2010). State of modern measurement approaches in social work research literature. Social Work Research, 34(2), 94-101. Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). Practical implications of item response theory and computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact scales. Medical Care, 38, 11731182. doi:10.1097/00005650-200009002-00011 Wenger, E., McDermott, R., & Snyder, W. M. (2002). Cultivating communities of practice. Boston, MA: Harvard Business School Press.
Wirth, R. J., & Edwards, M. C. (2007). Item factor analysis: Current approaches and future directions. Psychological Methods, 12(1), 58-79. doi:10.1037/1082-989X.12.1.58 Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, IL: MESA Press. Wu, M. L., Adams, R. J., & Wilson, M., & Haldane, S. (2008). ACER Conquest 2.0: Generalized item response modeling software [computer program]. Hawthorn, Australia: ACER.
82

An Introduction To Using Multidimensional Item Response Theory To Assess Latent Factor Structures

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

An Introduction To Using Multidimensional Item Response Theory To Assess Latent Factor Structures

Caricato da

Copyright:

Formati disponibili

Journal of the Society for Social Work and Research Volume 1, Issue 2, 6682

October 2010 ISSN 1948-822X DOI:10.5243/jsswr.2010.6

Journal of the Society for Social Work and Research

P() = c + (1-c)e a(-b-f)

Journal of the Society for Social Work and Research

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Journal of the Society for Social Work and Research

Journal of the Society for Social Work and Research

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Journal of the Society for Social Work and Research

Journal of the Society for Social Work and Research

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Journal of the Society for Social Work and Research

Journal of the Society for Social Work and Research

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Figure 2. Standardized solution for three-factor PSWCoP model

Journal of the Society for Social Work and Research

64.48(35) 1.84 .002

185.52(48) 3.86 <.001 121.04(13) <.001

.07 [.06, .09] .91 .09 245.82 0.91

Journal of the Society for Social Work and Research

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Journal of the Society for Social Work and Research

Four Factor (Between)

Three Factor* (Between)

17728.83 22 -170.19 4 42.55 < .001

Journal of the Society for Social Work and Research

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

*Indicates that a parameter estimate is constrained

Journal of the Society for Social Work and Research

Journal of the Society for Social Work and Research

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Journal of the Society for Social Work and Research

Journal of the Society for Social Work and Research

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY

Journal of the Society for Social Work and Research

Potrebbero piacerti anche