Preface - As part of the Factor Analysis course I took this Winter Quarter, I had to write a 2-page definition of factor analysis as if I were writing for an encyclopedia being produced for those involved in quantitative research. This piece is an elaboration of that homework assignment, including all of the material I had to cut out for reasons of space and correcting certain things that the instructor pointed out to me. The first time any bit of technical terminology is used, I'll type it in ALL CAPS to help set it off (I used boldface in the original). A few of these terms don't get much explanation in context, so I put in an endnotes section to cover them. FACTOR ANALYSIS - Put simply, factor analysis is a method for analyzing multiple measurements and looking for underlying causes for any relationships between the measurements. Put more simply, it's a way (but not the only way) to take a bunch of tests and see what they have in common. Of course, this simplification misses many important details, but it's a good start. When comparing the results of several tests, one can find CORRELATIONS between the results of any two tests taken together, which suggests the two tests have something in common. The theory behind factor analysis is that the reason these two tests are correlated is that both are trying to get at an unmeasurable FACTOR...because the factor influences both tests, a subject's score on one test can tell you something about their score on the other test. A factor is a trait of some sort, like intelligence, emotional stability or artistic ability, which cannot be known exactly. Any test can only approximate a given factor, and usually contains measurements of more than one factor at a time. Methods of factor analysis try to figure out how much a particular factor influences a given test without actually knowing what the value of the factor is. So, in general, factor analysis tries to use what can be measured (called MANIFEST VARIABLES) to make some sort of statement about how the tests are linked to what cannot be measured (called factors or LATENT VARIABLES). This statement is in the form of FACTOR LOADINGS, a measure of how much someone's score on a test is influenced by their unmeasurable ability in a given factor. Specifically, factor analysis makes the assumption that a subject's deviation from the mean score on a test is due to three different types of unmeasurable factor: COMMON FACTORS: These are factors which influence more than one test in a battery (group of tests). Finding the factor loadings of these latent variables is the main goal of most factor analysis methods. SPECIFIC FACTORS: These are latent variables that only influence one test in a battery. A single specific factor may actually be the result of several traits, so long as none of these traits affects any other test. A specific factor may become a common factor if a new test, influenced by that factor, is added to the battery. ERROR FACTORS: While common and specific factors measure things that presumably remain fairly constant in a subject across administrations of a particular test, error factors represent the effect of unreliable influences, such as the subject's mood, state of readiness or environment. Because specific and error factors relate only to performance on a single test, they are usually grouped together as a single UNIQUE FACTOR for each test. Each specific factor only affects a single test, and each error factor only affects a specific administration of that test, so there is no correlation between unique factors. Many methods of factor analysis exist, but all have in common that they try to separate the common factors from the unique factors so as to explain correlations between the manifest variables. The two major groups these methods are usually split into are EXPLORATORY and CONFIRMATORY factor analysis. Exploratory Factor Analysis involves attempting to find the model which best fits the available data, without influence of any prior theories. Because there are infinitely many solutions which qualify as "best," separated only by ORTHOGONAL TRANSFORMATION or ROTATION, the results given by a single solution may not be interpretable, and often will look useless. The goal of rotation is to find a solution that can be interpreted. Software packages exist which will do all of this for you, but be warned: the most commonly coded method for obtaining a rotated solution has also been shown repeatedly to be highly dubious...it survives due to a form of academic inertia. The best rotations are not orthogonal, but OBLIQUE, as they don't assume that factors are uncorrelated. Without getting too much further into material which can't be adequately treated in this short paper, suffice to say that if your computer program offers a choice between orthogonal and oblique rotation (Varimax is the most common orthogonal, Direct Quartimin a frequently-used oblique rotation), pick oblique. Aside from the question of rotation, exploratory methods also require the researcher to pick how many factors to keep. In general, adding more factors will make the model fit better, but will also make it harder to interpret and generally less useful. A number of rules of thumb exist and are even programmed into software packages, but consider how smart your average thumb is. The best advice is to use several criteria for picking the number of factors to keep, and analyze the problem in terms of the situation at hand. The final criterion is to look at the final, rotated solution and see if it can be interpreted sensibly. Confirmatory Factor Analysis starts with some theory about how the factors should relate to the manifest variables, such as grouping all math tests under one factor and all language tests under another factor. By doing this, it avoids both the need for rotation and the need to pick the number of factors after seeing the data. Extremely impractical to perform by hand, it can be done fairly easily by computers now, with the right software (SYSTAT for Windows 95 has a program in it called RAMONA which performs some confirmatory factory analysis). These methods, while not giving fits as good as exploratory factor analysis, yield statistical measures of how well the model fits. To use an analogy, exploratory factor analysis is like getting the mean very accurately, while confirmatory factor analysis is like getting the mean with slightly less accuracy, but also knowing the standard deviation. One should always devise models before running confirmatory factor analysis, since using the data to drive a confirmatory factor analysis (i.e. changing your model after seeing that it doesn't fit well) is cheating. Finally, it should be noted that there is a method known as PRINICPAL COMPONENTS ANALYSIS which shares many of the mathematical methods of factor analysis, but which should not be mistaken for factor analysis. Principal components analysis allows "unique factors" to be correlated, which means they're no longer actually unique factors, and the factor loadings obtained from analysis will not as accurately explain the correlations between the manifest variables. Principal Components Analysis is a tool for reducing a large set of data to a few managable numbers, not for explaining relations. Be very cautious when you see principal components analysis presented as being factor analysis...while it often gives good results, the fact that it violates the underlying assumptions of the factor analysis model is good reason to be careful. Endnotes: Here's some elaboration on a few of the capitalized terms which weren't defined in the text above, and might not be clear from context to those who haven't studied statistics and linear algebra. Let me know if you think this section should be added to. Correlation: Suppose you have results of two tests on your class. The better the correlation between the two tests, the more likely you can predict a student's score on test A based on the score on test B. Correlations are bound between -1 and +1, with 0 meaning there's no relationship at all between the two sets of data. When two things are highly correlated (close to either +1 or -1), this doesn't necessarily mean that one causes the other, but it suggests that they at least have a common cause. Factors are a way of representing these common causes. Orthogonal Transformation: In two dimensions, this just means rotating your axes. The idea is that if you graphed all your data, you hope it will cluster in such a way that you can rotate the axes and have each clump on an axis. For example, if you take students results on their midterm as the x-axis and results on the final as the y-axis, you might find that most of the points lie on one line. You can rotate your axes so that one axis is along that line, and name the axis "Physics Ability" or something else appropriate (or even "ability to take tests well"). Oblique Transformation: When you get up to higher dimensions (more tests), you might get tight clusters of data which lie on lines that aren't orthogonal to each other. Trying to simply rotate axes will result in not catching all of these clusters very well. So you relax the requirement that the axes be orthogonal, and just move them to where they fit the data best. Since the axes represent factors, this means that you have factors which are correlated (the projection of one onto another is not zero), implying a deeper factor underlying the current set. Dave Van Domelen Physics Education Research Group The Ohio State University dvandom@pacific.mps.ohio-state.edu