Updated: Nov 15
In the past personality tests were taken with the use of a pen and paper. The order of items in tests, although significant, could have been only considered during the test design. Items were placed in a test in a way researchers felt was optimal but once set, it was not easy to change their order. But what if the order that we present questions in alters the validity of a test? With the help of computerized testing, we have taken a closer look at this question.
We wanted to find out if the performance of a test was affected by the order of presenting questions to participants. To do that we used the Big 5 SAPA Personality Test by David M. Condon: an open-source version of the famous Big 5 personality test. The Big 5 measures five personality traits:
Openness (behavior related to how we seek diversity in experience and how we challenge our thinking)
Conscientiousness (behavior reflecting how much we invest ourselves in tasks we do)
Extraversion(behavior reflecting reactions to stimulation: extraverts need more of it and introverts need less)
Agreeableness (behavior related to getting along with people)
Neuroticism (behavior related to sensitivity and negative emotionality)
We picked a middle-length version of the test containing 81 personality questions. We applied this test in three different formats:
Fixed: the questions were presented in a fixed order for all study participants (which is the traditional presentation of the test).
Organized: the questions were grouped by which of the five Big 5 traits they belong to, but otherwise were randomized. In other words, all agreeableness questions were presented one after the other (in one group), all openness questions were presented one after the other (in a different group), and so on. That way questions relating to the same Big 5 factor were all grouped. However, the order of the groups and the order of each question within a group were randomized.
Randomized: each study participant received the questions in a random order (different for every participant). This randomized order we figured could serve as a point of reference to compare the other question orders against.
To check the test's validity, we presented participants a short additional questionnaire recording ten basic "outcomes" related to their life situation. We asked:
how satisfied they are with their life in general
how satisfied they are with their romantic life
to what extent they identify as left vs. right politically
how religious they are
how many times a week they got drunk
how many times a week they showered
how high they believed their IQ score percentile is
how many times they donated to charity last year
how healthy they feel they are
how free they feel
We then investigated the extent to which we could predict these ten outcomes using their 5 personality scores (for each of the 3 different test orders). There were 703 participants in the study: 240 of them responded to the test with the fixed question order, 232 with the organized question order, and 231 with the randomized question order. All participants were based in U.S., and were recruited using Positly.com.
Subscale Internal Consistency
We also checked the internal consistency of the tests using each of the 3 question orders (by using Cronbach's alpha for each of the 5 subscales and then averaging those 5 numbers together to produce an average internal consistency measure). Cronbach’s alpha is a measurement of a test’s reliability which indicates how internally consistent a test is. The mean internal consistency (alpha) for the fixed test was 0.91, for the organized test was 0.94 and for the randomized test was 0.93. The table below shows the Alphas for each of the 5 factors for each question order.
Cronbach Alpha for each Big 5 subscale (for each question order)
The reliability of the organized test (0.94) and the randomized test (0.93) appeared to be higher than the reliability of the fixed test (0.91). P-value for the Cronbach’s alphas comparison < 0.01. The table above shows Cronbach’s alphas for all tests as well as Cronbach’s alpha for each of the five dimensions of the tests.
These results require further research because although we can say that the organized version of the test had higher internal consistency, we cannot conclude if this is an optimal approach to test administration. It is possible that presenting questions of one type together, as was done in the organised test, artificially inflates correlations (and hence alpha) since participants are considering the answers to the other questions in the group they have answered already when giving each answer. This could potentially explain why the Organised test has the highest alpha, though it does not explain why Random has such a high alpha.
It was important to establish if the mean result on each of the 5 personality scales was different for each of the three test types. We ran an F-test for each of the subscales to see if the diagnostic results of the tests are different depending on how questions were organized. The results of the tests themselves, which is the mean scores people scored on the five subscales, didn’t differ across the three different test types (extraversion: F=2.42, p-value=0.09, neuroticism: F=0.28, p-value=0.75, conscientiousness: F=0.88, p-value=0.42, openness: F=0.48, p-value=0.62, agreeableness: F=0.32, p-value=0.73).
We also wanted to check how effectively the 10 outcomes we measured can be predicted using each version of the test (a measure of external validity). By outcomes we mean ten important facts from the lives of our participants: how happy they are with their lives, how they assess their health, what political views they share, if they are religious, how free they feel, what IQ they think they have compared to other people, how often they shower, how often they drink, how often they donate to charity and how happy they are with their romantic lives.
We wanted to see if our results from tests organized in three different ways predict these outcomes with more or less accuracy by the prediction which was obtained by linear regressions run for each outcome.
Surprisingly, 8 out of 10 life outcomes were predicted better by the randomized test compared to its fixed version. And 9 out of 10 life outcomes were predicted better by the organized test compared to its fixed version. Whereas 6 out of 10 life outcomes were predicted better by the randomized test compared to its organized version. See the table below which contains all prediction accuracies.
Linear Regression R Scores that result from using all Big 5 subscales at once
to predict each outcome (for each of the three question orders)
Hence we see that in this study the fixed test order had both the lowest internal consistency and the lowest predictive accuracy when compared to the organized and randomized order tests. This is surprising because fixed order tests are the standard way many assessments are administered.
One thing to note, however, is that a randomized test has an inherent disadvantage if the goal of such a test is to compare the answers of test-takers to each other, since each such test taker has a somewhat different experience taking the test, rather than having the experience be the same for all test takers (as it would be in a fixed order test).
It would require further investigation to see how well each of these types of tests would predict different aspects of life. Different ways of administering the test did not produce statistically significantly different mean scores on the 5 subscales, but may have produced differences in internal consistency and external validity. The obtained results suggest that the standard method of this test administration may be possible to improve upon.
Below there are links to all of the study materials for this study. If you run any more analysis on our data and reach some conclusions, please, let us know!