Gender Differences or Gender Bias?

Author(s):  
Rachel A. Plouffe ◽  
Christopher Marcin Kowalski ◽  
Paul F. Tremblay ◽  
Donald H. Saklofske ◽  
Radosław Rogoza ◽  
...  

Abstract. Sadism, defined by the infliction of pain and suffering on others for pleasure or subjugation, has recently garnered substantial attention in the psychological research literature. The Assessment of Sadistic Personality (ASP) was developed to measure levels of everyday sadism and has been shown to possess excellent reliability and validity using classical test theory methods. However, it is not known how well ASP items discriminate between respondents of different trait levels, or which Likert categories are endorsed by persons of various trait levels. Additionally, individual items should be evaluated to ensure that men and women of similar levels of sadism have an equal probability of response endorsement. The purpose of this research was to apply item response theory (IRT) and differential item functioning (DIF) to investigate item properties of the ASP across its three translations: English, Polish, and Italian. Overall, the results of the IRT analysis showed that with the exception of Item 9, the ASP demonstrated sound item properties. The DIF rate analyses identified two items from each questionnaire that were of practical significance across gender. Implications of these results are discussed.

2021 ◽  
Vol 104 (3) ◽  
pp. 003685042110283
Author(s):  
Meltem Yurtcu ◽  
Hülya Kelecioglu ◽  
Edward L Boone

Bayesian Nonparametric (BNP) modelling can be used to obtain more detailed information in test equating studies and to increase the accuracy of equating by accounting for covariates. In this study, two covariates are included in the equating under the Bayes nonparametric model, one is continuous, and the other is discrete. Scores equated with this model were obtained for a single group design for a small group in the study. The equated scores obtained with the model were compared with the mean and linear equating methods in the Classical Test Theory. Considering the equated scores obtained from three different methods, it was found that the equated scores obtained with the BNP model produced a distribution closer to the target test. Even the classical methods will give a good result with the smallest error when using a small sample, making equating studies valuable. The inclusion of the covariates in the model in the classical test equating process is based on some assumptions and cannot be achieved especially using small groups. The BNP model will be more beneficial than using frequentist methods, regardless of this limitation. Information about booklets and variables can be obtained from the distributors and equated scores that obtained with the BNP model. In this case, it makes it possible to compare sub-categories. This can be expressed as indicating the presence of differential item functioning (DIF). Therefore, the BNP model can be used actively in test equating studies, and it provides an opportunity to examine the characteristics of the individual participants at the same time. Thus, it allows test equating even in a small sample and offers the opportunity to reach a value closer to the scores in the target test.


The purpose of this study was to examine the differences in sensitivity of three methods: IRT-Likelihood Ratio (IRT-LR), Mantel-Haenszel (MH) and Logistics Regression (LR), in detecting gender differential item functioning (DIF) on National Mathematics Examination (Ujian Nasional: UN) for 2014/2015 academic year in North Sumatera Province of Indonesia. DIF item shows the unfairness. It advantages the test takers of certain groups and disadvantages other group test takers, in the case they have the same ability. The presence of DIF was reviewed in grouping by gender: men as reference groups (R) and women as focus groups (F). This study used the experimental method, 3x1 design, with one factor (i.e. method) with three treatments, in the form of 3 different DIF detection methods. There are 5 types of UN Mathematics Year 2015 packages (codes: 1107, 2207, 3307, 4407 and 5507). The 2207 package code was taken as the sample data, consisting of 5000 participants (3067 women, 1933 men; for 40 UN items). Item selection was carried out based on the classical test theory (CTT) on 40 UN items, producing 32 items that fulfilled, and item response theory selection (IRT) produced 18 items that fulfilled. With program R 3.333 and IRTLRDIF 2.0, it was found 5 items were detected as DIF by the IRT-Likelihood Ratio-method (IRTLR), 4 items were detected as DIF by the Logistic Regression method (LR), and 3 items were detected as DIF by the MantelHaenszel method (MH). To test the sensitivity of the three methods, it is not enough with just one time DIF detection, but formed six groups of data analysis: (4400,40),(4400,32), (4400,18), (3000,40), (3000,32), (3000,18), and generate 40 random data sets (without repetitions) in each group, and conduct detecting DIF on the items in each data set. Although the data lacks model fit, the 3 parameter logistic model (3PL) is chosen as the most suitable model. With the Tukey's HSD post hoc test, the IRT-LR method is known to be more sensitive than the MH and LR methods in the group (4400,40) and (3000,40). The IRT-LR method is not longer more sensitive than LR in the group (4400,32) and (3000,32), but still more sensitive than MH. In the groups (4400,18) and (3000,18) the IRT-LR method is more sensitive than LR, but not significantly more sensitive than MH. The LR method is consistently tested to be more sensitive than the MH method in the entire analysis groups.


2020 ◽  
Vol 64 (3) ◽  
pp. 219-237
Author(s):  
Brandon LeBeau ◽  
Susan G. Assouline ◽  
Duhita Mahatmya ◽  
Ann Lupkowski-Shoplik

This study investigated the application of item response theory (IRT) to expand the range of ability estimates for gifted (hereinafter referred to as high-achieving) students’ performance on an above-level test. Using a sample of fourth- to sixth-grade high-achieving students ( N = 1,893), we conducted a study to compare estimates from two measurement theories, classical test theory (CTT) and IRT. CTT and IRT make different assumptions about the analysis that impact the reliability and validity of the scores obtained from the test. IRT can also differentiate students based on the student’s grade or within a grade by using the unique string of correct and incorrect answers the student makes while taking the test. This differentiation may have implications for identifying or classifying students who are ready for advanced coursework. An exploration of the differentiation for Math, Reading, and Science tests and the impact the different measurement frameworks can have on classification of students are explored. Implications for academic talent identification with the talent search model and development of academic talent are discussed.


Author(s):  
Ansgar Opitz ◽  
Moritz Heene ◽  
Frank Fischer

Abstract. A significant problem that assessments of scientific reasoning face at the level of higher education is the question of domain generality, that is, whether a test will produce biased results for students from different domains. This study applied three recently developed methods of analyzing differential item functioning (DIF) to evaluate the domain generality assumption of a common scientific reasoning test. Additionally, we evaluated the usefulness of these new, tree- and lasso-based, methods to analyze DIF and compared them with methods based on classical test theory. We gave the scientific reasoning test to 507 university students majoring in physics, biology, or medicine. All three DIF analysis methods indicated a domain bias present in about one-third of the items, mostly benefiting biology students. We did not find this bias by using methods based on classical test theory. Those methods indicated instead that all items were easier for physics students compared to biology students. Thus, the tree- and lasso-based methods provide a clear added value to test evaluation. Taken together, our analyses indicate that the scientific reasoning test is neither entirely domain-general, nor entirely domain-specific. We advise against using it in high-stakes situations involving domain comparisons.


2011 ◽  
Vol 42 (1) ◽  
pp. 61-71 ◽  
Author(s):  
K. Klapheck ◽  
S. Nordmeyer ◽  
H. Cronjäger ◽  
D. Naber ◽  
T. Bock

BackgroundClinical research on subjective determinants of recovery and health has increased, but no instrument has been developed to assess the subjective experience and meaning of psychoses. We have therefore constructed and validated the Subjective Sense in Psychosis Questionnaire (SUSE) to measure sense making in psychotic disorders.MethodSUSE was based on an item pool generated by professionals and patients. For pre-testing, 90 psychosis patients completed the instrument. Psychometric properties were assessed using methods of classical test theory. In the main study, SUSE was administered to a representative sample of 400 patients. Factor structure, reliability and validity were assessed and confirmatory factor analyses (CFAs) were used for testing subscale coherence and adequacy of the hypothesized factor structure. Response effects due to clinical settings were tested using multilevel analyses.ResultsThe final version of SUSE comprises 34 items measuring distinct aspects of the experience and meaning of psychoses in a consistent overall model with six coherent subscales representing positive and negative meanings throughout the course of psychotic disorders. Multilevel analyses indicate independence from clinical context effects. Patients relating psychotic experiences to life events assessed their symptoms and prospects more positively. 76% of patients assumed a relationship between their biography and the emergence of psychosis, 42% reported positive experience of symptoms and 74% ascribed positive consequences to their psychosis.ConclusionsSUSE features good psychometric qualities and offers an empirical acquisition to subjective assessment of psychosis. The results highlight the significance of subjective meaning making in psychoses and support a more biographical and in-depth psychological orientation for treatment.


Author(s):  
Stephanie J. Slater

<p>The Test Of Astronomy STandards (TOAST) is a comprehensive assessment instrument designed to measure students general astronomy content knowledge. Built upon the research embedded within a generation of astronomy assessments designed to measure single concepts, the TOAST is appropriate to measure across an entire astronomy course. The TOASTs scientific content represents a consensus of expert opinion about what students should know from three different groups: the American Association for the Advancement of Science, the National Research Council, and the American Astronomical Society. The TOASTs reliability and validity are established by results from Cronbach alpha and classical test theory analyses, a review for construct validity, testing for sensitivity to instruction, and numerous rounds of expert review. As such the TOAST can be considered a valuable tool for classroom instructors and discipline based education researchers in astronomy across a variety of learning environments.</p>


Methodology ◽  
2011 ◽  
Vol 7 (3) ◽  
pp. 103-110 ◽  
Author(s):  
José Muñiz ◽  
Fernando Menéndez

Current availability of computers has led to the use of a new series of response formats that are an alternative to the classical dichotomic format, and to the recovery of other formats, like the case of the answer-until-correct (AUC) format, whose efficient administration requires this kind of technology. The goal of the present study is to determine whether the use of the AUC format improves test reliability and validity in comparison to the classical dichotomic format. Three samples of 174, 431, and 1,446 Spanish students from secondary education, professional training, and high school, ages between 13 and 20 years, were used. A 100-item test and a 25-item test that assessed knowledge of Universal History were used, both tests administered by Internet with the AUC format. There were 56 experimental conditions, resulting from the manipulation of eight scoring models and seven test lengths. The data were analyzed from the perspective of the Classical Test Theory and also with Item Response Theory (IRT) models. Reliability and construct validity, analyzed from the classic perspective, did not seem to improve significantly when using the AUC format; however, when assessing reliability with the Information Function obtained by means of IRT models, the advantages of the AUC format versus the dichotomic format become clear. For low levels of the trait assessed, scores obtained with the AUC format provide more information than scores obtained with the dichotomic format. Lastly, these results are commented on, and the possibilities and limits of the AUC format in highly computerized psychological and educational contexts are analyzed.


Author(s):  
David L. Streiner

This chapter discusses the two major theories underlying scale development: classical test theory, which has dominated the field for the past century, and item response theory, which is more recent. It begins by summarizing the history of measurement, first of physical and physiological parameters and later of intelligence. This is followed by the steps involved in developing a scale: creating the items, determining if they fully span the construct of interest while at the same time not including irrelevant content, and assessing the usability of the items (whether they are understood correctly, whether they are free of jargon, if they avoid negatively worded phrases, etc.). The chapter then describes how to establish the reliability and validity of the scale—what are called the psychometric properties of the scale. It concludes by discussing some of the shortcomings with classical test theory, how item response theory attempts to address them, and the degree to which it has been successful in this regard. This chapter should be useful for those who need to evaluate existing scales as well as for those wanting to develop new scales.


Sign in / Sign up

Export Citation Format

Share Document