Assessment of Item and Test parameters: Cosine Similarity Approach

The paper proposes new measures of difficulty and discriminating values of binary items and test consisting of such items and find their relationships including estimation of test error variance and thereby the test reliability, as per definition using cosine similarities. The measures use entire data. Difficulty value of test and item is defined as function of cosine of the angle between the observed score vector and the maximum possible score vector. Discriminating value of test and an item are taken as coefficient of variation (CV) of test score and item score respectively. Each ranges between 0 and 1 like difficulty value of test and an item. With increase in number of correct answer to an item, item difficulty curve increases and item discriminating curve decreases. The point of intersection of the two curves can be used for item deletion along with other criteria. Cronbach alpha was expressed and computed in terms of discriminating value of test and item. Relationship derived between test discriminating value and test reliability as per theoretical definition. Empirical verifications of proposed measures were undertaken. Future studies suggested.re to enter text.

Download Full-text

Improved Quality: Item and Test parameters

Health Sciences ◽

10.15342/hs.2020.267 ◽

2021 ◽

Vol 2020 ◽

Author(s):

Satyendra Nath Chakrabartty

Keyword(s):

Item Difficulty ◽

Test Reliability ◽

Data Driven ◽

Statistical Hypothesis ◽

Quality Item ◽

Test Parameters ◽

Type Test ◽

Angular Similarity ◽

Item Reliability

Introduction: Quality of a MCQ type test depends on qualities of the constituent items, assessed in terms of item reliability, item difficulty value, item discriminating value, etc. However, quality of a test involving reliability, validity, difficulty and discriminating values of the test etc. requires new approaches. Need is felt to find difficulty and discriminating values of an item and test using entire data and to derive relationships amongst them including relationship with test reliability to see impact of item deletion. Methods: Using angular similarity approach, measures proposed for item difficulty and item discriminating value, difficulty and discriminating value of test. Relationship derived between (i) difficulty value and discriminating value of item; (ii) difficulty value and discriminating value of a test (iii) test discriminating value and test reliability as per theoretical definition. Cronbach alpha was expressed using sum of item difficulty values and test discriminating value Results and Discussion: Each proposed measure ranges between 0 to 1. Discriminating value of test and item as coefficient of variation satisfy desired properties and facilitates population estimations. Intersection of item difficulty and item discriminating curves provides a data driven criterion for item deletion, impact of which on test reliability may be checked. In addition, the proposed measures facilitate testing of statistical hypothesis of departure of test reliability from unity, confidence interval of reliability, etc. Future problems suggested.

Download Full-text

Interpretation of the Standard Error of Measurement when True Scores and Error Scores on Mental Tests are Not Independent

Psychological Reports ◽

10.2466/pr0.1966.19.2.611 ◽

1966 ◽

Vol 19 (2) ◽

pp. 611-617 ◽

Cited By ~ 1

Author(s):

Donald W. Zimmerman ◽

Richard H. Williams

Keyword(s):

Standard Deviation ◽

Standard Error ◽

Test Score ◽

Test Reliability ◽

True Score ◽

Observed Score ◽

Standard Error Of Measurement ◽

True Scores ◽

Modified Equation ◽

Error Of Measurement

It is shown that for the case of non-independence of true scores and error scores interpretation of the standard error of measurement is modified in two ways. First, the standard deviation of the distribution of error scores is given by a modified equation. Second, the confidence interval for true score varies with the individual's observed score. It is shown that the equation, so=√[(N−O/a]+[so2(roō−roo)/roō]̄, where N is the number of items, O is the individual's observed score, a is the number of choices per item, so2 is observed variance, roo is test reliability as empirically determined, and roō is reliability for the case where only non-independent error is present, provides a more accurate interpretation of the test score of an individual.

Download Full-text

Validity and Reliability of the Sport Readiness Questionnaire Focused on Musculoskeletal Injuries

Asian Journal of Sports Medicine ◽

10.5812/asjsm.116188 ◽

2021 ◽

Vol In Press (In Press) ◽

Author(s):

Gianfranco Sganzerla ◽

Christianne de Faria Coelho Ravagnani ◽

Silvio Assis de Oliveira-Junior ◽

Fabricio Cesar de Paula Ravagnani

Keyword(s):

Internal Consistency ◽

Affirmative Answer ◽

Musculoskeletal Injuries ◽

Test Reliability ◽

Positive Finding ◽

Validity And Reliability ◽

Kappa Index ◽

Medical Settings ◽

Future Studies ◽

Sensitivity Specificity

Background: The pre-participation physical evaluation (PPE), which includes a musculoskeletal system evaluation, identifies factors that may be a risk for athletes while practicing sport. Thus, the Sport Readiness Questionnaire, focused on musculoskeletal injuries (MIR-Q) was developed to screen athletes at risk of future injuries or worsening pre-existing injuries during training or competition. However, the criterion-related validity and reliability of the MIR-Q have not yet been analyzed. Objectives: To test the criterion-related validity and reliability (internal consistency and test-retest) of the MIR-Q. Methods: One hundred and twenty adult athletes from different sports (17 women) completed the MIR-Q and underwent a physical orthopedic examination (POE) performed by an orthopedic physician. At least one affirmative answer on the MIR-Q, as well as one positive finding on the POE, was considered “a risk factor for sport injury”. The validity was assessed from sensitivity, specificity, and accuracy measurements. Internal consistency was obtained through the KR-20 test. Reliability was measured using the test-retest method in a 7-14-day interval with a sub-sample (n = 41) and verified by the Kappa index. Results: Eighty-one (67.5%) questionnaires contained positive responses. The sensitivity of the MIR-Q against POE was high (84.4%), while specificity and accuracy were considered moderate, with values of 42.7% and 58.0%, respectively. Internal consistency was moderate (KR-20 = 0.57), and test-retest was reduced (K = 0.30; P = 0.02). Conclusions: The MIR-Q was associated with high values of validity and low values of reliability. The questionnaire may be an alternative tool for musculoskeletal screening during PPE in limited medical settings (sports OR orthopedic physician) conditions. Future studies should investigate the predictive validity of the MIR-Q, and psychometric properties of the questionnaire with younger athletes.

Download Full-text

RE-engineered factory acceptance testing under the new normal

Built Environment Project and Asset Management ◽

10.1108/bepam-03-2021-0055 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Sanduni Peiris ◽

Nayanthara De Silva

Keyword(s):

Construction Industry ◽

Geographical Area ◽

Test Reliability ◽

Acceptance Testing ◽

Sri Lankan ◽

Content Type ◽

Future Studies ◽

Physical Touch ◽

Two Stages

Purpose Factory acceptance testing (FAT) in the construction industry has been severely hampered due to restrictions in cross-border travel resulting from the COVID-19 pandemic. Consequently, virtual FAT (vFAT) became a popular substitute for physical FAT. However, the credibility of vFAT is being questioned because it was adopted without much scrutiny. Hence, this study is aimed at investigating vFATs and re-engineering the FAT process to suit an effective vFAT environment.Design/methodology/approach A comprehensive literature search on FAT procedures was followed by two stages of expert interviews with eight leading subject experts and a case study. The findings were analysed using code-based content analysis on NVivo software.Findings Strengths of vFATs include “reduction in cost and time consumed”, “flexibility for more participants” and “faster orders”. Most emphasized weaknesses include “lack of reliability” and “lack of technology transfer”. vFAT has mostly increased test reliability by “improving accessibility” and has decreased reliability by “restricting physical touch and feel observation of the equipment”. A four-step vFAT process was developed with a noteworthy additional step called “Pre-FAT Meeting”.Research limitations/implications The scope of this study is limited to the Sri Lankan construction industry. Expansion of the geographical area of focus is recommended for future studies.Originality/value The findings of this study unveil a vFAT process, which is timely and beneficial for construction practitioners to optimize and enhance the effectiveness of vFATs which are currently conducted in a disarranged manner.

Download Full-text

Better Use of Scales as Measuring Instruments in Mental Disorders

10.47363/jnrrr/2020(2)117 ◽

2020 ◽

pp. 1-7

Author(s):

Satyendra Nath Chakrabartty ◽

Keyword(s):

Rating Scale ◽

Treatment Plan ◽

Zero Point ◽

Measuring Instruments ◽

Sample Population ◽

Item Score ◽

Distribution Method ◽

Test Parameters ◽

Set Up ◽

Different Response

Objective: Linearity implies high correlation but the converse is not true. For meaningful application of correlation and other descriptive and inferential analysis, checking of linearity, and assumptions of correlation including normality are needed The paper describes method of converting ordinal scores from Likert/ Rating scale to continuous, monotonic, equi-distant scores with fixed zero point and following normal distribution. Method: The method involves selection of weights for different response categories of different items so that weighted item score forms an arithmetic progression. Normalization of such score followed by further weights to items to ensure equal item-total correlation justifies addition of such converted scores. Results: Converted scores satisfying many desired properties can assess progress/deterioration of a patient over time and facilitate comparison, ranking, classification and assessing effectiveness of treatment plan. It also helps in computing reliability of Likert or Rating scale avoiding assumptions of Cronbach’s alpha. Conclusions: Converted scores will help the researchers and practitioners to find improved content validity and meaningfully undertake correlation and other analysis under parametric set up for deriving useful and valid conclusions about the sample, population and test parameters

Download Full-text

THE INFLUENCE OF EMOTIONAL AFFINITY TOWARD NATURE AND INSTITUTIONAL SUPPORT ON ENVIRONMENTAL RESPONSIBILITY BEHAVIOUR: A RELIABILITY TEST

EPRA International Journal of Multidisciplinary Research (IJMR) ◽

10.36713/epra3991 ◽

2020 ◽

pp. 72-75

Author(s):

Logeswari Uthama puthran ◽

Fais Ahmad ◽

Hazlinda Hassan

Keyword(s):

Climate Change ◽

Air Pollution ◽

Urban Sprawl ◽

Environmental Issues ◽

Test Reliability ◽

Environmental Responsibility ◽

Environmental Attitude ◽

Environmental Behaviour ◽

Future Studies ◽

Learning Institutions

Over the last few decades, environmental issues such as global warming, acid rain, air pollution, urban sprawl, waste disposal, ozone layer depletion, water pollution and climate change have facing concern among prominent world organisational annual meeting and occasions. Almost media covers the environmental and sustainability issues in their daily publications. Malaysia as a developing country starving to survive with annual climate change issues such as flash floods, haze, water and air pollution and increasing seasonal sicknesses. In conjunction, government, policymakers, learning institutions, and Non-Government Organisations (NGOs) play their role in educating people to attain sustainability lifestyle. Specifically, institution leaders as a change agent encourage adapting environmental behaviour to enhancing environmental attitude and behaviour among their stakeholders. Therefore, the purpose of this study is to test the reliability of environmental responsive behaviour, specifically among school leaders in Malaysia. For this study, 503 samples were used to test reliability. The findings indicated that all examined variables consistently reflect the construct it is measuring. Hence, adapted measurement items are reliable to use in future studies.

Download Full-text

Measuring language ability of students with compensatory multidimensional CAT: A post-hoc simulation study

Education and Information Technologies ◽

10.1007/s10639-021-10853-0 ◽

2022 ◽

Author(s):

Burhanettin Ozdemir ◽

Selahattin Gelbal

Keyword(s):

Error Variance ◽

Stopping Rule ◽

Language Ability ◽

Test Reliability ◽

Item Selection ◽

Ability Estimation ◽

Proficiency Tests ◽

Test Length ◽

Post Hoc ◽

Multidimensional Cat

AbstractThe computerized adaptive tests (CAT) apply an adaptive process in which the items are tailored to individuals' ability scores. The multidimensional CAT (MCAT) designs differ in terms of different item selection, ability estimation, and termination methods being used. This study aims at investigating the performance of the MCAT designs used to measure the language ability of students and to compare the results of MCAT designs with the outcomes of corresponding paper–pencil tests. For this purpose, items in the English Proficiency Tests (EPT) were used to create a multi-dimensional item pool that consists of 599 items. The performance of the MCAT designs was evaluated and compared based on the reliability coefficients, root means square error (RMSE), test-length, and root means squared difference (RMSD) statistics, respectively. Therefore, 36 different conditions were investigated in total. The results of the post-hoc simulation designs indicate that the MCAT designs with the A-optimality item selection method outperformed MCAT designs with other item selection methods by decreasing the test length and RMSD values without any sacrifice in test reliability. Additionally, the best error variance stopping rule for each MCAT algorithm with A-optimality item selection could be considered as 0.25 with 27.9 average test length and 30 items for the fixed test-length stopping rule for the Bayesian MAP method. Overall, MCAT designs tend to decrease the test length by 60 to 65 percent and provide ability estimations with higher precision compared to the traditional paper–pencil tests with 65 to 75 items. Therefore, it is suggested to use the A-optimality method for item selection and the Bayesian MAP method for ability estimation for the MCAT designs since the MCAT algorithm with these specifications shows better performance than others.

Download Full-text

Maximizing Power in Generalizability Studies Under Budget Constraints

Journal of Educational Statistics ◽

10.3102/10769986018002197 ◽

1993 ◽

Vol 18 (2) ◽

pp. 197-206 ◽

Cited By ~ 11

Author(s):

George A. Marcoulides

Keyword(s):

Error Control ◽

Resource Constraints ◽

Generalizability Theory ◽

Error Variance ◽

Test Reliability ◽

Simple Task ◽

Mean Error ◽

Measurement Design ◽

The Mean ◽

Statistical Issues

Generalizability theory provides a framework for examining the dependability of behavioral measurements. When designing generalizability studies, two important statistical issues are generally considered: power and measurement error. Control over power and error of measurement can be obtained by manipulation of sample size and/or test reliability. In generalizability theory, the mean error variance is an estimate that takes into account both these statistical issues. When limited resources are available, determining an optimal measurement design is not a simple task. This article presents a methodology for minimizing mean error variance in generalizability studies when resource constraints are imposed.

Download Full-text

Validation of a clinical critical thinking skills test in nursing

Journal of Educational Evaluation for Health Professions ◽

10.3352/jeehp.2015.12.1 ◽

2015 ◽

Vol 12 ◽

pp. 1 ◽

Cited By ~ 7

Author(s):

Sujin Shin ◽

Dukyoo Jung ◽

Sungeun Kim

Keyword(s):

College Students ◽

Critical Thinking ◽

Item Difficulty ◽

Thinking Skills ◽

Critical Thinking Skills ◽

Test Reliability ◽

School Students ◽

Response Process ◽

Convenience Sample ◽

Nursing College

Purpose: The purpose of this study was to develop a revised version of the clinical critical thinking skills test (CCTS) and to subsequently validate its performance. Methods: This study is a secondary analysis of the CCTS. Data were obtained from a convenience sample of 284 college students in June 2011. Thirty items were analyzed using item response theory and test reliability was assessed. Test-retest reliability was measured using the results of 20 nursing college and graduate school students in July 2013. The content validity of the revised items was analyzed by calculating the degree of agreement between instrument developer intention in item development and the judgments of six experts. To analyze response process validity, qualitative data related to the response processes of nine nursing college students obtained through cognitive interviews were analyzed. Results: Out of initial 30 items, 11 items were excluded after the analysis of difficulty and discrimination parameter. When the 19 items of the revised version of the CCTS were analyzed, levels of item difficulty were found to be relatively low and levels of discrimination were found to be appropriate or high. The degree of agreement between item developer intention and expert judgments equaled or exceeded 50%. Conclusion: From above results, evidence of the response process validity was demonstrated, indicating that subjects respondeds as intended by the test developer. The revised 19-item CCTS was found to have sufficient reliability and validity and will therefore represents a more convenient measurement of critical thinking ability.

Download Full-text

Some Psychometric Problems of the Matching Familiar Figures Test

Perceptual and Motor Skills ◽

10.2466/pms.1976.43.3.731 ◽

1976 ◽

Vol 43 (3) ◽

pp. 731-742 ◽

Cited By ~ 4

Author(s):

Hideo Kojima

Keyword(s):

Response Time ◽

Test Performance ◽

Item Difficulty ◽

Error Variance ◽

Intelligence Test ◽

Scoring Method ◽

Matching Test ◽

Matching Response ◽

Almost All ◽

Matching Familiar Figures Test

Administered were Kagan's Matching Familiar Figures Test and a group intelligence test to 151 male and 130 female Japanese 2nd graders, and detailed analyses of responses were made. While matching response time was high in internal consistency, errors were much less consistent. Variant positions were differentially selected by 4 groups, and the position where correct variants were placed partially accounted for “the error variance” of errors. But it seemed that errors of the matching test could not be made reliable enough simply by refining and lengthening the present version of Kagan's test. While slow-accurate, fast-accurate, and slow-inaccurate children adjusted their response time to item difficulty, fast-inaccurate children failed to do so. Almost all of these results were replicated in third and fifth graders. By the ordinary scoring method of intelligence test, the 4 matching groups differed from each other only in girls. But by adjusting the scores for errors, intelligence test performance came to correlate with the matching figures even in boys.

Download Full-text