scholarly journals Using Machine Learning Methods to Develop a Short Tree-Based Adaptive Classification Test: Case Study With a High-Dimensional Item Pool and Imbalanced Data

2020 ◽  
Vol 44 (7-8) ◽  
pp. 499-514
Author(s):  
Yi Zheng ◽  
Hyunjung Cheon ◽  
Charles M. Katz

This study explores advanced techniques in machine learning to develop a short tree-based adaptive classification test based on an existing lengthy instrument. A case study was carried out for an assessment of risk for juvenile delinquency. Two unique facts of this case are (a) the items in the original instrument measure a large number of distinctive constructs; (b) the target outcomes are of low prevalence, which renders imbalanced training data. Due to the high dimensionality of the items, traditional item response theory (IRT)-based adaptive testing approaches may not work well, whereas decision trees, which are developed in the machine learning discipline, present as a promising alternative solution for adaptive tests. A cross-validation study was carried out to compare eight tree-based adaptive test constructions with five benchmark methods using data from a sample of 3,975 subjects. The findings reveal that the best-performing tree-based adaptive tests yielded better classification accuracy than the benchmark method IRT scoring with optimal cutpoints, and yielded comparable or better classification accuracy than the best benchmark method, random forest with balanced sampling. The competitive classification accuracy of the tree-based adaptive tests also come with an over 30-fold reduction in the length of the instrument, only administering between 3 to 6 items to any individual. This study suggests that tree-based adaptive tests have an enormous potential when used to shorten instruments that measure a large variety of constructs.

2020 ◽  
Author(s):  
Yosoon Choi ◽  
Jieun Baek ◽  
Jangwon Suh ◽  
Sung-Min Kim

<p>In this study, we proposed a method to utilize a multi-sensor Unmanned Aerial System (UAS) for exploration of hydrothermal alteration zones. This study selected an area (10m × 20m) composed mainly of the andesite and located on the coast, with wide outcrops and well-developed structural and mineralization elements. Multi-sensor (visible, multispectral, thermal, magnetic) data were acquired in the study area using UAS, and were studied using machine learning techniques. For utilizing the machine learning techniques, we applied the stratified random method to sample 1000 training data in the hydrothermal zone and 1000 training data in the non-hydrothermal zone identified through the field survey. The 2000 training data sets created for supervised learning were first classified into 1500 for training and 500 for testing. Then, 1500 for training were classified into 1200 for training and 300 for validation. The training and validation data for machine learning were generated in five sets to enable cross-validation. Five types of machine learning techniques were applied to the training data sets: k-Nearest Neighbors (k-NN), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Deep Neural Network (DNN). As a result of integrated analysis of multi-sensor data using five types of machine learning techniques, RF and SVM techniques showed high classification accuracy of about 90%. Moreover, performing integrated analysis using multi-sensor data showed relatively higher classification accuracy in all five machine learning techniques than analyzing magnetic sensing data or single optical sensing data only.</p>


2017 ◽  
Author(s):  
Reuben Binns ◽  
Michael Veale ◽  
Max Van Kleek ◽  
Nigel Shadbolt

The internet has become a central medium through which 'networked publics' express their opinions and engage in debate. Offensive comments and personal attacks can inhibit participation in these spaces. Automated content moderation aims to overcome this problem using machine learning classifiers trained on large corpora of texts manually annotated for offence. While such systems could help encourage more civil debate, they must navigate inherently normatively contestable boundaries, and are subject to the idiosyncratic norms of the human raters who provide the training data. An important objective for platforms implementing such measures might be to ensure that they are not unduly biased towards or against particular norms of offence. This paper provides some exploratory methods by which the normative biases of algorithmic content moderation systems can be measured, by way of a case study using an existing dataset of comments labelled for offence. We train classifiers on comments labelled by different demographic subsets (men and women) to understand how differences in conceptions of offence between these groups might affect the performance of the resulting models on various test sets. We conclude by discussing some of the ethical choices facing the implementers of algorithmic moderation systems, given various desired levels of diversity of viewpoints amongst discussion participants.


2019 ◽  
Vol 11 (5) ◽  
pp. 503 ◽  
Author(s):  
Sachit Rajbhandari ◽  
Jagannath Aryal ◽  
Jon Osborn ◽  
Arko Lucieer ◽  
Robert Musk

Ontology-driven Geographic Object-Based Image Analysis (O-GEOBIA) contributes to the identification of meaningful objects. In fusing data from multiple sensors, the number of feature variables is increased and object identification becomes a challenging task. We propose a methodological contribution that extends feature variable characterisation. This method is illustrated with a case study in forest-type mapping in Tasmania, Australia. Satellite images, airborne LiDAR (Light Detection and Ranging) and expert photo-interpretation data are fused for feature extraction and classification. Two machine learning algorithms, Random Forest and Boruta, are used to identify important and relevant feature variables. A variogram is used to describe textural and spatial features. Different variogram features are used as input for rule-based classifications. The rule-based classifications employ (i) spectral features, (ii) vegetation indices, (iii) LiDAR, and (iv) variogram features, and resulted in overall classification accuracies of 77.06%, 78.90%, 73.39% and 77.06% respectively. Following data fusion, the use of combined feature variables resulted in a higher classification accuracy (81.65%). Using relevant features extracted from the Boruta algorithm, the classification accuracy is further improved (82.57%). The results demonstrate that the use of relevant variogram features together with spectral and LiDAR features resulted in improved classification accuracy.


Author(s):  
Mohannad Elhamod ◽  
Kelly M. Diamond ◽  
A. Murat Maga ◽  
Yasin Bakis ◽  
Henry L. Bart ◽  
...  

AbstractFish species classification is an important task that is the foundation of many industrial, commercial, ecological, and scientific applications involving the study of fish distributions, dynamics, and evolution.While conventional approaches for this task use off-the-shelf machine learning (ML) methods such as existing Convolutional Neural Network (ConvNet) architectures, there is an opportunity to inform the ConvNet architecture using our knowledge of biological hierarchies among taxonomic classes.In this work, we propose infusing phylogenetic information into the model’s training to guide its structure and relationships among the extracted features. In our extensive experimental analyses, the proposed model, named Hierarchy-Guided Neural Network (HGNN), outperforms conventional ConvNet models in terms of classification accuracy under scarce training data conditions.We also observe that HGNN shows better resilience to adversarial occlusions, when some of the most informative patch regions of the image are intentionally blocked and their effect on classification accuracy is studied.


Author(s):  
Mojtaba Haghighatlari ◽  
Ching-Yen Shih ◽  
Johannes Hachmann

<div><div><div><p>The appropriate sampling of training data out of a potentially imbalanced data set is of critical importance for the development of robust and accurate machine learning models. A challenge that underpins this task is the partitioning of the data into groups of similar instances, and the analysis of the group populations. In molecular data sets, different groups of molecules may be hard to identify. However, if the distribution of a given data set is ignored then some of these groups may remain under-represented and the sampling biased, even if the size of data is large. In this study, we use the example of the Harvard Clean Energy Project (CEP) data set to assess the challenges posed by imbalanced data and the impact that accounting for different groups during the selection of training data has on the quality of the resulting machine learning models. We employ a partitioning criterion based on the underlying rules for the CEP molecular library generation to identify groups of structurally similar compounds. First, we evaluate the performance of regression models that are trained globally (i.e., by randomly sampling the entire data set for training data). This traditional approach serves as the benchmark reference. We compare its results with those of models that are trained locally, i.e., within each of the identified molecular domains. We demonstrate that local models outperform the best reported global models by considerable margins and are more efficient in their training data needs. We propose a strategy to redesign training sets for the development of improved global models. While the resulting uniform training sets can successfully yield robust global models, we identify the distribution mismatch between feature representations of different molecular domains as a critical limitation for any further improvement. We take advantage of the discovered distribution shift and propose an ensemble of classification and regression models to achieve a generalized and reliable model that outperforms the state-of-the-art model, trained on the CEP data set. Moreover, this study provides a benchmark for the development of future methodologies concerned with imbalanced chemical data.</p></div></div></div>


2021 ◽  
Vol 12 (1) ◽  
pp. 1-17
Author(s):  
Swati V. Narwane ◽  
Sudhir D. Sawarkar

Class imbalance is the major hurdle for machine learning-based systems. Data set is the backbone of machine learning and must be studied to handle the class imbalance. The purpose of this paper is to investigate the effect of class imbalance on the data sets. The proposed methodology determines the model accuracy for class distribution. To find possible solutions, the behaviour of an imbalanced data set was investigated. The study considers two case studies with data set divided balanced to unbalanced class distribution. Testing of the data set with trained and test data was carried out for standard machine learning algorithms. Model accuracy for class distribution was measured with the training data set. Further, the built model was tested with individual binary class. Results show that, for the improvement of the system performance, it is essential to work on class imbalance problems. The study concludes that the system produces biased results due to the majority class. In the future, the multiclass imbalance problem can be studied using advanced algorithms.


Author(s):  
Shivani Vasantbhai Vora ◽  
Rupa G. Mehta ◽  
Shreyas Kishorkumar Patel

Continuously growing technology enhances creativity and simplifies humans' lives and offers the possibility to anticipate and satisfy their unmet needs. Understanding emotions is a crucial part of human behavior. Machines must deeply understand emotions to be able to predict human needs. Most tweets have sentiments of the user. It inherits the imbalanced class distribution. Most machine learning (ML) algorithms are likely to get biased towards the majority classes. The imbalanced distribution of classes gained extensive attention as it has produced many research challenges. It demands efficient approaches to handle the imbalanced data set. Strategies used for balancing the distribution of classes in the case study are handling redundant data, resampling training data, and data augmentation. Six methods related to these techniques have been examined in a case study. Upon conducting experiments on the Twitter dataset, it is seen that merging minority classes and shuffle sentence methods outperform other techniques.


Author(s):  
Carlos Sáez ◽  
Nekane Romero ◽  
J Alberto Conejero ◽  
Juan M García-Gómez

Abstract Objective The lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning. Materials and Methods We used the publicly available nCov2019 dataset, including patient-level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities. Results Cases from the 2 countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting. Conclusions Data source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning.


Author(s):  
Dr. Kalaivazhi Vijayaragavan ◽  
S. Prakathi ◽  
S. Rajalakshmi ◽  
M Sandhiya

Machine learning is a subfield of artificial intelligence, which is learning algorithms to make decision-based on data and try to behave like a human being. Classification is one of the most fundamental concepts in machine learning. It is a process of recognizing, understanding, and grouping ideas and objects into pre-set categories or sub-populations. Using precategorized training datasets, machine learning concept use variety of algorithms to classify the future datasets into categories. Classification algorithms use input training data in machine learning to predict the subsequent data that fall into one of the predetermined categories. To improve the classification accuracy design of neural network is regarded as effective model to obtain better accuracy. However, design of neural network is usually consider scaling layer, perceptron layers and probabilistic layer. In this paper, an enhanced model selection can be evaluated with training and testing strategy. Further, the classification accuracy can be predicted. Finally by using two popular machine learning frameworks: PyTorch and Tensor Flow the prediction of classification accuracy is compared. Results demonstrate that the proposed method can predict with more accuracy. After the deployment of our machine learning model the performance of the model has been evaluated with the help of iris data set.


Sign in / Sign up

Export Citation Format

Share Document