Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive

In this paper, we consider the task of assigning relevant labels to studies in the social science domain. Manual labelling is an expensive process and prone to human error. Various multi-label text classification machine learning approaches have been proposed to resolve this problem. We introduce a dataset obtained from the Finnish Social Science Archive and comprised of 2968 research studies’ metadata. The metadata of each study includes attributes, such as the “abstract” and the “set of labels”. We used the Bag of Words (BoW), TF-IDF term weighting and pretrained word embeddings obtained from FastText and BERT models to generate the text representations for each study’s abstract field. Our selection of multi-label classification methods includes a Naive approach, Multi-label k Nearest Neighbours (ML-kNN), Multi-Label Random Forest (ML-RF), X-BERT and Parabel. The methods were combined with the text representation techniques and their performance was evaluated on our dataset. We measured the classification accuracy of the combinations using Precision, Recall and F1 metrics. In addition, we used the Normalized Discounted Cumulative Gain to measure the label ranking performance of the selected methods combined with the text representation techniques. The results showed that the ML-RF model achieved a higher classification accuracy with the TF-IDF features and, based on the ranking score, the Parabel model outperformed the other methods.

Download Full-text

Evaluating classification accuracy for modern learning approaches

Statistics in Medicine ◽

10.1002/sim.8103 ◽

2019 ◽

Vol 38 (13) ◽

pp. 2477-2503 ◽

Cited By ~ 10

Author(s):

Jialiang Li ◽

Ming Gao ◽

Ralph D'Agostino

Keyword(s):

Classification Accuracy ◽

Learning Approaches ◽

Modern Learning

Download Full-text

When 4 ≈ 10,000: The Power of Social Science Knowledge in Predictive Performance

Socius Sociological Research for a Dynamic World ◽

10.1177/2378023118811774 ◽

2019 ◽

Vol 5 ◽

pp. 237802311881177 ◽

Cited By ~ 1

Author(s):

Stephen McKay

Keyword(s):

Social Science ◽

Data Science ◽

Predictive Performance ◽

Science Methods ◽

Learning Approaches ◽

Social Scientists ◽

Individual Variables ◽

Predicted Values ◽

Social Scientific ◽

New Variables

Computer science has devised leading methods for predicting variables; can social science compete? The author sets out a social scientific approach to the Fragile Families Challenge. Key insights included new variables constructed according to theory (e.g., a measure of shame relating to hardship), lagged values of the target variables, using predicted values of certain outcomes to inform others, and validated scales rather than individual variables. The models were competitive: a four-variable logistic regression model was placed second for predicting layoffs, narrowly beaten by a model using all the available variables (>10,000) and an ensemble of algorithms. Similarly, a relatively small random forest model (25 variables) was ranked seventh in predicting material hardship. However, a similar approach overfitted the prediction of grit. Machine learning approaches proved superior to linear regression for modeling the continuous outcomes. Overall, social scientists can contribute to predictive performance while benefiting from learning more about data science methods.

Download Full-text

Automatic Genre Classification Using Fractional Fourier Transform Based Mel Frequency Cepstral Coefficient and Timbral Features

Archives of Acoustics ◽

10.1515/aoa-2017-0024 ◽

2017 ◽

Vol 42 (2) ◽

pp. 213-222 ◽

Cited By ~ 1

Author(s):

Daulappa Guranna Bhalke ◽

Betsy Rajesh ◽

Dattatraya Shankar Bormane

Keyword(s):

Fourier Transform ◽

Classification Accuracy ◽

Classical Music ◽

Fractional Fourier Transform ◽

Support Vector ◽

Spectral Kurtosis ◽

Genre Classification ◽

Nearest Neighbours ◽

Spectral Flux ◽

Mel Frequency Cepstral Coefficient

Abstract This paper presents the Automatic Genre Classification of Indian Tamil Music and Western Music using Timbral and Fractional Fourier Transform (FrFT) based Mel Frequency Cepstral Coefficient (MFCC) features. The classifier model for the proposed system has been built using K-NN (K-Nearest Neighbours) and Support Vector Machine (SVM). In this work, the performance of various features extracted from music excerpts has been analysed, to identify the appropriate feature descriptors for the two major genres of Indian Tamil music, namely Classical music (Carnatic based devotional hymn compositions) & Folk music and for western genres of Rock and Classical music from the GTZAN dataset. The results for Tamil music have shown that the feature combination of Spectral Roll off, Spectral Flux, Spectral Skewness and Spectral Kurtosis, combined with Fractional MFCC features, outperforms all other feature combinations, to yield a higher classification accuracy of 96.05%, as compared to the accuracy of 84.21% with conventional MFCC. It has also been observed that the FrFT based MFCC effieciently classifies the two western genres of Rock and Classical music from the GTZAN dataset with a higher classification accuracy of 96.25% as compared to the classification accuracy of 80% with MFCC.

Download Full-text

Insights into few shot learning approaches for image scene classification

PeerJ Computer Science ◽

10.7717/peerj-cs.666 ◽

2021 ◽

Vol 7 ◽

pp. e666

Author(s):

Mohamed Soudy ◽

Yasmine Afify ◽

Nagwa Badr

Keyword(s):

Classification Accuracy ◽

Large Data ◽

Image Understanding ◽

Research Area ◽

Optimal Performance ◽

Learning Approaches ◽

Learning Models ◽

Scene Classification ◽

Training Set ◽

Machine Learning Models

Image understanding and scene classification are keystone tasks in computer vision. The development of technologies and profusion of existing datasets open a wide room for improvement in the image classification and recognition research area. Notwithstanding the optimal performance of exiting machine learning models in image understanding and scene classification, there are still obstacles to overcome. All models are data-dependent that can only classify samples close to the training set. Moreover, these models require large data for training and learning. The first problem is solved by few-shot learning, which achieves optimal performance in object detection and classification but with a lack of eligible attention in the scene classification task. Motivated by these findings, in this paper, we introduce two models for few-shot learning in scene classification. In order to trace the behavior of those models, we also introduce two datasets (MiniSun; MiniPlaces) for image scene classification. Experimental results show that the proposed models outperform the benchmark approaches in respect of classification accuracy.

Download Full-text

Dcmd: Distance-based classification using mixture distributions on microbiome data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008799 ◽

2021 ◽

Vol 17 (3) ◽

pp. e1008799

Author(s):

Konstantin Shestopaloff ◽

Mei Dong ◽

Fan Gao ◽

Wei Xu

Keyword(s):

Machine Learning ◽

Count Data ◽

Human Microbiome ◽

Mixture Distribution ◽

Classification Performance ◽

Study Data ◽

Mixture Distributions ◽

Learning Approaches ◽

Nearest Neighbours ◽

Microbiome Data

Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means classification and k-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at https://github.com/kshestop/DCMD for academic use.

Download Full-text

Scoring algorithms for a computer-based cognitive screening tool: An illustrative example of overfitting machine learning approaches and the impact on estimates of classification accuracy.

Psychological Assessment ◽

10.1037/pas0000764 ◽

2019 ◽

Vol 31 (11) ◽

pp. 1377-1382 ◽

Cited By ~ 2

Author(s):

Jake Ursenbach ◽

Megan E. O'Connell ◽

Jennafer Neiser ◽

Mary C. Tierney ◽

Debra Morgan ◽

...

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Screening Tool ◽

Learning Approaches ◽

Cognitive Screening ◽

Computer Based ◽

Scoring Algorithms ◽

The Impact

Download Full-text

Empirical Analysis of Case-Based Reasoning and Other Prediction Methods in a Social Science Domain: Repeat Criminal Victimization

Case-Based Reasoning Research and Development - Lecture Notes in Computer Science ◽

10.1007/3-540-45006-8_35 ◽

2007 ◽

pp. 452-464 ◽

Cited By ~ 4

Author(s):

Michael A. Redmond ◽

Cynthia Blackburn Line

Keyword(s):

Social Science ◽

Empirical Analysis ◽

Prediction Methods ◽

Case Based Reasoning ◽

Criminal Victimization ◽

Science Domain ◽

Case Based

Download Full-text

A High-Accuracy Model Average Ensemble of Convolutional Neural Networks for Classification of Cloud Image Patches on Small Datasets

Applied Sciences ◽

10.3390/app9214500 ◽

2019 ◽

Vol 9 (21) ◽

pp. 4500 ◽

Cited By ~ 8

Author(s):

Phung ◽

Rhee

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Classification Accuracy ◽

Complete Solution ◽

Large Data ◽

High Accuracy ◽

Learning Approaches ◽

Model Average ◽

Image Patches

Research on clouds has an enormous influence on sky sciences and related applications, and cloud classification plays an essential role in it. Much research has been conducted which includes both traditional machine learning approaches and deep learning approaches. Compared with traditional machine learning approaches, deep learning approaches achieved better results. However, most deep learning models need large data to train due to the large number of parameters. Therefore, they cannot get high accuracy in case of small datasets. In this paper, we propose a complete solution for high accuracy of classification of cloud image patches on small datasets. Firstly, we designed a suitable convolutional neural network (CNN) model for small datasets. Secondly, we applied regularization techniques to increase generalization and avoid overfitting of the model. Finally, we introduce a model average ensemble to reduce the variance of prediction and increase the classification accuracy. We experiment the proposed solution on the Singapore whole-sky imaging categories (SWIMCAT) dataset, which demonstrates perfect classification accuracy for most classes and confirms the robustness of the proposed model.

Download Full-text

A Survey On Missing Data in Machine Learning

10.21203/rs.3.rs-535520/v1 ◽

2021 ◽

Author(s):

Tlamelo Emmanuel ◽

Thabiso Maupong ◽

Dimane Mpoeleng ◽

Thabo Semong ◽

Mphago Banyatsang ◽

...

Keyword(s):

Machine Learning ◽

Missing Data ◽

Human Error ◽

Missing Values ◽

Nearest Neighbor ◽

Research Direction ◽

Machine Learning Techniques ◽

Future Research ◽

Learning Approaches ◽

K Nearest Neighbor

Abstract Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur as a result of various factors like missing completely at random, missing at random or missing not at random. All these may be as a result of system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of the proposed techniques, how they perform, their limitations and the kind of data they are most suitable for. Finally, we experiment on the K nearest neighbor and random forest imputation techniques on novel power plant induced fan data and offer some possible future research direction.

Download Full-text

Calibrating Wrist-Worn Accelerometers for Physical Activity Assessment in Preschoolers: Machine Learning Approaches (Preprint)

10.2196/preprints.16727 ◽

2019 ◽

Author(s):

Shiyu Li ◽

Jeffrey T Howard ◽

Erica T Sosa ◽

Alberto Cordova ◽

Deborah Parra-Medina ◽

...

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Cluster Analysis ◽

Sedentary Behavior ◽

Classification Accuracy ◽

Characteristic Curve ◽

Learning Approaches ◽

Cut Points ◽

Activity Intensity ◽

Vector Magnitude

BACKGROUND Physical activity (PA) level is associated with multiple health benefits during early childhood. However, inconsistency in the methods for quantification of PA levels among preschoolers remains a problem. OBJECTIVE This study aimed to develop PA intensity cut points for wrist-worn accelerometers by using machine learning (ML) approaches to assess PA in preschoolers. METHODS Wrist- and hip-derived acceleration data were collected simultaneously from 34 preschoolers on 3 consecutive preschool days. Two supervised ML models, receiver operating characteristic curve (ROC) and ordinal logistic regression (OLR), and one unsupervised ML model, k-means cluster analysis, were applied to establish wrist-worn accelerometer vector magnitude (VM) cut points to classify accelerometer counts into sedentary behavior, light PA (LPA), moderate PA (MPA), and vigorous PA (VPA). Physical activity intensity levels identified by hip-worn accelerometer VM cut points were used as reference to train the supervised ML models. Vector magnitude counts were classified by intensity based on three newly established wrist methods and the hip reference to examine classification accuracy. Daily estimates of PA were compared to the hip-reference criterion. RESULTS In total, 3600 epochs with matched hip- and wrist-worn accelerometer VM counts were analyzed. All ML approaches performed differently on developing PA intensity cut points for wrist-worn accelerometers. Among the three ML models, k-means cluster analysis derived the following cut points: ≤2556 counts per minute (cpm) for sedentary behavior, 2557-7064 cpm for LPA, 7065-14532 cpm for MPA, and ≥14533 cpm for VPA; in addition, k-means cluster analysis had the highest classification accuracy, with more than 70% of the total epochs being classified into the correct PA categories, as examined by the hip reference. Additionally, k-means cut points exhibited the most accurate estimates on sedentary behavior, LPA, and VPA as the hip reference. None of the three wrist methods were able to accurately assess MPA. CONCLUSIONS This study demonstrates the potential of ML approaches in establishing cut points for wrist-worn accelerometers to assess PA in preschoolers. However, the findings from this study warrant additional validation studies.

Download Full-text