Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches

We compare two supervised machine learning algorithms—Multinomial Naïve Bayes and Gradient Boosting—to classify social science articles using textual data. The high level of granularity of the classification scheme used and the possibility that multiple categories are assigned to a document make this task challenging. To collect the training data, we query three discipline specific thesauri to retrieve articles corresponding to specialties in the classification. The resulting dataset consists of 113,909 records and covers 245 specialties, aggregated into 31 subdisciplines from three disciplines. Experts were consulted to validate the thesauri-based classification. The resulting multi-label dataset is used to train the machine learning algorithms in different configurations. We deploy a multi-label classifier chaining model, allowing for an arbitrary number of categories to be assigned to each document. The best results are obtained with Gradient Boosting. The approach does not rely on citation data. It can be applied in settings where such information is not available. We conclude that fine-grained text-based classification of social sciences publications at a subdisciplinary level is a hard task, for humans and machines alike. A combination of human expertise and machine learning is suggested as a way forward to improve the classification of social sciences documents.

Download Full-text

A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

Genes ◽

10.3390/genes11090985 ◽

2020 ◽

Vol 11 (9) ◽

pp. 985 ◽

Cited By ~ 2

Author(s):

Thomas Vanhaeren ◽

Federico Divina ◽

Miguel García-Torres ◽

Francisco Gómez-Vela ◽

Wim Vanhoof ◽

...

Keyword(s):

Machine Learning ◽

Transcription Factors ◽

Long Range ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

The Other ◽

Supervised Machine Learning ◽

Chromatin Interaction ◽

Gradient Boosting ◽

Chromatin Interactions

The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.

Download Full-text

P.1.b.003 Supervised machine learning algorithms predict “correct” classification of retinal ganglia neuron subtypes

European Neuropsychopharmacology ◽

10.1016/s0924-977x(08)70261-x ◽

2008 ◽

Vol 18 ◽

pp. S218

Author(s):

S. Matthews ◽

H. Jelinek ◽

C.S. McLachlan ◽

I. Spence

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Correct Classification

Download Full-text

Application of supervised machine learning algorithms in the classification of sagittal gait patterns of cerebral palsy children with spastic diplegia

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2019.01.009 ◽

2019 ◽

Vol 106 ◽

pp. 33-39 ◽

Cited By ~ 13

Author(s):

Yanxin Zhang ◽

Ye Ma

Keyword(s):

Machine Learning ◽

Cerebral Palsy ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Spastic Diplegia ◽

Gait Patterns

Download Full-text

Advanced Supervised Machine Learning Algorithms for Efficient Electrofacies Classification of a Carbonate Reservoir in a Giant Southern Iraqi Oil Field

10.4043/30906-ms ◽

2020 ◽

Cited By ~ 1

Author(s):

Watheq J Al-Mudhafar

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Carbonate Reservoir ◽

Oil Field ◽

Machine Learning Algorithms ◽

Supervised Machine Learning

Download Full-text

Application of supervised machine learning algorithms for the classification of regulatory RNA riboswitches

Briefings in Functional Genomics ◽

10.1093/bfgp/elw005 ◽

2016 ◽

pp. elw005 ◽

Cited By ~ 5

Author(s):

Swadha Singh ◽

Raghvendra Singh

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Regulatory Rna

Download Full-text

A comparative study of supervised machine learning algorithms for the prediction of long-range chromatin interactions

10.1101/2020.06.09.141473 ◽

2020 ◽

Author(s):

Thomas Vanhaeren ◽

Federico Divina ◽

Miguel García-Torres ◽

Francisco Gómez-Vela ◽

Wim Vanhoof ◽

...

Keyword(s):

Machine Learning ◽

Transcription Factors ◽

Long Range ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

The Other ◽

Supervised Machine Learning ◽

Chromatin Interaction ◽

Gradient Boosting ◽

Chromatin Interactions

AbstractThe role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model chromatin interactions in two human cell lines and evaluate the prediction performance of 5 popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines and multi-layer perceptron. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other four algorithms, yielding accuracies of ~ 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring.

Download Full-text

Classification of hazelnut cultivars: comparison of DL4J and ensemble learning algorithms

Notulae Botanicae Horti Agrobotanici Cluj-Napoca ◽

10.15835/nbha48412041 ◽

2020 ◽

Vol 48 (4) ◽

pp. 2316-2327

Author(s):

Caner KOC ◽

Dilara GERDAN ◽

Maksut B. EMİNOĞLU ◽

Uğur YEGÜL ◽

Bulent KOC ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Random Forest ◽

Ensemble Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Performance Criteria ◽

Gradient Boosting ◽

Data Set

Classification of hazelnuts is one of the values adding processes that increase the marketability and profitability of its production. While traditional classification methods are used commonly, machine learning and deep learning can be implemented to enhance the hazelnut classification processes. This paper presents the results of a comparative study of machine learning frameworks to classify hazelnut (Corylus avellana L.) cultivars (‘Sivri’, ‘Kara’, ‘Tombul’) using DL4J and ensemble learning algorithms. For each cultivar, 50 samples were used for evaluations. Maximum length, width, compression strength, and weight of hazelnuts were measured using a caliper and a force transducer. Gradient boosting machine (Boosting), random forest (Bagging), and DL4J feedforward (Deep Learning) algorithms were applied in traditional machine learning algorithms. The data set was partitioned into a 10-fold-cross validation method. The classifier performance criteria of accuracy (%), error percentage (%), F-Measure, Cohen’s Kappa, recall, precision, true positive (TP), false positive (FP), true negative (TN), false negative (FN) values are provided in the results section. The results showed classification accuracies of 94% for Gradient Boosting, 100% for Random Forest, and 94% for DL4J Feedforward algorithms.

Download Full-text

A Machine Learning Practice on NAS Dataset: Influence of Socioeconomic Factors on Student Performance

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1652.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 3272-3275

Keyword(s):

Machine Learning ◽

Student Performance ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Gradient Boosting ◽

Support Vector ◽

Nearest Neighbours ◽

Proactive Measures ◽

New Student

India’s population is enormous and diverse due to which its education system is very complex. Furthermore, due to several reasons that they have grown up in different environmental situations. Over the years, several changes have been suggested and implemented by various stakeholders to improve the quality of education in schools. This paper presents a novel method to predict the performance of a new student by the analysis of historical student data records, and furthermore, we explore the NAS dataset using cutting edge Machine Learning Algorithms to predict the grades of a new student and take proactive measures to help them succeed. Similarly, NAS Dataset can also be worthwhile to the employee dataset and can predict the performance of the employee. Some of the Supervised Machine Learning Algorithms for Classification which have been successfully applied to the NAS dataset. Support Vector Machines and K-Nearest Neighbours algorithms did not crop results in coherent time for the given dataset; Gradient Boosting Classifier outperformed than all other algorithms reliably

Download Full-text