Chinese Comma Disambiguation in Math Word Problems Using SMOTE and Random Forests

Natural language understanding technologies play an essential role in automatically solving math word problems. In the process of machine understanding Chinese math word problems, comma disambiguation, which is associated with a class imbalance binary learning problem, is addressed as a valuable instrument to transform the problem statement of math word problems into structured representation. Aiming to resolve this problem, we employed the synthetic minority oversampling technique (SMOTE) and random forests to comma classification after their hyperparameters were jointly optimized. We propose a strict measure to evaluate the performance of deployed comma classification models on comma disambiguation in math word problems. To verify the effectiveness of random forest classifiers with SMOTE on comma disambiguation, we conducted two-stage experiments on two datasets with a collection of evaluation measures. Experimental results showed that random forest classifiers were significantly superior to baseline methods in Chinese comma disambiguation. The SMOTE algorithm with optimized hyperparameter settings based on the categorical distribution of different datasets is preferable, instead of with its default values. For practitioners, we suggest that hyperparameters of a classification models be optimized again after parameter settings of SMOTE have been changed.

Download Full-text

Imbalanced Learning Based on Logistic Discrimination

Computational Intelligence and Neuroscience ◽

10.1155/2016/5423204 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Huaping Guo ◽

Weimei Zhi ◽

Hongbing Liu ◽

Mingliang Xu

Keyword(s):

Statistical Model ◽

Cost Function ◽

State Of The Art ◽

Class Imbalance ◽

Imbalanced Learning ◽

Learning Problem ◽

Logistic Discrimination ◽

Positive Class ◽

Negative Class ◽

Novel Method

In recent years, imbalanced learning problem has attracted more and more attentions from both academia and industry, and the problem is concerned with the performance of learning algorithms in the presence of data with severe class distribution skews. In this paper, we apply the well-known statistical model logistic discrimination to this problem and propose a novel method to improve its performance. To fully consider the class imbalance, we design a new cost function which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Unlike traditional logistic discrimination, the proposed method learns its parameters by maximizing the proposed cost function. Experimental results show that, compared with other state-of-the-art methods, the proposed one shows significantly better performance on measures of recall,g-mean,f-measure, AUC, and accuracy.

Download Full-text

Classification Models Using Decision Tree, Random Forest, and Moving Average Analysis

New Frontiers in Nanochemistry ◽

10.1201/9780429022951-6 ◽

2020 ◽

pp. 91-115

Author(s):

Rohit Dutt ◽

Harish Dureja ◽

A. K. Madan

Keyword(s):

Random Forest ◽

Decision Tree ◽

Moving Average ◽

Classification Models ◽

Average Analysis

Download Full-text

Species-specific audio detection: A comparison of three template-based classification algorithms using random forests

10.7287/peerj.preprints.2713 ◽

2017 ◽

Author(s):

Carlos J Corrada Bravo ◽

Rafael Álvarez Berríos ◽

T. Mitchell Aide

Keyword(s):

Random Forest ◽

Random Forests ◽

Random Forest Classifier ◽

Classification Algorithms ◽

Statistical Features ◽

Web Based ◽

Average Accuracy ◽

Species Specific ◽

Web Based System

We developed a web-based cloud-hosted system that allow users to archive, listen, visualize, and annotate recordings. The system also provides tools to convert these annotations into datasets that can be used to train a computer to detect the presence or absence of a species. The algorithm used by the system was selected after comparing the accuracy and efficiency of three variants of a template-based classification. The algorithm computes a similarity vector by comparing a template of a species call with time increments across the spectrogram. Statistical features are extracted from this vector and used as input for a Random Forest classifier that predicts presence or absence of the species in the recording. The fastest algorithm variant had the highest average accuracy and specificity; therefore, it was implemented in the ARBIMON web-based system.

Download Full-text

Data Mining for Potential Customer Segmentation in the Marketing Bank Dataset

JUITA Jurnal Informatika ◽

10.30595/juita.v9i1.7983 ◽

2021 ◽

Vol 9 (1) ◽

pp. 25

Author(s):

Maulida Ayu Fitriani ◽

Dany Candra Febrianto

Keyword(s):

Data Mining ◽

Random Forest ◽

Direct Marketing ◽

Class Imbalance ◽

Processing Technique ◽

Customer Segmentation ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Mining Methods ◽

Bank Marketing

Direct marketing is an effort made by the Bank to increase sales of its products and services, but the Bank sometimes has to contact a customer or prospective customer more than once to ascertain whether the customer or prospective customer is willing to subscribe to a product or service. To overcome this ineffective process several data mining methods are proposed. This study compares several data mining methods such as Naïve Bayes, K-NN, Random Forest, SVM, J48, AdaBoost J48 which prior to classification the SMOTE pre-processing technique was done in order to eliminate the class imbalance problem in the Bank Marketing dataset instance. The SMOTE + Random Forest method in this study produced the highest accuracy value of 92.61%.

Download Full-text

For Honor, for Toxicity

Proceedings of the ACM on Human-Computer Interaction ◽

10.1145/3474680 ◽

2021 ◽

Vol 5 (CHI PLAY) ◽

pp. 1-29

Author(s):

Alessandro Canossa ◽

Dmitry Salimov ◽

Ahmad Azadvar ◽

Casper Harteveld ◽

Georgios Yannakakis

Keyword(s):

Machine Learning ◽

Random Forest ◽

Random Forests ◽

Initial Study ◽

Unfair Advantage ◽

Offensive Behavior ◽

Forest Models ◽

Random Forest Models ◽

Action Type ◽

Degree Of Severity

Is it possible to detect toxicity in games just by observing in-game behavior? If so, what are the behavioral factors that will help machine learning to discover the unknown relationship between gameplay and toxic behavior? In this initial study, we examine whether it is possible to predict toxicity in the MOBA gameFor Honor by observing in-game behavior for players that have been labeled as toxic (i.e. players that have been sanctioned by Ubisoft community managers). We test our hypothesis of detecting toxicity through gameplay with a dataset of almost 1,800 sanctioned players, and comparing these sanctioned players with unsanctioned players. Sanctioned players are defined by their toxic action type (offensive behavior vs. unfair advantage) and degree of severity (warned vs. banned). Our findings, based on supervised learning with random forests, suggest that it is not only possible to behaviorally distinguish sanctioned from unsanctioned players based on selected features of gameplay; it is also possible to predict both the sanction severity (warned vs. banned) and the sanction type (offensive behavior vs. unfair advantage). In particular, all random forest models predict toxicity, its severity, and type, with an accuracy of at least 82%, on average, on unseen players. This research shows that observing in-game behavior can support the work of community managers in moderating and possibly containing the burden of toxic behavior.

Download Full-text

A Hybrid Predictive Model Integrating C4.5 and Decision Table Classifiers for Medical Data Sets

Research Anthology on Decision Support Systems and Decision Management in Healthcare, Business, and Engineering ◽

10.4018/978-1-7998-9023-2.ch016 ◽

2021 ◽

pp. 348-366

Author(s):

Amit Kumar ◽

Bikash Kanti Sarkar

Keyword(s):

Class Imbalance ◽

Decision Table ◽

Data Sets ◽

Classification Models ◽

Medical Practitioners ◽

Treatment Processes ◽

Disease Specificity ◽

Medical Domain ◽

Reliable Classification ◽

Intelligent Models

This article describes how, recently, data mining has been in great use for extracting meaningful patterns from medical domain data sets, and these patterns are then applied for clinical diagnosis. Truly, any accurate, precise and reliable classification models significantly assist the medical practitioners to improve diagnosis, prognosis and treatment processes of individual diseases. However, numerous intelligent models have been proposed in this respect but still they have several drawbacks like, disease specificity, class imbalance, conflicting and lack adequacy for dimensionality of patient's data. The present study has attempted to design a hybrid prediction model for medical domain data sets by combining the decision tree based classifier (mainly C4.5) and the decision table based classifier (DT). The experimental results validate in favour of the claims.

Download Full-text

NIMG-37. PREDICTING SEIZURE IN GLIOMA PATIENTS USING A RANDOM FOREST CLASSIFIER TRAINED ON SEX-SPECIFIC AND MIXED COHORTS

Neuro-Oncology ◽

10.1093/neuonc/noz175.707 ◽

2019 ◽

Vol 21 (Supplement_6) ◽

pp. vi169-vi169

Author(s):

Aditya Khurana ◽

Sandra Johnston ◽

Paula Whitmire ◽

Sara Ranjbar ◽

Akanksha Sharma ◽

...

Keyword(s):

Random Forest ◽

Mixed Model ◽

Cross Validation ◽

Class Imbalance ◽

Random Forest Classifier ◽

Co Morbidity ◽

Pre Treatment ◽

Male Model ◽

Sensitivity Specificity ◽

Testing Set

Abstract PURPOSE Brain tumor related epilepsy (BTE) is a major co-morbidity in patients with glioma. It is difficult to determine whether the use of anti-epileptic drugs is necessary. We attempted to build a machine-learning model to predict the probability of seizure presentation (SP) with glioma. METHODS We trained a random forest classifier using the following variables: volumetric data of pre-treatment MR images (T1Gd and T2-FLAIR sequences), patient demographics (age; sex), and measurements of tumor proliferation (log(ρ)), invasiveness (log(D)) and their relative ratio (log(ρ/D)). Our cohort consisted of 221 patients total. Using an 80-20 ratio, we used 176 patients (76 SP, 100 nSP) for training and the remaining 45 patients (19 SP, 26 nSP) were used for testing. We also trained on male-only and female-only cohorts to evaluate any sex differences in prediction. For training, 108 males (53 SP, 55 nSP) were used and 28 for testing (14 SP, 14 nSP). We used 72 females (21 SP, 49 nSP) for training and 15 (7 SP, 8 nSP) for testing. We corrected for class imbalance in the female cohort before training. Using 10-fold cross-validation and a separate testing set, we measured performance by ROC curve (AUC), accuracy, sensitivity, and specificity of predictions (average of folds in cross validation). RESULTS The female model achieved the highest AUC (0.853) followed by the mixed model (0.726) and the male model (0.651). In the validation set, the accuracy/sensitivity/specificity of the three cohorts were as follows: mixed (0.726/0.696/0.750), female (0.853/0.830/0.875), and male (0.651/0.577/0.722). The performance of the testing set, in terms of accuracy/sensitivity/specificity were: mixed (0.733/0.74/0.73), female (0.8/0.57/1), and male (0.714/0.64/0.79). CONCLUSION We found a negative correlation between seizure probability and size and invasiveness of tumors. Our model shows promising performance on testing set data. Further cohort studies and training is warranted.

Download Full-text

Classification study of solvation free energies of organic molecules using machine learning techniques

RSC Advances ◽

10.1039/c4ra07961b ◽

2014 ◽

Vol 4 (106) ◽

pp. 61624-61630 ◽

Cited By ~ 8

Author(s):

N. S. Hari Narayana Moorthy ◽

Silvia A. Martins ◽

Sergio F. Sousa ◽

Maria J. Ramos ◽

Pedro A. Fernandes

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Organic Molecules ◽

Machine Learning Techniques ◽

Support Vector ◽

Classification Models ◽

Free Energies ◽

Learning Techniques ◽

Solvation Free Energies

Classification models to predict the solvation free energies of organic molecules were developed using decision tree, random forest and support vector machine approaches and with MACCS fingerprints, MOE and PaDEL descriptors.

Download Full-text

Components of Random Forests

Combinatorics Probability Computing ◽

10.1017/s0963548300000067 ◽

1992 ◽

Vol 1 (1) ◽

pp. 35-52 ◽

Cited By ~ 7

Author(s):

Tomasz Łuczak ◽

Boris Pittel

Keyword(s):

Phase Transition ◽

Random Forest ◽

Asymptotic Behaviour ◽

Random Graph ◽

Random Forests ◽

Limit Distribution ◽

The Family ◽

Supercritical Phase

A forest ℱ(n, M) chosen uniformly from the family of all labelled unrooted forests with n vertices and M edges is studied. We show that, like the Érdős-Rényi random graph G(n, M), the random forest exhibits three modes of asymptotic behaviour: subcritical, nearcritical and supercritical, with the phase transition at the point M = n/2. For each of the phases, we determine the limit distribution of the size of the k-th largest component of ℱ(n, M). The similarity to the random graph is far from being complete. For instance, in the supercritical phase, the giant tree in ℱ(n, M) grows roughly two times slower than the largest component of G(n, M) and the second largest tree in ℱ(n, M) is of the order n⅔ for every M = n/2 +s, provided that s3n−2 → ∞ and s = o(n), while its counterpart in G(n, M) is of the order n2s−2 log(s3n−2) ≪ n⅔.

Download Full-text

Expertise in French health forums

Health Informatics Journal ◽

10.1177/1460458216682356 ◽

2016 ◽

Vol 25 (1) ◽

pp. 17-26 ◽

Cited By ~ 1

Author(s):

Amine Abdaoui ◽

Jérôme Azé ◽

Sandra Bringay ◽

Natalia Grabar ◽

Pascal Poncelet

Keyword(s):

Medical Students ◽

Random Forest ◽

Text Categorization ◽

Random Forest Classifier ◽

Classification Models ◽

High Quality ◽

Categorization Task ◽

Medical Role

More and more health websites hire medical experts (physicians, medical students, experienced volunteers, etc.) and indicate explicitly their medical role in order to notify that they provide high-quality answers. However, medical experts may participate in forum discussions even when their role is not officially indicated. Detecting posts written by medical experts facilitates the quick access to posts that have more chances of being correct and informative. The main objective of this work is to learn classification models that can be used to detect posts written by medical experts in any health forum discussions. Two French health forums have been used to discover the best features and methods for this text categorization task. The obtained results confirm that models learned on appropriate websites may be used efficiently on other websites (more than 98% of F1-measure has been obtained using a Random Forest classifier). A study of misclassified posts highlights the participation of medical experts in forum discussions even if their role is not explicitly indicated.

Download Full-text