scholarly journals Chinese Comma Disambiguation in Math Word Problems Using SMOTE and Random Forests

AI ◽  
2021 ◽  
Vol 2 (4) ◽  
pp. 738-755
Author(s):  
Jingxiu Huang ◽  
Qingtang Liu ◽  
Yunxiang Zheng ◽  
Linjing Wu

Natural language understanding technologies play an essential role in automatically solving math word problems. In the process of machine understanding Chinese math word problems, comma disambiguation, which is associated with a class imbalance binary learning problem, is addressed as a valuable instrument to transform the problem statement of math word problems into structured representation. Aiming to resolve this problem, we employed the synthetic minority oversampling technique (SMOTE) and random forests to comma classification after their hyperparameters were jointly optimized. We propose a strict measure to evaluate the performance of deployed comma classification models on comma disambiguation in math word problems. To verify the effectiveness of random forest classifiers with SMOTE on comma disambiguation, we conducted two-stage experiments on two datasets with a collection of evaluation measures. Experimental results showed that random forest classifiers were significantly superior to baseline methods in Chinese comma disambiguation. The SMOTE algorithm with optimized hyperparameter settings based on the categorical distribution of different datasets is preferable, instead of with its default values. For practitioners, we suggest that hyperparameters of a classification models be optimized again after parameter settings of SMOTE have been changed.

2016 ◽  
Vol 2016 ◽  
pp. 1-10 ◽  
Author(s):  
Huaping Guo ◽  
Weimei Zhi ◽  
Hongbing Liu ◽  
Mingliang Xu

In recent years, imbalanced learning problem has attracted more and more attentions from both academia and industry, and the problem is concerned with the performance of learning algorithms in the presence of data with severe class distribution skews. In this paper, we apply the well-known statistical model logistic discrimination to this problem and propose a novel method to improve its performance. To fully consider the class imbalance, we design a new cost function which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Unlike traditional logistic discrimination, the proposed method learns its parameters by maximizing the proposed cost function. Experimental results show that, compared with other state-of-the-art methods, the proposed one shows significantly better performance on measures of recall,g-mean,f-measure, AUC, and accuracy.


2017 ◽  
Author(s):  
Carlos J Corrada Bravo ◽  
Rafael Álvarez Berríos ◽  
T. Mitchell Aide

We developed a web-based cloud-hosted system that allow users to archive, listen, visualize, and annotate recordings. The system also provides tools to convert these annotations into datasets that can be used to train a computer to detect the presence or absence of a species. The algorithm used by the system was selected after comparing the accuracy and efficiency of three variants of a template-based classification. The algorithm computes a similarity vector by comparing a template of a species call with time increments across the spectrogram. Statistical features are extracted from this vector and used as input for a Random Forest classifier that predicts presence or absence of the species in the recording. The fastest algorithm variant had the highest average accuracy and specificity; therefore, it was implemented in the ARBIMON web-based system.


2021 ◽  
Vol 9 (1) ◽  
pp. 25
Author(s):  
Maulida Ayu Fitriani ◽  
Dany Candra Febrianto

Direct marketing is an effort made by the Bank to increase sales of its products and services, but the Bank sometimes has to contact a customer or prospective customer more than once to ascertain whether the customer or prospective customer is willing to subscribe to a product or service. To overcome this ineffective process several data mining methods are proposed. This study compares several data mining methods such as Naïve Bayes, K-NN, Random Forest, SVM, J48, AdaBoost J48 which prior to classification the SMOTE pre-processing technique was done in order to eliminate the class imbalance problem in the Bank Marketing dataset instance. The SMOTE + Random Forest method in this study produced the highest accuracy value of 92.61%.


2021 ◽  
Vol 5 (CHI PLAY) ◽  
pp. 1-29
Author(s):  
Alessandro Canossa ◽  
Dmitry Salimov ◽  
Ahmad Azadvar ◽  
Casper Harteveld ◽  
Georgios Yannakakis

Is it possible to detect toxicity in games just by observing in-game behavior? If so, what are the behavioral factors that will help machine learning to discover the unknown relationship between gameplay and toxic behavior? In this initial study, we examine whether it is possible to predict toxicity in the MOBA gameFor Honor by observing in-game behavior for players that have been labeled as toxic (i.e. players that have been sanctioned by Ubisoft community managers). We test our hypothesis of detecting toxicity through gameplay with a dataset of almost 1,800 sanctioned players, and comparing these sanctioned players with unsanctioned players. Sanctioned players are defined by their toxic action type (offensive behavior vs. unfair advantage) and degree of severity (warned vs. banned). Our findings, based on supervised learning with random forests, suggest that it is not only possible to behaviorally distinguish sanctioned from unsanctioned players based on selected features of gameplay; it is also possible to predict both the sanction severity (warned vs. banned) and the sanction type (offensive behavior vs. unfair advantage). In particular, all random forest models predict toxicity, its severity, and type, with an accuracy of at least 82%, on average, on unseen players. This research shows that observing in-game behavior can support the work of community managers in moderating and possibly containing the burden of toxic behavior.


Author(s):  
Amit Kumar ◽  
Bikash Kanti Sarkar

This article describes how, recently, data mining has been in great use for extracting meaningful patterns from medical domain data sets, and these patterns are then applied for clinical diagnosis. Truly, any accurate, precise and reliable classification models significantly assist the medical practitioners to improve diagnosis, prognosis and treatment processes of individual diseases. However, numerous intelligent models have been proposed in this respect but still they have several drawbacks like, disease specificity, class imbalance, conflicting and lack adequacy for dimensionality of patient's data. The present study has attempted to design a hybrid prediction model for medical domain data sets by combining the decision tree based classifier (mainly C4.5) and the decision table based classifier (DT). The experimental results validate in favour of the claims.


2019 ◽  
Vol 21 (Supplement_6) ◽  
pp. vi169-vi169
Author(s):  
Aditya Khurana ◽  
Sandra Johnston ◽  
Paula Whitmire ◽  
Sara Ranjbar ◽  
Akanksha Sharma ◽  
...  

Abstract PURPOSE Brain tumor related epilepsy (BTE) is a major co-morbidity in patients with glioma. It is difficult to determine whether the use of anti-epileptic drugs is necessary. We attempted to build a machine-learning model to predict the probability of seizure presentation (SP) with glioma. METHODS We trained a random forest classifier using the following variables: volumetric data of pre-treatment MR images (T1Gd and T2-FLAIR sequences), patient demographics (age; sex), and measurements of tumor proliferation (log(ρ)), invasiveness (log(D)) and their relative ratio (log(ρ/D)). Our cohort consisted of 221 patients total. Using an 80-20 ratio, we used 176 patients (76 SP, 100 nSP) for training and the remaining 45 patients (19 SP, 26 nSP) were used for testing. We also trained on male-only and female-only cohorts to evaluate any sex differences in prediction. For training, 108 males (53 SP, 55 nSP) were used and 28 for testing (14 SP, 14 nSP). We used 72 females (21 SP, 49 nSP) for training and 15 (7 SP, 8 nSP) for testing. We corrected for class imbalance in the female cohort before training. Using 10-fold cross-validation and a separate testing set, we measured performance by ROC curve (AUC), accuracy, sensitivity, and specificity of predictions (average of folds in cross validation). RESULTS The female model achieved the highest AUC (0.853) followed by the mixed model (0.726) and the male model (0.651). In the validation set, the accuracy/sensitivity/specificity of the three cohorts were as follows: mixed (0.726/0.696/0.750), female (0.853/0.830/0.875), and male (0.651/0.577/0.722). The performance of the testing set, in terms of accuracy/sensitivity/specificity were: mixed (0.733/0.74/0.73), female (0.8/0.57/1), and male (0.714/0.64/0.79). CONCLUSION We found a negative correlation between seizure probability and size and invasiveness of tumors. Our model shows promising performance on testing set data. Further cohort studies and training is warranted.


RSC Advances ◽  
2014 ◽  
Vol 4 (106) ◽  
pp. 61624-61630 ◽  
Author(s):  
N. S. Hari Narayana Moorthy ◽  
Silvia A. Martins ◽  
Sergio F. Sousa ◽  
Maria J. Ramos ◽  
Pedro A. Fernandes

Classification models to predict the solvation free energies of organic molecules were developed using decision tree, random forest and support vector machine approaches and with MACCS fingerprints, MOE and PaDEL descriptors.


1992 ◽  
Vol 1 (1) ◽  
pp. 35-52 ◽  
Author(s):  
Tomasz Łuczak ◽  
Boris Pittel

A forest ℱ(n, M) chosen uniformly from the family of all labelled unrooted forests with n vertices and M edges is studied. We show that, like the Érdős-Rényi random graph G(n, M), the random forest exhibits three modes of asymptotic behaviour: subcritical, nearcritical and supercritical, with the phase transition at the point M = n/2. For each of the phases, we determine the limit distribution of the size of the k-th largest component of ℱ(n, M). The similarity to the random graph is far from being complete. For instance, in the supercritical phase, the giant tree in ℱ(n, M) grows roughly two times slower than the largest component of G(n, M) and the second largest tree in ℱ(n, M) is of the order n⅔ for every M = n/2 +s, provided that s3n−2 → ∞ and s = o(n), while its counterpart in G(n, M) is of the order n2s−2 log(s3n−2) ≪ n⅔.


2016 ◽  
Vol 25 (1) ◽  
pp. 17-26 ◽  
Author(s):  
Amine Abdaoui ◽  
Jérôme Azé ◽  
Sandra Bringay ◽  
Natalia Grabar ◽  
Pascal Poncelet

More and more health websites hire medical experts (physicians, medical students, experienced volunteers, etc.) and indicate explicitly their medical role in order to notify that they provide high-quality answers. However, medical experts may participate in forum discussions even when their role is not officially indicated. Detecting posts written by medical experts facilitates the quick access to posts that have more chances of being correct and informative. The main objective of this work is to learn classification models that can be used to detect posts written by medical experts in any health forum discussions. Two French health forums have been used to discover the best features and methods for this text categorization task. The obtained results confirm that models learned on appropriate websites may be used efficiently on other websites (more than 98% of F1-measure has been obtained using a Random Forest classifier). A study of misclassified posts highlights the participation of medical experts in forum discussions even if their role is not explicitly indicated.


Sign in / Sign up

Export Citation Format

Share Document