Binary Classification Algorithms for the Detection of Sparse Word Forms in New Indo-Aryan Languages

Author(s):  
Rafał Jaworski ◽  
Krzysztof Jassem ◽  
Krzysztof Stroński
Stats ◽  
2020 ◽  
Vol 3 (4) ◽  
pp. 427-443
Author(s):  
Gildas Tagny-Ngompé ◽  
Stéphane Mussard ◽  
Guillaume Zambrano ◽  
Sébastien Harispe ◽  
Jacky Montmain

This paper presents and compares several text classification models that can be used to extract the outcome of a judgment from justice decisions, i.e., legal documents summarizing the different rulings made by a judge. Such models can be used to gather important statistics about cases, e.g., success rate based on specific characteristics of cases’ parties or jurisdiction, and are therefore important for the development of Judicial prediction not to mention the study of Law enforcement in general. We propose in particular the generalized Gini-PLS which better considers the information in the distribution tails while attenuating, as in the simple Gini-PLS, the influence exerted by outliers. Modeling the studied task as a supervised binary classification, we also introduce the LOGIT-Gini-PLS suited to the explanation of a binary target variable. In addition, various technical aspects regarding the evaluated text classification approaches which consists of combinations of representations of judgments and classification algorithms are studied using an annotated corpora of French justice decisions.


2005 ◽  
Vol 04 (02) ◽  
pp. 83-94
Author(s):  
Dursun Delen ◽  
Marilyn G. Kletke ◽  
Jin-Hwa Kim

Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600 K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.


2020 ◽  
Vol 13 (1) ◽  
pp. 149-158
Author(s):  
A.K. Kovalev ◽  
Y.M. Kuznetsova ◽  
M.Y. Penkina ◽  
M.A. Stankevich ◽  
N.V. Chudova

Using a tool for automatic text analysis and machine learning methods developed at the Federal Research Center ‘Computer Science and Control’ of the Russian Academy of Sciences, the first results are obtained in the task of identifying text parameters specific to people with certain psychological characteristics. The tool of corpus linguistic and statistical research, based on the use of relational-situational analysis, psycholinguistic indicators and dictionaries covering the vocabulary of emotional and rational assessment, allowed us to obtain values for 177 textual attributes of the essay written by 486 subjects. To obtain data on the severity of characterological and personality characteristics of the subjects, a number of psychological questionnaires were used. When processing the data, binary classification algorithms were used — the support vector method (SVM) and the Random Forest method. The results allow us to draw conclusions about the prospects of using some textual parameters in problems of population psychodiagnostics and the adequacy of the applied classification algorithms.


Kybernetes ◽  
2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Sara Tavassoli ◽  
Hamidreza Koosha

PurposeCustomer churn prediction is one of the most well-known approaches to manage and improve customer retention. Machine learning techniques, especially classification algorithms, are very popular tools to predict the churners. In this paper, three ensemble classifiers are proposed based on bagging and boosting for customer churn prediction.Design/methodology/approachIn this paper, three ensemble classifiers are proposed based on bagging and boosting for customer churn prediction. The first classifier, which is called boosted bagging, uses boosting for each bagging sample. In this approach, before concluding the final results in a bagging algorithm, the authors try to improve the prediction by applying a boosting algorithm for each bootstrap sample. The second proposed ensemble classifier, which is called bagged bagging, combines bagging with itself. In the other words, the authors apply bagging for each sample of bagging algorithm. Finally, the third approach uses bagging of neural network with learning based on a genetic algorithm.FindingsTo examine the performance of all proposed ensemble classifiers, they are applied to two datasets. Numerical simulations illustrate that the proposed hybrid approaches outperform the simple bagging and boosting algorithms as well as base classifiers. Especially, bagged bagging provides high accuracy and precision results.Originality/valueIn this paper, three novel ensemble classifiers are proposed based on bagging and boosting for customer churn prediction. Not only the proposed approaches can be applied for customer churn prediction but also can be used for any other binary classification algorithms.


Author(s):  
Anupama Jawale ◽  
Ganesh Magar

Human activity recognition is a rapidly growing area in healthcare systems. The applications include fall detection, ambiguous activity, dangerous behavior, etc. It has become one of the important requirements for the elderly or neurological disorder patients. The devices included are accelerometer and gyroscope, which generate large amounts of data. Accuracy of classification algorithms for this data is highly dependent upon extraction and selection of data features. This research study has extracted time domain features, based on statistical functions as well as rotational features around three axes. Gyroscope data features are also used to enhance accuracy of accelerometer data. Three popular classification techniques are used to classify the accelerometer dataset into activity categories. Binary classification (run -1 / walk-0) is considered. The results have shown SVM and LDA when used with rotation and gyroscope data gives the highest accuracy of 92.0% whereas FDA shows 84% accuracy.


2019 ◽  
Vol 13 (2) ◽  
pp. 47-66
Author(s):  
Martin Boldt ◽  
Kaavya Rekanar

In the present article, the authors investigate to what extent supervised binary classification can be used to distinguish between legitimate and rogue privacy policies posted on web pages. 15 classification algorithms are evaluated using a data set that consists of 100 privacy policies from legitimate websites (belonging to companies that top the Fortune Global 500 list) as well as 67 policies from rogue websites. A manual analysis of all policy content was performed and clear statistical differences in terms of both length and adherence to seven general privacy principles are found. Privacy policies from legitimate companies have a 98% adherence to the seven privacy principles, which is significantly higher than the 45% associated with rogue companies. Out of the 15 evaluated classification algorithms, Naïve Bayes Multinomial is the most suitable candidate to solve the problem at hand. Its models show the best performance, with an AUC measure of 0.90 (0.08), which outperforms most of the other candidates in the statistical tests used.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
James M. Holt ◽  
◽  
Brandon Wilk ◽  
Camille L. Birch ◽  
Donna M. Brown ◽  
...  

Abstract Background When applying genomic medicine to a rare disease patient, the primary goal is to identify one or more genomic variants that may explain the patient’s phenotypes. Typically, this is done through annotation, filtering, and then prioritization of variants for manual curation. However, prioritization of variants in rare disease patients remains a challenging task due to the high degree of variability in phenotype presentation and molecular source of disease. Thus, methods that can identify and/or prioritize variants to be clinically reported in the presence of such variability are of critical importance. Methods We tested the application of classification algorithms that ingest variant annotations along with phenotype information for predicting whether a variant will ultimately be clinically reported and returned to a patient. To test the classifiers, we performed a retrospective study on variants that were clinically reported to 237 patients in the Undiagnosed Diseases Network. Results We treated the classifiers as variant prioritization systems and compared them to four variant prioritization algorithms and two single-measure controls. We showed that the trained classifiers outperformed all other tested methods with the best classifiers ranking 72% of all reported variants and 94% of reported pathogenic variants in the top 20. Conclusions We demonstrated how freely available binary classification algorithms can be used to prioritize variants even in the presence of real-world variability. Furthermore, these classifiers outperformed all other tested methods, suggesting that they may be well suited for working with real rare disease patient datasets.


Sign in / Sign up

Export Citation Format

Share Document