Binary Classification Algorithms for the Detection of Sparse Word Forms in New Indo-Aryan Languages

Identification of Judicial Outcomes in Judgments: A Generalized Gini-PLS Approach

Stats ◽

10.3390/stats3040027 ◽

2020 ◽

Vol 3 (4) ◽

pp. 427-443

Author(s):

Gildas Tagny-Ngompé ◽

Stéphane Mussard ◽

Guillaume Zambrano ◽

Sébastien Harispe ◽

Jacky Montmain

Keyword(s):

Law Enforcement ◽

Success Rate ◽

Text Classification ◽

Binary Classification ◽

Classification Algorithms ◽

Classification Models ◽

Target Variable ◽

Technical Aspects ◽

Legal Documents ◽

Judicial Outcomes

This paper presents and compares several text classification models that can be used to extract the outcome of a judgment from justice decisions, i.e., legal documents summarizing the different rulings made by a judge. Such models can be used to gather important statistics about cases, e.g., success rate based on specific characteristics of cases’ parties or jurisdiction, and are therefore important for the development of Judicial prediction not to mention the study of Law enforcement in general. We propose in particular the generalized Gini-PLS which better considers the information in the distribution tails while attenuating, as in the simple Gini-PLS, the influence exerted by outliers. Modeling the studied task as a supervised binary classification, we also introduce the LOGIT-Gini-PLS suited to the explanation of a binary target variable. In addition, various technical aspects regarding the evaluated text classification approaches which consists of combinations of representations of judgments and classification algorithms are studied using an annotated corpora of French justice decisions.

Download Full-text

A Scalable Classification Algorithm for Very Large Datasets

Journal of Information & Knowledge Management ◽

10.1142/s0219649205001092 ◽

2005 ◽

Vol 04 (02) ◽

pp. 83-94

Author(s):

Dursun Delen ◽

Marilyn G. Kletke ◽

Jin-Hwa Kim

Keyword(s):

Domain Knowledge ◽

Binary Classification ◽

Performance Testing ◽

Classification Problem ◽

Large Datasets ◽

Classification Algorithm ◽

Iterative Refinement ◽

Classification Algorithms ◽

Very Large Datasets ◽

Scalable Classification

Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600 K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.

Download Full-text

Possibilities of automatic text analysis in the task of determining the psychological characteristics of the author

Experimental Psychology (Russia) ◽

10.17759/exppsy.2020130111 ◽

2020 ◽

Vol 13 (1) ◽

pp. 149-158

Author(s):

A.K. Kovalev ◽

Y.M. Kuznetsova ◽

M.Y. Penkina ◽

M.A. Stankevich ◽

N.V. Chudova

Keyword(s):

Text Analysis ◽

Binary Classification ◽

Psychological Characteristics ◽

Support Vector ◽

Classification Algorithms ◽

Situational Analysis ◽

Statistical Research ◽

Vector Method ◽

Automatic Text Analysis ◽

Automatic Text

Using a tool for automatic text analysis and machine learning methods developed at the Federal Research Center ‘Computer Science and Control’ of the Russian Academy of Sciences, the first results are obtained in the task of identifying text parameters specific to people with certain psychological characteristics. The tool of corpus linguistic and statistical research, based on the use of relational-situational analysis, psycholinguistic indicators and dictionaries covering the vocabulary of emotional and rational assessment, allowed us to obtain values for 177 textual attributes of the essay written by 486 subjects. To obtain data on the severity of characterological and personality characteristics of the subjects, a number of psychological questionnaires were used. When processing the data, binary classification algorithms were used — the support vector method (SVM) and the Random Forest method. The results allow us to draw conclusions about the prospects of using some textual parameters in problems of population psychodiagnostics and the adequacy of the applied classification algorithms.

Download Full-text

Hybrid ensemble learning approaches to customer churn prediction

Kybernetes ◽

10.1108/k-04-2020-0214 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Sara Tavassoli ◽

Hamidreza Koosha

Keyword(s):

Binary Classification ◽

Machine Learning Techniques ◽

Classification Algorithms ◽

Bootstrap Sample ◽

Learning Approaches ◽

Ensemble Classifiers ◽

Churn Prediction ◽

Content Type ◽

Customer Churn ◽

Customer Churn Prediction

PurposeCustomer churn prediction is one of the most well-known approaches to manage and improve customer retention. Machine learning techniques, especially classification algorithms, are very popular tools to predict the churners. In this paper, three ensemble classifiers are proposed based on bagging and boosting for customer churn prediction.Design/methodology/approachIn this paper, three ensemble classifiers are proposed based on bagging and boosting for customer churn prediction. The first classifier, which is called boosted bagging, uses boosting for each bagging sample. In this approach, before concluding the final results in a bagging algorithm, the authors try to improve the prediction by applying a boosting algorithm for each bootstrap sample. The second proposed ensemble classifier, which is called bagged bagging, combines bagging with itself. In the other words, the authors apply bagging for each sample of bagging algorithm. Finally, the third approach uses bagging of neural network with learning based on a genetic algorithm.FindingsTo examine the performance of all proposed ensemble classifiers, they are applied to two datasets. Numerical simulations illustrate that the proposed hybrid approaches outperform the simple bagging and boosting algorithms as well as base classifiers. Especially, bagged bagging provides high accuracy and precision results.Originality/valueIn this paper, three novel ensemble classifiers are proposed based on bagging and boosting for customer churn prediction. Not only the proposed approaches can be applied for customer churn prediction but also can be used for any other binary classification algorithms.

Download Full-text

Study of Feature Extraction Techniques for Sensor Data Classification

International Journal of Information Communication Technologies and Human Development ◽

10.4018/ijicthd.2021010103 ◽

2021 ◽

Vol 13 (1) ◽

pp. 33-46

Author(s):

Anupama Jawale ◽

Ganesh Magar

Keyword(s):

Research Study ◽

Binary Classification ◽

Fall Detection ◽

The Elderly ◽

Sensor Data ◽

Classification Algorithms ◽

Accelerometer Data ◽

Extraction Techniques ◽

Selection Of ◽

Statistical Functions

Human activity recognition is a rapidly growing area in healthcare systems. The applications include fall detection, ambiguous activity, dangerous behavior, etc. It has become one of the important requirements for the elderly or neurological disorder patients. The devices included are accelerometer and gyroscope, which generate large amounts of data. Accuracy of classification algorithms for this data is highly dependent upon extraction and selection of data features. This research study has extracted time domain features, based on statistical functions as well as rotational features around three axes. Gyroscope data features are also used to enhance accuracy of accelerometer data. Three popular classification techniques are used to classify the accelerometer dataset into activity categories. Binary classification (run -1 / walk-0) is considered. The results have shown SVM and LDA when used with rotation and gyroscope data gives the highest accuracy of 92.0% whereas FDA shows 84% accuracy.

Download Full-text

Performance Analysis of Various Supervised Binary Classification Algorithms and their Optimized Variants on High-Dimension Limited-Sample-Size Data

10.1109/upcon52273.2021.9667609 ◽

2021 ◽

Author(s):

Rohan Kumar Lala

Keyword(s):

Performance Analysis ◽

Sample Size ◽

High Dimension ◽

Binary Classification ◽

Classification Algorithms ◽

Limited Sample ◽

Limited Sample Size ◽

Size Data

Download Full-text

Comparing and analysing binary classification algorithms when used to detect the Zeus malware

2019 Sixth HCT Information Technology Trends (ITT) ◽

10.1109/itt48889.2019.9075115 ◽

2019 ◽

Author(s):

Mohamed Ali Kazi ◽

Steve Woodhead ◽

Diane Gan

Keyword(s):

Binary Classification ◽

Classification Algorithms

Download Full-text

Analysis and Text Classification of Privacy Policies From Rogue and Top-100 Fortune Global Companies

International Journal of Information Security and Privacy ◽

10.4018/ijisp.2019040104 ◽

2019 ◽

Vol 13 (2) ◽

pp. 47-66

Author(s):

Martin Boldt ◽

Kaavya Rekanar

Keyword(s):

Statistical Tests ◽

Binary Classification ◽

Classification Algorithms ◽

Web Pages ◽

Privacy Policies ◽

Data Set ◽

Suitable Candidate ◽

Global Companies ◽

Policy Content

In the present article, the authors investigate to what extent supervised binary classification can be used to distinguish between legitimate and rogue privacy policies posted on web pages. 15 classification algorithms are evaluated using a data set that consists of 100 privacy policies from legitimate websites (belonging to companies that top the Fortune Global 500 list) as well as 67 policies from rogue websites. A manual analysis of all policy content was performed and clear statistical differences in terms of both length and adherence to seven general privacy principles are found. Privacy policies from legitimate companies have a 98% adherence to the seven privacy principles, which is significantly higher than the 45% associated with rogue companies. Out of the 15 evaluated classification algorithms, Naïve Bayes Multinomial is the most suitable candidate to solve the problem at hand. Its models show the best performance, with an AUC measure of 0.90 (0.08), which outperforms most of the other candidates in the statistical tests used.

Download Full-text

VarSight: prioritizing clinically reported variants with binary classification algorithms

BMC Bioinformatics ◽

10.1186/s12859-019-3026-8 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 4

Author(s):

James M. Holt ◽

◽

Brandon Wilk ◽

Camille L. Birch ◽

Donna M. Brown ◽

...

Keyword(s):

Rare Disease ◽

Binary Classification ◽

Genomic Medicine ◽

Disease Patient ◽

Classification Algorithms ◽

Variant Prioritization ◽

Single Measure ◽

Pathogenic Variants ◽

Manual Curation ◽

Rare Disease Patient

Abstract Background When applying genomic medicine to a rare disease patient, the primary goal is to identify one or more genomic variants that may explain the patient’s phenotypes. Typically, this is done through annotation, filtering, and then prioritization of variants for manual curation. However, prioritization of variants in rare disease patients remains a challenging task due to the high degree of variability in phenotype presentation and molecular source of disease. Thus, methods that can identify and/or prioritize variants to be clinically reported in the presence of such variability are of critical importance. Methods We tested the application of classification algorithms that ingest variant annotations along with phenotype information for predicting whether a variant will ultimately be clinically reported and returned to a patient. To test the classifiers, we performed a retrospective study on variants that were clinically reported to 237 patients in the Undiagnosed Diseases Network. Results We treated the classifiers as variant prioritization systems and compared them to four variant prioritization algorithms and two single-measure controls. We showed that the trained classifiers outperformed all other tested methods with the best classifiers ranking 72% of all reported variants and 94% of reported pathogenic variants in the top 20. Conclusions We demonstrated how freely available binary classification algorithms can be used to prioritize variants even in the presence of real-world variability. Furthermore, these classifiers outperformed all other tested methods, suggesting that they may be well suited for working with real rare disease patient datasets.

Download Full-text

Optimized Parallelization of Binary Classification Algorithms Based on Spark

2016 9th International Symposium on Computational Intelligence and Design (ISCID) ◽

10.1109/iscid.2016.2028 ◽

2016 ◽

Author(s):

Yushui Geng ◽

Jianguo Zhang

Keyword(s):

Binary Classification ◽

Classification Algorithms

Download Full-text