Performance of Various Machine Learning Classifiers on Small Datasets with Varying Dimensionalities: A Study

2016 ◽  
Vol 1 (1) ◽  
pp. 30-35 ◽  
Author(s):  
Sahil Sharma ◽  
Vinod Sharma

Classification is an important supervised learning technique that is used by many applications. An important factor on which the performance of a classifier depends is the size of the dataset using which the classifier is going to be trained. In this manuscript the authors analyzed five different classification techniques (namely decision trees, KNN, SVM, linear discriminant and Ensemble method) in terms of AUC and predictive accuracy when trained using small datasets with different dimensionalities. The study was done using a dataset with 24 features and 400 instances (samples). The results showed that in general ensemble method (using boosted trees) performed better than others but its performance degraded a bit with reduced dimensionality.

Computers ◽  
2019 ◽  
Vol 8 (1) ◽  
pp. 8 ◽  
Author(s):  
Marcus Lim ◽  
Azween Abdullah ◽  
NZ Jhanjhi ◽  
Mahadevan Supramaniam

Criminal network activities, which are usually secret and stealthy, present certain difficulties in conducting criminal network analysis (CNA) because of the lack of complete datasets. The collection of criminal activities data in these networks tends to be incomplete and inconsistent, which is reflected structurally in the criminal network in the form of missing nodes (actors) and links (relationships). Criminal networks are commonly analyzed using social network analysis (SNA) models. Most machine learning techniques that rely on the metrics of SNA models in the development of hidden or missing link prediction models utilize supervised learning. However, supervised learning usually requires the availability of a large dataset to train the link prediction model in order to achieve an optimum performance level. Therefore, this research is conducted to explore the application of deep reinforcement learning (DRL) in developing a criminal network hidden links prediction model from the reconstruction of a corrupted criminal network dataset. The experiment conducted on the model indicates that the dataset generated by the DRL model through self-play or self-simulation can be used to train the link prediction model. The DRL link prediction model exhibits a better performance than a conventional supervised machine learning technique, such as the gradient boosting machine (GBM) trained with a relatively smaller domain dataset.


2021 ◽  
Author(s):  
Jessica Röhner ◽  
Philipp Thoss ◽  
Astrid Schütz

Research has shown that even experts cannot detect faking above chance, but recent studies have suggested that machine learning may help in this endeavor. However, faking differs between faking conditions, previous efforts have not taken these differences into account, and faking indices have yet to be integrated into such approaches. We reanalyzed seven data sets (N = 1,039) with various faking conditions (high and low scores, different constructs, naïve and informed faking, faking with and without practice, different measures [self-reports vs. implicit association tests; IATs]). We investigated the extent to which and how machine learning classifiers could detect faking under these conditions and compared different input data (response patterns, scores, faking indices) and different classifiers (logistic regression, random forest, XGBoost). We also explored the features that classifiers used for detection. Our results show that machine learning has the potential to detect faking, but detection success varies between conditions from chance levels to 100%. There were differences in detection (e.g., detecting low-score faking was better than detecting high-score faking). For self-reports, response patterns and scores were comparable with regard to faking detection, whereas for IATs, faking indices and response patterns were superior to scores. Logistic regression and random forest worked about equally well and outperformed XGBoost. In most cases, classifiers used more than one feature (faking occurred over different pathways), and the features varied in their relevance. Our research supports the assumption of different faking processes and explains why detecting faking is a complex endeavor.


Author(s):  
Sumit Kumar ◽  
Sanlap Acharya

The prediction of stock prices has always been a very challenging problem for investors. Using machine learning techniques to predict stock prices is also one of the favourite topics for academics working in this domain. This chapter discusses five supervised learning techniques and two unsupervised learning techniques to solve the problem of stock price prediction and has compared the performances of all the algorithms. Among the supervised learning techniques, Long Short-Term Memory (LSTM) algorithm performed better than the others whereas, among the unsupervised learning techniques, Restricted Boltzmann Machine (RBM) performed better. RBM is found to be performing even better than LSTM.


Author(s):  
Bhekisipho Twala ◽  

The major objective of the paper is to investigate a new probabilistic supervised learning approach that incorporates “missingness” into a decision tree classifier splitting criterion at each particular attribute node in terms of software effort development predictive accuracy. The proposed approach is compared empirically with ten supervised learning methods (classifiers) that have mechanisms for dealing with missing values. 10 industrial datasets are utilized for this task. Overall, missing incorporated in attributes 3 is the top performing strategy, followed by C4.5, missing incorporated in attributes, missing incorporated in attributes 2, missing incorporated in attributes, linear discriminant analysis and so on. Classification and regression trees and C4.5 performed well in data with high correlations among attributes whilek-nearest neighbour and support vector machines performed well in data with higher complexity (limited number of instances). The worst performing method is repeated incremental pruning to produce error reduction.


2019 ◽  
Vol 109 ◽  
pp. 65-70 ◽  
Author(s):  
Susan Athey ◽  
Mohsen Bayati ◽  
Guido Imbens ◽  
Zhaonan Qu

In many prediction problems researchers have found that combinations of prediction methods (“ensembles”) perform better than individual methods. In this paper we apply these ideas to synthetic control type problems in panel data. Here a number of conceptually quite different methods have been developed. We compare the predictive accuracy of three methods with an ensemble method and find that the latter dominates. These results show that ensemble methods are a practical and effective method for the type of data configurations typically encountered in empirical work in economics, and that these methods deserve more attention.


Author(s):  
Maad M. Mijwil ◽  
Israa Ezzat Salem

The fraud detection in payment is a classification problem that aims to identify fraudulent transactions based individually on the information it contains and on the basis that a fraudster's behaviour patterns differ significantly from that of the actual customer. In this context, the authors propose to implement machine learning classifiers (Naïve Bayes, C4.5 decision trees, and Bagging Ensemble Learner) to predict the outcome of regular transactions and fraudulent transactions. The performance of these classifiers is judged by the following ways: precision, recall rate, and precision-recall curve (PRC) area rate. The dataset includes more than 297K transactions via credit cards in September 2013 and November 2017 that have been collected from Kaggle platform, of which 3293 are frauds. The performance PRC ratio of machine learning classifiers is between 99.9% and 100%, which confirms that these classifiers are very good at identifying binary classes 0 in the dataset. The results of the tests have proved that the best classifier is C4.5 decision trees. This classifier has the best accuracy of 94.12% in prediction of fraudulent transactions.


Author(s):  
Hamza Turabieh ◽  
Ahmad S. Alghamdi

Wi-Fi technology is now everywhere either inside or outside buildings. Using Wi-fi technology introduces an indoor localization service(s) (ILS). Determining indoor user location is a hard and complex problem. Several applications highlight the importance of indoor user localization such as disaster management, health care zones, Internet of Things applications (IoT), and public settlement planning. The measurements of Wi-Fi signal strength (i.e., Received Signal Strength Indicator (RSSI)) can be used to determine indoor user location. In this paper, we proposed a hybrid model between a wrapper feature selection algorithm and machine learning classifiers to determine indoor user location. We employed the Minimum Redundancy Maximum Relevance (mRMR) algorithm as a feature selection to select the most active access point (AP) based on RSSI values. Six different machine learning classifiers were used in this work (i.e., Decision Tree (DT), Support Vector Machine (SVM), k-nearest neighbors (kNN), Linear Discriminant Analysis (LDA), Ensemble-Bagged Tree (EBaT), and Ensemble Boosted Tree (EBoT)). We examined all classifiers on a public dataset obtained from UCI repository. The obtained results show that EBoT outperforms all other classifiers based on accuracy value/


2021 ◽  
Author(s):  
Dixita Mali ◽  
Kritika Kumawat ◽  
Gaurav Kumawat ◽  
Prasun Chakrabarti ◽  
Sandeep Poddar ◽  
...  

Abstract Depression is an ordinary mental health care problem and the usual cause of disability worldwide. The main purpose of this research was to determine that how depression affects the life of an individual. It is a leading cause of morbidity and death. Over the last 50–60 years, large numbers of studies published various aspects including the impact of depression. The main purpose of this research is to determine whether the person is suffering from depression or not. The dataset of Depression has been taken from the Kaggle website. Guided Machine Learning classifiers have helped in the highest accuracy of a dataset. Classifiers like XGBoost Tree, Random Trees, Neural Network, SVM, Random Forest, C5.0, and Bay Net. From the result, it is evident that the C5.0 classifier is giving the highest accuracy with 83.94 % and for each classifier, the result is derived based without pre-processing.


Sensors ◽  
2018 ◽  
Vol 18 (9) ◽  
pp. 2845 ◽  
Author(s):  
Chi-Hsiang Huang ◽  
Chian Zeng ◽  
Yi-Chia Wang ◽  
Hsin-Yi Peng ◽  
Chia-Sheng Lin ◽  
...  

Lung cancer is the leading cause of cancer death around the world, and lung cancer screening remains challenging. This study aimed to develop a breath test for the detection of lung cancer using a chemical sensor array and a machine learning technique. We conducted a prospective study to enroll lung cancer cases and non-tumour controls between 2016 and 2018 and analysed alveolar air samples using carbon nanotube sensor arrays. A total of 117 cases and 199 controls were enrolled in the study of which 72 subjects were excluded due to having cancer at another site, benign lung tumours, metastatic lung cancer, carcinoma in situ, minimally invasive adenocarcinoma, received chemotherapy or other diseases. Subjects enrolled in 2016 and 2017 were used for the model derivation and internal validation. The model was externally validated in subjects recruited in 2018. The diagnostic accuracy was assessed using the pathological reports as the reference standard. In the external validation, the areas under the receiver operating characteristic curve (AUCs) were 0.91 (95% CI = 0.79–1.00) by linear discriminant analysis and 0.90 (95% CI = 0.80–0.99) by the supportive vector machine technique. The combination of the sensor array technique and machine learning can detect lung cancer with high accuracy.


2021 ◽  
Author(s):  
Hayat Ali Shah

<div># Machine learning Classifiers for prediction of Pathway module & it classes </div><div>We use SMILES representation of query molecules to generate relevant fingerprints, which are then fed to the machine learning classifiers ETC for producing binary labels corresponding pathway module & its classes. The details of the works are described in our paper.</div><div>A dataset of 6597 downloaded from KEGG, 4612 compounds either belong or not to Pathway module in metabolic pathway the remaining 1985 compounds belong to module classes prediction problems </div><div>### Requirements</div><div>*Chemoinformatics tools</div><div>* Python</div><div>* scikit-learn</div><div>* RDKit</div><div>* Jupyter Notebook</div><div>### Usage</div><div>We provide two folder containing Classifiers files,grid search for optimization of hyperparameters, and datasets(module, module classes</div>


Sign in / Sign up

Export Citation Format

Share Document