scholarly journals Diagnosis of vertebral column pathologies using concatenated resampling with machine learning algorithms

2021 ◽  
Vol 7 ◽  
pp. e547
Author(s):  
Aijaz Ahmad Reshi ◽  
Imran Ashraf ◽  
Furqan Rustam ◽  
Hina Fatima Shahzad ◽  
Arif Mehmood ◽  
...  

Medical diagnosis through the classification of biomedical attributes is one of the exponentially growing fields in bioinformatics. Although a large number of approaches have been presented in the past, wide use and superior performance of the machine learning (ML) methods in medical diagnosis necessitates significant consideration for automatic diagnostic methods. This study proposes a novel approach called concatenated resampling (CR) to increase the efficacy of traditional ML algorithms. The performance is analyzed leveraging four ML approaches like tree-based ensemble approaches, and linear machine learning approach for automatic diagnosis of inter-vertebral pathologies with increased. Besides, undersampling, over-sampling, and proposed CR techniques have been applied to unbalanced training dataset to analyze the impact of these techniques on the accuracy of each of the classification model. Extensive experiments have been conducted to make comparisons among different classification models using several metrics including accuracy, precision, recall, and F1 score. Comparative analysis has been performed on the experimental results to identify the best performing classifier along with the application of the re-sampling technique. The results show that the extra tree classifier achieves an accuracy of 0.99 in association with the proposed CR technique.

2015 ◽  
Vol 32 (6) ◽  
pp. 821-827 ◽  
Author(s):  
Enrique Audain ◽  
Yassel Ramos ◽  
Henning Hermjakob ◽  
Darren R. Flower ◽  
Yasset Perez-Riverol

Abstract Motivation: In any macromolecular polyprotic system—for example protein, DNA or RNA—the isoelectric point—commonly referred to as the pI—can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge—and thus the electrophoretic mobility—of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 1500 ◽  
pp. 012131
Author(s):  
Firdaus ◽  
Andre Herviant Juliano ◽  
Naufal Rachmatullah ◽  
Sarifah Putri Rafflesia ◽  
Dinna Yunika Hardiyanti ◽  
...  

2021 ◽  
Vol 3 (4) ◽  
pp. 32-37
Author(s):  
J. Adassuriya ◽  
J. A. N. S. S. Jayasinghe ◽  
K. P. S. C. Jayaratne

Machine learning algorithms play an impressive role in modern technology and address automation problems in many fields as these techniques can be used to identify features with high sensitivity, which humans or other programming techniques aren’t capable of detecting. In addition, the growth of the availability of the data demands the need of faster, accurate, and more reliable automating methods of extracting information, reforming, and preprocessing, and analyzing them in the world of science. The development of machine learning techniques to automate complex manual programs is a time relevant research in astrophysics as it’s a field where, experts are dealing with large sets of data every day. In this study, an automated classification was built for 6 types of star classes Beta Cephei, Delta Scuti, Gamma Doradus, Red Giants, RR Lyrae and RV Tarui with widely varying properties, features extracted from training dataset of stellar light curves obtained from Kepler mission. The Random Forest classification model was used as the Machine Learning model and both periodic and non-periodic features extracted from light curves were used as the inputs to the model. Our implementation achieved an accuracy of 86.5%, an average precision level of 0.86, an average recall value of 0.87, and average F1-Score of 0.86 for the testing dataset obtained from the Kepler mission.


2019 ◽  
Vol 2019 (2) ◽  
pp. 47-65
Author(s):  
Balázs Pejó ◽  
Qiang Tang ◽  
Gergely Biczók

Abstract Machine learning algorithms have reached mainstream status and are widely deployed in many applications. The accuracy of such algorithms depends significantly on the size of the underlying training dataset; in reality a small or medium sized organization often does not have the necessary data to train a reasonably accurate model. For such organizations, a realistic solution is to train their machine learning models based on their joint dataset (which is a union of the individual ones). Unfortunately, privacy concerns prevent them from straightforwardly doing so. While a number of privacy-preserving solutions exist for collaborating organizations to securely aggregate the parameters in the process of training the models, we are not aware of any work that provides a rational framework for the participants to precisely balance the privacy loss and accuracy gain in their collaboration. In this paper, by focusing on a two-player setting, we model the collaborative training process as a two-player game where each player aims to achieve higher accuracy while preserving the privacy of its own dataset. We introduce the notion of Price of Privacy, a novel approach for measuring the impact of privacy protection on the accuracy in the proposed framework. Furthermore, we develop a game-theoretical model for different player types, and then either find or prove the existence of a Nash Equilibrium with regard to the strength of privacy protection for each player. Using recommendation systems as our main use case, we demonstrate how two players can make practical use of the proposed theoretical framework, including setting up the parameters and approximating the non-trivial Nash Equilibrium.


2021 ◽  
Vol 11 (7) ◽  
pp. 3130
Author(s):  
Janka Kabathova ◽  
Martin Drlik

Early and precisely predicting the students’ dropout based on available educational data belongs to the widespread research topic of the learning analytics research field. Despite the amount of already realized research, the progress is not significant and persists on all educational data levels. Even though various features have already been researched, there is still an open question, which features can be considered appropriate for different machine learning classifiers applied to the typical scarce set of educational data at the e-learning course level. Therefore, the main goal of the research is to emphasize the importance of the data understanding, data gathering phase, stress the limitations of the available datasets of educational data, compare the performance of several machine learning classifiers, and show that also a limited set of features, which are available for teachers in the e-learning course, can predict student’s dropout with sufficient accuracy if the performance metrics are thoroughly considered. The data collected from four academic years were analyzed. The features selected in this study proved to be applicable in predicting course completers and non-completers. The prediction accuracy varied between 77 and 93% on unseen data from the next academic year. In addition to the frequently used performance metrics, the comparison of machine learning classifiers homogeneity was analyzed to overcome the impact of the limited size of the dataset on obtained high values of performance metrics. The results showed that several machine learning algorithms could be successfully applied to a scarce dataset of educational data. Simultaneously, classification performance metrics should be thoroughly considered before deciding to deploy the best performance classification model to predict potential dropout cases and design beneficial intervention mechanisms.


Electronics ◽  
2021 ◽  
Vol 10 (17) ◽  
pp. 2099
Author(s):  
Paweł Ziemba ◽  
Jarosław Becker ◽  
Aneta Becker ◽  
Aleksandra Radomska-Zalas ◽  
Mateusz Pawluk ◽  
...  

One of the important research problems in the context of financial institutions is the assessment of credit risk and the decision to whether grant or refuse a loan. Recently, machine learning based methods are increasingly employed to solve such problems. However, the selection of appropriate feature selection technique, sampling mechanism, and/or classifiers for credit decision support is very challenging, and can affect the quality of the loan recommendations. To address this challenging task, this article examines the effectiveness of various data science techniques in issue of credit decision support. In particular, processing pipeline was designed, which consists of methods for data resampling, feature discretization, feature selection, and binary classification. We suggest building appropriate decision models leveraging pertinent methods for binary classification, feature selection, as well as data resampling and feature discretization. The selected models’ feasibility analysis was performed through rigorous experiments on real data describing the client’s ability for loan repayment. During experiments, we analyzed the impact of feature selection on the results of binary classification, and the impact of data resampling with feature discretization on the results of feature selection and binary classification. After experimental evaluation, we found that correlation-based feature selection technique and random forest classifier yield the superior performance in solving underlying problem.


2019 ◽  
Vol 143 (8) ◽  
pp. 990-998 ◽  
Author(s):  
Min Yu ◽  
Lindsay A. L. Bazydlo ◽  
David E. Bruns ◽  
James H. Harrison

Context.— Turnaround time and productivity of clinical mass spectrometric (MS) testing are hampered by time-consuming manual review of the analytical quality of MS data before release of patient results. Objective.— To determine whether a classification model created by using standard machine learning algorithms can verify analytically acceptable MS results and thereby reduce manual review requirements. Design.— We obtained retrospective data from gas chromatography–MS analyses of 11-nor-9-carboxy-delta-9-tetrahydrocannabinol (THC-COOH) in 1267 urine samples. The data for each sample had been labeled previously as either analytically unacceptable or acceptable by manual review. The dataset was randomly split into training and test sets (848 and 419 samples, respectively), maintaining equal proportions of acceptable (90%) and unacceptable (10%) results in each set. We used stratified 10-fold cross-validation in assessing the abilities of 6 supervised machine learning algorithms to distinguish unacceptable from acceptable assay results in the training dataset. The classifier with the highest recall was used to build a final model, and its performance was evaluated against the test dataset. Results.— In comparison testing of the 6 classifiers, a model based on the Support Vector Machines algorithm yielded the highest recall and acceptable precision. After optimization, this model correctly identified all unacceptable results in the test dataset (100% recall) with a precision of 81%. Conclusions.— Automated data review identified all analytically unacceptable assays in the test dataset, while reducing the manual review requirement by about 87%. This automation strategy can focus manual review only on assays likely to be problematic, allowing improved throughput and turnaround time without reducing quality.


T-Comm ◽  
2021 ◽  
Vol 15 (9) ◽  
pp. 24-35
Author(s):  
Irina A. Krasnova ◽  

The paper analyzes the impact of setting the parameters of Machine Learning algorithms on the results of traffic classification in real-time. The Random Forest and XGBoost algorithms are considered. A brief description of the work of both methods and methods for evaluating the results of classification is given. Experimental studies are conducted on a database obtained on a real network, separately for TCP and UDP flows. In order for the results of the study to be used in real time, a special feature matrix is created based on the first 15 packets of the flow. The main parameters of the Random Forest (RF) algorithm for configuration are the number of trees, the partition criterion used, the maximum number of features for constructing the partition function, the depth of the tree, and the minimum number of samples in the node and in the leaf. For XGBoost, the number of trees, the depth of the tree, the minimum number of samples in the leaf, for features, and the percentage of samples needed to build the tree are taken. Increasing the number of trees leads to an increase in accuracy to a certain value, but as shown in the article, it is important to make sure that the model is not overfitted. To combat overfitting, the remaining parameters of the trees are used. In the data set under study, by eliminating overfitting, it was possible to achieve an increase in classification accuracy for individual applications by 11-12% for Random Forest and by 12-19% for XGBoost. The results show that setting the parameters is a very important step in building a traffic classification model, because it helps to combat overfitting and significantly increases the accuracy of the algorithm’s predictions. In addition, it was shown that if the parameters are properly configured, XGBoost, which is not very popular in traffic classification works, becomes a competitive algorithm and shows better results compared to the widespread Random Forest.


Energies ◽  
2021 ◽  
Vol 14 (18) ◽  
pp. 5718
Author(s):  
Regelii Suassuna de Andrade Ferreira ◽  
Patrick Picher ◽  
Hassan Ezzaidi ◽  
Issouf Fofana

Frequency response analysis (FRA) is a powerful and widely used tool for condition assessment in power transformers. However, interpretation schemes are still challenging. Studies show that FRA data can be influenced by parameters other than winding deformation, including temperature. In this study, a machine-learning approach with temperature as an input attribute was used to objectively identify faults in FRA traces. To the best knowledge of the authors, this has not been reported in the literature. A single-phase transformer model was specifically designed and fabricated for use as a test object for the study. The model is unique in that it allows the non-destructive interchange of healthy and distorted winding sections and, hence, reproducible and repeatable FRA measurements. FRA measurements taken at temperatures ranging from −40 °C to 40 °C were used first to describe the impact of temperature on FRA traces and then to test the ability of the machine learning algorithms to discriminate between fault conditions and temperature variation. The results show that when temperature is not considered in the training dataset, the algorithm may misclassify healthy measurements, taken at different temperatures, as mechanical or electrical faults. However, once the influence of temperature was considered in the training set, the performance of the classifier as studied was restored. The results indicate the feasibility of using the proposed approach to prevent misclassification based on temperature changes.


Author(s):  
Y. Dileep Sean ◽  
D.D. Smith ◽  
V.S.P. Bitra ◽  
Vimala Bera ◽  
Sk. Nafeez Umar

Automated defect detection of fruits using computer vision and machine learning concepts has ‎become a significant area of research. In ‎this work, working prototype hardware model of conveyor with PC is designed, constructed and implemented to analyze the fruit quality. The prototype consists of low-cost microcontrollers, USB camera and MATLAB user interface. The automated classification model rejects or accepts the fruit based on the quality i.e., good (ripe, unripe) and bad. For the classification of fruit quality, machine learning algorithms such as Support Vector Machine, KNN, Random Forest classifier, Decision Tree classifier and ANN are used. The dataset used in this work consists of the following fruit varieties i.e., apple, orange, tomato, guava, lemon, and pomegranate. We trained, tested and ‎compared the performance of these five machine learning approaches and found out that the ANN based fruit detection performs better. The overall accuracy obtained by the ANN model for the dataset is 95.6%. In addition, the response time of the system is 50 seconds per fruit which is very low. Therefore, it will be very suitable and useful for small-scale industries and farmers to grow up their business.


Sign in / Sign up

Export Citation Format

Share Document