Identifying novel transcript biomarkers for hepatocellular carcinoma (HCC) using RNA-Seq datasets and machine learning

Abstract Background Hepatocellular carcinoma (HCC) is one of the leading causes of cancer death in the world owing to limitations in its prognosis. The current prognosis approaches include radiological examination and detection of serum biomarkers, however, both have limited efficiency and are ineffective in early prognosis. Due to such limitations, we propose to use RNA-Seq data for evaluating putative higher accuracy biomarkers at the transcript level that could help in early prognosis. Methods To identify such potential transcript biomarkers, RNA-Seq data for healthy liver and various HCC cell models were subjected to five different machine learning algorithms: random forest, K-nearest neighbor, Naïve Bayes, support vector machine, and neural networks. Various metrics, namely sensitivity, specificity, MCC, informedness, and AUC-ROC (except for support vector machine) were evaluated. The algorithms that produced the highest values for all metrics were chosen to extract the top features that were subjected to recursive feature elimination. Through recursive feature elimination, the least number of features were obtained to differentiate between the healthy and HCC cell models. Results From the metrics used, it is demonstrated that the efficiency of the known protein biomarkers for HCC is comparatively lower than complete transcriptomics data. Among the different machine learning algorithms, random forest and support vector machine demonstrated the best performance. Using recursive feature elimination on top features of random forest and support vector machine three transcripts were selected that had an accuracy of 0.97 and kappa of 0.93. Of the three transcripts, two were protein coding (PARP2–202 and SPON2–203) and one was a non-coding transcript (CYREN-211). Lastly, we demonstrated that these three selected transcripts outperformed randomly taken three transcripts (15,000 combinations), hence were not chance findings, and could then be an interesting candidate for new HCC biomarker development. Conclusion Using RNA-Seq data combined with machine learning approaches can aid in finding novel transcript biomarkers. The three biomarkers identified: PARP2–202, SPON2–203, and CYREN-211, presented the highest accuracy among all other transcripts in differentiating the healthy and HCC cell models. The machine learning pipeline developed in this study can be used for any RNA-Seq dataset to find novel transcript biomarkers. Code: www.github.com/rajinder4489/ML_biomarkers

Download Full-text

Execution Assessment of Machine Learning Algorithms for Spam Profile Detection on Instagram

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/561032021 ◽

2021 ◽

Vol 10 (3) ◽

pp. 1889-1894

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Tools ◽

Learning Models ◽

K Nearest Neighbor

Witheverypassingsecondsocialnetworkcommunityisgrowingrapidly,becauseofthat,attackershaveshownkeeninterestinthesekindsofplatformsandwanttodistributemischievouscontentsontheseplatforms.Withthefocus on introducing new set of characteristics and features forcounteractivemeasures,agreatdealofstudieshasresearchedthe possibility of lessening the malicious activities on social medianetworks. This research was to highlight features for identifyingspammers on Instagram and additional features were presentedto improve the performance of different machine learning algorithms. Performance of different machine learning algorithmsnamely, Multilayer Perceptron (MLP), Random Forest (RF), K-Nearest Neighbor (KNN) and Support Vector Machine (SVM)were evaluated on machine learning tools named, RapidMinerand WEKA. The results from this research tells us that RandomForest (RF) outperformed all other selected machine learningalgorithmsonbothselectedmachinelearningtools.OverallRandom Forest (RF) provided best results on RapidMiner. Theseresultsareusefulfortheresearcherswhoarekeentobuildmachine learning models to find out the spamming activities onsocialnetworkcommunities.

Download Full-text

Detecting Real-Time Fall of Elderly People Using Machine Learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39635 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1913-1918

Author(s):

Prathima P

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Elderly People ◽

Fall Detection ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Support Vector ◽

False Alarms ◽

Severe Injuries

Abstract: Fall is a significant national health issue for the elderly people, generally resulting in severe injuries when the person lies down on the floor over an extended period without any aid after experiencing a great fall. Thus, elders need to be cared very attentively. A supervised-machine learning based fall detection approach with accelerometer, gyroscope is devised. The system can detect falls by grouping different actions as fall or non-fall events and the care taker is alerted immediately as soon as the person falls. The public dataset SisFall with efficient class of features is used to identify fall. The Random Forest (RF) and Support Vector Machine (SVM) machine learning algorithms are employed to detect falls with lesser false alarms. The SVM algorithm obtain a highest accuracy of 99.23% than RF algorithm. Keywords: Fall detection, Machine learning, Supervised classification, Sisfall, Activities of daily living, Wearable sensors, Random Forest, Support Vector Machine

Download Full-text

Prediction of non-muscle invasive bladder cancer recurrence using machine learning of quantitative nuclear features

Modern Pathology ◽

10.1038/s41379-021-00955-y ◽

2021 ◽

Author(s):

Naoto Tokuyama ◽

Akira Saito ◽

Ryu Muraoka ◽

Shuya Matsubara ◽

Takeshi Hashimoto ◽

...

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Bladder Cancer ◽

Random Forest ◽

Machine Learning Algorithms ◽

Invasive Bladder Cancer ◽

Support Vector ◽

Muscle Invasive Bladder Cancer ◽

Nuclear Atypia ◽

Muscle Invasive

AbstractNon-muscle invasive bladder cancer (NMIBC) generally has a good prognosis; however, recurrence after transurethral resection (TUR), the standard primary treatment, is a major problem. Clinical management after TUR has been based on risk classification using clinicopathological factors, but these classifications are not complete. In this study, we attempted to predict early recurrence of NMIBC based on machine learning of quantitative morphological features. In general, structural, cellular, and nuclear atypia are evaluated to determine cancer atypia. However, since it is difficult to accurately quantify structural atypia from TUR specimens, in this study, we used only nuclear atypia and analyzed it using feature extraction followed by classification using Support Vector Machine and Random Forest machine learning algorithms. For the analysis, 125 patients diagnosed with NMIBC were used; data from 95 patients were randomly selected for the training set, and data from 30 patients were randomly selected for the test set. The results showed that the support vector machine-based model predicted recurrence within 2 years after TUR with a probability of 90% and the random forest-based model with probability of 86.7%. In the future, the system can be used to objectively predict NMIBC recurrence after TUR.

Download Full-text

Interpolation of Instantaneous Air Temperature Using Geographical and MODIS Derived Variables with Machine Learning Techniques

10.20944/preprints201906.0008.v1 ◽

2019 ◽

Author(s):

Marcos Ruiz-Álvarez ◽

Francisco Alonso-Sarría ◽

Francisco Gomariz-Castillo

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Linear Regression ◽

Air Temperature ◽

Satellite Data ◽

Multivariate Linear Regression ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector

Several methods have been tried to estimate air temperature using satellite imagery. In this paper, the results of two machine learning algorithms, Support Vector Machine and Random Forest, are compared with Multivariate Linear Regression, TVX and Ordinary kriging. Several geographic, remote sensing and time variables are used as predictors. The validation is carried out using four different statistics on a daily basis allowing the use of ANOVA to compare the results. The main conclusion is that Random Forest with residual kriging produces the best results (R$^2$=0.612 $\pm$ 0.019, NSE=0.578 $\pm$ 0.025, RMSE=1.068 $\pm$ 0.027, PBIAS=-0.172 $\pm$ 0.046), whereas TVX produces the least accurate results. The environmental conditions in the study area are not really suited to TVX, moreover this method only takes into account satellite data. On the other hand, regression methods (Support Vector Machine, Random Forest and Multivariate Linear Regression) use several parameters that are easily calculated from a Digital Elevation Model, adding very little difficulty to the use of satellite data alone. The most important variables in the Random Forest Model were satellite temperature, potential irradiation and cdayt, a cosine transformation of the julian day.

Download Full-text

Machine learning in the diagnosis of Myocardial Infarction with Non-Obstructive Coronary Arteries

European Heart Journal ◽

10.1093/eurheartj/ehab724.3067 ◽

2021 ◽

Vol 42 (Supplement_1) ◽

Author(s):

M J Espinosa Pascual ◽

P Vaquero Martinez ◽

V Vaquero Martinez ◽

J Lopez Pais ◽

B Izquierdo Coronel ◽

...

Keyword(s):

Machine Learning ◽

Myocardial Infarction ◽

Support Vector Machine ◽

Logistic Regression ◽

Random Forest ◽

Obstructive Coronary Artery Disease ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Classification Model ◽

Support Vector

Abstract Introduction Out of all patients admitted with Myocardial Infarction, 10 to 15% have Myocardial Infarction with Non-Obstructive Coronaries Arteries (MINOCA). Classification algorithms based on deep learning substantially exceed traditional diagnostic algorithms. Therefore, numerous machine learning models have been proposed as useful tools for the detection of various pathologies, but to date no study has proposed a diagnostic algorithm for MINOCA. Purpose The aim of this study was to estimate the diagnostic accuracy of several automated learning algorithms (Support-Vector Machine [SVM], Random Forest [RF] and Logistic Regression [LR]) to discriminate between people suffering from MINOCA from those with Myocardial Infarction with Obstructive Coronary Artery Disease (MICAD) at the time of admission and before performing a coronary angiography, whether invasive or not. Methods A Diagnostic Test Evaluation study was carried out applying the proposed algorithms to a database constituted by 553 consecutive patients admitted to our Hospital with Myocardial Infarction. According to the definitions of 2016 ESC Position Paper on MINOCA, patients were classified into two groups: MICAD and MINOCA. Out of the total 553 patients, 214 were discarded due to the lack of complete data. The set of machine learning algorithms was trained on 244 patients (training sample: 75%) and tested on 80 patients (test sample: 25%). A total of 64 variables were available for each patient, including demographic, clinical and laboratorial features before the angiographic procedure. Finally, the diagnostic precision of each architecture was taken. Results The most accurate classification model was the Random Forest algorithm (Specificity [Sp] 0.88, Sensitivity [Se] 0.57, Negative Predictive Value [NPV] 0.93, Area Under the Curve [AUC] 0.85 [CI 0.83–0.88]) followed by the standard Logistic Regression (Sp 0.76, Se 0.57, NPV 0.92 AUC 0.74 and Support-Vector Machine (Sp 0.84, Se 0.38, NPV 0.90, AUC 0.78) (see graph). The variables that contributed the most in order to discriminate a MINOCA from a MICAD were the traditional cardiovascular risk factors, biomarkers of myocardial injury, hemoglobin and gender. Results were similar when the 19 patients with Takotsubo syndrome were excluded from the analysis. Conclusion A prediction system for diagnosing MINOCA before performing coronary angiographies was developed using machine learning algorithms. Results show higher accuracy of diagnosing MINOCA than conventional statistical methods. This study supports the potential of machine learning algorithms in clinical cardiology. However, further studies are required in order to validate our results. FUNDunding Acknowledgement Type of funding sources: None. ROC curves of different algorithms

Download Full-text

A Comparative Study on Supervised Machine Learning Algorithms for Copper Recovery Quality Prediction in a Leaching Process

Sensors ◽

10.3390/s21062119 ◽

2021 ◽

Vol 21 (6) ◽

pp. 2119

Author(s):

Victor Flores ◽

Claudio Leiva

Keyword(s):

Neural Network ◽

Machine Learning ◽

Artificial Neural Network ◽

Support Vector Machine ◽

Random Forest ◽

Mining Industry ◽

Machine Learning Algorithms ◽

Copper Recovery ◽

Support Vector ◽

Copper Mining

The copper mining industry is increasingly using artificial intelligence methods to improve copper production processes. Recent studies reveal the use of algorithms, such as Artificial Neural Network, Support Vector Machine, and Random Forest, among others, to develop models for predicting product quality. Other studies compare the predictive models developed with these machine learning algorithms in the mining industry as a whole. However, not many copper mining studies published compare the results of machine learning techniques for copper recovery prediction. This study makes a detailed comparison between three models for predicting copper recovery by leaching, using four datasets resulting from mining operations in Northern Chile. The algorithms used for developing the models were Random Forest, Support Vector Machine, and Artificial Neural Network. To validate these models, four indicators or values of merit were used: accuracy (acc), precision (p), recall (r), and Matthew’s correlation coefficient (mcc). This paper describes the dataset preparation and the refinement of the threshold values used for the predictive variable most influential on the class (the copper recovery). Results show both a precision over 98.50% and also the model with the best behavior between the predicted and the real values. Finally, the obtained models have the following mean values: acc = 0.943, p = 88.47, r = 0.995, and mcc = 0.232. These values are highly competitive when compared with those obtained in similar studies using other approaches in the context.

Download Full-text

APLICAÇÃO DE MACHINE LEARNING NA IDENTIFICAÇÃO DE E-MAILS COMO SPAM

Colloquium Exactarum ◽

10.5747/ce.2020.v12.n3.e327 ◽

2021 ◽

Vol 12 (3) ◽

pp. 31-38

Author(s):

Michelle Tais Garcia Furuya ◽

Danielle Elis Garcia Furuya

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Nearest Neighbors ◽

Machine Learning Algorithms ◽

The Other ◽

Support Vector ◽

K Nearest Neighbors ◽

Mail Service ◽

E Mail

The e-mail service is one of the main tools used today and is an example that technology facilitates the exchange of information. On the other hand, one of the biggest obstacles faced by e-mail services is spam, the name given to the unsolicited message received by a user. The machine learning application has been gaining prominence in recent years as an alternative for efficient identification of spam. In this area, different algorithms can be evaluated to identify which one has the best performance. The aim of the study is to identify the ability of machine learning algorithms to correctly classify e-mails and also to identify which algorithm obtained the greatest accuracy. The database used was taken from the Kaggle platform and the data were processed bythe Orange software with four algorithms: Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Naive Bayes (NB). The division of data in training and testing considers 80% of the data for training and 20% for testing. The results show that Random Forest was the best performing algorithm with 99% accuracy.

Download Full-text

Using Machine Learning to Predict Heart Disease

WSEAS TRANSACTIONS ON BIOLOGY AND BIOMEDICINE ◽

10.37394/23208.2022.19.1 ◽

2022 ◽

Vol 19 ◽

pp. 1-9

Author(s):

Nikhil Bora ◽

Sreedevi Gutta ◽

Ahmad Hadaegh

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Heart Disease ◽

Random Forest ◽

Data Science ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbor

Heart Disease has become one of the most leading cause of the death on the planet and it has become most life-threatening disease. The early prediction of the heart disease will help in reducing death rate. Predicting Heart Disease has become one of the most difficult challenges in the medical sector in recent years. As per recent statistics, about one person dies from heart disease every minute. In the realm of healthcare, a massive amount of data was discovered for which the data-science is critical for analyzing this massive amount of data. This paper proposes heart disease prediction using different machine-learning algorithms like logistic regression, naïve bayes, support vector machine, k nearest neighbor (KNN), random forest, extreme gradient boost, etc. These machine learning algorithm techniques we used to predict likelihood of person getting heart disease on the basis of features (such as cholesterol, blood pressure, age, sex, etc. which were extracted from the datasets. In our research we used two separate datasets. The first heart disease dataset we used was collected from very famous UCI machine learning repository which has 303 record instances with 14 different attributes (13 features and one target) and the second dataset that we used was collected from Kaggle website which contained 1190 patient’s record instances with 11 features and one target. This dataset is a combination of 5 popular datasets for heart disease. This study compares the accuracy of various machine learning techniques. In our research, for the first dataset we got the highest accuracy of 92% by Support Vector Machine (SVM). And for the second dataset, Random Forest gave us the highest accuracy of 94.12%. Then, we combined both the datasets which we used in our research for which we got the highest accuracy of 93.31% using Random Forest.

Download Full-text

Predicting Inhibitors for Multidrug Resistance Associated Protein-2 Transporter by Machine Learning Approach

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666181024104822 ◽

2018 ◽

Vol 21 (8) ◽

pp. 557-566 ◽

Cited By ~ 3

Author(s):

Sahil Kharangarh ◽

Hardeep Sandhu ◽

Sujit Tangadpalliwar ◽

Prabha Garg

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Multidrug Resistance ◽

Random Forest ◽

Predictive Models ◽

Computational Study ◽

Recursive Feature Elimination ◽

Efflux Transporters ◽

Support Vector ◽

Efflux Transporter

Background: The efflux transporter multidrug resistance associated protein-2 belongs to ATP-binding cassette superfamily which plays an important role in multidrug resistance and drugdrug interactions. Efflux transporters are considered to be important targets for increasing the efficacy of drugs and importance of computational study of efflux transporters for predicting substrates, non-substrates, inhibitors and non-inhibitors is well documented. Previous work on predictive models for inhibitors of multidrug resistance associated Protein-2 efflux transporter showed that machine learning methods produced good results. Objective: The aim of the present work was to develop a machine learning predictive model to classify inhibitors and non-inhibitors of multidrug resistance associated protein-2 transporter using a well refined dataset. Method: In this study, the various algorithms of machine learning were used to develop the predictive models i.e. support vector machine, random forest and k-nearest neighbor. The methods like variance threshold, SelectKBest, random forest, and recursive feature elimination were used to select the features generated by PyDPI. A total of 239 molecules consisting of 124 inhibitors and 115 non-inhibitors were used for model development. Results: The best multidrug resistance associated protein-2 inhibitor model showed prediction accuracies of 0.76, 0.72 and 0.79 for training, 5-fold cross-validation and external sets, respectively. Conclusion: It was observed that support vector machine model built on features selected using recursive feature elimination method shows the best performance. The developed model can be used in the early stages of drug discovery for identifying the inhibitors of multidrug resistance associated protein-2 efflux transporter.

Download Full-text

Implementing Machine Learning Algorithms to Classify Postures and Forecast Motions When Using a Dynamic Chair

Sensors ◽

10.3390/s22010400 ◽

2022 ◽

Vol 22 (1) ◽

pp. 400

Author(s):

Ghazal Farhani ◽

Yue Zhou ◽

Patrick Danielson ◽

Ana Luisa Trejos

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Learning Algorithms ◽

High Accuracy ◽

Machine Learning Algorithms ◽

Support Vector ◽

Lstm Network ◽

Health Complications ◽

Convolutional Lstm

Many modern jobs require long periods of sitting on a chair that may result in serious health complications. Dynamic chairs are proposed as alternatives to the traditional sitting chairs; however, previous studies have suggested that most users are not aware of their postures and do not take advantage of the increased range of motion offered by the dynamic chairs. Building a system that identifies users’ postures in real time, as well as forecasts the next few postures, can bring awareness to the sitting behavior of each user. In this study, machine learning algorithms have been implemented to automatically classify users’ postures and forecast their next motions. The random forest, gradient decision tree, and support vector machine algorithms were used to classify postures. The evaluation of the trained classifiers indicated that they could successfully identify users’ postures with an accuracy above 90%. The algorithm can provide users with an accurate report of their sitting habits. A 1D-convolutional-LSTM network has also been implemented to forecast users’ future postures based on their previous motions, the model can forecast a user’s motions with high accuracy (97%). The ability of the algorithm to forecast future postures could be used to suggest alternative postures as needed.

Download Full-text