scholarly journals Data mining of coronavirus: SARS-CoV-2, SARS-CoV and MERS-CoV

2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Jung Eun Huh ◽  
Seunghee Han ◽  
Taeseon Yoon

Abstract Objective In this study we compare the amino acid and codon sequence of SARS-CoV-2, SARS-CoV and MERS-CoV using different statistics programs to understand their characteristics. Specifically, we are interested in how differences in the amino acid and codon sequence can lead to different incubation periods and outbreak periods. Our initial question was to compare SARS-CoV-2 to different viruses in the coronavirus family using BLAST program of NCBI and machine learning algorithms. Results The result of experiments using BLAST, Apriori and Decision Tree has shown that SARS-CoV-2 had high similarity with SARS-CoV while having comparably low similarity with MERS-CoV. We decided to compare the codons of SARS-CoV-2 and MERS-CoV to see the difference. Though the viruses are very alike according to BLAST and Apriori experiments, SVM proved that they can be effectively classified using non-linear kernels. Decision Tree experiment proved several remarkable properties of SARS-CoV-2 amino acid sequence that cannot be found in MERS-CoV amino acid sequence. The consequential purpose of this paper is to minimize the damage on humanity from SARS-CoV-2. Hence, further studies can be focused on the comparison of SARS-CoV-2 virus with other viruses that also can be transmitted during latent periods.

2021 ◽  
Author(s):  
Jung Eun Huh ◽  
Seunghee Han ◽  
Taeseon Yoon

Abstract Objectives: All three SARS-CoV-2, SARS-CoV and MERS-CoV belong to the Coronaviridae family. In this study we compare amino acid and codon sequence of SARS-CoV-2, SARS-CoV and MERS-CoV using different statistics programs to understand their characteristics. Specifically, we are interested in how differences in the amino acid and codon sequence lead to different incubation periods and outbreak periods.Results: The initial question we had was to compare SARS-CoV-2 to different viruses in the coronavirus family to understand its characteristics. The result of experiments using BLAST, Apriori and Decision Tree has shown that SARS-CoV-2 had high similarity with SARS-CoV while having comparably low similarity with MERS-CoV. We decided to compare the codons of SARS-CoV-2 and MERS-CoV to see the difference. Though the viruses are very alike according to BLAST and Apriori experiments, SVM proved that they can be effectively classified using non-linear kernels. Decision Tree experiment has proved several remarkable properties of SARS-CoV-2 amino acid sequence that cannot be found in MERS-CoV amino acid sequence.The consequential purpose of this paper is to minimize the damage on humanity from SARS-CoV-2. Hence, further studies can focus on the comparison of SARS-CoV-2 virus with other viruses that also can be transmitted during latent periods.


2021 ◽  
Vol 11 (9) ◽  
pp. 4251
Author(s):  
Jinsong Zhang ◽  
Shuai Zhang ◽  
Jianhua Zhang ◽  
Zhiliang Wang

In the digital microfluidic experiments, the droplet characteristics and flow patterns are generally identified and predicted by the empirical methods, which are difficult to process a large amount of data mining. In addition, due to the existence of inevitable human invention, the inconsistent judgment standards make the comparison between different experiments cumbersome and almost impossible. In this paper, we tried to use machine learning to build algorithms that could automatically identify, judge, and predict flow patterns and droplet characteristics, so that the empirical judgment was transferred to be an intelligent process. The difference on the usual machine learning algorithms, a generalized variable system was introduced to describe the different geometry configurations of the digital microfluidics. Specifically, Buckingham’s theorem had been adopted to obtain multiple groups of dimensionless numbers as the input variables of machine learning algorithms. Through the verification of the algorithms, the SVM and BPNN algorithms had classified and predicted the different flow patterns and droplet characteristics (the length and frequency) successfully. By comparing with the primitive parameters system, the dimensionless numbers system was superior in the predictive capability. The traditional dimensionless numbers selected for the machine learning algorithms should have physical meanings strongly rather than mathematical meanings. The machine learning algorithms applying the dimensionless numbers had declined the dimensionality of the system and the amount of computation and not lose the information of primitive parameters.


2021 ◽  
Vol 11 (15) ◽  
pp. 6728
Author(s):  
Muhammad Asfand Hafeez ◽  
Muhammad Rashid ◽  
Hassan Tariq ◽  
Zain Ul Abideen ◽  
Saud S. Alotaibi ◽  
...  

Classification and regression are the major applications of machine learning algorithms which are widely used to solve problems in numerous domains of engineering and computer science. Different classifiers based on the optimization of the decision tree have been proposed, however, it is still evolving over time. This paper presents a novel and robust classifier based on a decision tree and tabu search algorithms, respectively. In the aim of improving performance, our proposed algorithm constructs multiple decision trees while employing a tabu search algorithm to consistently monitor the leaf and decision nodes in the corresponding decision trees. Additionally, the used tabu search algorithm is responsible to balance the entropy of the corresponding decision trees. For training the model, we used the clinical data of COVID-19 patients to predict whether a patient is suffering. The experimental results were obtained using our proposed classifier based on the built-in sci-kit learn library in Python. The extensive analysis for the performance comparison was presented using Big O and statistical analysis for conventional supervised machine learning algorithms. Moreover, the performance comparison to optimized state-of-the-art classifiers is also presented. The achieved accuracy of 98%, the required execution time of 55.6 ms and the area under receiver operating characteristic (AUROC) for proposed method of 0.95 reveals that the proposed classifier algorithm is convenient for large datasets.


2020 ◽  
Vol 98 (Supplement_4) ◽  
pp. 126-127
Author(s):  
Lucas S Lopes ◽  
Christine F Baes ◽  
Dan Tulpan ◽  
Luis Artur Loyola Chardulo ◽  
Otavio Machado Neto ◽  
...  

Abstract The aim of this project is to compare some of the state-of-the-art machine learning algorithms on the classification of steers finished in feedlots based on performance, carcass and meat quality traits. The precise classification of animals allows for fast, real-time decision making in animal food industry, such as culling or retention of herd animals. Beef production presents high variability in its numerous carcass and beef quality traits. Machine learning algorithms and software provide an opportunity to evaluate the interactions between traits to better classify animals. Four different treatment levels of wet distiller’s grain were applied to 97 Angus-Nellore animals and used as features for the classification problem. The C4.5 decision tree, Naïve Bayes (NB), Random Forest (RF) and Multilayer Perceptron (MLP) Artificial Neural Network algorithms were used to predict and classify the animals based on recorded traits measurements, which include initial and final weights, sheer force and meat color. The top performing classifier was the C4.5 decision tree algorithm with a classification accuracy of 96.90%, while the RF, the MLP and NB classifiers had accuracies of 55.67%, 39.17% and 29.89% respectively. We observed that the final decision tree model constructed with C4.5 selected only the dry matter intake (DMI) feature as a differentiator. When DMI was removed, no other feature or combination of features was sufficiently strong to provide good prediction accuracies for any of the classifiers. We plan to investigate in a follow-up study on a significantly larger sample size, the reasons behind DMI being a more relevant parameter than the other measurements.


2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Faizan Ullah ◽  
Qaisar Javaid ◽  
Abdu Salam ◽  
Masood Ahmad ◽  
Nadeem Sarwar ◽  
...  

Ransomware (RW) is a distinctive variety of malware that encrypts the files or locks the user’s system by keeping and taking their files hostage, which leads to huge financial losses to users. In this article, we propose a new model that extracts the novel features from the RW dataset and performs classification of the RW and benign files. The proposed model can detect a large number of RW from various families at runtime and scan the network, registry activities, and file system throughout the execution. API-call series was reutilized to represent the behavior-based features of RW. The technique extracts fourteen-feature vector at runtime and analyzes it by applying online machine learning algorithms to predict the RW. To validate the effectiveness and scalability, we test 78550 recent malign and benign RW and compare with the random forest and AdaBoost, and the testing accuracy is extended at 99.56%.


2021 ◽  
Vol 10 (1) ◽  
pp. 99
Author(s):  
Sajad Yousefi

Introduction: Heart disease is often associated with conditions such as clogged arteries due to the sediment accumulation which causes chest pain and heart attack. Many people die due to the heart disease annually. Most countries have a shortage of cardiovascular specialists and thus, a significant percentage of misdiagnosis occurs. Hence, predicting this disease is a serious issue. Using machine learning models performed on multidimensional dataset, this article aims to find the most efficient and accurate machine learning models for disease prediction.Material and Methods: Several algorithms were utilized to predict heart disease among which Decision Tree, Random Forest and KNN supervised machine learning are highly mentioned. The algorithms are applied to the dataset taken from the UCI repository including 294 samples. The dataset includes heart disease features. To enhance the algorithm performance, these features are analyzed, the feature importance scores and cross validation are considered.Results: The algorithm performance is compared with each other, so that performance based on ROC curve and some criteria such as accuracy, precision, sensitivity and F1 score were evaluated for each model. As a result of evaluation, Accuracy, AUC ROC are 83% and 99% respectively for Decision Tree algorithm. Logistic Regression algorithm with accuracy and AUC ROC are 88% and 91% respectively has better performance than other algorithms. Therefore, these techniques can be useful for physicians to predict heart disease patients and prescribe them correctly.Conclusion: Machine learning technique can be used in medicine for analyzing the related data collections to a disease and its prediction. The area under the ROC curve and evaluating criteria related to a number of classifying algorithms of machine learning to evaluate heart disease and indeed, the prediction of heart disease is compared to determine the most appropriate classification. As a result of evaluation, better performance was observed in both Decision Tree and Logistic Regression models.


Author(s):  
Tanujit Chakraborty

Decision tree algorithms have been among the most popular algorithms for interpretable (transparent) machine learning since the early 1980s. On the other hand, deep learning methods have boosted the capacity of machine learning algorithms and are now being used for non-trivial applications in various applied domains. But training a fully-connected deep feed-forward network by gradient-descent backpropagation is slow and requires arbitrary choices regarding the number of hidden units and layers. In this paper, we propose near-optimal neural regression trees, intending to make it much faster than deep feed-forward networks and for which it is not essential to specify the number of hidden units in the hidden layers of the neural network in advance. The key idea is to construct a decision tree and then simulate the decision tree with a neural network. This work aims to build a mathematical formulation of neural trees and gain the complementary benefits of both sparse optimal decision trees and neural trees. We propose near-optimal sparse neural trees (NSNT) that is shown to be asymptotically consistent and robust in nature. Additionally, the proposed NSNT model obtain a fast rate of convergence which is near-optimal up to some logarithmic factor. We comprehensively benchmark the proposed method on a sample of 80 datasets (40 classification datasets and 40 regression datasets) from the UCI machine learning repository. We establish that the proposed method is likely to outperform the current state-of-the-art methods (random forest, XGBoost, optimal classification tree, and near-optimal nonlinear trees) for the majority of the datasets.


2019 ◽  
Author(s):  
Cheng-Sheng Yu ◽  
Yu-Jiun Lin ◽  
Chang-Hsien Lin ◽  
Sen-Te Wang ◽  
Shiyng-Yu Lin ◽  
...  

BACKGROUND Metabolic syndrome is a cluster of disorders that significantly influence the development and deterioration of numerous diseases. FibroScan is an ultrasound device that was recently shown to predict metabolic syndrome with moderate accuracy. However, previous research regarding prediction of metabolic syndrome in subjects examined with FibroScan has been mainly based on conventional statistical models. Alternatively, machine learning, whereby a computer algorithm learns from prior experience, has better predictive performance over conventional statistical modeling. OBJECTIVE We aimed to evaluate the accuracy of different decision tree machine learning algorithms to predict the state of metabolic syndrome in self-paid health examination subjects who were examined with FibroScan. METHODS Multivariate logistic regression was conducted for every known risk factor of metabolic syndrome. Principal components analysis was used to visualize the distribution of metabolic syndrome patients. We further applied various statistical machine learning techniques to visualize and investigate the pattern and relationship between metabolic syndrome and several risk variables. RESULTS Obesity, serum glutamic-oxalocetic transaminase, serum glutamic pyruvic transaminase, controlled attenuation parameter score, and glycated hemoglobin emerged as significant risk factors in multivariate logistic regression. The area under the receiver operating characteristic curve values for classification and regression trees and for the random forest were 0.831 and 0.904, respectively. CONCLUSIONS Machine learning technology facilitates the identification of metabolic syndrome in self-paid health examination subjects with high accuracy.


2020 ◽  
Author(s):  
Xueyan Li ◽  
Genshan Ma ◽  
Xiaobo Qian ◽  
Yamou Wu ◽  
Xiaochen Huang ◽  
...  

Abstract Background: We aimed to assess the performance of machine learning algorithms for the prediction of risk factors of postoperative ileus (POI) in patients underwent laparoscopic colorectal surgery for malignant lesions. Methods: We conducted analyses in a retrospective observational study with a total of 637 patients at Suzhou Hospital of Nanjing Medical University. Four machine learning algorithms (logistic regression, decision tree, random forest, gradient boosting decision tree) were considered to predict risk factors of POI. The total cases were randomly divided into training and testing data sets, with a ratio of 8:2. The performance of each model was evaluated by area under receiver operator characteristic curve (AUC), precision, recall and F1-score. Results: The morbidity of POI in this study was 19.15% (122/637). Gradient boosting decision tree reached the highest AUC (0.76) and was the best model for POI risk prediction. In addition, the results of the importance matrix of gradient boosting decision tree showed that the five most important variables were time to first passage of flatus, opioids during POD3, duration of surgery, height and weight. Conclusions: The gradient boosting decision tree was the optimal model to predict the risk of POI in patients underwent laparoscopic colorectal surgery for malignant lesions. And the results of our study could be useful for clinical guidelines in POI risk prediction.


Biology ◽  
2020 ◽  
Vol 9 (11) ◽  
pp. 365
Author(s):  
Taha ValizadehAslani ◽  
Zhengqiao Zhao ◽  
Bahrad A. Sokhansanj ◽  
Gail L. Rosen

Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide k-mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid k-mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide k-mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately.


Sign in / Sign up

Export Citation Format

Share Document