Identification of Most Relevant Features for Classification of Francisella tularensis using Machine Learning

2020 ◽  
Vol 15 ◽  
Author(s):  
Fareed Ahmad ◽  
Amjad Farooq ◽  
Muhammad Usman Ghani Khan ◽  
Muhammad Zubair Shabbir ◽  
Masood Rabbani ◽  
...  

Background: Francisella tularensis is a stealth pathogen fatal for animals and humans. Ease of its propagation, coupled with high capacity for ailment and death makes it a potential candidate for biological weapon. Objective: Work related to the pathogen’s classification and factors affecting its prolonged existence in soil is limited to statistical measures. Machine learning other than conventional analysis methods may be applied to better predict epidemiological modeling for this soil-borne pathogen. Method: Feature-ranking algorithms namely; relief, correlation and oneR are used for soil attribute ranking. Moreover, classification algorithms; SVM, random forest, naive bayes, logistic regression and MLP are used for classification of the soil attribute dataset for Francisella tularensis positive and negative soils. Results: Feature-ranking methods conclude; clay, nitrogen, organic matter, soluble salts, zinc, silt and nickel are the most significant attributes while potassium, phosphorous, iron, calcium, copper, chromium and sand are least contributing risk factors for the persistence of the pathogen. However, clay is the most significant and potassium is the least contributing attribute. Data analysis suggests that feature-ranking using relief produced classification accuracy of 84.35% for multilayer perceptron; 82.99% for linear regression; 80.27% for SVM and random forest; and 78.23% for naive bayes, which is better than other ranking methods. MLP outperforms other classifiers by generating an accuracy of 84.35%,82.99% and 81.63% for feature-ranking using relief, correlation and oneR algorithms, respectively. Conclusion: These models can significantly improve accuracy and can minimize the risk of incorrect classification. They further help in controlling epidemics and thereby minimizing the socio-economic impact on the society.

2019 ◽  
Vol 9 (9) ◽  
pp. 231 ◽  
Author(s):  
Attallah ◽  
Sharkas ◽  
Gadelkarim

Magnetic resonance imaging (MRI) is a common imaging technique used extensively to study human brain activities. Recently, it has been used for scanning the fetal brain. Amongst 1000 pregnant women, 3 of them have fetuses with brain abnormality. Hence, the primary detection and classification are important. Machine learning techniques have a large potential in aiding the early detection of these abnormalities, which correspondingly could enhance the diagnosis process and follow up plans. Most research focused on the classification of abnormal brains in a primary age has been for newborns and premature infants, with fewer studies focusing on images for fetuses. These studies associated fetal scans to scans after birth for the detection and classification of brain defects early in the neonatal age. This type of brain abnormality is named small for gestational age (SGA). This article proposes a novel framework for the classification of fetal brains at an early age (before the fetus is born). As far as we could know, this is the first study to classify brain abnormalities of fetuses of widespread gestational ages (GAs). The study incorporates several machine learning classifiers, such as diagonal quadratic discriminates analysis (DQDA), K-nearest neighbour (K-NN), random forest, naïve Bayes, and radial basis function (RBF) neural network classifiers. Moreover, several bagging and Adaboosting ensembles models have been constructed using random forest, naïve Bayes, and RBF network classifiers. The performances of these ensembles have been compared with their individual models. Our results show that our novel approach can successfully identify and classify numerous types of defects within MRI images of the fetal brain of various GAs. Using the KNN classifier, we were able to achieve the highest classification accuracy and area under receiving operating characteristics of 95.6% and 99% respectively. In addition, ensemble classifiers improved the results of their respective individual models.


2021 ◽  
Vol 2021 (1) ◽  
pp. 1012-1018
Author(s):  
Handy Geraldy ◽  
Lutfi Rahmatuti Maghfiroh

Dalam menjalankan peran sebagai penyedia data, Badan Pusat Statistik (BPS) memberikan layanan akses data BPS bagi masyarakat. Salah satu layanan tersebut adalah fitur pencarian di website BPS. Namun, layanan pencarian yang diberikan belum memenuhi harapan konsumen. Untuk memenuhi harapan konsumen, salah satu upaya yang dapat dilakukan adalah meningkatkan efektivitas pencarian agar lebih relevan dengan maksud pengguna. Oleh karena itu, penelitian ini bertujuan untuk membangun fungsi klasifikasi kueri pada mesin pencari dan menguji apakah fungsi tersebut dapat meningkatkan efektivitas pencarian. Fungsi klasifikasi kueri dibangun menggunakan model machine learning. Kami membandingkan lima algoritma yaitu SVM, Random Forest, Gradient Boosting, KNN, dan Naive Bayes. Dari lima algoritma tersebut, model terbaik diperoleh pada algoritma SVM. Kemudian, fungsi tersebut diimplementasikan pada mesin pencari yang diukur efektivitasnya berdasarkan nilai precision dan recall. Hasilnya, fungsi klasifikasi kueri dapat mempersempit hasil pencarian pada kueri tertentu, sehingga meningkatkan nilai precision. Namun, fungsi klasifikasi kueri tidak memengaruhi nilai recall.


Author(s):  
Anirudh Reddy Cingireddy ◽  
Robin Ghosh ◽  
Supratik Kar ◽  
Venkata Melapu ◽  
Sravanthi Joginipeli ◽  
...  

Frequent testing of the entire population would help to identify individuals with active COVID-19 and allow us to identify concealed carriers. Molecular tests, antigen tests, and antibody tests are being widely used to confirm COVID-19 in the population. Molecular tests such as the real-time reverse transcription-polymerase chain reaction (rRT-PCR) test will take a minimum of 3 hours to a maximum of 4 days for the results. The authors suggest using machine learning and data mining tools to filter large populations at a preliminary level to overcome this issue. The ML tools could reduce the testing population size by 20 to 30%. In this study, they have used a subset of features from full blood profile which are drawn from patients at Israelita Albert Einstein hospital located in Brazil. They used classification models, namely KNN, logistic regression, XGBooting, naive Bayes, decision tree, random forest, support vector machine, and multilayer perceptron with k-fold cross-validation, to validate the models. Naïve bayes, KNN, and random forest stand out as the most predictive ones with 88% accuracy each.


2020 ◽  
Vol 8 (6) ◽  
pp. 1623-1630

As huge amount of data accumulating currently, Challenges to draw out the required amount of data from available information is needed. Machine learning contributes to various fields. The fast-growing population caused the evolution of a wide range of diseases. This intern resulted in the need for the machine learning model that uses the patient's datasets. From different sources of datasets analysis, cancer is the most hazardous disease, it may cause the death of the forbearer. The outcome of the conducted surveys states cancer can be nearly cured in the initial stages and it may also cause the death of an affected person in later stages. One of the major types of cancer is lung cancer. It highly depends on the past data which requires detection in early stages. The recommended work is based on the machine learning algorithm for grouping the individual details into categories to predict whether they are going to expose to cancer in the early stage itself. Random forest algorithm is implemented, it results in more efficiency of 97% compare to KNN and Naive Bayes. Further, the KNN algorithm doesn't learn anything from training data but uses it for classification. Naive Bayes results in the inaccuracy of prediction. The proposed system is for predicting the chances of lung cancer by displaying three levels namely low, medium, and high. Thus, mortality rates can be reduced significantly.


2019 ◽  
Author(s):  
Thomas M. Kaiser ◽  
Pieter B. Burger

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.


2019 ◽  
Vol 9 (14) ◽  
pp. 2789 ◽  
Author(s):  
Sadaf Malik ◽  
Nadia Kanwal ◽  
Mamoona Naveed Asghar ◽  
Mohammad Ali A. Sadiq ◽  
Irfan Karamat ◽  
...  

Medical health systems have been concentrating on artificial intelligence techniques for speedy diagnosis. However, the recording of health data in a standard form still requires attention so that machine learning can be more accurate and reliable by considering multiple features. The aim of this study is to develop a general framework for recording diagnostic data in an international standard format to facilitate prediction of disease diagnosis based on symptoms using machine learning algorithms. Efforts were made to ensure error-free data entry by developing a user-friendly interface. Furthermore, multiple machine learning algorithms including Decision Tree, Random Forest, Naive Bayes and Neural Network algorithms were used to analyze patient data based on multiple features, including age, illness history and clinical observations. This data was formatted according to structured hierarchies designed by medical experts, whereas diagnosis was made as per the ICD-10 coding developed by the American Academy of Ophthalmology. Furthermore, the system is designed to evolve through self-learning by adding new classifications for both diagnosis and symptoms. The classification results from tree-based methods demonstrated that the proposed framework performs satisfactorily, given a sufficient amount of data. Owing to a structured data arrangement, the random forest and decision tree algorithms’ prediction rate is more than 90% as compared to more complex methods such as neural networks and the naïve Bayes algorithm.


2020 ◽  
Vol 13 (5) ◽  
pp. 901-908
Author(s):  
Somil Jain ◽  
Puneet Kumar

Background:: Breast cancer is one of the diseases which cause number of deaths ever year across the globe, early detection and diagnosis of such type of disease is a challenging task in order to reduce the number of deaths. Now a days various techniques of machine learning and data mining are used for medical diagnosis which has proven there metal by which prediction can be done for the chronic diseases like cancer which can save the life’s of the patients suffering from such type of disease. The major concern of this study is to find the prediction accuracy of the classification algorithms like Support Vector Machine, J48, Naïve Bayes and Random Forest and to suggest the best algorithm. Objective:: The objective of this study is to assess the prediction accuracy of the classification algorithms in terms of efficiency and effectiveness. Methods: This paper provides a detailed analysis of the classification algorithms like Support Vector Machine, J48, Naïve Bayes and Random Forest in terms of their prediction accuracy by applying 10 fold cross validation technique on the Wisconsin Diagnostic Breast Cancer dataset using WEKA open source tool. Results:: The result of this study states that Support Vector Machine has achieved the highest prediction accuracy of 97.89 % with low error rate of 0.14%. Conclusion:: This paper provides a clear view over the performance of the classification algorithms in terms of their predicting ability which provides a helping hand to the medical practitioners to diagnose the chronic disease like breast cancer effectively.


2021 ◽  
Vol 9 (1) ◽  
pp. 5
Author(s):  
Haewon Byeon

This preliminary study used the stacking ensemble to explore the major elements (factors) which could predict depression in patients with Parkinson’s disease and presented baseline data for developing a nomogram prognostic index for predicting high-risk groups for depression among patients with Parkinson’s disease in the future. Depression, an outcome variable, was divided into “with depression” and “without depression” using the Geriatric Depression Scale-30 (GDS-30). This study developed nine machine learning models (ANN, random forest, naive bayes, CART, ANN+LR, random forest+LR, naive bayes+LR, CART+LR, and random forest+naive bayes+CART+ANN+LR). The predictive performance (e.g., REMS, IA, Ev) of each machine learning model was validated through 10-fold cross-validation. The analysis results showed that the random forest+LR had the best predictive performance: RMSE = 0.16, IA = 0.73, and Ev = 0.48. This study analyzed the normalized importance of the random forest+LR model’s variables (the final model) and confirmed that K-MMSE, K-MoCA, Global CDR, sum of boxes in CDR, total score of UPDRS, motor score of UPDRS, K-IADL, H and Y staging, Schwab and England ADL, and REM and RBD were ten major variables with high weight among predictors of Parkinson’s disease with depression in South Korea. It is necessary as well to develop interpretable machine learning to build a model for predicting depression in patients with Parkinson’s disease that can be used in the medical field.


Sign in / Sign up

Export Citation Format

Share Document