Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment

The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.

Download Full-text

Literature on Applied Machine Learning in Metagenomic Classification: A Scoping Review

Biology ◽

10.3390/biology9120453 ◽

2020 ◽

Vol 9 (12) ◽

pp. 453

Author(s):

Petar Tonkovic ◽

Slobodan Kalajdziski ◽

Eftim Zdravevski ◽

Petre Lameski ◽

Roberto Corizzo ◽

...

Keyword(s):

Machine Learning ◽

Language Processing ◽

Scoping Review ◽

Digital Libraries ◽

Research Field ◽

Time Interval ◽

Research Papers ◽

Data Set ◽

Practical Applications ◽

Applied Machine Learning

Applied machine learning in bioinformatics is growing as computer science slowly invades all research spheres. With the arrival of modern next-generation DNA sequencing algorithms, metagenomics is becoming an increasingly interesting research field as it finds countless practical applications exploiting the vast amounts of generated data. This study aims to scope the scientific literature in the field of metagenomic classification in the time interval 2008–2019 and provide an evolutionary timeline of data processing and machine learning in this field. This study follows the scoping review methodology and PRISMA guidelines to identify and process the available literature. Natural Language Processing (NLP) is deployed to ensure efficient and exhaustive search of the literary corpus of three large digital libraries: IEEE, PubMed, and Springer. The search is based on keywords and properties looked up using the digital libraries’ search engines. The scoping review results reveal an increasing number of research papers related to metagenomic classification over the past decade. The research is mainly focused on metagenomic classifiers, identifying scope specific metrics for model evaluation, data set sanitization, and dimensionality reduction. Out of all of these subproblems, data preprocessing is the least researched with considerable potential for improvement.

Download Full-text

Intelligent Detection of False Information in Arabic Tweets Utilizing Hybrid Harris Hawks Based Feature Selection and Machine Learning Models

Symmetry ◽

10.3390/sym13040556 ◽

2021 ◽

Vol 13 (4) ◽

pp. 556

Author(s):

Thaer Thaher ◽

Mahmoud Saheb ◽

Hamza Turabieh ◽

Hamouda Chantar

Keyword(s):

Machine Learning ◽

Social Media ◽

Feature Selection ◽

Language Processing ◽

User Profile ◽

Vital Role ◽

Classification Model ◽

Fake News ◽

False Information ◽

Social Media Platforms

Fake or false information on social media platforms is a significant challenge that leads to deliberately misleading users due to the inclusion of rumors, propaganda, or deceptive information about a person, organization, or service. Twitter is one of the most widely used social media platforms, especially in the Arab region, where the number of users is steadily increasing, accompanied by an increase in the rate of fake news. This drew the attention of researchers to provide a safe online environment free of misleading information. This paper aims to propose a smart classification model for the early detection of fake news in Arabic tweets utilizing Natural Language Processing (NLP) techniques, Machine Learning (ML) models, and Harris Hawks Optimizer (HHO) as a wrapper-based feature selection approach. Arabic Twitter corpus composed of 1862 previously annotated tweets was utilized by this research to assess the efficiency of the proposed model. The Bag of Words (BoW) model is utilized using different term-weighting schemes for feature extraction. Eight well-known learning algorithms are investigated with varying combinations of features, including user-profile, content-based, and words-features. Reported results showed that the Logistic Regression (LR) with Term Frequency-Inverse Document Frequency (TF-IDF) model scores the best rank. Moreover, feature selection based on the binary HHO algorithm plays a vital role in reducing dimensionality, thereby enhancing the learning model’s performance for fake news detection. Interestingly, the proposed BHHO-LR model can yield a better enhancement of 5% compared with previous works on the same dataset.

Download Full-text

Finding Warning Markers: Leveraging Natural Language Processing and Machine Learning Technologies to Detect Risk of School Violence (Preprint)

10.2196/preprints.15584 ◽

2019 ◽

Author(s):

Yizhao Ni ◽

Drew Barzman ◽

Alycia Bachtel ◽

Marcus Griffey ◽

Alexander Osborn ◽

...

Keyword(s):

Machine Learning ◽

Risk Assessment ◽

Feature Selection ◽

Natural Language Processing ◽

Protective Factors ◽

School Violence ◽

Language Processing ◽

Predictive Value ◽

Learning Technologies ◽

Linguistic Features

BACKGROUND School violence has a far reaching effect, impacting the entire school population including staff, students and their families. Among youth attending the most violent schools, studies have reported higher dropout rates, poor school attendance, and poor scholastic achievement. It was noted that the largest crime-prevention results occurred when youth at elevated risk were given an individualized prevention program. However, much work is needed to establish an effective approach to identify at-risk subjects. OBJECTIVE In our earlier research, we developed a standardized risk assessment program to interview subjects, identify risk and protective factors, and evaluate risk for school violence. This study focused on developing natural language processing (NLP) and machine learning technologies to automate the risk assessment process. METHODS We prospectively recruited 131 students with behavioral concerns from 89 schools between 05/01/2015 and 04/30/2018. The subjects were interviewed with three innovative risk assessment scales and their risk of violence were determined by pediatric psychiatrists based on clinical judgment. Leveraging NLP technologies, different types of linguistic features were extracted from the interview content. Machine learning classifiers were then applied to predict risk of school violence for individual subjects. A two-stage feature selection was implemented to identify violence-related predictors. The performance was validated on the psychiatrist-generated reference standard of risk levels, where positive predictive value (PPV), sensitivity (SEN), negative predictive value (NPV), specificity (SPEC) and area under the ROC curve (AUC) were assessed. RESULTS Compared to subjects' demographics and socioeconomic information, use of linguistic features significantly improved classifiers' predictive performance (P<0.01). The best-performing classifier with n-gram features achieved 86.5%/86.5%/85.7%/85.7%/94.0% (PPV/SEN/NPV/SPEC/AUC) on the cross-validation set and 83.3%/93.8%/91.7%/78.6%/94.6% (PPV/SEN/NPV/SPEC/AUC) on the test data. The feature selection process identified a set of predictors covering the discussion of subjects' thoughts, perspectives, behaviors, individual characteristics, peers and family dynamics, and protective factors. CONCLUSIONS By analyzing the content from subject interviews, the NLP and machine learning algorithms showed good capacity for detecting risk of school violence. The feature selection uncovered multiple warning markers that could deliver useful clinical insights to assist personalizing intervention. Consequently, the developed approach offered the promise of an end-to-end computerized screening service for preventing school violence.

Download Full-text

Sentiment Analysis for Social Media using SVM Classifier of Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i1107.0789s419 ◽

2019 ◽

Vol 8 (9S4) ◽

pp. 39-47

Keyword(s):

Machine Learning ◽

Social Media ◽

Feature Selection ◽

Sentiment Analysis ◽

Language Processing ◽

Cuckoo Search ◽

Support Vector ◽

Svm Classifier ◽

Feature Selection Technique ◽

Performance Factors

Sentiment analysis is an area of natural language processing (NLP) and machine learning where the text is to be categorized into predefined classes i.e. positive and negative. As the field of internet and social media, both are increasing day by day, the product of these two nowadays is having many more feedbacks from the customer than before. Text generated through social media, blogs, post, review on any product, etc. has become the bested suited cases for consumer sentiment, providing a best-suited idea for that particular product. Features are an important source for the classification task as more the features are optimized, the more accurate are results. Therefore, this research paper proposes a hybrid feature selection which is a combination of Particle swarm optimization (PSO) and cuckoo search. Due to the subjective nature of social media reviews, hybrid feature selection technique outperforms the traditional technique. The performance factors like f-measure, recall, precision, and accuracy tested on twitter dataset using Support Vector Machine (SVM) classifier and compared with convolution neural network. Experimental results of this paper on the basis of different parameters show that the proposed work outperforms the existing work

Download Full-text

Automated Identification of Substantial Changes in Construction Projects of Airport Improvement Program: Machine Learning and Natural Language Processing Comparative Analysis

Journal of Management in Engineering ◽

10.1061/(asce)me.1943-5479.0000959 ◽

2021 ◽

Vol 37 (6) ◽

pp. 04021062

Author(s):

Ramy Khalef ◽

Islam H. El-adaway

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Comparative Analysis ◽

Natural Language ◽

Language Processing ◽

Construction Projects ◽

Automated Identification ◽

Improvement Program

Download Full-text

An Extensive Text Mining Study for the Turkish Language

Advances in Business Information Systems and Analytics - Natural Language Processing for Global and Local Business ◽

10.4018/978-1-7998-4240-8.ch012 ◽

2021 ◽

pp. 272-306

Author(s):

Durmuş Özkan Şahin ◽

Erdal Kılıç

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Text Mining ◽

Language Processing ◽

Information Gain ◽

Learning Algorithms ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Classification Algorithms ◽

Chi Square

In this study, the authors give both theoretical and experimental information about text mining, which is one of the natural language processing topics. Three different text mining problems such as news classification, sentiment analysis, and author recognition are discussed for Turkish. They aim to reduce the running time and increase the performance of machine learning algorithms. Four different machine learning algorithms and two different feature selection metrics are used to solve these text classification problems. Classification algorithms are random forest (RF), logistic regression (LR), naive bayes (NB), and sequential minimal optimization (SMO). Chi-square and information gain metrics are used as the feature selection method. The highest classification performance achieved in this study is 0.895 according to the F-measure metric. This result is obtained by using the SMO classifier and information gain metric for news classification. This study is important in terms of comparing the performances of classification algorithms and feature selection methods.

Download Full-text

Automated Identification of Disaster News for Crisis Management using Machine Learning and Natural Language Processing

2020 International Conference on Electronics and Sustainable Communication Systems (ICESC) ◽

10.1109/icesc48915.2020.9156031 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jayashree Domala ◽

Manmohan Dogra ◽

Vinit Masrani ◽

Dwayne Fernandes ◽

Kevin D'souza ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Crisis Management ◽

Language Processing ◽

Automated Identification

Download Full-text

Neural Feature Selection for Learning to Rank

Lecture Notes in Computer Science - Advances in Information Retrieval ◽

10.1007/978-3-030-72240-1_34 ◽

2021 ◽

pp. 342-349

Author(s):

Alberto Purpura ◽

Karolina Buchner ◽

Gianmaria Silvello ◽

Gian Antonio Susto

Keyword(s):

Machine Learning ◽

Feature Selection ◽

System Performance ◽

Large Scale ◽

Learning To Rank ◽

Research Area ◽

Model Complexity ◽

Learning Models ◽

Model Size ◽

Machine Learning Models

AbstractLEarning TO Rank (LETOR) is a research area in the field of Information Retrieval (IR) where machine learning models are employed to rank a set of items. In the past few years, neural LETOR approaches have become a competitive alternative to traditional ones like LambdaMART. However, neural architectures performance grew proportionally to their complexity and size. This can be an obstacle for their adoption in large-scale search systems where a model size impacts latency and update time. For this reason, we propose an architecture-agnostic approach based on a neural LETOR model to reduce the size of its input by up to 60% without affecting the system performance. This approach also allows to reduce a LETOR model complexity and, therefore, its training and inference time up to 50%.

Download Full-text

Genome-Scale Metabolic Modeling of the Human Microbiome in the Era of Personalized Medicine

Annual Review of Microbiology ◽

10.1146/annurev-micro-060221-012134 ◽

2021 ◽

Vol 75 (1) ◽

Author(s):

Almut Heinken ◽

Arianna Basile ◽

Johannes Hertel ◽

Cyrille Thinnes ◽

Ines Thiele

Keyword(s):

Human Microbiome ◽

Metabolic Modeling ◽

Therapeutic Interventions ◽

Annual Review ◽

Publication Date ◽

Multivariate Statistical ◽

Disease Etiology ◽

Microbiome Composition ◽

Complementary Approach ◽

And Function

The human microbiome plays an important role in human health and disease. Meta-omics analyses provide indispensable data for linking changes in microbiome composition and function to disease etiology. Yet, the lack of a mechanistic understanding of, e.g., microbiome-metabolome links hampers the translation of these findings into effective, novel therapeutics. Here, we propose metabolic modeling of microbial communities through constraint-based reconstruction and analysis (COBRA) as a complementary approach to meta-omics analyses. First, we highlight the importance of microbial metabolism in cardiometabolic diseases, inflammatory bowel disease, colorectal cancer, Alzheimer disease, and Parkinson disease. Next, we demonstrate that microbial community modeling can stratify patients and controls, mechanistically link microbes with fecal metabolites altered in disease, and identify host pathways affected by the microbiome. Finally, we outline our vision for COBRA modeling combined with meta-omics analyses and multivariate statistical analyses to inform and guide clinical trials, yield testable hypotheses, and ultimately propose novel dietary and therapeutic interventions. Expected final online publication date for the Annual Review of Microbiology, Volume 75 is October 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text