Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge

Abstract BackgroundComposition of microbial communities can be location specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selection, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets.ResultsFeature selection, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.6% and 30.0% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively.ConclusionsThe results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which was also supported by the results from ANCOM and importance score from the RF. The analysis utilized in this study can be of great help in field of forensic science to efficiently predict the origin of the samples. And the accurate of the prediction could be improved by more samples and better sequencing depth.

Download Full-text

Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge

Biology Direct ◽

10.1186/s13062-020-00284-1 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Runzhi Zhang ◽

Alejandro R. Walker ◽

Susmita Datta

Keyword(s):

Machine Learning ◽

Error Rates ◽

Average Error ◽

Importance Score ◽

Learning Methods ◽

Principal Coordinates Analysis ◽

Principal Coordinates ◽

Total Variability ◽

The Common ◽

Feature Selecting

Abstract Background Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. Results Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively. Conclusions The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth.

Download Full-text

Feature Selection and Machine Learning Methods for Optimal Identification and Prediction of Subtypes in Parkinson's Disease

Computer Methods and Programs in Biomedicine ◽

10.1016/j.cmpb.2021.106131 ◽

2021 ◽

pp. 106131

Author(s):

Mohammad R. Salmanpour ◽

Mojtaba Shamsaei ◽

Arman Rahmim

Keyword(s):

Machine Learning ◽

Parkinson’S Disease ◽

Parkinson's Disease ◽

Feature Selection ◽

Learning Methods ◽

Machine Learning Methods

Download Full-text

Improved Permeability Prediction of Porous Media by Feature Selection and Machine Learning Methods Comparison

Journal of Computing in Civil Engineering ◽

10.1061/(asce)cp.1943-5487.0000983 ◽

2022 ◽

Vol 36 (2) ◽

Author(s):

J. W. Tian ◽

Chongchong Qi ◽

Kang Peng ◽

Yingfeng Sun ◽

Zaher Mundher Yaseen

Keyword(s):

Machine Learning ◽

Porous Media ◽

Feature Selection ◽

Learning Methods ◽

Methods Comparison ◽

Machine Learning Methods ◽

Permeability Prediction

Download Full-text

Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization

Procedia Technology ◽

10.1016/j.protcy.2013.12.254 ◽

2013 ◽

Vol 11 ◽

pp. 748-754 ◽

Cited By ~ 6

Author(s):

Hamood Alshalabi ◽

Sabrina Tiun ◽

Nazlia Omar ◽

Mohammed Albared

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Text Categorization ◽

Learning Methods ◽

Machine Learning Methods

Download Full-text

Laser-induced breakdown spectroscopy for the classification of wood materials using machine learning methods combined with feature selection

Plasma Science and Technology ◽

10.1088/2058-6272/abf1ac ◽

2021 ◽

Author(s):

Xutai Cui ◽

Qianqian Wang ◽

Kai Wei ◽

Geer Teng ◽

Xiangjun Xu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Laser Induced Breakdown Spectroscopy ◽

Learning Methods ◽

Breakdown Spectroscopy ◽

Machine Learning Methods ◽

Laser Induced Breakdown

Download Full-text

Oral cancer prognosis based on clinicopathologic and genomic markers using a hybrid of feature selection and machine learning methods

BMC Bioinformatics ◽

10.1186/1471-2105-14-170 ◽

2013 ◽

Vol 14 (1) ◽

Cited By ~ 39

Author(s):

Siow-Wee Chang ◽

Sameem Abdul-Kareem ◽

Amir Feisal Merican ◽

Rosnah Binti Zain

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Oral Cancer ◽

Cancer Prognosis ◽

Learning Methods ◽

Machine Learning Methods ◽

Genomic Markers

Download Full-text

Up-to-Date Feature Selection Methods for Scalable and Efficient Machine Learning

Efficiency and Scalability Methods for Computational Intellect ◽

10.4018/978-1-4666-3942-3.ch001 ◽

2013 ◽

pp. 1-26 ◽

Cited By ~ 2

Author(s):

Amparo Alonso-Betanzos ◽

Verónica Bolón-Canedo ◽

Diego Fernández-Francos ◽

Iago Porto-Díaz ◽

Noelia Sánchez-Maroño

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Real World ◽

Large Scale ◽

High Dimensionality ◽

Selection Methods ◽

Learning Methods ◽

Large Databases ◽

Efficient Machine ◽

Processing Techniques

With the advent of high dimensionality, machine learning researchers are now interested not only in accuracy, but also in scalability of algorithms. When dealing with large databases, pre-processing techniques are required to reduce input dimensionality and machine learning can take advantage of feature selection, which consists of selecting the relevant features and discarding irrelevant ones with a minimum degradation in performance. In this chapter, we will review the most up-to-date feature selection methods, focusing on their scalability properties. Moreover, we will show how these learning methods are enhanced when applied to large scale datasets and, finally, some examples of the application of feature selection in real world databases will be shown.

Download Full-text

Gene Identification from Microarray Data for Diagnosis of Acute Myeloid and Lymphoblastic Leukemia Using a Sparse Gene Selection Method

Iranian Journal of Pediatric Hematology & Oncology ◽

10.18502/ijpho.v11i2.5838 ◽

2021 ◽

Author(s):

Razieh Sheikhpour ◽

Roohallah Fazli ◽

Sanaz Mehrabani

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Microarray Data ◽

Lymphoblastic Leukemia ◽

Feature Selection Method ◽

Selection Method ◽

Learning Methods ◽

Machine Learning Methods ◽

Acute Myeloid ◽

Sparse Feature Selection

Background: Microarray experiments can simultaneously determine the expression of thousands of genes. Identification of potential genes from microarray data for diagnosis of cancer is important. This study aimed to identify genes for the diagnosis of acute myeloid and lymphoblastic leukemia using a sparse feature selection method. Materials and Methods: In this descriptive study, the expression of 7129 genes of 25 patients with acute myeloid leukemia (AML), and 47 patients with lymphoblastic leukemia (ALL) achieved by the microarray technology were used in this study. Then, the important genes were identified using a sparse feature selection method to diagnose AML and ALL tissues based on the machine learning methods such as support vector machine (SVM), Gaussian kernel density estimation based classifier (GKDEC), k-nearest neighbor (KNN), and linear discriminant classifier (LDC). Results: Diagnosis of ALL and AML was done with the accuracy of 100% using 8 genes of microarray data selected by the sparse feature selection method, GKDEC, and LDC. Moreover, the KNN classifier using 6 genes and the SVM classifier using 7 genes diagnosed AML and ALL with the accuracy of 91.18% and 94.12%, respectively. The gene with the description “Paired-box protein PAX2 (PAX2) gene, exon 11 and complete CDs” was determined as the most important gene in the diagnosis of ALL and AML. Conclusion: The experimental results of the current study showed that AML and ALL can be diagnosed with high accuracy using sparse feature selection and machine learning methods. It seems that the investigation of the expression of selected genes in this study can be helpful in the diagnosis of ALL and AML.

Download Full-text