scholarly journals Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge

2021 ◽  
Vol 12 ◽  
Author(s):  
Runzhi Zhang ◽  
Dorothy Ellis ◽  
Alejandro R. Walker ◽  
Susmita Datta

The composition of microbial communities has been known to be location-specific. Investigating the microbial composition across different cities enables us to unravel city-specific microbial signatures and further predict the origin of unknown samples. As part of the CAMDA 2020 Metagenomic Geolocation Challenge, MetaSUB provided the whole genome shotgun (WGS) metagenomics data from samples across 28 cities along with non-microbial city data for 23 of these cities. In our solution to this challenge, we implemented feature selection, normalization, clustering and three methods of machine learning to classify the cities based on their microbial compositions. Of the three methods, multilayer perceptron obtained the best performance with an error rate of 19.60% based on whether the correct city received the highest or second highest number of votes for the test data contained in the main dataset. We then trained the model to predict the origins of samples from the mystery dataset by including these samples with the additional group label of “mystery.” The mystery dataset compromised of samples collected from a subset of the cities in the main dataset as well as samples collected from new cities. For samples from cities that belonged to the main dataset, error rates ranged from 18.18 to 72.7%. For samples from new cities that did not belong to the main dataset, 57.7% of the test samples could be correctly labeled as “mystery” samples. Furthermore, we also predicted some of the non-microbial features for the mystery samples from the cities that did not belong to main dataset to draw inferences and narrow the range of the possible sample origins using a multi-output multilayer perceptron algorithm.

2020 ◽  
Author(s):  
Runzhi Zhang ◽  
Alejandro R. Walker ◽  
Susmita Datta

Abstract BackgroundComposition of microbial communities can be location specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selection, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets.ResultsFeature selection, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.6% and 30.0% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively.ConclusionsThe results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which was also supported by the results from ANCOM and importance score from the RF. The analysis utilized in this study can be of great help in field of forensic science to efficiently predict the origin of the samples. And the accurate of the prediction could be improved by more samples and better sequencing depth.


Author(s):  
Annisa Nurul Puteri ◽  
Arizal Arizal ◽  
Andini Dani Achmad

Pre-processing merupakan tahap yang penting dalam melakukan klasifikasi data. Pre-processing berguna untuk mempersiapkan data sehingga teknik klasifikasi yang diterapkan menghasilkan pola yang berkualitas dan akurat. Salah satu teknik data pre-processing yang sering digunakan untuk mengetahui atribut yang paling berpengaruh pada sebuah dataset adalah feature selection. Data yang digunakan dalam penelitian ini adalah customer data collection dari a Portuguese banking institution in UCI Machine Learning Repository. Penelitian ini menggunakan metode feature selection correlation-based yang dikombinasikan dengan metode klasifikasi Multilayer Perceptron Neural Networks. Tujuan penelitian ini untuk mengidentifikasi atribut yang paling relevan dan berpengaruh dari dataset dalam memprediksi nasabah yang potensial untuk penawaran deposito berjangka. Penelitian ini menghasilkan 10 atribut yang memiliki ranking teratas. Atribut-atribut tersebut adalah duration, previous, contact, cons.price.idx, month, cons.cof.idx, age, job, marital, dan housing. Hasil klasifikasi dari atribut yang terpilih memiliki tingkat akurasi tertinggi sebesar 80.5% dan tingkat akurasi terendah 79.1%.


2020 ◽  
Vol 10 (19) ◽  
pp. 6896
Author(s):  
Paloma Tirado-Martin ◽  
Judith Liu-Jimenez ◽  
Jorge Sanchez-Casanova ◽  
Raul Sanchez-Reillo

Currently, machine learning techniques are successfully applied in biometrics and Electrocardiogram (ECG) biometrics specifically. However, not many works deal with different physiological states in the user, which can provide significant heart rate variations, being these a key matter when working with ECG biometrics. Techniques in machine learning simplify the feature extraction process, where sometimes it can be reduced to a fixed segmentation. The applied database includes visits taken in two different days and three different conditions (sitting down, standing up after exercise), which is not common in current public databases. These characteristics allow studying differences among users under different scenarios, which may affect the pattern in the acquired data. Multilayer Perceptron (MLP) is used as a classifier to form a baseline, as it has a simple structure that has provided good results in the state-of-the-art. This work studies its behavior in ECG verification by using QRS complexes, finding its best hyperparameter configuration through tuning. The final performance is calculated considering different visits for enrolling and verification. Differentiation in the QRS complexes is also tested, as it is already required for detection, proving that applying a simple first differentiation gives a good result in comparison to state-of-the-art similar works. Moreover, it also improves the computational cost by avoiding complex transformations and using only one type of signal. When applying different numbers of complexes, the best results are obtained when 100 and 187 complexes in enrolment, obtaining Equal Error Rates (EER) that range between 2.79–4.95% and 2.69–4.71%, respectively.


2021 ◽  
Vol 17 (1) ◽  
pp. 19-24
Author(s):  
Siti Masturoh ◽  
Fitra Septia Nugraha ◽  
Siti Nurlela ◽  
M. Rangga Ramadhan Saelan ◽  
Daniati Uki Eka Saputri ◽  
...  

Telemarketing is a promotion that is considered effective for promoting a product to consumers by telephone, other than that telemarketing is easier to accept because of its direct nature of offering products to consumers. Telemarketing is also considered to help increase a company's revenue. The problem of predicting the success of a bank's telemarketing data must be done using machine learning techniques.  Machine learning used in the available historical data is a bank dataset of 45211 instances at 17 features using the multilayer perceptron algorithm (MLP) with resampling. The use of resampling aims to balance the unbalanced data resulting in an accuracy value of 90.18% and a ROC of 0.89%. Meanwhile, if the data resampling is not used in the multilayer perceptron (MLP) algorithm, the accuracy value is 88.6 and ROC is 0.88%. The use of resampling data becomes more effective and results in higher accuracy values.


2019 ◽  
Vol 8 (2) ◽  
pp. 3316-3322

Huge amount of Healthcare data are produced every day from the various health care sectors. The accumulated data can be effectively analyzed to identify people's risk from chronic diseases. The process of predicting the presence or absence of the disease and also to diagnosing the various disease using the historical medical data is known as Health Care Analytics. Health care analytics will improve patient care and also the harness practice of medical practitioner. The feature selection is considered as a core aspect of the machine learning which hugely contribute towards the performance of the machine learning model. In this paper symmetry based feature subset selection is proposed to select the optimal features from the Health care data which contribute towards the prediction outcome. The Multilayer perceptron algorithm(MLP) used as a classifier which will predict the outcome by using the features which are selected from the Symmetry-based feature subset selection technique. The chronic disease dataset Diabetes, Cancer, Breast Cancer, and Heart Disease data set accumulated from UCI repository is used to conduct the experiment. The experimental results demonstrate that the proposed hybrid combination of feature selection technique and the multilayer perceptron outperforms in accuracy compare to the existing approaches.


Author(s):  
Timnit Gebru

This chapter discusses the role of race and gender in artificial intelligence (AI). The rapid permeation of AI into society has not been accompanied by a thorough investigation of the sociopolitical issues that cause certain groups of people to be harmed rather than advantaged by it. For instance, recent studies have shown that commercial automated facial analysis systems have much higher error rates for dark-skinned women, while having minimal errors on light-skinned men. Moreover, a 2016 ProPublica investigation uncovered that machine learning–based tools that assess crime recidivism rates in the United States are biased against African Americans. Other studies show that natural language–processing tools trained on news articles exhibit societal biases. While many technical solutions have been proposed to alleviate bias in machine learning systems, a holistic and multifaceted approach must be taken. This includes standardization bodies determining what types of systems can be used in which scenarios, making sure that automated decision tools are created by people from diverse backgrounds, and understanding the historical and political factors that disadvantage certain groups who are subjected to these tools.


Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1226
Author(s):  
Saeed Najafi-Zangeneh ◽  
Naser Shams-Gharneh ◽  
Ali Arjomandi-Nezhad ◽  
Sarfaraz Hashemkhani Zolfani

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.


Sign in / Sign up

Export Citation Format

Share Document