scholarly journals Same-Species Contamination Detection with Variant Calling Information from Next Generation Sequencing

2019 ◽  
Author(s):  
Tao Jiang ◽  
Martin Buchkovich ◽  
Alison Motsinger-Reif

AbstractMotivationSame-species contamination detection is an important quality control step in genetic data analysis. Compared with widely discussed cross-species contamination, same-species contamination is more challenging to detect, and there is a scarcity of methods to detect and correct for this quality control issue. Same-species contamination may be due to contamination by lab technicians or samples from other contributors. Here, we introduce a novel machine learning algorithm to detect same species contamination in next generation sequence data using support vector machines. Our approach uniquely detects such contamination using variant calling information stored in the variant call format (VCF) files (either DNA or RNA), and importantly can differentiate between same species contamination and mixtures of tumor and normal cells.MethodsIn the first stage of our approach, a change-point detection method is used to identify copy number variations or copy number aberrations (CNVs or CNAs) for filtering prior to testing for contamination. Next, single nucleotide polymorphism (SNP) data is used to test for same species contamination using a support vector machine model. Based on the assumption that alternative allele frequencies in next generation sequencing follow the beta-binomial distribution, the deviation parameter ρ is estimated by maximum likelihood method. All features of a radial basis function (RBF) kernel support vector machine (SVM) are generated using either publicly available or private training data. Lastly, the generated SVM is applied in the test data to detect contamination. If training data is not available, a default RBF kernel SVM model is used.ResultsWe demonstrate the potential of our approach using simulation experiments, creating datasets with varying levels of contamination. The datasets combine, in silico, exome sequencing data of DNA from two lymphoblastoid cell lines (NA12878 and NA10855). We generated VCF files using variants identified in these data, and then evaluated the power and false positive rate of our approach to detect same species contamination. Our simulation experiments show that our method can detect levels of contamination as low as 5% with reasonable false positive rates. Results in real data have sensitivity above 99.99% and specificity at 90.24%, even in the presence of DNA degradation that has similar features to contaminated samples. Additionally, the approach can identify the difference between mixture of tumor-normal cells and contamination. We provide an R software implementation of our approach using the defcon()function in the vanquish: Variant Quality Investigation Helper R package on CRAN.

Pathology ◽  
2020 ◽  
Vol 52 ◽  
pp. S108
Author(s):  
Dylan A. Mordaunt ◽  
Julien Soubrier ◽  
Song Gao ◽  
Lesley Rawlings ◽  
Jillian Nicholl ◽  
...  

Author(s):  
Christ Memory Sitorus ◽  
Adhi Rizal ◽  
Mohamad Jajuli

The ride-hailing service is now booming because it has been helped by internet technology, therefore many call this service online transportation. The magnitude of the potential for growth in online transportation service users also increases the risk of user satisfaction which could have declined therefore the company is increasing in its service. Both in terms of application and services provided by partners/drivers of the company. During each trip, the online transportation application will record device movement data and send it to the server. This data set is usually called telematic data. This telematics data if processed can have enormous benefits. In this study, an analysis will be conducted to predict the risk of online transportation trips using the Support Vector Machine (SVM) algorithm based on the obtained telematic data. The data obtained is telematic data so it must be processed first using feature engineering to obtain 51 features, then trained using the SVM algorithm with RBF kernel and modified C values. Every C value that is changed will be used K-Fold cross-validation first to separate the testing data and training data. The specified k value is 5. The results for each trial obtained accuracy, Receiver Operating Characteristic (ROC) and Area Under the Curves (AUC), for the best that is at C = 100 while the worst at C = 0.001.


2019 ◽  
Vol 6 (5) ◽  
pp. 190001 ◽  
Author(s):  
Katherine E. Klug ◽  
Christian M. Jennings ◽  
Nicholas Lytal ◽  
Lingling An ◽  
Jeong-Yeol Yoon

A straightforward method for classifying heavy metal ions in water is proposed using statistical classification and clustering techniques from non-specific microparticle scattering data. A set of carboxylated polystyrene microparticles of sizes 0.91, 0.75 and 0.40 µm was mixed with the solutions of nine heavy metal ions and two control cations, and scattering measurements were collected at two angles optimized for scattering from non-aggregated and aggregated particles. Classification of these observations was conducted and compared among several machine learning techniques, including linear discriminant analysis, support vector machine analysis, K-means clustering and K-medians clustering. This study found the highest classification accuracy using the linear discriminant and support vector machine analysis, each reporting high classification rates for heavy metal ions with respect to the model. This may be attributed to moderate correlation between detection angle and particle size. These classification models provide reasonable discrimination between most ion species, with the highest distinction seen for Pb(II), Cd(II), Ni(II) and Co(II), followed by Fe(II) and Fe(III), potentially due to its known sorption with carboxyl groups. The support vector machine analysis was also applied to three different mixture solutions representing leaching from pipes and mine tailings, and showed good correlation with single-species data, specifically with Pb(II) and Ni(II). With more expansive training data and further processing, this method shows promise for low-cost and portable heavy metal identification and sensing.


2019 ◽  
Vol 21 (2) ◽  
pp. 307-317 ◽  
Author(s):  
Sounak Gupta ◽  
Chad M. Vanderbilt ◽  
Paolo Cotzia ◽  
Javier A. Arias-Stella ◽  
Jason C. Chang ◽  
...  

2017 ◽  
Vol 58 (11) ◽  
pp. 2202-2209 ◽  
Author(s):  
Michael A. Iacocca ◽  
Jian Wang ◽  
Jacqueline S. Dron ◽  
John F. Robinson ◽  
Adam D. McIntyre ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document