scholarly journals Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks

Genes ◽  
2021 ◽  
Vol 12 (11) ◽  
pp. 1755
Author(s):  
Moritz Kohls ◽  
Magdalena Kircher ◽  
Jessica Krepel ◽  
Pamela Liebig ◽  
Klaus Jung

Estimating the taxonomic composition of viral sequences in a biological samples processed by next-generation sequencing is an important step in comparative metagenomics. Mapping sequencing reads against a database of known viral reference genomes, however, fails to classify reads from novel viruses whose reference sequences are not yet available in public databases. Instead of a mapping approach, and in order to classify sequencing reads at least to a taxonomic level, the performance of artificial neural networks and other machine learning models was studied. Taxonomic and genomic data from the NCBI database were used to sample labelled sequencing reads as training data. The fitted neural network was applied to classify unlabelled reads of simulated and real-world test sets. Additional auxiliary test sets of labelled reads were used to estimate the conditional class probabilities, and to correct the prior estimation of the taxonomic distribution in the actual test set. Among the taxonomic levels, the biological order of viruses provided the most comprehensive data base to generate training data. The prediction accuracy of the artificial neural network to classify test reads to their viral order was considerably higher than that of a random classification. Posterior estimation of taxa frequencies could correct the primary classification results.

2020 ◽  
Author(s):  
Moritz Kohls ◽  
Magdalena Kircher ◽  
Jessica Krepel ◽  
Pamela Liebig ◽  
Klaus Jung

Abstract Background: Estimating the taxonomic composition of viral sequences in a biological sample processed by next-generation sequencing is an important step for comparative metagenomics. For that purpose, sequencing reads are usually classified by mapping them against a database of known viral reference genomes. This fails, however, to classify reads from novel viruses and quasispecies whose reference sequences are not yet available in public databases. Methods: In order to circumvent the problem of a mapping approach with unknown viruses, the feasibility and performance of neural networks to classify sequencing reads to taxonomic classes is studied. For that purpose, taxonomy and genome data from the NCBI database are used to sample artificial reads from known viruses with known taxonomic attribution. Based on these training data, artificial neural networks are fitted and applied to classify single viral read sequences to di erent taxa. Model building includes di erent input features derived from artificial read sequences as possible predictors which are chosen by a feature selection method. Training, validation and test data are computed from these input features. To summarise classification results, a generalised confusion matrix is proposed which lists all possible misclassification combination frequencies. Two new formulas to statistically estimate taxa frequencies are introduced for studying the overall viral composition.Results: We found that the best taxonomic level supported by the NCBI database is that of viral orders. Prediction accuracy of the fitted models is evaluated on test data and classification results are summarised in a confusion matrix, from which diagnostic measures such as sensitivity and specificity as well as positive and negative predictive values are calculated. The prediction accuracy of the artificial neural net is considerably higher than for random classification and posterior estimation of taxa frequencies is closer to the true distribution in the training data than simple classification or mapping results. Conclusions: Neural networks are helpful to classify sequencing reads into viral orders and can be used to complement the results of mapping approaches. The machine learning approach is not limited to already known viruses. In addition, statistical estimations of taxa frequencies can be used for subsequent comparative metagenomics.


2019 ◽  
Author(s):  
René Janßen ◽  
Jakob Zabel ◽  
Uwe von Lukas ◽  
Matthias Labrenz

AbstractArtificial neural networks can be trained on complex data sets to detect, predict, or model specific aspects. Aim of this study was to train an artificial neural network to support environmental monitoring efforts in case of a contamination event by detecting induced changes towards the microbial communities. The neural net was trained on taxonomic cluster count tables obtained via next-generation amplicon sequencing of water column samples originating from a lab microcosm incubation experiment conducted over 140 days to determine the effects of the herbicide glyphosate on succession within brackish-water microbial communities. Glyphosate-treated assemblages were classified correctly; a subsetting approach identified the clusters primarily responsible for this, permitting the reduction of input features. This study demonstrates the potential of artificial neural networks to predict indicator species in cases of glyphosate contamination. The results could empower the development of environmental monitoring strategies with applications limited to neither glyphosate nor amplicon sequence data.Highlight bullet pointsAn artificial neural net was able to identify glyphosate-affected microbial community assemblages based on next generation sequencing dataDecision-relevant taxonomic clusters can be identified by a stochastically subsetting approachJust a fraction of present clusters is needed for classificationFiltering of input data improves classification


2022 ◽  
pp. 1559-1575
Author(s):  
Mário Pereira Véstias

Machine learning is the study of algorithms and models for computing systems to do tasks based on pattern identification and inference. When it is difficult or infeasible to develop an algorithm to do a particular task, machine learning algorithms can provide an output based on previous training data. A well-known machine learning model is deep learning. The most recent deep learning models are based on artificial neural networks (ANN). There exist several types of artificial neural networks including the feedforward neural network, the Kohonen self-organizing neural network, the recurrent neural network, the convolutional neural network, the modular neural network, among others. This article focuses on convolutional neural networks with a description of the model, the training and inference processes and its applicability. It will also give an overview of the most used CNN models and what to expect from the next generation of CNN models.


2017 ◽  
Author(s):  
Sungsoo Park ◽  
Bonggun Shin ◽  
Yoonjung Choi ◽  
Kilsoo Kang ◽  
Keunsoo Kang

AbstractMotivationNext-generation sequencing (NGS), which allows the simultaneous sequencing of billions of DNA fragments simultaneously, has revolutionized how we study genomics and molecular biology by generating genome-wide molecular maps of molecules of interest. For example, an NGS-based transcriptomic assay called RNA-seq can be used to estimate the abundance of approximately 190,000 transcripts together. As the cost of next-generation sequencing sharply declines, researchers in many fields have been conducting research using NGS. The amount of information produced by NGS has made it difficult for researchers to choose the optimal set of target genes (or genomic loci).ResultsWe have sought to resolve this issue by developing a neural network-based feature (gene) selection algorithm called Wx. The Wx algorithm ranks genes based on the discriminative index (DI) score that represents the classification power for distinguishing given groups. With a gene list ranked by DI score, researchers can institutively select the optimal set of genes from the highest-ranking ones. We applied the Wx algorithm to a TCGA pan-cancer gene-expression cohort to identify an optimal set of gene-expression biomarker (universal gene-expression biomarkers) candidates that can distinguish cancer samples from normal samples for 12 different types of cancer. The 14 gene-expression biomarker candidates identified by Wx were comparable to or outperformed previously reported universal gene expression biomarkers, highlighting the usefulness of the Wx algorithm for next-generation sequencing data. Thus, we anticipate that the Wx algorithm can complement current state-of-the-art analytical applications for the identification of biomarker candidates as an alternative method.Availabilityhttps://github.com/deargen/[email protected] informationSupplementary data are available at online.


2019 ◽  
Vol 14 (1) ◽  
pp. 58-79 ◽  
Author(s):  
Gaetano Bosurgi ◽  
Orazio Pellegrino ◽  
Giuseppe Sollazzo

Artificial Neural Networks represent useful tools for several engineering issues. Although they were adopted in several pavement-engineering problems for performance evaluation, their application on pavement structural performance evaluation appears to be remarkable. It is conceivable that defining a proper Artificial Neural Network for estimating structural performance in asphalt pavements from measurements performed through quick and economic surveys produces significant savings for road agencies and improves maintenance planning. However, the architecture of such an Artificial Neural Network must be optimised, to improve the final accuracy and provide a reliable technique for enriching decision-making tools. In this paper, the influence on the final quality of different features conditioning the network architecture has been examined, for maximising the resulting quality and, consequently, the final benefits of the methodology. In particular, input factor quality (structural, traffic, climatic), “homogeneity” of training data records and the actual net topology have been investigated. Finally, these results further prove the approach efficiency, for improving Pavement Management Systems and reducing deflection survey frequency, with remarkable savings for road agencies.


2021 ◽  
Vol 12 ◽  
Author(s):  
Tihao Huang ◽  
Junqing Li ◽  
Baoxian Jia ◽  
Hongyan Sang

Copy number variation (CNV), is defined as repetitions or deletions of genomic segments of 1 Kb to 5 Mb, and is a major trigger for human disease. The high-throughput and low-cost characteristics of next-generation sequencing technology provide the possibility of the detection of CNVs in the whole genome, and also greatly improve the clinical practicability of next-generation sequencing (NGS) testing. However, current methods for the detection of CNVs are easily affected by sequencing and mapping errors, and uneven distribution of reads. In this paper, we propose an improved approach, CNV-MEANN, for the detection of CNVs, involving changing the structure of the neural network used in the MFCNV method. This method has three differences relative to the MFCNV method: (1) it utilizes a new feature, mapping quality, to replace two features in MFCNV, (2) it considers the influence of the loss categories of CNV on disease prediction, and refines the output structure, and (3) it uses a mind evolutionary algorithm to optimize the backpropagation (neural network) neural network model, and calculates individual scores for each genome bin to predict CNVs. Using both simulated and real datasets, we tested the performance of CNV-MEANN and compared its performance with those of seven widely used CNV detection methods. Experimental results demonstrated that the CNV-MEANN approach outperformed other methods with respect to sensitivity, precision, and F1-score. The proposed method was able to detect many CNVs that other approaches could not, and it reduced the boundary bias. CNV-MEANN is expected to be an effective method for the analysis of changes in CNVs in the genome.


GigaScience ◽  
2021 ◽  
Vol 10 (7) ◽  
Author(s):  
Michael D Linderman ◽  
Crystal Paudyal ◽  
Musab Shakeel ◽  
William Kelley ◽  
Ali Bashir ◽  
...  

Abstract Background Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome next-generation sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases. Results We introduce NPSV, a machine learning–based approach for genotyping previously discovered SVs that uses next-generation sequencing simulation to model the combined effects of the genomic region, sequencer, and alignment pipeline on the observed SV evidence. We evaluate NPSV alongside existing SV genotypers on multiple benchmark call sets. We show that NPSV consistently achieves or exceeds state-of-the-art genotyping accuracy across SV call sets, samples, and variant types. NPSV can specifically identify putative de novo SVs in a trio context and is robust to offset SV breakpoints. Conclusions Growing SV databases and the increasing availability of SV calls from long-read sequencing make stand-alone genotyping of previously identified SVs an increasingly important component of genome analyses. By treating potential biases as a “black box” that can be simulated, NPSV provides a framework for accurately genotyping a broad range of SVs in both targeted and genome-scale applications.


Author(s):  
Anna Olegovna Chupakova ◽  
Sergey Vital'evich Gudin ◽  
Renat Shamil'evich Khabibulin

The article highlights the significant increase of industrial capacities and automation of production, which requires taking effective management decisions by a responsible person. There have been outlined the important achievements of the scientists in application of the artificial neural networks in the various fields of activity and decision support systems involving the information analysis and processing with the results obtained. There has been proposed a review of publications on training artificial neural networks and on their efficient application in solving problems of classification, prediction and control. The most common structures of neural networks, their advantages and disadvantages, as well as the methods used to create training data arrays have been studied. A comparative analysis of using various structures of artificial neural networks and the effectiveness of existing teaching methods and the prospects for their use has been carried out. There has been defined the most preferred neural network topology for solving problems of fire safety management at the production facilities as an active decision support system. Using the analysis results, the most common and effective training methods have been identified, application of which is appropriate for developing and training various types of neural networks. The use of the technology is well grounded for reducing the errors in data processing, the financial costs for ensuring security, as well as for possible using the neural networks in the decision support systems to optimize these systems.


Author(s):  
Mário Pereira Véstias

Machine learning is the study of algorithms and models for computing systems to do tasks based on pattern identification and inference. When it is difficult or infeasible to develop an algorithm to do a particular task, machine learning algorithms can provide an output based on previous training data. A well-known machine learning model is deep learning. The most recent deep learning models are based on artificial neural networks (ANN). There exist several types of artificial neural networks including the feedforward neural network, the Kohonen self-organizing neural network, the recurrent neural network, the convolutional neural network, the modular neural network, among others. This article focuses on convolutional neural networks with a description of the model, the training and inference processes and its applicability. It will also give an overview of the most used CNN models and what to expect from the next generation of CNN models.


Sign in / Sign up

Export Citation Format

Share Document