Denoising of Aligned Genomic Data

ABSTRACTNoise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available athttps://github.com/ihwang/SAMDUDE.

Download Full-text

Denoising of Aligned Genomic Data

Scientific Reports ◽

10.1038/s41598-019-51418-z ◽

2019 ◽

Vol 9 (1) ◽

Author(s):

Irena Fischer-Hwang ◽

Idoia Ochoa ◽

Tsachy Weissman ◽

Mikel Hernaez

Keyword(s):

State Of The Art ◽

Variant Calling ◽

Genomic Data ◽

Clinical Settings ◽

Data Sets ◽

Sequencing Data ◽

Denoising Method ◽

Data Set ◽

Genomic Data Analysis ◽

Variant Identification

Abstract Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE.

Download Full-text

Bayesian Classification of Microbial Communities Based on 16S rRNA Metagenomic Data

10.1101/340653 ◽

2018 ◽

Cited By ~ 1

Author(s):

Arghavan Bahadorinejad ◽

Ivan Ivanov ◽

Johanna W Lampe ◽

Meredith AJ Hullar ◽

Robert S Chapkin ◽

...

Keyword(s):

16S Rrna ◽

Sample Size ◽

Microbial Communities ◽

State Of The Art ◽

Metagenomic Data ◽

Data Sets ◽

Sequencing Data ◽

Sample Data

AbstractWe propose a Bayesian method for the classification of 16S rRNA metagenomic profiles of bacterial abundance, by introducing a Poisson-Dirichlet-Multinomial hierarchical model for the sequencing data, constructing a prior distribution from sample data, calculating the posterior distribution in closed form; and deriving an Optimal Bayesian Classifier (OBC). The proposed algorithm is compared to state-of-the-art classification methods for 16S rRNA metagenomic data, including Random Forests and the phylogeny-based Metaphyl algorithm, for varying sample size, classification difficulty, and dimensionality (number of OTUs), using both synthetic and real metagenomic data sets. The results demonstrate that the proposed OBC method, with either noninformative or constructed priors, is competitive or superior to the other methods. In particular, in the case where the ratio of sample size to dimensionality is small, it was observed that the proposed method can vastly outperform the others.Author summaryRecent studies have highlighted the interplay between host genetics, gut microbes, and colorectal tumor initiation/progression. The characterization of microbial communities using metagenomic profiling has therefore received renewed interest. In this paper, we propose a method for classification, i.e., prediction of different outcomes, based on 16S rRNA metagenomic data. The proposed method employs a Bayesian approach, which is suitable for data sets with small ration of number of available instances to the dimensionality. Results using both synthetic and real metagenomic data show that the proposed method can outperform other state-of-the-art metagenomic classification algorithms.

Download Full-text

A unified haplotype-based method for accurate and comprehensive variant calling

10.1101/456103 ◽

2018 ◽

Cited By ~ 3

Author(s):

Daniel P Cooke ◽

David C Wedge ◽

Gerton Lunter

Keyword(s):

De Novo ◽

Variant Calling ◽

Normal Sample ◽

Sequencing Data ◽

Somatic Variation ◽

Data Set ◽

Small Complex ◽

Physical Linkage ◽

Germline Variation ◽

Almost All

Haplotype-based variant callers, which consider physical linkage between variant sites, are currently among the best tools for germline variation discovery and genotyping from short-read sequencing data. However, almost all such tools were designed specifically for detecting common germline variation in diploid populations, and give sub-optimal results in other scenarios. Here we present Octopus, a versatile haplotype-based variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. We show that Octopus accurately calls de novo mutations in parent-offspring trios and germline variants in individuals, including SNVs, indels, and small complex replacements such as microinversions. In addition, using a carefully designed synthetic-tumour data set derived from clean sequencing data from a sample with known germline haplotypes, and observed mutations in large cohort of tumour samples, we show that Octopus accurately characterizes germline and somatic variation in tumours, both with and without a paired normal sample. Sequencing reads and prior information are combined to phase called genotypes of arbitrary ploidy, including those with somatic mutations. Octopus also outputs realigned evidence BAMs to aid validation and interpretation.

Download Full-text

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

BMC Genomics ◽

10.1186/s12864-020-07038-3 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Da Xu ◽

Jialin Zhang ◽

Hanxiao Xu ◽

Yusen Zhang ◽

Wei Chen ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Therapeutic Targets ◽

Genomic Data ◽

Data Sets ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Model Learning ◽

Data Set ◽

Multi Scale

Abstract Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.

Download Full-text

The effects of a globin blocker on the resolution of 3’mRNA sequencing data in porcine blood

BMC Genomics ◽

10.1186/s12864-019-6122-2 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Kyu-Sang Lim ◽

Qian Dong ◽

Pamela Moll ◽

Jana Vitkovska ◽

Gregor Wiktorin ◽

...

Keyword(s):

Gene Expression ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Data Sets ◽

Globin Genes ◽

Sequencing Data ◽

Globin Mrna ◽

Data Set ◽

Mrna Sequencing ◽

Porcine Blood

Abstract Background Gene expression profiling in blood is a potential source of biomarkers to evaluate or predict phenotypic differences between pigs but is expensive and inefficient because of the high abundance of globin mRNA in porcine blood. These limitations can be overcome by the use of QuantSeq 3’mRNA sequencing (QuantSeq) combined with a method to deplete or block the processing of globin mRNA prior to or during library construction. Here, we validated the effectiveness of QuantSeq using a novel specific globin blocker (GB) that is included in the library preparation step of QuantSeq. Results In data set 1, four concentrations of the GB were applied to RNA samples from two pigs. The GB significantly reduced the proportion of globin reads compared to non-GB (NGB) samples (P = 0.005) and increased the number of detectable non-globin genes. The highest evaluated concentration (C1) of the GB resulted in the largest reduction of globin reads compared to the NGB (from 56.4 to 10.1%). The second highest concentration C2, which showed very similar globin depletion rates (12%) as C1 but a better correlation of the expression of non-globin genes between NGB and GB (r = 0.98), allowed the expression of an additional 1295 non-globin genes to be detected, although 40 genes that were detected in the NGB sample (at a low level) were not present in the GB library. Concentration C2 was applied in the rest of the study. In data set 2, the distribution of the percentage of globin reads for NGB (n = 184) and GB (n = 189) samples clearly showed the effects of the GB on reducing globin reads, in particular for HBB, similar to results from data set 1. Data set 3 (n = 84) revealed that the proportion of globin reads that remained in GB samples was significantly and positively correlated with the reticulocyte count in the original blood sample (P < 0.001). Conclusions The effect of the GB on reducing the proportion of globin reads in porcine blood QuantSeq was demonstrated in three data sets. In addition to increasing the efficiency of sequencing non-globin mRNA, the GB for QuantSeq has an advantage that it does not require an additional step prior to or during library creation. Therefore, the GB is a useful tool in the quantification of whole gene expression profiles in porcine blood.

Download Full-text

A Novel Method to Detect Bias in Short Read NGS Data

Journal of Integrative Bioinformatics ◽

10.1515/jib-2017-0025 ◽

2017 ◽

Vol 14 (3) ◽

Cited By ~ 1

Author(s):

Jamie Alnasir ◽

Hugh P. Shanahan

Keyword(s):

Biological Significance ◽

Gc Content ◽

Next Generation Sequencing Data ◽

Data Sets ◽

Sequencing Data ◽

Data Set ◽

Short Read ◽

Novel Method ◽

Type Data ◽

Ngs Data

AbstractDetecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. We outline a novel method to detect sequence specific bias in short read Next Generation Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and used to analyse two D. melanogaster eye-antennal disc data sets generated at the same laboratory. The wild type data set in drosophila indicates a variation due to motif GC content that is more significant than that found due to exon GC content. The software is available online and could be applied for cross-experiment transcriptome data analysis in eukaryotes.

Download Full-text

A review of forecasting techniques for large datasets

National Institute Economic Review ◽

10.1177/0027950108089682 ◽

2008 ◽

Vol 203 ◽

pp. 109-115 ◽

Cited By ~ 4

Author(s):

Jana Eklund ◽

George Kapetanios

Keyword(s):

State Of The Art ◽

Large Data ◽

Large Datasets ◽

Large Data Sets ◽

Multiple Models ◽

Small Subset ◽

Data Sets ◽

Single Model ◽

Data Set ◽

Forecasting Techniques

This paper aims to provide a brief and relatively non-technical overview of state-of-the-art forecasting with large data sets. We classify existing methods into four groups depending on whether data sets are used wholly or partly, whether a single model or multiple models are used and whether a small subset or the whole data set is being forecast. In particular, we provide brief descriptions of the methods and short recommendations where appropriate, without going into detailed discussions of their merits or demerits.

Download Full-text

Modified Deep Neural Networks for Dog Breeds Identification

10.20944/preprints201812.0232.v1 ◽

2018 ◽

Cited By ~ 1

Author(s):

Aydin Ayanzadeh ◽

Sahand Vahidnia

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

State Of The Art ◽

The State ◽

Fine Tuning ◽

Test Accuracy ◽

Data Sets ◽

Data Set

In this paper, we leverage state of the art models on Imagenet data-sets. We use the pre-trained model and learned weighs to extract the feature from the Dog breeds identification data-set. Afterwards, we applied fine-tuning and dataaugmentation to increase the performance of our test accuracy in classification of dog breeds datasets. The performance of the proposed approaches are compared with the state of the art models of Image-Net datasets such as ResNet-50, DenseNet-121, DenseNet-169 and GoogleNet. we achieved 89.66% , 85.37% 84.01% and 82.08% test accuracy respectively which shows thesuperior performance of proposed method to the previous works on Stanford dog breeds datasets.

Download Full-text

Chemical biology-whole genome engineering datasets predict new antibacterial combinations

Microbial Genomics ◽

10.1099/mgen.0.000718 ◽

2021 ◽

Vol 7 (12) ◽

Author(s):

Arthur K. Turner ◽

Muhammad Yasir ◽

Sarah Bastkowski ◽

Andrea Telatin ◽

Andrew Page ◽

...

Keyword(s):

Chemical Biology ◽

Genome Engineering ◽

Genomic Data ◽

Inducible Promoter ◽

Data Sets ◽

Whole Genome ◽

Data Set ◽

Content Type ◽

Chemical Genomic ◽

Increased Susceptibility

Trimethoprim and sulfamethoxazole are used commonly together as cotrimoxazole for the treatment of urinary tract and other infections. The evolution of resistance to these and other antibacterials threatens therapeutic options for clinicians. We generated and analysed a chemical-biology-whole-genome data set to predict new targets for antibacterial combinations with trimethoprim and sulfamethoxazole. For this we used a large transposon mutant library in Escherichia coli BW25113 where an outward-transcribing inducible promoter was engineered into one end of the transposon. This approach allows regulated expression of adjacent genes in addition to gene inactivation at transposon insertion sites, a methodology that has been called TraDIS-Xpress. These chemical genomic data sets identified mechanisms for both reduced and increased susceptibility to trimethoprim and sulfamethoxazole. The data identified that over-expression of FolA reduced trimethoprim susceptibility, a known mechanism for reduced susceptibility. In addition, transposon insertions into the genes tdk, deoR, ybbC, hha, ldcA, wbbK and waaS increased susceptibility to trimethoprim and likewise for rsmH, fadR, ddlB, nlpI and prc with sulfamethoxazole, while insertions in ispD, uspC, minC, minD, yebK, truD and umpG increased susceptibility to both these antibiotics. Two of these genes’ products, Tdk and IspD, are inhibited by AZT and fosmidomycin respectively, antibiotics that are known to synergise with trimethoprim. Thus, the data identified two known targets and several new target candidates for the development of co-drugs that synergise with trimethoprim, sulfamethoxazole or cotrimoxazole. We demonstrate that the TraDIS-Xpress technology can be used to generate information-rich chemical-genomic data sets that can be used for antibacterial development.

Download Full-text

Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data

10.1101/390195 ◽

2018 ◽

Cited By ~ 4

Author(s):

Janko Tackmann ◽

João Frederico Matias Rodrigues ◽

Christian von Mering

Keyword(s):

Graphical Models ◽

Large Scale ◽

Study Data ◽

Microbial Interactions ◽

Data Sets ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Data Set ◽

Seamless Integration

AbstractThe recent explosion of metagenomic sequencing data opens the door towards the modeling of microbial ecosystems in unprecedented detail. In particular, co-occurrence based prediction of ecological interactions could strongly benefit from this development. However, current methods fall short on several fronts: univariate tools do not distinguish between direct and indirect interactions, resulting in excessive false positives, while approaches with better resolution are so far computationally highly limited. Furthermore, confounding variables typical for cross-study data sets are rarely addressed. We present FlashWeave, a new approach based on a flexible Probabilistic Graphical Models framework to infer highly resolved direct microbial interactions from massive heterogeneous microbial abundance data sets with seamless integration of metadata. On a variety of benchmarks, FlashWeave outperforms state-of-the-art methods by several orders of magnitude in terms of speed while generally providing increased accuracy. We apply FlashWeave to a cross-study data set of 69 818 publicly available human gut samples, resulting in one of the largest and most diverse models of microbial interactions in the human gut to date.

Download Full-text