A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments

Author(s):  
Jochen Kruppa ◽  
Frank Kramer ◽  
Tim Beißbarth ◽  
Klaus Jung

AbstractAs part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.

2014 ◽  
Vol 11 (6) ◽  
pp. 683-688 ◽  
Author(s):  
Jacob M Tome ◽  
Abdullah Ozer ◽  
John M Pagano ◽  
Dan Gheba ◽  
Gary P Schroth ◽  
...  

2009 ◽  
Vol 37 (6) ◽  
pp. 1278-1280 ◽  
Author(s):  
Jernej Ule

UV-cross-linking and RNase protection, combined with high-throughput sequencing, have provided global maps of RNA sites bound by individual proteins or ribosomes. Using a stringent purification protocol, UV-CLIP (UV-cross-linking and immunoprecipitation) was able to identify intronic and exonic sites bound by splicing regulators in mouse brain tissue. Ribosome profiling has been used to quantify ribosome density on budding yeast mRNAs under different environmental conditions. Post-transcriptional regulation in neurons requires high spatial and temporal precision, as is evident from the role of localized translational control in synaptic plasticity. It remains to be seen if the high-throughput methods can be applied quantitatively to study the dynamics of RNP (ribonucleoprotein) remodelling in specific neuronal populations during the neurodegenerative process. It is certain, however, that applications of new biochemical techniques followed by high-throughput sequencing will continue to provide important insights into the mechanisms of neuronal post-transcriptional regulation.


2015 ◽  
Vol 10 (8) ◽  
pp. 1212-1233 ◽  
Author(s):  
Abdullah Ozer ◽  
Jacob M Tome ◽  
Robin C Friedman ◽  
Dan Gheba ◽  
Gary P Schroth ◽  
...  

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Gwenna Breton ◽  
Anna C. V. Johansson ◽  
Per Sjödin ◽  
Carina M. Schlebusch ◽  
Mattias Jakobsson

Abstract Background Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its “Best Practices” bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. Results We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK “Best Practices” are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. Conclusions We conclude that applying the GATK “Best Practices” pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.


2019 ◽  
Author(s):  
Julia Zinkus-Boltz ◽  
Craig Devalk ◽  
Bryan Dickinson

Protein-protein interactions (PPIs) are critical for organizing molecules in a cell and mediating signaling pathways. Dysregulation of PPIs are often key drivers of disease. To better understand the biophysical basis of such disease processes – and to potentially target them - it is critical to understand the molecular determinants of PPIs. Deep mutational scanning (DMS) facilitates the acquisition of large amounts of biochemical data by coupling selection with high throughput sequencing (HTS). The challenging and labor-intensive design and optimization of a relevant selection platform for DMS, however, limits the use of powerful directed evolution and selection approaches. To address this limitation, we designed a versatile new phage assisted continuous selection (PACS) system using our proximity-dependent split RNA polymerase (RNAP) biosensors with the aim of greatly simplifying and streamlining the design of a new selection platform for PPIs. After characterization and validation using the model KRAS/RAF PPI, we generated a library of RAF variants and subjected them to PACS and DMS. Our HTS data revealed that amino acid (aa) positions 66, 84, and 89 on RAF, key residues in the KRAS/RAF PPI, are intolerant to mutations. We also identified a subset of residues with broad aa substitution tolerance, aa positions 52, 55, 76, and 79. Due to the plug and play nature of RNAP biosensors, this method can easily be extended to other PPIs. More broadly, this, and other methods under development, supports the application of evolutionary and high-throughput approaches to bear on biochemical problems, moving towards a more comprehensive understanding of sequence-function relationships in proteins.


2019 ◽  
Vol 48 (3) ◽  
pp. e15-e15 ◽  
Author(s):  
Ibrahim Avsar Ilik ◽  
Tugce Aktas ◽  
Daniel Maticzka ◽  
Rolf Backofen ◽  
Asifa Akhtar

Abstract Determination of the in vivo binding sites of RNA-binding proteins (RBPs) is paramount to understanding their function and how they affect different aspects of gene regulation. With hundreds of RNA-binding proteins identified in human cells, a flexible, high-resolution, high-throughput, highly multiplexible and radioactivity-free method to determine their binding sites has not been described to date. Here we report FLASH (Fast Ligation of RNA after some sort of Affinity Purification for High-throughput Sequencing), which uses a special adapter design and an optimized protocol to determine protein–RNA interactions in living cells. The entire FLASH protocol, starting from cells on plates to a sequencing library, takes 1.5 days. We demonstrate the flexibility, speed and versatility of FLASH by using it to determine RNA targets of both tagged and endogenously expressed proteins under diverse conditions in vivo.


2019 ◽  
Author(s):  
Julia Zinkus-Boltz ◽  
Craig Devalk ◽  
Bryan Dickinson

Protein-protein interactions (PPIs) are critical for organizing molecules in a cell and mediating signaling pathways. Dysregulation of PPIs are often key drivers of disease. To better understand the biophysical basis of such disease processes – and to potentially target them - it is critical to understand the molecular determinants of PPIs. Deep mutational scanning (DMS) facilitates the acquisition of large amounts of biochemical data by coupling selection with high throughput sequencing (HTS). The challenging and labor-intensive design and optimization of a relevant selection platform for DMS, however, limits the use of powerful directed evolution and selection approaches. To address this limitation, we designed a versatile new phage assisted continuous selection (PACS) system using our proximity-dependent split RNA polymerase (RNAP) biosensors with the aim of greatly simplifying and streamlining the design of a new selection platform for PPIs. After characterization and validation using the model KRAS/RAF PPI, we generated a library of RAF variants and subjected them to PACS and DMS. Our HTS data revealed that amino acid (aa) positions 66, 84, and 89 on RAF, key residues in the KRAS/RAF PPI, are intolerant to mutations. We also identified a subset of residues with broad aa substitution tolerance, aa positions 52, 55, 76, and 79. Due to the plug and play nature of RNAP biosensors, this method can easily be extended to other PPIs. More broadly, this, and other methods under development, supports the application of evolutionary and high-throughput approaches to bear on biochemical problems, moving towards a more comprehensive understanding of sequence-function relationships in proteins.


2021 ◽  
Author(s):  
Ana Lechuga ◽  
Cédric Lood ◽  
Mónica Berjón-Otero ◽  
Alicia Del Prado ◽  
Jeroen Wagemans ◽  
...  

Bacillus virus Bam35 is the model Betatectivirus and member of the Tectiviridae family, which is composed of tailless, icosahedral, and membrane-containing bacteriophages. The interest in these viruses has greatly increased in recent years as they are thought to be an evolutionary link between diverse groups of prokaryotic and eukaryotic viruses. Additionally, betatectiviruses infect bacteria of the Bacillus cereus group, known for their applications in industry and notorious since it contains many pathogens. Here, we present the first protein-protein interactions network for a tectivirus-host system by studying the Bam35- Bacillus thuringiensis model using a novel approach that integrates the traditional yeast two-hybrid system and Illumina high-throughput sequencing. We generated and thoroughly analyzed a genomic library of Bam35’s host B. thuringiensis HER1410 and screened interactions with all the viral proteins using different combinations of bait-prey couples. In total, this screen resulted in the detection of over 4,000 potential interactions, of which 183 high-confidence interactions were defined as part of the core virus-host interactome. Overall, host metabolism proteins and peptidases are particularly enriched within the detected interactions, distinguishing this host-phage system from the other reported host-phage protein-protein interaction networks (PPIs). Our approach also suggests biological roles for several Bam35 proteins of unknown function, resulting in a better understanding of the Bam35- B. thuringiensis interaction at the molecular level.


F1000Research ◽  
2017 ◽  
Vol 6 ◽  
pp. 1138 ◽  
Author(s):  
Lance E. Palmer ◽  
Mitchell J. Weiss ◽  
Vikram R. Paralkar

YODEL is a peak calling software for analyzing RNA sequencing data generated by High-Throughput Sequencing of RNA isolated by Crosslinking Immunoprecipitation (HITS-CLIP; also known as CLIP-SEQ), a method to identify RNA-protein interactions genome-wide. We designed YODEL to analyze HITS-CLIP experiments, in which Argonaute proteins are immunoprecipitated, followed by sequencing of the associated RNA in order to identify bound microRNAs and their mRNA targets. The HITS-CLIP sequenced reads are mapped to the genome, and then read peaks are visualized where clustered sets of reads map to the same region. Several peak calling algorithms have been developed to define the boundaries of these peaks. In contrast to other peak callers for HITS-CLIP data, such as Piranha, YODEL does not map the starts of reads to fixed interval bins, but instead uses a heuristic approach to iteratively find the tallest point within a set clustered reads and examine bases upstream and downstream of that point until a peak has been determined. This allows the peak boundary to be defined more precisely than coordinates that are multiples of the bin size. Per-sample peak counts are also generated by YODEL, which quickly enables downstream differential representation analysis. YODEL is available athttps://github.com/LancePalmerStJude/YODEL/.


Sign in / Sign up

Export Citation Format

Share Document