Turning vice into virtue: Using Batch-Effects to Detect Errors in Large Genomic Datasets

AbstractIt is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling datasets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined datasets. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes dataset, we find that coding regions are enriched for errors, where about 1% of the higher-frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%). As expected, predicted errors are less often found than other variants in a dataset that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large datasets; our observation is thus not specific to the 1000 Genomes dataset. Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale datasets to detect systematic errors.

Download Full-text

Correcting for experiment-specific variability in expression compendia can remove underlying signals

GigaScience ◽

10.1093/gigascience/giaa117 ◽

2020 ◽

Vol 9 (11) ◽

Author(s):

Alexandra J Lee ◽

YoSon Park ◽

Georgia Doing ◽

Deborah A Hogan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Large Scale ◽

Original Signal ◽

Batch Effects ◽

Technical Variability ◽

The Past ◽

Statistical Correction ◽

Before And After ◽

Data Collections ◽

Biological Patterns

Abstract Motivation In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined. Objective We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments. Method We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability. Results The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal. Conclusion When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.

Download Full-text

Illuminating the Plant Rhabdovirus Landscape through Metatranscriptomics Data

Viruses ◽

10.3390/v13071304 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1304

Author(s):

Nicolás Bejerman ◽

Ralf G. Dietzgen ◽

Humberto Debat

Keyword(s):

Plant Species ◽

High Throughput Sequencing ◽

Plant Viruses ◽

The Novel ◽

Coding Regions ◽

Public Data ◽

Invaluable Tool ◽

Sequencing Platforms ◽

Viral Sequences ◽

Plant Rhabdovirus

Rhabdoviruses infect a large number of plant species and cause significant crop diseases. They have a negative-sense, single-stranded unsegmented or bisegmented RNA genome. The number of plant-associated rhabdovirid sequences has grown in the last few years in concert with the extensive use of high-throughput sequencing platforms. Here, we report the discovery of 27 novel rhabdovirus genomes associated with 25 different host plant species and one insect, which were hidden in public databases. These viral sequences were identified through homology searches in more than 3000 plant and insect transcriptomes from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) using known plant rhabdovirus sequences as the query. The identification, assembly and curation of raw SRA reads resulted in sixteen viral genome sequences with full-length coding regions and ten partial genomes. Highlights of the obtained sequences include viruses with unique and novel genome organizations among known plant rhabdoviruses. Phylogenetic analysis showed that thirteen of the novel viruses were related to cytorhabdoviruses, one to alphanucleorhabdoviruses, five to betanucleorhabdoviruses, one to dichorhaviruses and seven to varicosaviruses. These findings resulted in the most complete phylogeny of plant rhabdoviruses to date and shed new light on the phylogenetic relationships and evolutionary landscape of this group of plant viruses. Furthermore, this study provided additional evidence for the complexity and diversity of plant rhabdovirus genomes and demonstrated that analyzing SRA public data provides an invaluable tool to accelerate virus discovery, gain evolutionary insights and refine virus taxonomy.

Download Full-text

Comparison of HapMap and 1000 Genomes Reference Panels in a Large-Scale Genome-Wide Association Study

PLoS ONE ◽

10.1371/journal.pone.0167742 ◽

2017 ◽

Vol 12 (1) ◽

pp. e0167742 ◽

Cited By ~ 14

Author(s):

Paul S. de Vries ◽

Maria Sabater-Lleal ◽

Daniel I. Chasman ◽

Stella Trompet ◽

Tarunveer S. Ahluwalia ◽

...

Keyword(s):

Association Study ◽

Large Scale ◽

Genome Wide Association Study ◽

Genome Wide Association ◽

1000 Genomes ◽

Genome Wide

Download Full-text

tHapMix: simulating tumour samples through haplotype mixtures

10.1101/057414 ◽

2016 ◽

Author(s):

Sergii Ivakhno ◽

Camilla Colombo ◽

Stephen Tanner ◽

Philip Tedder ◽

Stefano Berri ◽

...

Keyword(s):

Copy Number ◽

Large Scale ◽

Variant Calling ◽

Copy Number Variant ◽

Supplementary Information ◽

Genome Diversity ◽

Simulation Framework ◽

Somatic Genome ◽

Copy Number Changes ◽

Sequencing Platforms

AbstractMotivationLarge-scale rearrangements and copy number changes combined with different modes of cloevolution create extensive somatic genome diversity, making it difficult to develop versatile and scalable oriant calling tools and create well-calibrated benchmarks.ResultsWe developed a new simulation framework tHapMix that enables the creation of tumour samples with different ploidy, purity and polyclonality features. It easily scales to simulation of hundreds of somatic genomes, while re-use of real read data preserves noise and biases present in sequencing platforms. We further demonstrate tHapMix utility by creating a simulated set of 140 somatic genomes and showing how it can be used in training and testing of somatic copy number variant calling tools.Availability and implementationtHapMix is distributed under an open source license and can be downloaded from https://github.com/Illumina/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

The Differential Astrometric Reference Frame on short timescales in the Gaia Era

Proceedings of the International Astronomical Union ◽

10.1017/s174392131700583x ◽

2017 ◽

Vol 12 (S330) ◽

pp. 79-80

Author(s):

Ummi Abbas ◽

Beatrice Bucciarelli ◽

Mario G. Lattanzi ◽

Mariateresa Crosta ◽

Mario Gai ◽

...

Keyword(s):

Reference Frame ◽

Large Scale ◽

Total Error ◽

Systematic Errors ◽

Field Of View ◽

Small Field ◽

Inertial Reference Frame ◽

Error Budget ◽

G 14 ◽

High Cadence

AbstractWe use methods of differential astrometry to construct a small field inertial reference frame stable at the micro-arcsecond level. Using Gaia measurements of field angles we look at the influence of the number of reference stars and the stars magnitude as well as astrometric systematics on the total error budget with the help of Gaia-like simulations around the Ecliptic Pole in a differential astrometric scenario. We find that the systematic errors are modeled and reliably estimated to the μas level even in fields with a modest number of 37 stars with G <13 mag over a 0.24 sq. degrees field of view for short timescales of the order of a day for a perfect instrument and with high-cadence observations. Accounting for large-scale calibrations by including the geometric instrument model over such short timescales requires fainter stars down to G=14 mag without diminishing the accuracy of the reference frame.

Download Full-text

Abstract 2722: Large Scale Mutation Discovery Screen Identifies Functionally Variant UGDH Alleles in Patients with Atrioventricular Valve Defects

Circulation ◽

10.1161/circ.116.suppl_16.ii_604 ◽

2007 ◽

Vol 116 (suppl_16) ◽

Author(s):

Jeroen Bakkers ◽

Sonja Chocron ◽

Victor Gouriev ◽

Kelly Smith ◽

Ronald Lekanne dit Deprez ◽

...

Keyword(s):

Candidate Genes ◽

Congenital Heart Defects ◽

Congenital Heart ◽

Large Scale ◽

Heart Defects ◽

Glucose Dehydrogenase ◽

Model Organisms ◽

Missense Mutations ◽

Coding Regions

Background: Congenital heart defects are the most common birth defects. Although genetic dispositions are believed to cause CHDs, only few genes have been identified that harbour mutations causing such defects. Studies in model organisms have identified many essential genes for cardiac development. UDP-glucose dehydrogenase (UGDH) enzymatic activity is required for the signal transduction of FGF and Wnt ligands and zebrafish jekyll/ugdh mutations lack AV valves. Methods and Results: From literature candidate genes were selected that are essential for AV canal-, septum-, and valve formation. By large scale sequencing we analysed the coding regions of 36 candidate genes in 192 patients with reported AVSDs. As a result we identified 457 genetic variations of which 207 variants are in flanking non-coding regions, 156 variants are in coding regions but silent and 94 variants are non-synonymous variants that alter the protein sequence. Comparison with the available databases such as HapMap and screening 350 control individuals resulted in the validation of 49 non-synonomous missense mutations in 23 genes only present in the patient group. These included novel GATA4 missense mutations (R285C and M224V) located in the highly conserved DNA binding domains, which by in vitro analysis significantly reduce transcriptional activity of the protein. Three patients with mitral valvar prolapse and mitral regurgitation were identified with novel missense mutations in the UDP-glucose dehydrogenase (UGDH) gene (R141C and E416D). In vitro experiments demonstrated a negative affect on enzyme activity and stability by a change in protein conformation. Furthermore, experiments in zebrafish jekyll/ugdh mutants showed that UGDH R141C and UGDH E416D couldn’t rescue the defects in AV formation demonstrating an inactivating effect of these missense mutations in vivo. Conclusions: A model organism based candidate gene screen in CHD patients resulted in the identification of novel functional missense mutations in the UGDH gene not previously implicated in congenital heart defects.

Download Full-text

N6-methyladenosine modification modulate in the heat stress of sheep (Ovis aries)

10.21203/rs.2.9885/v1 ◽

2019 ◽

Author(s):

Zengkui Lu ◽

Huihua Wang ◽

Youji Ma ◽

Mingxing Chu ◽

Kai Quan ◽

...

Keyword(s):

Heat Stress ◽

Signaling Pathways ◽

Stress Responses ◽

Large Scale ◽

Ovis Aries ◽

Stop Codons ◽

Coding Regions ◽

Mrna Methylation ◽

Response To Stress ◽

And Control

Abstract Background: Intensive and large-scale development of the sheep industry and increases in global temperature are increasingly exposing sheep to heat stress. N6-methyladenosine (m6A) mRNA methylation varies in response to stress, and can link external stress with complex transcriptional and post-transcriptional processes. However, no m6A mRNA methylation map has been obtained for sheep, nor is it known what effect this has on regulating heat stress in sheep. Results: A total of 8,306 and 12,958 m6A peaks were detected in heat stress and control groups, respectively, with 2,697 and 5,494 genes associated with each. Peaks were mainly enriched in coding regions and near stop codons with classical RRACH motifs. Methylation levels of heat stress and control sheep were higher near stop codons, although methylation was significantly lower in heat stress sheep. GO revealed that differential m6A-containing genes were mainly enriched in the nucleus and were involved in several stress responses and substance metabolism processes. KEGG pathway analysis found that differential m6A-containing genes were significantly enriched in Rap1, FoxO, MAPK, and other signaling pathways of the stress response, and TGF-beta, AMPK, Wnt, and other signaling pathways involved in fat metabolism. These m6A-modified genes were moderately expressed in both heat stress and control sheep, and the enrichment of m6A modification was significantly negatively correlated with gene expression. Conclusions: Our results showed that m6A mRNA methylation modifications regulate heat stress in sheep, and it also provided a new way for the study of animal response to heat stress.

Download Full-text

Variant calling and quality control of large-scale human genome sequencing data

Emerging Topics in Life Sciences ◽

10.1042/etls20190007 ◽

2019 ◽

Vol 3 (4) ◽

pp. 399-409 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Jae Hoon Sul

Keyword(s):

Quality Control ◽

Genome Sequencing ◽

Genetic Variants ◽

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Computational Approaches ◽

Sequencing Errors ◽

Human Genome Sequencing ◽

Number Of Individuals

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text

Technological advancements and their importance for nematode identification

SOIL ◽

10.5194/soil-2-257-2016 ◽

2016 ◽

Vol 2 (2) ◽

pp. 257-270 ◽

Cited By ~ 6

Author(s):

Mohammed Ahmed ◽

Melanie Sapp ◽

Thomas Prior ◽

Gerrit Karssen ◽

Matthew Alan Back

Keyword(s):

High Throughput ◽

Crop Production ◽

Large Scale ◽

High Throughput Sequencing ◽

Biological Indicators ◽

Rapid Identification ◽

Community Studies ◽

Terrestrial Environments ◽

Traditional Taxonomy ◽

Sequencing Platforms

Abstract. Nematodes represent a species-rich and morphologically diverse group of metazoans known to inhabit both aquatic and terrestrial environments. Their role as biological indicators and as key players in nutrient cycling has been well documented. Some plant-parasitic species are also known to cause significant losses to crop production. In spite of this, there still exists a huge gap in our knowledge of their diversity due to the enormity of time and expertise often involved in characterising species using phenotypic features. Molecular methodology provides useful means of complementing the limited number of reliable diagnostic characters available for morphology-based identification. We discuss herein some of the limitations of traditional taxonomy and how molecular methodologies, especially the use of high-throughput sequencing, have assisted in carrying out large-scale nematode community studies and characterisation of phytonematodes through rapid identification of multiple taxa. We also provide brief descriptions of some the current and almost-outdated high-throughput sequencing platforms and their applications in both plant nematology and soil ecology.

Download Full-text

Standing out in a networked communication context: Toward a network contingency model of public attention

New Media & Society ◽

10.1177/1461444820939445 ◽

2020 ◽

pp. 146144482093944

Author(s):

Aimei Yang ◽

Adam J Saffer

Keyword(s):

Social Media ◽

Large Scale ◽

Cost Effective ◽

Machine Learning Techniques ◽

Public Attention ◽

Combine Data ◽

Contingency Model ◽

Learning Techniques ◽

Community Strategy ◽

Communication Context

Social media can offer strategic communicators cost-effective opportunities to reach millions of individuals. However, in practice it can be difficult to be heard in these crowded digital spaces. This study takes a strategic network perspective and draws from recent research in network science to propose the network contingency model of public attention. This model argues that in the networked social-mediated environment, an organization’s ability to attract public attention on social media is contingent on its ability to fit its network position with the network structure of the communication context. To test the model, we combine data mining, social network analysis, and machine-learning techniques to analyze a large-scale Twitter discussion network. The results of our analysis of Twitter discussion around the refugee crisis in 2016 suggest that in high core-periphery network contexts, “star” positions were most influential whereas in low core-periphery network contexts, a “community” strategy is crucial to attracting public attention.

Download Full-text