scholarly journals BAMboozle removes genetic variation from human sequence data for open data sharing

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Christoph Ziegenhain ◽  
Rickard Sandberg

AbstractThe risks associated with re-identification of human genetic data are severely limiting open data sharing in life sciences, even in studies where donor-related genetic variant information is not of primary interest. Here, we developed BAMboozle, a versatile tool to eliminate critical types of sensitive genetic information in human sequence data by reverting aligned reads to the genome reference sequence. Applying BAMboozle to functional genomics data, such as single-cell RNA-seq (scRNA-seq) and scATAC-seq datasets, confirmed the removal of donor-related single nucleotide polymorphisms (SNPs) and indels in a manner that did not disclose the altered positions. Importantly, BAMboozle only removes the genetic sequence variants of the sample (i.e., donor) while preserving other important aspects of the raw sequence data. For example, BAMboozled scRNA-seq data contained accurate cell-type associated gene expression signatures, splice kinetic information, and can be used for methods benchmarking. Altogether, BAMboozle efficiently removes genetic variation in aligned sequence data, which represents a step forward towards open data sharing in many areas of genomics where the genetic variant information is not of primary interest.

2021 ◽  
Author(s):  
Christoph Ziegenhain ◽  
Rickard Sandberg

AbstractThe risks associated with re-identification of human genetic data are severely limiting open data sharing in life sciences. Here, we developed anonymizeBAM, a versatile tool for the anonymization of genetic variant information present in sequence data. Applying anonymizeBAM to single-cell RNA-seq and ATAC-seq datasets confirmed the complete removal of donor-related genetic information. Therefore, the accurate generation of de-identified sequence data will re-enable open sharing in sequencing-based studies for improved transparency, reproducibility, and innovation.


2019 ◽  
Author(s):  
Xin Zhou ◽  
Lu Zhang ◽  
Ziming Weng ◽  
David L. Dill ◽  
Arend Sidow

AbstractVariant discovery in personal, whole genome sequence data is critical for uncovering the genetic contributions to health and disease. We introduce a new approach, Aquila, that uses linked-read data for generating a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. Assemblies cover >95% of the human reference genome, with over 98% in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased VCF file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective evolution of whole-genome reconstruction that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.


2016 ◽  
Author(s):  
Chris Wymant ◽  
François Blanquart ◽  
Astrid Gall ◽  
Margreet Bakker ◽  
Daniela Bezemer ◽  
...  

AbstractNext-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of rapid between- and within-host evolution may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions.De novoassembly avoids this bias by effectively aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the toolshiverto preprocess reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We useshiverto reconstruct the consensus sequence and minority variant information from paired-end short-read data produced with the Illumina platform, for 65 existing publicly available samples and 50 new samples. We show the systematic superiority of mapping toshiver’s constructed reference over mapping the same reads to the standard reference HXB2: an average of 29 bases per sample are called differently, of which 98.5% are supported by higher coverage. We also provide a practical guide to working with imperfect contigs.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Xin Zhou ◽  
Lu Zhang ◽  
Ziming Weng ◽  
David L. Dill ◽  
Arend Sidow

AbstractWe introduce Aquila, a new approach to variant discovery in personal genomes, which is critical for uncovering the genetic contributions to health and disease. Aquila uses a reference sequence and linked-read data to generate a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. The contigs of the assemblies from our libraries cover >95% of the human reference genome, with over 98% of that in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased Variant Call Format (VCF) file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective approach that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.


2020 ◽  
Vol 98 (Supplement_4) ◽  
pp. 477-477
Author(s):  
Leah K Treffer ◽  
Edward S Rice ◽  
Anna M Fuller ◽  
Samuel Cutler ◽  
Jessica L Petersen

Abstract Domestic yak (Bos grunniens) are bovids native to the Asian Qinghai-Tibetan Plateau. Studies of Asian yak have revealed that introgression with domestic cattle has contributed to the evolution of the species. When imported to North America (NA), some hybridization with B. taurus did occur. The objective of this study was to use mitochondrial (mt) DNA sequence data to better understand the mtDNA origin of NA yak and their relationship to Asian yak and related species. The complete mtDNA sequence of 14 individuals (12 NA yak, 1 Tibetan yak, 1 Tibetan B. indicus) was generated and compared with sequences of similar species from GeneBank (B. indicus, B. grunniens (Chinese), B. taurus, B. gaurus, B. primigenius, B. frontalis, Bison bison, and Ovis aries). Individuals were aligned to the B. grunniens reference genome (ARS_UNL_BGru_maternal_1.0), which was also included in the analyses. The mtDNA genes were annotated using the ARS-UCD1.2 cattle sequence as a reference. Ten unique NA yak haplotypes were identified, which a haplotype network separated into two clusters. Variation among the NA haplotypes included 93 nonsynonymous single nucleotide polymorphisms. A maximum likelihood tree including all taxa was made using IQtree after the data were partitioned into twenty-two subgroups using PartitionFinder2. Notably, six NA yak haplotypes formed a clade with B. indicus; the other four haplotypes grouped with B. grunniens and fell as a sister clade to bison, gaur and gayal. These data demonstrate two mitochondrial origins of NA yak with genetic variation in protein coding genes. Although these data suggest yak introgression with B. indicus, it appears to date prior to importation into NA. In addition to contributing to our understanding of the species history, these results suggest the two major mtDNA haplotypes in NA yak may functionally differ. Characterization of the impact of these differences on cellular function is currently underway.


Pathogens ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 363
Author(s):  
Sulochana K. Wasala ◽  
Dana K. Howe ◽  
Louise-Marie Dandurand ◽  
Inga A. Zasada ◽  
Dee R. Denver

Globodera pallida is among the most significant plant-parasitic nematodes worldwide, causing major damage to potato production. Since it was discovered in Idaho in 2006, eradication efforts have aimed to contain and eradicate G. pallida through phytosanitary action and soil fumigation. In this study, we investigated genome-wide patterns of G. pallida genetic variation across Idaho fields to evaluate whether the infestation resulted from a single or multiple introduction(s) and to investigate potential evolutionary responses since the time of infestation. A total of 53 G. pallida samples (~1,042,000 individuals) were collected and analyzed, representing five different fields in Idaho, a greenhouse population, and a field in Scotland that was used for external comparison. According to genome-wide allele frequency and fixation index (Fst) analyses, most of the genetic variation was shared among the G. pallida populations in Idaho fields pre-fumigation, indicating that the infestation likely resulted from a single introduction. Temporal patterns of genome-wide polymorphisms involving (1) pre-fumigation field samples collected in 2007 and 2014 and (2) pre- and post-fumigation samples revealed nucleotide variants (SNPs, single-nucleotide polymorphisms) with significantly differentiated allele frequencies indicating genetic differentiation. This study provides insights into the genetic origins and adaptive potential of G. pallida invading new environments.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Kelly E. Williams ◽  
Damian M. Menning ◽  
Eric J. Wald ◽  
Sandra L. Talbot ◽  
Kumi L. Rattenbury ◽  
...  

Abstract Objectives Dall’s sheep (Ovis dalli dalli) are important herbivores in the mountainous ecosystems of northwestern North America, and recent declines in some populations have sparked concern. Our aim was to improve capabilities for fecal metabarcoding diet analysis of Dall’s sheep and other herbivores by contributing new sequence data for arctic and alpine plants. This expanded reference library will provide critical reference sequence data that will facilitate metabarcoding diet analysis of Dall’s sheep and thus improve understanding of plant-animal interactions in a region undergoing rapid climate change. Data description We provide sequences for the chloroplast rbcL gene of 16 arctic-alpine vascular plant species that are known to comprise the diet of Dall’s sheep. These sequences contribute to a growing reference library that can be used in diet studies of arctic herbivores.


2021 ◽  
pp. 002203452110202
Author(s):  
F. Schwendicke ◽  
J. Krois

Data are a key resource for modern societies and expected to improve quality, accessibility, affordability, safety, and equity of health care. Dental care and research are currently transforming into what we term data dentistry, with 3 main applications: 1) medical data analysis uses deep learning, allowing one to master unprecedented amounts of data (language, speech, imagery) and put them to productive use. 2) Data-enriched clinical care integrates data from individual (e.g., demographic, social, clinical and omics data, consumer data), setting (e.g., geospatial, environmental, provider-related data), and systems level (payer or regulatory data to characterize input, throughput, output, and outcomes of health care) to provide a comprehensive and continuous real-time assessment of biologic perturbations, individual behaviors, and context. Such care may contribute to a deeper understanding of health and disease and a more precise, personalized, predictive, and preventive care. 3) Data for research include open research data and data sharing, allowing one to appraise, benchmark, pool, replicate, and reuse data. Concerns and confidence into data-driven applications, stakeholders’ and system’s capabilities, and lack of data standardization and harmonization currently limit the development and implementation of data dentistry. Aspects of bias and data-user interaction require attention. Action items for the dental community circle around increasing data availability, refinement, and usage; demonstrating safety, value, and usefulness of applications; educating the dental workforce and consumers; providing performant and standardized infrastructure and processes; and incentivizing and adopting open data and data sharing.


Author(s):  
Di Xian ◽  
Peng Zhang ◽  
Ling Gao ◽  
Ruijing Sun ◽  
Haizhen Zhang ◽  
...  

AbstractFollowing the progress of satellite data assimilation in the 1990s, the combination of meteorological satellites and numerical models has changed the way scientists understand the earth. With the evolution of numerical weather prediction models and earth system models, meteorological satellites will play a more important role in earth sciences in the future. As part of the space-based infrastructure, the Fengyun (FY) meteorological satellites have contributed to earth science sustainability studies through an open data policy and stable data quality since the first launch of the FY-1A satellite in 1988. The capability of earth system monitoring was greatly enhanced after the second-generation polar orbiting FY-3 satellites and geostationary orbiting FY-4 satellites were developed. Meanwhile, the quality of the products generated from the FY-3 and FY-4 satellites is comparable to the well-known MODIS products. FY satellite data has been utilized broadly in weather forecasting, climate and climate change investigations, environmental disaster monitoring, etc. This article reviews the instruments mounted on the FY satellites. Sensor-dependent level 1 products (radiance data) and inversion algorithm-dependent level 2 products (geophysical parameters) are introduced. As an example, some typical geophysical parameters, such as wildfires, lightning, vegetation indices, aerosol products, soil moisture, and precipitation estimation have been demonstrated and validated by in-situ observations and other well-known satellite products. To help users access the FY products, a set of data sharing systems has been developed and operated. The newly developed data sharing system based on cloud technology has been illustrated to improve the efficiency of data delivery.


Sign in / Sign up

Export Citation Format

Share Document