scholarly journals High-quality, genome-wide SNP genotypic data for pedigreed germplasm of the diploid outbreeding species apple, peach, and sweet cherry through a common workflow

2019 ◽  
Author(s):  
Stijn Vanderzande ◽  
Nicholas P Howard ◽  
Lichun Cai ◽  
Cassia Da Silva Linge ◽  
Laima Antanaviciute ◽  
...  

AbstractHigh-quality genotypic data is a requirement for many genetic analyses. For any crop, errors in genotype calls, phasing of markers, linkage maps, pedigree records, and unnoticed variation in ploidy levels can lead to spurious marker-locus-trait associations and incorrect origin assignment of alleles to individuals. High-throughput genotyping requires automated scoring, as manual inspection of thousands of scored loci is too time-consuming. However, automated SNP scoring can result in errors that should be corrected to ensure recorded genotypic data are accurate and thereby ensure confidence in downstream genetic analyses. To enable quick identification of errors in a large genotypic data set, we have developed a comprehensive workflow. This multiple-step workflow is based on inheritance principles and on removal of markers and individuals that do not follow these principles, as demonstrated here for apple, peach, and sweet cherry. Genotypic data was obtained on pedigreed germplasm using 6-9K SNP arrays for each crop and a subset of well-performing SNPs was created using ASSIsT. Use of correct (and corrected) pedigree records readily identified violations of simple inheritance principles in the genotypic data, streamlined with FlexQTL™ software. Retained SNPs were grouped into haploblocks to increase the information content of single alleles and reduce computational power needed in downstream genetic analyses. Haploblock borders were defined by recombination locations detected in ancestral generations of cultivars and selections. Another round of inheritance-checking was conducted, for haploblock alleles (i.e., haplotypes). High-quality genotypic data sets were created using this workflow for pedigreed collections representing the U.S. breeding germplasm of apple, peach, and sweet cherry evaluated within the RosBREED project. These data sets contain 3855, 4005, and 1617 SNPs spread over 932, 103, and 196 haploblocks in apple, peach, and sweet cherry, respectively. The highly curated phased SNP and haplotype data sets, as well as the raw iScan data, of germplasm in the apple, peach, and sweet cherry Crop Reference Sets is available through the Genome Database for Rosaceae.

Author(s):  
Avinash Navlani ◽  
V. B. Gupta

In the last couple of decades, clustering has become a very crucial research problem in the data mining research community. Clustering refers to the partitioning of data objects such as records and documents into groups or clusters of similar characteristics. Clustering is unsupervised learning, because of unsupervised nature there is no unique solution for all problems. Most of the time complex data sets require explanation in multiple clustering sets. All the Traditional clustering approaches generate single clustering. There is more than one pattern in a dataset; each of patterns can be interesting in from different perspectives. Alternative clustering intends to find all unlike groupings of the data set such that each grouping has high quality and distinct from each other. This chapter gives you an overall view of alternative clustering; it's various approaches, related work, comparing with various confusing related terms like subspace, multi-view, and ensemble clustering, applications, issues, and challenges.


Geophysics ◽  
1993 ◽  
Vol 58 (9) ◽  
pp. 1281-1296 ◽  
Author(s):  
V. J. S. Grauch

The magnetic data set compiled for the Decade of North American Geology (DNAG) project presents an important digital data base that can be used to examine the North American crust. The data represent a patchwork from many individual airborne and marine magnetic surveys. However, the portion of data for the conterminous U.S. has problems that limit the resolution and use of the data. Now that the data are available in digital form, it is important to describe the data limitations more specifically than before. The primary problem is caused by datum shifts between individual survey boundaries. In the western U.S., the DNAG data are generally shifted less than 100 nT. In the eastern U.S., the DNAG data may be shifted by as much as 300 nT and contain regionally shifted areas with wavelengths on the order of 800 to 1400 km. The worst case is the artificial low centered over Kentucky and Tennessee produced by a series of datum shifts. A second significant problem is lack of anomaly resolution that arises primarily from using survey data that is too widely spaced compared to the flight heights above magnetic sources. Unfortunately, these are the only data available for much of the U.S. Another problem is produced by the lack of common observation surface between individual pieces of the U.S. DNAG data. The height disparities introduce variations in spatial frequency content that are unrelated to the magnetization of rocks. The spectral effects of datum shifts and the variation of spatial frequency content due to height disparities were estimated for the DNAG data for the conterminous U.S. As a general guideline for digital filtering, the most reliable features in the U.S. DNAG data have wavelengths roughly between 170 and 500 km, or anomaly half‐widths between 85 and 250 km. High‐quality, large‐region magnetic data sets have become increasingly important to meet exploration and scientific objectives. The acquisition of a new national magnetic data set with higher quality at a greater range of wavelengths is clearly in order. The best approach is to refly much of the U.S. with common specifications and reduction procedures. At the very least, magnetic data sets should be remerged digitally using available or newly flown long‐distance flight‐line data to adjust survey levels. In any case, national coordination is required to produce a consistent, high‐quality national magnetic map.


2016 ◽  
Author(s):  
Dorothee C. E. Bakker ◽  
Benjamin Pfeil ◽  
Camilla S. Landa ◽  
Nicolas Metzl ◽  
Kevin M. O'Brien ◽  
...  

Abstract. The Surface Ocean CO2 Atlas (SOCAT) is a synthesis of quality-controlled fCO2 (fugacity of carbon dioxide) values for the global surface oceans and coastal seas with regular updates. Version 3 of SOCAT has 14.5 million fCO2 values from 3646 data sets covering the years 1957 to 2014. This latest version has an additional 4.4 million fCO2 values relative to version 2 and extends the record from 2011 to 2014. Version 3 also significantly increases the data availability for 2005 to 2013. SOCAT has an average of approximately 1.2 million surface water fCO2 values per year for the years 2006 to 2012. Quality and documentation of the data has improved. A new feature is the data set quality control (QC) flag of E for data from alternative sensors and platforms. The accuracy of surface water fCO2 has been defined for all data set QC flags. Automated range checking has been carried out for all data sets during their upload into SOCAT. The upgrade of the interactive Data Set Viewer (previously known as the Cruise Data Viewer) allows better interrogation of the SOCAT data collection and rapid creation of high-quality figures for scientific presentations. Automated data upload has been launched for version 4 and will enable more frequent SOCAT releases in the future. High-profile scientific applications of SOCAT include quantification of the ocean sink for atmospheric carbon dioxide and its long-term variation, detection of ocean acidification, as well as evaluation of coupled-climate and ocean-only biogeochemical models. Users of SOCAT data products are urged to acknowledge the contribution of data providers, as stated in the SOCAT Fair Data Use Statement. This ESSD (Earth System Science Data) "Living Data" publication documents the methods and data sets used for the assembly of this new version of the SOCAT data collection and compares these with those used for earlier versions of the data collection (Pfeil et al., 2013; Sabine et al., 2013; Bakker et al., 2014).


2019 ◽  
Author(s):  
Jacob Schreiber ◽  
Jeffrey Bilmes ◽  
William Stafford Noble

AbstractMotivationRecent efforts to describe the human epigenome have yielded thousands of uniformly processed epigenomic and transcriptomic data sets. These data sets characterize a rich variety of biological activity in hundreds of human cell lines and tissues (“biosamples”). Understanding these data sets, and specifically how they differ across biosamples, can help explain many cellular mechanisms, particularly those driving development and disease. However, due primarily to cost, the total number of assays that can be performed is limited. Previously described imputation approaches, such as Avocado, have sought to overcome this limitation by predicting genome-wide epigenomics experiments using learned associations among available epigenomic data sets. However, these previous imputations have focused primarily on measurements of histone modification and chromatin accessibility, despite other biological activity being crucially important.ResultsWe applied Avocado to a data set of 3,814 tracks of data derived from the ENCODE compendium, spanning 400 human biosamples and 84 assays. The resulting imputations cover measurements of chromatin accessibility, histone modification, transcription, and protein binding. We demonstrate the quality of these imputations by comprehensively evaluating the model’s predictions and by showing significant improvements in protein binding performance compared to the top models in an ENCODE-DREAM challenge. Additionally, we show that the Avocado model allows for efficient addition of new assays and biosamples to a pre-trained model, achieving high accuracy at predicting protein binding, even with only a single track of training data.AvailabilityTutorials and source code are available under an Apache 2.0 license at https://github.com/jmschrei/[email protected] or [email protected]


2019 ◽  
Author(s):  
Inken Wohlers ◽  
Axel Künstner ◽  
Matthias Munz ◽  
Michael Olbrich ◽  
Anke Fähnrich ◽  
...  

AbstractThe human genome is composed of chromosomal DNA sequences consisting of bases A, C, G and T – the blueprint to implement the molecular functions that are the basis of every individual’s life. Deciphering the first human genome was a consortium effort that took more than a decade and considerable cost. With the latest technological advances, determining an individual’s entire personal genome with manageable cost and effort has come within reach. Although the benefits of the all-encompassing genetic information that entire genomes provide are manifold, only a small number of de novo assembled human genomes have been reported to date 1–3, and few have been complemented with population-based genetic variation 4, which is particularly important for North Africans who are not represented in current genome-wide data sets 5–7. Here, we combine long- and short-read whole-genome next-generation sequencing data with recent assembly approaches into the first de novo assembly of the genome of an Egyptian individual. The resulting assembly demonstrates well-balanced quality metrics and is complemented with high-quality variant phasing via linked reads into haploblocks, which we can associate with gene expression changes in blood. To construct an Egyptian genome reference, we further assayed genome-wide genetic variation occurring in the Egyptian population within a representative cohort of 110 Egyptian individuals. We show that differences in allele frequencies and linkage disequilibrium between Egyptians and Europeans may compromise the transferability of European ancestry-based genetic disease risk and polygenic scores, substantiating the need for multi-ethnic genetic studies and corresponding genome references. The Egyptian genome reference represents a comprehensive population data set based on a high-quality personal genome. It is a proof of concept to be considered by the many national and international genome initiatives underway. More importantly, we anticipate that the Egyptian genome reference will be a valuable resource for precision medicine targeting the Egyptian population and beyond.


2021 ◽  
Vol 12 ◽  
Author(s):  
Akio Onogi ◽  
Daisuke Sekine ◽  
Akito Kaga ◽  
Satoshi Nakano ◽  
Tetsuya Yamada ◽  
...  

It has not been fully understood in real fields what environment stimuli cause the genotype-by-environment (G × E) interactions, when they occur, and what genes react to them. Large-scale multi-environment data sets are attractive data sources for these purposes because they potentially experienced various environmental conditions. Here we developed a data-driven approach termed Environmental Covariate Search Affecting Genetic Correlations (ECGC) to identify environmental stimuli and genes responsible for the G × E interactions from large-scale multi-environment data sets. ECGC was applied to a soybean (Glycine max) data set that consisted of 25,158 records collected at 52 environments. ECGC illustrated what meteorological factors shaped the G × E interactions in six traits including yield, flowering time, and protein content and when these factors were involved in the interactions. For example, it illustrated the relevance of precipitation around sowing dates and hours of sunshine just before maturity to the interactions observed for yield. Moreover, genome-wide association mapping on the sensitivities to the identified stimuli discovered candidate and known genes responsible for the G × E interactions. Our results demonstrate the capability of data-driven approaches to bring novel insights on the G × E interactions observed in fields.


2020 ◽  
Author(s):  
Haley Hunter-Zinck ◽  
Yunling Shi ◽  
Man Li ◽  
Bryan R. Gorman ◽  
Sun-Gou Ji ◽  
...  

AbstractThe Million Veteran Program (MVP), initiated by the Department of Veterans Affairs (VA), aims to collect consented biosamples from at least one million Veterans. Presently, blood samples have been collected from over 800,000 enrolled participants. The size and diversity of the MVP cohort, as well as the availability of extensive VA electronic health records make it a promising resource for precision medicine. MVP is conducting array-based genotyping to provide genome-wide scan of the entire cohort, in parallel with whole genome sequencing, methylation, and other omics assays. Here, we present the design and performance of MVP 1.0 custom Axiom® array, which was designed and developed as a single assay to be used across the multi-ethnic MVP cohort. A unified genetic quality control analysis was developed and conducted on an initial tranche of 485,856 individuals leading to a high-quality dataset of 459,777 unique individuals. 668,418 genetic markers passed quality control and showed high quality genotypes not only on common variants but also on rare variants. We confirmed the substantial ancestral diversity of MVP with nearly 30% non-European individuals, surpassing other large biobanks. We also demonstrated the quality of the MVP dataset by replicating established genetic associations with height in European Americans and African Americans ancestries. This current data set has been made available to approved MVP researchers for genome-wide association studies and other downstream analyses. Further data releases will be available for analysis as recruitment at the VA continues and the cohort expands both in size and diversity.


PLoS ONE ◽  
2019 ◽  
Vol 14 (6) ◽  
pp. e0210928 ◽  
Author(s):  
Stijn Vanderzande ◽  
Nicholas P. Howard ◽  
Lichun Cai ◽  
Cassia Da Silva Linge ◽  
Laima Antanaviciute ◽  
...  

2021 ◽  
Author(s):  
Akio Onogi ◽  
Daisuke Sekine ◽  
Akito Kaga ◽  
Satoshi Nakano ◽  
Tetsuya Yamada ◽  
...  

It has not been fully understood in real fields what environment stimuli cause the genotype-by-environment (G x E) interactions, when they occur, and what genes react to them. Large-scale multi-environment data sets are attractive data sources for these purposes because they potentially experienced various environmental conditions. Here we developed a data-driven approach termed Environmental Covariate Search Affecting Genetic Correlations (ECGC) to identify environmental stimuli and genes responsible for the G x E interactions from large-scale multi-environment data sets. ECGC was applied to a soybean (Glycine max) data set that consisted of 25,158 records collected at 52 environments. ECGC illustrated what meteorological factors shaped the G x E interactions in six traits including yield, flowering time, and protein content and when they were involved. For example, it illustrated the relevance of precipitation around sowing dates and hours of sunshine just before maturity to the interactions observed for yield. Moreover, genome-wide association mapping on the sensitivities to the identified stimuli discovered candidate and known genes responsible for the G x E interactions. Our results demonstrate the capability of data-driven approaches to bring novel insights on the G x E interactions observed in fields.


1994 ◽  
Vol 27 (5) ◽  
pp. 722-726 ◽  
Author(s):  
J. Grochowski ◽  
P. Serda ◽  
K. S. Wilson ◽  
Z. Dauter

Two data sets were collected on single crystals of hexamethylenetetramine (urotropin) using a four-circle diffractometer with Cu Kα radiation and an imaging-plate two-dimensional detector with Mo Kα source using the rotation method. Both data sets extend to the same limit of sin θ/λ = 0.62 Å−1, corresponding to a resolution of 0.81 Å. Different processing protocols were employed for the two sets of data. Structure refinements carried out separately with each data set led to equivalent results of comparable accuracy. The imaging-plate scanner was able to provide X-ray data of high quality in a significantly shorter time than the diffractometer.


Sign in / Sign up

Export Citation Format

Share Document