scholarly journals VariantStore: an index for large-scale genomic variant search

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Prashant Pandey ◽  
Yinjie Gao ◽  
Carl Kingsford

AbstractEfficiently scaling genomic variant search indexes to thousands of samples is computationally challenging due to the presence of multiple coordinate systems to avoid reference biases. We present VariantStore, a system that indexes genomic variants from multiple samples using a variation graph and enables variant queries across any sample-specific coordinate system. We show the scalability of VariantStore by indexing genomic variants from the TCGA project in 4 h and the 1000 Genomes project in 3 h. Querying for variants in a gene takes between 0.002 and 3 seconds using memory only 10% of the size of the full representation.

2019 ◽  
Author(s):  
Prashant Pandey ◽  
Yinjie Gao ◽  
Carl Kingsford

AbstractThe ability to efficiently query genomic variants from thousands of samples is critical to achieving the full potential of many medical and scientific applications such as personalized medicine. Performing variant queries based on coordinates in the reference or sample sequences is at the core of these applications. Efficiently supporting variant queries across thousands of samples is computationally challenging. Most solutions only support queries based on the reference coordinates and the ones that support queries based on coordinates across multiple samples do not scale to data containing more than a few thousand samples. We present VariantStore, a system for efficiently indexing and querying genomic variants and their sequences in either the reference or sample-specific coordinate systems. We show the scalability of VariantStore by indexing genomic variants from the TCGA-BRCA project containing 8640 samples and 5M variants in 4 Hrs and the 1000 genomes project containing 2500 samples and 924M variants in 3 Hrs. Querying for variants in a gene takes between 0.002 – 3 seconds using memory only 10% of the size of the full representation.


2016 ◽  
Vol 10 (2) ◽  
Author(s):  
Xianwen Yu ◽  
Huiqing Wang ◽  
Jinling Wang

AbstractWhile producing large-scale larger than 1:2000 maps in cities or towns, the obstruction from buildings leads to difficult and heavy tasks of measuring mapping control points. In order to avoid measuring the mapping control points and shorten the time of fieldwork, in this paper, a quick mapping method is proposed. This method adjusts many free blocks of surveys together, and transforms the points from all free blocks of surveys into the same coordinate system. The entire surveying area is divided into many free blocks, and connection points are set on the boundaries between free blocks. An independent coordinate system of every free block is established via completely free station technology, and the coordinates of the connection points, detail points and control points in every free block in the corresponding independent coordinate systems are obtained based on poly-directional open traverses. Error equations are established based on connection points, which are determined together to obtain the transformation parameters. All points are transformed from the independent coordinate systems to a transitional coordinate system via the transformation parameters. Several control points are then measured by GPS in a geodetic coordinate system. All the points can then be transformed from the transitional coordinate system to the geodetic coordinate system. In this paper, the implementation process and mathematical formulas of the new method are presented in detail, and the formula to estimate the precision of surveys is given. An example has demonstrated that the precision of using the new method could meet large-scale mapping needs.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Huiguang Yi ◽  
Yanling Lin ◽  
Chengqi Lin ◽  
Wenfei Jin

AbstractHere, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Kshitij Srivastava ◽  
Anne-Sophie Fratzscher ◽  
Bo Lan ◽  
Willy Albert Flegel

Abstract Background Clinically effective and safe genotyping relies on correct reference sequences, often represented by haplotypes. The 1000 Genomes Project recorded individual genotypes across 26 different populations and, using computerized genotype phasing, reported haplotype data. In contrast, we identified long reference sequences by analyzing the homozygous genomic regions in this online database, a concept that has rarely been reported since next generation sequencing data became available. Study design and methods Phased genotype data for a 80.6 kb region of chromosome 1 was downloaded for all 2,504 unrelated individuals of the 1000 Genome Project Phase 3 cohort. The data was centered on the ACKR1 gene and bordered by the CADM3 and FCER1A genes. Individuals with heterozygosity at a single site or with complete homozygosity allowed unambiguous assignment of an ACKR1 haplotype. A computer algorithm was developed for extracting these haplotypes from the 1000 Genome Project in an automated fashion. A manual analysis validated the data extracted by the algorithm. Results We confirmed 902 ACKR1 haplotypes of varying lengths, the longest at 80,584 nucleotides and shortest at 1,901 nucleotides. The combined length of haplotype sequences comprised 19,895,388 nucleotides with a median of 16,014 nucleotides. Based on our approach, all haplotypes can be considered experimentally confirmed and not affected by the known errors of computerized genotype phasing. Conclusions Tracts of homozygosity can provide definitive reference sequences for any gene. They are particularly useful when observed in unrelated individuals of large scale sequence databases. As a proof of principle, we explored the 1000 Genomes Project database for ACKR1 gene data and mined long haplotypes. These haplotypes are useful for high throughput analysis with next generation sequencing. Our approach is scalable, using automated bioinformatics tools, and can be applied to any gene.


2019 ◽  
Vol 35 (22) ◽  
pp. 4851-4853 ◽  
Author(s):  
Mihir A Kamat ◽  
James A Blackshaw ◽  
Robin Young ◽  
Praveen Surendran ◽  
Stephen Burgess ◽  
...  

Abstract Summary PhenoScanner is a curated database of publicly available results from large-scale genetic association studies in humans. This online tool facilitates ‘phenome scans’, where genetic variants are cross-referenced for association with many phenotypes of different types. Here we present a major update of PhenoScanner (‘PhenoScanner V2’), including over 150 million genetic variants and more than 65 billion associations (compared to 350 million associations in PhenoScanner V1) with diseases and traits, gene expression, metabolite and protein levels, and epigenetic markers. The query options have been extended to include searches by genes, genomic regions and phenotypes, as well as for genetic variants. All variants are positionally annotated using the Variant Effect Predictor and the phenotypes are mapped to Experimental Factor Ontology terms. Linkage disequilibrium statistics from the 1000 Genomes project can be used to search for phenotype associations with proxy variants. Availability and implementation PhenoScanner V2 is available at www.phenoscanner.medschl.cam.ac.uk.


2021 ◽  
Author(s):  
Nae-Chyun Chen ◽  
Alexey Kolesnikov ◽  
Sidharth Goel ◽  
Taedong Yun ◽  
Pi-Chuan Chang ◽  
...  

Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we modify DeepVariant to add a new channel encoding population allele frequencies from the 1000 Genomes Project. We show that this model reduces variant calling errors, improving both precision and recall. We assess the impact of using population-specific or diverse reference panels. We achieve the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.


1975 ◽  
Vol 26 ◽  
pp. 87-92
Author(s):  
P. L. Bender

AbstractFive important geodynamical quantities which are closely linked are: 1) motions of points on the Earth’s surface; 2)polar motion; 3) changes in UT1-UTC; 4) nutation; and 5) motion of the geocenter. For each of these we expect to achieve measurements in the near future which have an accuracy of 1 to 3 cm or 0.3 to 1 milliarcsec.From a metrological point of view, one can say simply: “Measure each quantity against whichever coordinate system you can make the most accurate measurements with respect to”. I believe that this statement should serve as a guiding principle for the recommendations of the colloquium. However, it also is important that the coordinate systems help to provide a clear separation between the different phenomena of interest, and correspond closely to the conceptual definitions in terms of which geophysicists think about the phenomena.In any discussion of angular motion in space, both a “body-fixed” system and a “space-fixed” system are used. Some relevant types of coordinate systems, reference directions, or reference points which have been considered are: 1) celestial systems based on optical star catalogs, distant galaxies, radio source catalogs, or the Moon and inner planets; 2) the Earth’s axis of rotation, which defines a line through the Earth as well as a celestial reference direction; 3) the geocenter; and 4) “quasi-Earth-fixed” coordinate systems.When a geophysicists discusses UT1 and polar motion, he usually is thinking of the angular motion of the main part of the mantle with respect to an inertial frame and to the direction of the spin axis. Since the velocities of relative motion in most of the mantle are expectd to be extremely small, even if “substantial” deep convection is occurring, the conceptual “quasi-Earth-fixed” reference frame seems well defined. Methods for realizing a close approximation to this frame fortunately exist. Hopefully, this colloquium will recommend procedures for establishing and maintaining such a system for use in geodynamics. Motion of points on the Earth’s surface and of the geocenter can be measured against such a system with the full accuracy of the new techniques.The situation with respect to celestial reference frames is different. The various measurement techniques give changes in the orientation of the Earth, relative to different systems, so that we would like to know the relative motions of the systems in order to compare the results. However, there does not appear to be a need for defining any new system. Subjective figures of merit for the various system dependon both the accuracy with which measurements can be made against them and the degree to which they can be related to inertial systems.The main coordinate system requirement related to the 5 geodynamic quantities discussed in this talk is thus for the establishment and maintenance of a “quasi-Earth-fixed” coordinate system which closely approximates the motion of the main part of the mantle. Changes in the orientation of this system with respect to the various celestial systems can be determined by both the new and the conventional techniques, provided that some knowledge of changes in the local vertical is available. Changes in the axis of rotation and in the geocenter with respect to this system also can be obtained, as well as measurements of nutation.


1975 ◽  
Vol 26 ◽  
pp. 21-26

An ideal definition of a reference coordinate system should meet the following general requirements:1. It should be as conceptually simple as possible, so its philosophy is well understood by the users.2. It should imply as few physical assumptions as possible. Wherever they are necessary, such assumptions should be of a very general character and, in particular, they should not be dependent upon astronomical and geophysical detailed theories.3. It should suggest a materialization that is dynamically stable and is accessible to observations with the required accuracy.


2020 ◽  
Vol 962 (8) ◽  
pp. 24-37
Author(s):  
V.E. Tereshchenko

The article suggests a technique for relation global kinematic reference system and local static realization of global reference system by regional continuously operated reference stations (CORS) network. On the example of regional CORS network located in the Novosibirsk Region (CORS NSO) the relation parameters of the global reference system WGS-84 and its local static realization by CORS NSO network at the epoch of fixing stations coordinates in catalog are calculated. With the realization of this technique, the main parameters to be determined are the speed of displacement one system center relativly to another and the speeds of rotation the coordinate axes of one system relatively to another, since the time evolution of most stations in the Russian Federation is not currently provided. The article shows the scale factor for relation determination of coordinate systems is not always necessary to consider. The technique described in the article also allows detecting the errors in determining the coordinates of CORS network in global coordinate system and compensate for them. A systematic error of determining and fixing the CORS NSO coordinates in global coordinate system was detected. It is noted that the main part of the error falls on the altitude component and reaches 12 cm. The proposed technique creates conditions for practical use of the advanced method Precise Point Positioning (PPP) in some regions of the Russian Federation. Also the technique will ensure consistent PPP method results with the results of the most commonly used in the Russian Federation other post-processing methods of high-precision positioning.


Sign in / Sign up

Export Citation Format

Share Document