scholarly journals Trellis for efficient data and task management in the VA Million Veteran Program

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Paul Billing Ross ◽  
Jina Song ◽  
Philip S. Tsao ◽  
Cuiping Pan

AbstractBiomedical studies have become larger in size and yielded large quantities of data, yet efficient data processing remains a challenge. Here we present Trellis, a cloud-based data and task management framework that completely automates the process from data ingestion to result presentation, while tracking data lineage, facilitating information query, and supporting fault-tolerance and scalability. Using a graph database to coordinate the state of the data processing workflows and a scalable microservice architecture to perform bioinformatics tasks, Trellis has enabled efficient variant calling on 100,000 human genomes collected in the VA Million Veteran Program.

2018 ◽  
Author(s):  
Michael F. Lin ◽  
Ohad Rodeh ◽  
John Penn ◽  
Xiaodong Bai ◽  
Jeffrey G. Reid ◽  
...  

ABSTRACTAs ever-larger cohorts of human genomes are collected in pursuit of genotype/phenotype associations, sequencing informatics must scale up to yield complete and accurate genotypes from vast raw datasets. Joint variant calling, a data processing step entailing simultaneous analysis of all participants sequenced, exhibits this scaling challenge acutely. We present GLnexus (GL, Genotype Likelihood), a system for joint variant calling designed to scale up to the largest foreseeable human cohorts. GLnexus combines scalable joint calling algorithms with a persistent database that grows efficiently as additional participants are sequenced. We validate GLnexus using 50,000 exomes to show it produces comparable or better results than existing methods, at a fraction of the computational cost with better scaling. We provide a standalone open-source version of GLnexus and a DNAnexus cloud-native deployment supporting very large projects, which has been employed for cohorts of >240,000 exomes and >22,000 whole-genomes.


Author(s):  
Man Tianxing ◽  
Nataly Zhukova ◽  
Alexander Vodyaho ◽  
Tin Tun Aung

Extracting knowledge from data streams received from observed objects through data mining is required in various domains. However, there is a lack of any kind of guidance on which techniques can or should be used in which contexts. Meta mining technology can help build processes of data processing based on knowledge models taking into account the specific features of the objects. This paper proposes a meta mining ontology framework that allows selecting algorithms for solving specific data mining tasks and build suitable processes. The proposed ontology is constructed using existing ontologies and is extended with an ontology of data characteristics and task requirements. Different from the existing ontologies, the proposed ontology describes the overall data mining process, used to build data processing processes in various domains, and has low computational complexity compared to others. The authors developed an ontology merging method and a sub-ontology extraction method, which are implemented based on OWL API via extracting and integrating the relevant axioms.


2018 ◽  
Author(s):  
Allison A. Regier ◽  
Yossi Farjoun ◽  
David Larson ◽  
Olga Krasheninina ◽  
Hyun Min Kang ◽  
...  

AbstractHundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years to interrogate a broad range of traits, across diverse populations. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power for trait mapping, and will enable studies of genome biology, population genetics and genome function at unprecedented scale. A central challenge for joint analysis is that different WGS data processing and analysis pipelines cause substantial batch effects in combined datasets, necessitating computationally expensive reprocessing and harmonization prior to variant calling. This approach is no longer tenable given the scale of current studies and data volumes. Here, in a collaboration across multiple genome centers and NIH programs, we define WGS data processing standards that allow different groups to produce “functionally equivalent” (FE) results suitable for joint variant calling with minimal batch effects. Our approach promotes broad harmonization of upstream data processing steps, while allowing for diverse variant callers. Importantly, it allows each group to continue innovating on data processing pipelines, as long as results remain compatible. We present initial FE pipelines developed at five genome centers and show that they yield similar variant calling results – including single nucleotide (SNV), insertion/deletion (indel) and structural variation (SV) – and produce significantly less variability than sequencing replicates. Residual inter-pipeline variability is concentrated at low quality sites and repetitive genomic regions prone to stochastic effects. This work alleviates a key technical bottleneck for genome aggregation and helps lay the foundation for broad data sharing and community-wide “big-data” human genetics studies.


Author(s):  
Sebastian Götz ◽  
Thomas Ilsche ◽  
Jorge Cardoso ◽  
Josef Spillner ◽  
Uwe ASSmann ◽  
...  

Database technology is highly developed for the many uses that it is employs; although, tomorrow will hold new challenges and demands that it is ill-equipped to accomplish. The rigors and demands of the current Information Age pushes information systems to develop more universal solutions not pre-established on the proprietary demands of capitalistic conceptions. In the Information Age, the ever-increasing need for more data processing capabilities becomes inherent with the times, and with the addition of the Digital Age, it is assumed that increased data processing will continue to be conducted by discrete electronic computing systems and the many forms that they will take. The continued development of more efficient data models, and the database systems designed to leverage them, will become the chariot bringing forth the climax of the current times and the dawning of new endeavors for human curiosity and our willingness to learn and explore ever further into the beyond. Tackling these issues is the direct purpose of the LISA Universal Informationbase System (the LISA Informationbase), to effectively integrate data of diverse variations and in a semi-ubiquitous structure to increase data automation of information content for use by our patrons in a powerful database management technology. Surveyed in this chapter is a review of this driving technology and its applications, covering the NITA Methodology Stage-I, Stage-II, and Stage-III in its developmental process.


Sign in / Sign up

Export Citation Format

Share Document