scholarly journals A computational framework to analyze human genomes

2019 ◽  
Vol 35 (2) ◽  
pp. 105-118
Author(s):  
Vinh Le

The advent of genomic technologies has led to the current genomic era. Large-scale human genome projects have resulted in a huge amount of genomic data. Analyzing human genomes is a challenging task including a number of key steps from short read alignment, variant calling, and variant annotating. In this paper, the state-of-the-art computational methods and databases for each step will be analyzed to suggest a practical and efficient guideline for whole human genome analyses. This paper also discusses frameworks to combine variants from various genome analysis pipelines to obtain reliable variants. Finally, we will address advantages as well as discordances of widely-used variant annotation methods to evaluate the clinical significance of variants. The review will empower bioinformaticians to efficiently perform human genome analyses, and more importantly, help genetic consultants understand and properly interpret mutations for clinical purposes.

2015 ◽  
Author(s):  
Justin M Zook ◽  
David Catoe ◽  
Jennifer McDaniel ◽  
Lindsay Vang ◽  
Noah Spies ◽  
...  

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Michael D. Linderman ◽  
Davin Chia ◽  
Forrest Wallace ◽  
Frank A. Nothaft

Abstract Background XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results. Results DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster. Conclusions We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.


2016 ◽  
Author(s):  
Ari Löytynoja ◽  
Nick Goldman

AbstractResequencing efforts are uncovering the extent of genetic variation in humans and provide data to study the evolutionary processes shaping our genome. One recurring puzzle in both intra- and inter-species studies is the high frequency of complex mutations comprising multiple nearby base substitutions or insertion-deletions. We devised a generalized mutation model of template switching during replication that extends existing models of genome rearrangement, and used this to study the role of template switch events in the origin of such mutation clusters. Applied to the human genome, our model detects thousands of template switch events during the evolution of human and chimp from their common ancestor, and hundreds of events between two independently sequenced human genomes. While many of these are consistent with the template switch mechanism previously proposed for bacteria but not thought significant in higher organisms, our model also identifies new types of mutations that create short inversions, some flanked by paired inverted repeats. The local template switch process can create numerous complex mutation patterns, including hairpin loop structures, and explains multi-nucleotide mutations and compensatory substitutions without invoking positive selection, complicated and speculative mechanisms, or implausible coincidence. Clustered sequence differences are challenging for mapping and variant calling methods, and we show that detection of mutation clusters with current resequencing methodologies is difficult and many erroneous variant annotations exist in human reference data. Template switch events such as those we have uncovered may have been neglected as an explanation for complex mutations because of biases in commonly used analyses. Incorporation of our model into reference-based analysis pipelines and comparisons of de novo-assembled genomes will lead to improved understanding of genome variation and evolution.


2019 ◽  
Author(s):  
Ankit Kumar Pathak ◽  
Ashwin Kumar Jainarayanan ◽  
Samir Kumar Brahmachari

ABSTRACTWith large-scale human genome and exome sequencing, a lot of focus has gone in studying variations present in genomes and their associations to various diseases. Since major emphasis has been put on their variations, less focus has been given to invariant genes in the population. Here we present 60,706 genomes from the ExAC database to identify population specific invariant genes. Out of 1,336 total genes drawn from various population specific invariant genes, 423 were identified to be mostly (allele frequency less than 0.001) invariant across different populations. 46 of these invariant genes showed absolute invariance in all populations. Most of these common invariant genes have homologs in primates, rodents and placental mammals while 8 of them were unique to human genome and 3 genes still had unknown functions. Surprisingly, a majority were found to be X-linked and around 50% of these genes were not expressed in any tissues. The functional analysis showed that the invariant genes are not only involved in fundamental functions like transcription and translation but also in various developmental processes. The variations in many of these invariant genes were found to be associated with cancer, developmental diseases and dominant genetic disorders.


2013 ◽  
Author(s):  
Claudia Gonzaga-Jauregui

Current genome-wide technologies allow interrogation and exploration of the human genome as never before. Next-generation sequencing (NGS) technologies, along with high resolution Single Nucleotide Polymorphisms (SNP) arrays and array Comparative Genomic Hybrization (aCGH) enable assessment of human genome variation at the finest resolution from base pair changes such as simple nucleotide variants (SNVs) to large copy-number variants (CNVs). The application of these genomic technologies in the clinical setting has also enabled the molecular characterization of genetic disorders and the understanding of the biological functions of more genes in human development, disease, and health. In this review, the current approaches and platforms available for high-throughput human genome analyses, the steps involved in these different methodologies from sample preparation to data analysis, their applications, and limitations are summarized and discussed.


2017 ◽  
Author(s):  
Sergei Yakneen ◽  
Sebastian M. Waszak ◽  
Michael Gertz ◽  
Jan O. Korbel

We present Butler, a computational framework developed in the context of the international Pan-cancer Analysis of Whole Genomes (PCAWG)1 project to overcome the challenges of orchestrating analyses of thousands of human genomes on the cloud. Butler operates equally well on public and academic clouds. This highly flexible framework facilitates management of virtual cloud infrastructure, software configuration, genomics workflow development, and provides unique capabilities in workflow execution management. By comprehensively collecting and analysing metrics and logs, performing anomaly detection as well as notification and cluster self-healing, Butler enables large-scale analytical processing of human genomes with 43% increased throughput compared to prior setups. Butler was key for delivering the germline genetic variant call-sets in 2,834 cancer genomes analysed by PCAWG1.


2013 ◽  
Author(s):  
Claudia Gonzaga-Jauregui

Current genome-wide technologies allow interrogation and exploration of the human genome as never before. Next-generation sequencing (NGS) technologies, along with high resolution Single Nucleotide Polymorphisms (SNP) arrays and array Comparative Genomic Hybrization (aCGH) enable assessment of human genome variation at the finest resolution from base pair changes such as simple nucleotide variants (SNVs) to large copy-number variants (CNVs). The application of these genomic technologies in the clinical setting has also enabled the molecular characterization of genetic disorders and the understanding of the biological functions of more genes in human development, disease, and health. In this review, the current approaches and platforms available for high-throughput human genome analyses, the steps involved in these different methodologies from sample preparation to data analysis, their applications, and limitations are summarized and discussed.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Mohammadreza Yaghoobi ◽  
Krzysztof S. Stopka ◽  
Aaditya Lakshmanan ◽  
Veera Sundararaghavan ◽  
John E. Allison ◽  
...  

AbstractThe PRISMS-Fatigue open-source framework for simulation-based analysis of microstructural influences on fatigue resistance for polycrystalline metals and alloys is presented here. The framework uses the crystal plasticity finite element method as its microstructure analysis tool and provides a highly efficient, scalable, flexible, and easy-to-use ICME community platform. The PRISMS-Fatigue framework is linked to different open-source software to instantiate microstructures, compute the material response, and assess fatigue indicator parameters. The performance of PRISMS-Fatigue is benchmarked against a similar framework implemented using ABAQUS. Results indicate that the multilevel parallelism scheme of PRISMS-Fatigue is more efficient and scalable than ABAQUS for large-scale fatigue simulations. The performance and flexibility of this framework is demonstrated with various examples that assess the driving force for fatigue crack formation of microstructures with different crystallographic textures, grain morphologies, and grain numbers, and under different multiaxial strain states, strain magnitudes, and boundary conditions.


Sign in / Sign up

Export Citation Format

Share Document