GenoPheno: cataloging large-scale phenotypic and next-generation sequencing data within human datasets

Abstract Precision medicine promises to revolutionize treatment, shifting therapeutic approaches from the classical one-size-fits-all to those more tailored to the patient’s individual genomic profile, lifestyle and environmental exposures. Yet, to advance precision medicine’s main objective—ensuring the optimum diagnosis, treatment and prognosis for each individual—investigators need access to large-scale clinical and genomic data repositories. Despite the vast proliferation of these datasets, locating and obtaining access to many remains a challenge. We sought to provide an overview of available patient-level datasets that contain both genotypic data, obtained by next-generation sequencing, and phenotypic data—and to create a dynamic, online catalog for consultation, contribution and revision by the research community. Datasets included in this review conform to six specific inclusion parameters that are: (i) contain data from more than 500 human subjects; (ii) contain both genotypic and phenotypic data from the same subjects; (iii) include whole genome sequencing or whole exome sequencing data; (iv) include at least 100 recorded phenotypic variables per subject; (v) accessible through a website or collaboration with investigators and (vi) make access information available in English. Using these criteria, we identified 30 datasets, reviewed them and provided results in the release version of a catalog, which is publicly available through a dynamic Web application and on GitHub. Users can review as well as contribute new datasets for inclusion (Web: https://avillachlab.shinyapps.io/genophenocatalog/; GitHub: https://github.com/hms-dbmi/GenoPheno-CatalogShiny).

Download Full-text

Speeding Up Large-Scale Next Generation Sequencing Data Analysis with pBWA

Journal of Applied Bioinformatics & Computational Biology ◽

10.4172/2329-9533.1000101 ◽

2017 ◽

Vol 01 (01) ◽

Cited By ~ 4

Author(s):

Darren Peters ◽

Xuemei Luo ◽

Ke Qiu ◽

Ping Liang

Keyword(s):

Data Analysis ◽

Next Generation Sequencing ◽

Large Scale ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing ◽

Sequencing Data Analysis

Download Full-text

NGSPERL: a semi-automated framework for large scale next generation sequencing data analysis

International Journal of Computational Biology and Drug Design ◽

10.1504/ijcbdd.2015.072082 ◽

2015 ◽

Vol 8 (3) ◽

pp. 203

Author(s):

Quanhu Sheng ◽

Shilin Zhao ◽

Mingsheng Guo ◽

Yu Shyr

Keyword(s):

Data Analysis ◽

Next Generation Sequencing ◽

Large Scale ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing ◽

Sequencing Data Analysis

Download Full-text

Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data

PROTEOMICS ◽

10.1002/pmic.201400206 ◽

2014 ◽

Vol 14 (23-24) ◽

pp. 2719-2730 ◽

Cited By ~ 45

Author(s):

Sunghee Woo ◽

Seong Won Cha ◽

Seungjin Na ◽

Clark Guest ◽

Tao Liu ◽

...

Keyword(s):

Next Generation Sequencing ◽

Large Scale ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

rmvPFBAM: Removing Primers from BAM Files Based on Amplicon-Based Next-Generation Sequencing and Cloud Computing When Analyzing Personal Genome Data

Scientific Programming ◽

10.1155/2021/6536470 ◽

2021 ◽

Vol 2021 ◽

pp. 1-6

Author(s):

Yanjun Ma

Keyword(s):

Next Generation Sequencing ◽

False Positive ◽

Large Scale ◽

Genomic Data ◽

Next Generation Sequencing Data ◽

Personal Genome ◽

Next Generation ◽

Sequencing Data ◽

Personal Genomic ◽

Generation Sequencing

Personal genomic data constitute one important part of personal health data. However, due to the large amount of personal genomic data obtained by the next-generation sequencing technology, special tools are needed to analyze these data. In this article, we will explore a tool analyzing cloud-based large-scale genome sequencing data. Analyzing and identifying genomic variations from amplicon-based next-generation sequencing data are necessary for the clinical diagnosis and treatment of cancer patients. When processing the amplicon-based next-generation sequencing data, one essential step is removing primer sequences from the reads to avoid detecting false-positive mutations introduced by nonspecific primer binding and primer extension reactions. At present, the removing primer tools usually discard primer sequences from the FASTQ file instead of BAM file, but this method could cause some downstream analysis problems. Only one tool (BAMClipper) removes primer sequences from BAM files, but it only modified the CIGAR value of the BAM file, and false-positive mutations falling in the primer region could still be detected based on its processed BAM file. So, we developed one cutting primer tool (rmvPFBAM) removing primer sequences from the BAM file, and the mutations detected based on the processed BAM file by rmvPFBAM are highly credible. Besides that, rmvPFBAM runs faster than other tools, such as cutPrimers and BAMClipper.

Download Full-text

Parkour LIMS: facilitating high-quality sample preparation in next generation sequencing

10.1101/338533 ◽

2018 ◽

Author(s):

E. Anatskiy ◽

D.P. Ryan ◽

B. Grüning ◽

L. Arrigoni ◽

T. Manke ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sample Preparation ◽

Web Application ◽

Quality Criteria ◽

Turnaround Time ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Report Generation ◽

Generation Sequencing

AbstractSummaryThis paper presents Parkour, a software package for sample processing and quality management of next generation sequencing data and samples. Starting with user requests, Parkour allows tracking and assessing samples based on predefined quality criteria through different stages of the sample preparation workflow. Ideally suited for academic core laboratories, the software aims to maximize efficiency and reduce turnaround time by intelligent sample grouping and a clear assignment of staff to work units. Tools for automated invoicing, interactive statistics on facility usage and simple report generation minimize administrative tasks. Provided as a web application, Parkour is a convenient tool for both deep sequencing service users and laboratory personal. A set of web APIs allow coordinated information sharing with local and remote bioinformaticians. The flexible structure allows workflow customization and simple addition of new features as well as the expansion to other domains.Availability and implementationThe code and documentation are available at https://github.com/maxplanck-ie/[email protected]

Download Full-text

Compression of Next-Generation Sequencing Data and of DNA Digital Files

Algorithms ◽

10.3390/a13060151 ◽

2020 ◽

Vol 13 (6) ◽

pp. 151

Author(s):

Bruno Carpentieri

Keyword(s):

Next Generation Sequencing ◽

Dna Sequences ◽

Network Traffic ◽

Large Scale ◽

Genomic Data ◽

Biological Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

The increase in memory and in network traffic used and caused by new sequenced biological data has recently deeply grown. Genomic projects such as HapMap and 1000 Genomes have contributed to the very large rise of databases and network traffic related to genomic data and to the development of new efficient technologies. The large-scale sequencing of samples of DNA has brought new attention and produced new research, and thus the interest in the scientific community for genomic data has greatly increased. In a very short time, researchers have developed hardware tools, analysis software, algorithms, private databases, and infrastructures to support the research in genomics. In this paper, we analyze different approaches for compressing digital files generated by Next-Generation Sequencing tools containing nucleotide sequences, and we discuss and evaluate the compression performance of generic compression algorithms by confronting them with a specific system designed by Jones et al. specifically for genomic file compression: Quip. Moreover, we present a simple but effective technique for the compression of DNA sequences in which we only consider the relevant DNA data and experimentally evaluate its performances.

Download Full-text

PAPNC, a novel method to calculate nucleotide diversity from large scale next generation sequencing data

Journal of Virological Methods ◽

10.1016/j.jviromet.2014.03.008 ◽

2014 ◽

Vol 203 ◽

pp. 73-80 ◽

Cited By ~ 10

Author(s):

Wei Shao ◽

Mary F. Kearney ◽

Valerie F. Boltz ◽

Jonathan E. Spindler ◽

John W. Mellors ◽

...

Keyword(s):

Next Generation Sequencing ◽

Large Scale ◽

Nucleotide Diversity ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Novel Method ◽

Generation Sequencing

Download Full-text

Faculty Opinions recommendation of VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718272765.793499663 ◽

2014 ◽

Author(s):

Gary Bader ◽

Mohamed Helmy

Keyword(s):

Next Generation Sequencing ◽

Network Analysis ◽

Next Generation Sequencing Data ◽

Cancer Genes ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Faculty Opinions recommendation of Bioinformatory-assisted analysis of next-generation sequencing data for precision medicine in pancreatic cancer.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727775566.793536095 ◽

2017 ◽

Author(s):

Steve Pereira

Keyword(s):

Pancreatic Cancer ◽

Next Generation Sequencing ◽

Precision Medicine ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Assisted Analysis ◽

Generation Sequencing

Download Full-text

NGSremix: A software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab174 ◽

2021 ◽

Author(s):

Anne Krogh Nøhr ◽

Kristian Hanghøj ◽

Genis Garcia Erill ◽

Zilong Li ◽

Ida Moltke ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Research ◽

Likelihood Estimation ◽

Software Tool ◽

Estimation Methods ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ngs Data ◽

Generation Sequencing

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.

Download Full-text