GTX.Digest.VCF: an online NGS data interpretation system based on intelligent gene ranking and large-scale text mining

Abstract Background An important task in the interpretation of sequencing data is to highlight pathogenic genes (or detrimental variants) in the field of Mendelian diseases. It is still challenging despite the recent rapid development of genomics and bioinformatics. A typical interpretation workflow includes annotation, filtration, manual inspection and literature review. Those steps are time-consuming and error-prone in the absence of systematic support. Therefore, we developed GTX.Digest.VCF, an online DNA sequencing interpretation system, which prioritizes genes and variants for novel disease-gene relation discovery and integrates text mining results to provide literature evidence for the discovery. Its phenotype-driven ranking and biological data mining approach significantly speed up the whole interpretation process. Results The GTX.Digest.VCF system is freely available as a web portal at http://vcf.gtxlab.com for academic research. Evaluation on the DDD project dataset demonstrates an accuracy of 77% (235 out of 305 cases) for top-50 genes and an accuracy of 41.6% (127 out of 305 cases) for top-5 genes. Conclusions GTX.Digest.VCF provides an intelligent web portal for genomics data interpretation via the integration of bioinformatics tools, distributed parallel computing, biomedical text mining. It can facilitate the application of genomic analytics in clinical research and practices.

Download Full-text

Sketch distance-based clustering of chromosomes for large genome database compression

BMC Genomics ◽

10.1186/s12864-019-6310-0 ◽

2019 ◽

Vol 20 (S10) ◽

Author(s):

Tao Tang ◽

Yuansheng Liu ◽

Buzhong Zhang ◽

Benyue Su ◽

Jinyan Li

Keyword(s):

Compression Ratio ◽

Large Scale ◽

Rapid Development ◽

Rice Genome ◽

Sequencing Data ◽

Genome Database ◽

Compression Algorithms ◽

Large Genome ◽

Reference Selection ◽

Reference Genomes

Abstract Background The rapid development of Next-Generation Sequencing technologies enables sequencing genomes with low cost. The dramatically increasing amount of sequencing data raised crucial needs for efficient compression algorithms. Reference-based compression algorithms have exhibited outstanding performance on compressing single genomes. However, for the more challenging and more useful problem of compressing a large collection of n genomes, straightforward application of these reference-based algorithms suffers a series of issues such as difficult reference selection and remarkable performance variation. Results We propose an efficient clustering-based reference selection algorithm for reference-based compression within separate clusters of the n genomes. This method clusters the genomes into subsets of highly similar genomes using MinHash sketch distance, and uses the centroid sequence of each cluster as the reference genome for an outstanding reference-based compression of the remaining genomes in each cluster. A final reference is then selected from these reference genomes for the compression of the remaining reference genomes. Our method significantly improved the performance of the-state-of-art compression algorithms on large-scale human and rice genome databases containing thousands of genome sequences. The compression ratio gain can reach up to 20-30% in most cases for the datasets from NCBI, the 1000 Human Genomes Project and the 3000 Rice Genomes Project. The best improvement boosts the performance from 351.74 compression folds to 443.51 folds. Conclusions The compression ratio of reference-based compression on large scale genome datasets can be improved via reference selection by applying appropriate data preprocessing and clustering methods. Our algorithm provides an efficient way to compress large genome database.

Download Full-text

Biomarker Identification Using Text Mining

Computational and Mathematical Methods in Medicine ◽

10.1155/2012/135780 ◽

2012 ◽

Vol 2012 ◽

pp. 1-4 ◽

Cited By ~ 3

Author(s):

Hui Li ◽

Chunmei Liu

Keyword(s):

Text Mining ◽

Large Scale ◽

Finite State Machine ◽

State Machine ◽

Biological Data ◽

Molecular Biomarkers ◽

Biomarker Identification ◽

Pubmed Database ◽

Finite State

Identifying molecular biomarkers has become one of the important tasks for scientists to assess the different phenotypic states of cells or organisms correlated to the genotypes of diseases from large-scale biological data. In this paper, we proposed a text-mining-based method to discover biomarkers from PubMed. First, we construct a database based on a dictionary, and then we used a finite state machine to identify the biomarkers. Our method of text mining provides a highly reliable approach to discover the biomarkers in the PubMed database.

Download Full-text

Streamlining Data-Intensive Biology With Workflow Systems

10.1101/2020.06.30.178673 ◽

2020 ◽

Cited By ~ 1

Author(s):

Taylor Reiter ◽

Phillip T. Brooks ◽

Luiz Irber ◽

Shannon E.K. Joslin ◽

Charles M. Reid ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Open Science ◽

Biological Data ◽

Data Generation ◽

Biological Sequence ◽

Sequencing Data ◽

Workflow Systems

AbstractAs the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these strategies in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.Author SummaryWe present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.

Download Full-text

Streamlining data-intensive biology with workflow systems

GigaScience ◽

10.1093/gigascience/giaa140 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Taylor Reiter ◽

Phillip T Brooks† ◽

Luiz Irber† ◽

Shannon E K Joslin† ◽

Charles M Reid† ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Biological Data ◽

Data Generation ◽

Sequencing Data ◽

Workflow Systems ◽

Data Intensive ◽

High Throughput Sequencing Data ◽

Project Data

Download Full-text

Compression of Next-Generation Sequencing Data and of DNA Digital Files

Algorithms ◽

10.3390/a13060151 ◽

2020 ◽

Vol 13 (6) ◽

pp. 151

Author(s):

Bruno Carpentieri

Keyword(s):

Next Generation Sequencing ◽

Dna Sequences ◽

Network Traffic ◽

Large Scale ◽

Genomic Data ◽

Biological Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

The increase in memory and in network traffic used and caused by new sequenced biological data has recently deeply grown. Genomic projects such as HapMap and 1000 Genomes have contributed to the very large rise of databases and network traffic related to genomic data and to the development of new efficient technologies. The large-scale sequencing of samples of DNA has brought new attention and produced new research, and thus the interest in the scientific community for genomic data has greatly increased. In a very short time, researchers have developed hardware tools, analysis software, algorithms, private databases, and infrastructures to support the research in genomics. In this paper, we analyze different approaches for compressing digital files generated by Next-Generation Sequencing tools containing nucleotide sequences, and we discuss and evaluate the compression performance of generic compression algorithms by confronting them with a specific system designed by Jones et al. specifically for genomic file compression: Quip. Moreover, we present a simple but effective technique for the compression of DNA sequences in which we only consider the relevant DNA data and experimentally evaluate its performances.

Download Full-text

CNSA: a data repository for archiving omics data

10.1101/2020.04.07.030833 ◽

2020 ◽

Author(s):

Xueqin Guo ◽

Fengzhen Chen ◽

Fei Gao ◽

Ling Li ◽

Ke Liu ◽

...

Keyword(s):

High Throughput Sequencing ◽

Analytical Data ◽

Academic Research ◽

Biological Data ◽

Data Repository ◽

Free Access ◽

Omics Data ◽

Sequencing Data ◽

Data Standards ◽

Sample Information

AbstractWith the application and development of high-throughput sequencing technology in life and health sciences, massive multi-dimensional biological data brings the problem of efficient management and utilization. Database development and biocuration are the prerequisites for the reuse of these big data. Here, relying on China National GeneBank (CNGB), we present CNGB Sequence Archive (CNSA) for archiving omics data, including raw sequencing data and its analytical data and related metadata which are organized into six objects, namely Project, Sample, Experiment, Run, Assembly, and Variation at present. Moreover, CNSA has created the correlation model of living samples, sample information, and analytical data on some projects, so that all data can be traced throughout the life cycle from the living sample to the sample information to the analytical data. Complying with the data standards commonly used in the life sciences, CNSA is committed to building a comprehensive and curated data repository for the storage, management and sharing of omics data, improving the data standards, and providing free access to open data resources for worldwide scientific communities to support academic research and the bio-industry. Database URL: https://db.cngb.org/cnsa/

Download Full-text

DDBJ update: streamlining submission and access of human data

Nucleic Acids Research ◽

10.1093/nar/gkaa982 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D71-D75

Author(s):

Asami Fukuda ◽

Yuichi Kodama ◽

Jun Mashima ◽

Takatomo Fujisawa ◽

Osamu Ogasawara

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Group Structure ◽

Biological Data ◽

Sequencing Data ◽

The Public ◽

Human Data ◽

Phenotype Data ◽

Sequencing Platforms ◽

Authorized Access

Abstract The Bioinformation and DDBJ Center (DDBJ Center, https://www.ddbj.nig.ac.jp) provides databases that capture, preserve and disseminate diverse biological data to support research in the life sciences. This center collects nucleotide sequences with annotations, raw sequencing data, and alignment information from high-throughput sequencing platforms, and study and sample information, in collaboration with the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI). This collaborative framework is known as the International Nucleotide Sequence Database Collaboration (INSDC). In collaboration with the National Bioscience Database Center (NBDC), the DDBJ Center also provides a controlled-access database, the Japanese Genotype–phenotype Archive (JGA), which archives and distributes human genotype and phenotype data, requiring authorized access. The NBDC formulates guidelines and policies for sharing human data and reviews data submission and use applications. To streamline all of the processes at NBDC and JGA, we have integrated the two systems by introducing a unified login platform with a group structure in September 2020. In addition to the public databases, the DDBJ Center provides a computer resource, the NIG supercomputer, for domestic researchers to analyze large-scale genomic data. This report describes updates to the services of the DDBJ Center, focusing on the NBDC and JGA system enhancements.

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

A Comparison of Perception on Creativity between Academic Research and Social Big Data Using Text-Mining Techniques

Korean Society for Creativity Education ◽

10.36358/jce.2020.20.3.47 ◽

2020 ◽

Vol 20 (3) ◽

pp. 47-67

Author(s):

Eunbyul Cho ◽

Jiyeon Min ◽

Soowon Park

Keyword(s):

Big Data ◽

Text Mining ◽

Academic Research ◽

Social Big Data

Download Full-text

Study of Biochemical and Clinical Markers in Steatohepatitis Related to Obesity

Revista de Chimie ◽

10.37358/rc.18.6.6355 ◽

2018 ◽

Vol 69 (6) ◽

pp. 1501-1505

Author(s):

Roxana Maria Livadariu ◽

Radu Danila ◽

Lidia Ionescu ◽

Delia Ciobanu ◽

Daniel Timofte

Keyword(s):

Liver Disease ◽

Liver Biopsy ◽

Large Scale ◽

Biological Data ◽

Obese Patients ◽

Clinical Markers ◽

Mean Values ◽

Nonalcoholic Fatty Liver ◽

Obstructive Sleep ◽

Increased Risk

Nonalcoholic fatty liver disease (NAFLD) is highly associated to obesity and comprises several liver diseases, from simple steatosis to steatohepatitis (NASH) with increased risk of developing progressive liver fibrosis, cirrhosis and hepatocellular carcinoma. Liver biopsy is the gold standard in diagnosing the disease, but it cannot be used in a large scale. The aim of the study was the assessment of some non-invasive clinical and biological markers in relation to the progressive forms of NAFLD. We performed a prospective study on 64 obese patients successively hospitalised for bariatric surgery in our Surgical Unit. Patients with history of alcohol consumption, chronic hepatitis B or C, other chronic liver disease or patients undergoing hepatotoxic drug use were excluded. All patients underwent liver biopsy during sleeve gastrectomy. NAFLD was present in 100% of the patients: hepatic steatosis (38%), NASH with the two forms: with fibrosis (31%) and without fibrosis (20%), cumulating 51%; 7 patients had NASH with vanished steatosis. NASH with fibrosis statistically correlated with metabolic syndrome (p = 0.036), DM II (p = 0.01) and obstructive sleep apnea (p = 0.02). Waist circumference was significantly higher in the steatohepatitis groups (both with and without fibrosis), each 10 cm increase increasing the risk of steatohepatitis (p = 0.007). The mean values of serum fibrinogen and CRP were significantly higher in patients having the progressive forms of NAFLD. Simple clinical and biological data available to the practitioner in medicine can be used to identify obese patients at high risk of NASH, aiming to direct them to specialized medical centers.

Download Full-text