IGD: high-performance search for large-scale genomic interval datasets

SummaryDatabases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions.Availabilityhttps://github.com/databio/IGD

Download Full-text

IGD: high-performance search for large-scale genomic interval datasets

Bioinformatics ◽

10.1093/bioinformatics/btaa1062 ◽

2020 ◽

Author(s):

Jianglin Feng ◽

Nathan C Sheffield

Keyword(s):

High Performance ◽

Large Scale ◽

Interval Data ◽

Scale Analysis ◽

Genome Database ◽

Genomic Interval ◽

Critical Resource ◽

Genomic Regions ◽

Genome Projects

Abstract Summary Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. Availability https://github.com/databio/IGD

Download Full-text

REVA as a Well-curated Database for Human Expression-modulating Variants

10.1101/2021.02.24.432622 ◽

2021 ◽

Author(s):

Yu Wang ◽

Fang-Yuan Shi ◽

Yu Liang ◽

Ge Gao

Keyword(s):

Large Scale ◽

Regulatory Mechanism ◽

State Of The Art ◽

Scale Analysis ◽

Computational Tools ◽

Functional Annotations ◽

Link Type ◽

Large Scale Analysis ◽

Multiple State ◽

Limited Sensitivity

AbstractMore than 80% of disease- and trait-associated human variants are noncoding. By systematically screening multiple large-scale studies, we compiled REVA, a manually curated database for over 11.8 million experimentally tested noncoding variants with expression-modulating potentials. We provided 2424 functional annotations that could be used to pinpoint plausible regulatory mechanism of these variants. We further benchmarked multiple state-of-the-art computational tools and found their limited sensitivity remains a serious challenge for effective large-scale analysis. REVA provides high-qualify experimentally tested expression-modulating variants with extensive functional annotations, which will be useful for users in the noncoding variants community. REVA is available at http://reva.gao-lab.org.

Download Full-text

Augmented Interval List: a novel data structure for efficient genomic interval search

Bioinformatics ◽

10.1093/bioinformatics/btz407 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4907-4911 ◽

Cited By ~ 8

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Supplementary Information ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Scalable Methods

Abstract Motivation Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. Results We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5–18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4–60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. Availability and implementation An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Analysis of Normal Levels of Urine and Plasma Free Glycosaminoglycans in Adults

10.1101/2021.05.21.445098 ◽

2021 ◽

Author(s):

Sinisa Bratulic ◽

Angelo Limeta ◽

Francesca Maccari ◽

Fabio Galeotti ◽

Nicola Volpi ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Blood Chemistry ◽

Sulfated Polysaccharides ◽

Reference Ranges ◽

Reactive Protein ◽

Non Invasive ◽

Critical Resource ◽

Biomarker Research

Plasma and urine glycosaminoglycans (GAGs), long linear sulfated polysaccharides, have been recognized as potential non-invasive biomarkers for several diseases. However, owing to the analytical complexity associated with the measurement of GAG concentration and disaccharide composition (or GAGome), a reference study of the normal healthy GAGome is currently missing. Here, we prospectively enrolled 270 healthy adults and analyzed their urine and plasma free GAGomes using a standardized ultra-high-performance liquid chromatography coupled with triple-quadrupole tandem mass spectrometry (UHPLC-MS/MS) method together with comprehensive demographic and blood chemistry biomarker data. Free GAGomes did not correlate with age nor with any of the 25 blood chemistry biomarkers, except for a few plasma disaccharides that correlated with hemoglobin and C-reactive protein. However, free GAGome levels were generally higher in males. Partitioned by gender, we established reference ranges for all plasma and urine free chondroitin sulfate (CS), heparan sulfate (HS), and hyaluronic acid (HS) disaccharides. Our study is the first large-scale determination of normal plasma and urine free GAGomes reference ranges and represents a critical resource for biomarker research.

Download Full-text

A large-scale analysis of bioinformatics code on GitHub

10.1101/321919 ◽

2018 ◽

Author(s):

Pamela H Russell ◽

Rachel L Johnson ◽

Shreyas Ananthan ◽

Benjamin Harnke ◽

Nichole E Carlson

Keyword(s):

Large Scale ◽

Source Code ◽

The State ◽

Scale Analysis ◽

Development Activity ◽

High Profile ◽

Link Type ◽

Large Scale Analysis ◽

Bioinformatics Software ◽

Bioinformatics Community

AbstractIn recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/githubbioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.Author summaryWe present, to our knowledge, the first large-scale analysis of bioinformatics source code. The purpose of our work is to contribute data to the growing conversation in the bioinformatics community around reproducibility, code quality, and software usability. We analyze a large collection of bioinformatics software projects, identifying relationships between code properties, development activity, developer communities, and software impact. Throughout the work, we compare the large set of projects to a small set of highly popular bioinformatics tools, highlighting features associated with high-profile projects. We make our data and code publicly available to enable others to build upon our analysis or generate new datasets. The significance of our work is to (1) contribute a large base of knowledge to the bioinformatics community about the state of their software, (2) contribute tools and resources enabling the community to conduct their own analyses, and (3) demonstrate that it is possible to systematically analyze large volumes of bioinformatics code. This work and the provided resources will enable a more effective, data-driven conversation around software practices in the bioinformatics community.

Download Full-text

Augmented Interval List: a novel data structure for efficient genomic interval search

10.1101/593657 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C. Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Genomic Data Analysis ◽

Scalable Methods

AbstractMotivationGenomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary.ResultsWe present a new data structure, the augmented interval list (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N + n + m), where n is the number of overlaps between R and q, N is the number of intervals in the set R, and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5 - 18 times faster than standard high-performance code based on augmented interval-trees (AITree), nested containment lists (NCList), or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4% - 60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis.AvailabilityAn implementation of the AIList data structure with both construction and search algorithms is available at code.databio.org/AIList.

Download Full-text

PISCES: a package for rapid quantitation and quality control of large scale mRNA-seq datasets

10.1101/2020.12.01.390575 ◽

2020 ◽

Author(s):

Matthew D. Shirley ◽

Viveksagar K. Radhakrishna ◽

Javad Golji ◽

Joshua M. Korn

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

High Performance ◽

Large Scale ◽

Differential Expression Analysis ◽

Link Type ◽

File Formats ◽

Comparison Groups ◽

Reproducible Analysis ◽

High Performance Computing Cluster

AbstractPISCES eases processing of large mRNA-seq experiments by encouraging capture of metadata using simple textual file formats, processing samples on either a single machine or in parallel on a high performance computing cluster (HPC), validating sample identity using genetic fingerprinting, and summarizing all outputs in analysis-ready data matrices. PISCES consists of two modules: 1) compute cluster-aware analysis of individual mRNA-seq libraries including species detection, SNP genotyping, library geometry detection, and quantitation using salmon, and 2) gene-level transcript aggregation, transcriptional and read-based QC, TMM normalization and differential expression analysis of multiple libraries to produce data ready for visualization and further analysis.PISCES is implemented as a python3 package and is bundled with all necessary dependencies to enable reproducible analysis and easy deployment. JSON configuration files are used to build and identify transcriptome indices, and CSV files are used to supply sample metadata and to define comparison groups for differential expression analysis using DEseq2. PISCES builds on many existing open-source tools, and releases of PISCES are available on GitHub or the python package index (PyPI).

Download Full-text

High-Performance Computing for Large-Scale Analysis, Optimization, and Control

Journal of Aerospace Engineering ◽

10.1061/(asce)0893-1321(2000)13:1(1) ◽

2000 ◽

Vol 13 (1) ◽

pp. 1-10 ◽

Cited By ~ 35

Author(s):

Hojjat Adeli

Keyword(s):

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Scale Analysis ◽

Large Scale Analysis ◽

Optimization And Control ◽

And Control ◽

Performance Computing

Download Full-text

Evaluating institutional open access performance: Methodology, challenges and assessment

10.1101/2020.03.19.998336 ◽

2020 ◽

Author(s):

Chun-Kai Huang ◽

Cameron Neylon ◽

Richard Hosking ◽

Lucy Montgomery ◽

Katie Wilson ◽

...

Keyword(s):

Open Access ◽

Latin American ◽

High Performance ◽

Large Scale ◽

Scale Analysis ◽

Reporting Standard ◽

Grass Roots ◽

Large Scale Analysis ◽

Substantial Progress ◽

Gold Open Access

AbstractOpen Access to research outputs is becoming rapidly more important to the global research community and society. Changes are driven by funder mandates, institutional policy, grass-roots advocacy and culture change. It has been challenging to provide a robust, transparent and updateable analysis of progress towards open access that can inform these interventions, particularly at the institutional level. Here we propose a minimum reporting standard and present a large-scale analysis of open access progress across 1,207 institutions world-wide that shows substantial progress being made. The analysis detects responses that coincide with policy and funding interventions. Among the striking results are the high performance of Latin American and African universities, particularly for gold open access, whereas overall open access levels in Europe and North America are driven by repository-mediated access. We present a top-100 of global universities with the world’s leading institutions achieving around 80% open access for 2017 publications.

Download Full-text

The Structure and Properties of MoSi2 Thin Film in Mos Process

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s1431927600001379 ◽

1980 ◽

Vol 38 ◽

pp. 326-327

Author(s):

C.K. Wu ◽

P. Chang ◽

N. Godinho

Keyword(s):

Thin Film ◽

Integrated Circuits ◽

High Performance ◽

Large Scale ◽

Process Development ◽

Structure And Properties ◽

Metal Silicides ◽

High Oxidation ◽

Important Approach ◽

High Oxidation Resistance

Recently, the use of refractory metal silicides as low resistivity, high temperature and high oxidation resistance gate materials in large scale integrated circuits (LSI) has become an important approach in advanced MOS process development (1). This research is a systematic study on the structure and properties of molybdenum silicide thin film and its applicability to high performance LSI fabrication.

Download Full-text