IGD: high-performance search for large-scale genomic interval datasets

Bioinformatics ◽

10.1093/bioinformatics/btaa1062 ◽

2020 ◽

Author(s):

Jianglin Feng ◽

Nathan C Sheffield

Keyword(s):

High Performance ◽

Large Scale ◽

Interval Data ◽

Scale Analysis ◽

Genome Database ◽

Genomic Interval ◽

Critical Resource ◽

Genomic Regions ◽

Genome Projects

Abstract Summary Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. Availability https://github.com/databio/IGD

Download Full-text

IGD: high-performance search for large-scale genomic interval datasets

10.1101/2020.06.08.139758 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jianglin Feng ◽

Nathan C. Sheffield

Keyword(s):

High Performance ◽

Large Scale ◽

Interval Data ◽

Scale Analysis ◽

Genome Database ◽

Link Type ◽

Genomic Interval ◽

Critical Resource ◽

Genomic Regions ◽

Genome Projects

SummaryDatabases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions.Availabilityhttps://github.com/databio/IGD

Download Full-text

Augmented Interval List: a novel data structure for efficient genomic interval search

Bioinformatics ◽

10.1093/bioinformatics/btz407 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4907-4911 ◽

Cited By ~ 8

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Supplementary Information ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Scalable Methods

Abstract Motivation Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. Results We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5–18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4–60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. Availability and implementation An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Analysis of Normal Levels of Urine and Plasma Free Glycosaminoglycans in Adults

10.1101/2021.05.21.445098 ◽

2021 ◽

Author(s):

Sinisa Bratulic ◽

Angelo Limeta ◽

Francesca Maccari ◽

Fabio Galeotti ◽

Nicola Volpi ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Blood Chemistry ◽

Sulfated Polysaccharides ◽

Reference Ranges ◽

Reactive Protein ◽

Non Invasive ◽

Critical Resource ◽

Biomarker Research

Plasma and urine glycosaminoglycans (GAGs), long linear sulfated polysaccharides, have been recognized as potential non-invasive biomarkers for several diseases. However, owing to the analytical complexity associated with the measurement of GAG concentration and disaccharide composition (or GAGome), a reference study of the normal healthy GAGome is currently missing. Here, we prospectively enrolled 270 healthy adults and analyzed their urine and plasma free GAGomes using a standardized ultra-high-performance liquid chromatography coupled with triple-quadrupole tandem mass spectrometry (UHPLC-MS/MS) method together with comprehensive demographic and blood chemistry biomarker data. Free GAGomes did not correlate with age nor with any of the 25 blood chemistry biomarkers, except for a few plasma disaccharides that correlated with hemoglobin and C-reactive protein. However, free GAGome levels were generally higher in males. Partitioned by gender, we established reference ranges for all plasma and urine free chondroitin sulfate (CS), heparan sulfate (HS), and hyaluronic acid (HS) disaccharides. Our study is the first large-scale determination of normal plasma and urine free GAGomes reference ranges and represents a critical resource for biomarker research.

Download Full-text

Augmented Interval List: a novel data structure for efficient genomic interval search

10.1101/593657 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C. Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Genomic Data Analysis ◽

Scalable Methods

AbstractMotivationGenomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary.ResultsWe present a new data structure, the augmented interval list (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N + n + m), where n is the number of overlaps between R and q, N is the number of intervals in the set R, and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5 - 18 times faster than standard high-performance code based on augmented interval-trees (AITree), nested containment lists (NCList), or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4% - 60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis.AvailabilityAn implementation of the AIList data structure with both construction and search algorithms is available at code.databio.org/AIList.

Download Full-text

High-Performance Computing for Large-Scale Analysis, Optimization, and Control

Journal of Aerospace Engineering ◽

10.1061/(asce)0893-1321(2000)13:1(1) ◽

2000 ◽

Vol 13 (1) ◽

pp. 1-10 ◽

Cited By ~ 35

Author(s):

Hojjat Adeli

Keyword(s):

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Scale Analysis ◽

Large Scale Analysis ◽

Optimization And Control ◽

And Control ◽

Performance Computing

Download Full-text

Evaluating institutional open access performance: Methodology, challenges and assessment

10.1101/2020.03.19.998336 ◽

2020 ◽

Author(s):

Chun-Kai Huang ◽

Cameron Neylon ◽

Richard Hosking ◽

Lucy Montgomery ◽

Katie Wilson ◽

...

Keyword(s):

Open Access ◽

Latin American ◽

High Performance ◽

Large Scale ◽

Scale Analysis ◽

Reporting Standard ◽

Grass Roots ◽

Large Scale Analysis ◽

Substantial Progress ◽

Gold Open Access

AbstractOpen Access to research outputs is becoming rapidly more important to the global research community and society. Changes are driven by funder mandates, institutional policy, grass-roots advocacy and culture change. It has been challenging to provide a robust, transparent and updateable analysis of progress towards open access that can inform these interventions, particularly at the institutional level. Here we propose a minimum reporting standard and present a large-scale analysis of open access progress across 1,207 institutions world-wide that shows substantial progress being made. The analysis detects responses that coincide with policy and funding interventions. Among the striking results are the high performance of Latin American and African universities, particularly for gold open access, whereas overall open access levels in Europe and North America are driven by repository-mediated access. We present a top-100 of global universities with the world’s leading institutions achieving around 80% open access for 2017 publications.

Download Full-text

The Structure and Properties of MoSi2 Thin Film in Mos Process

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s1431927600001379 ◽

1980 ◽

Vol 38 ◽

pp. 326-327

Author(s):

C.K. Wu ◽

P. Chang ◽

N. Godinho

Keyword(s):

Thin Film ◽

Integrated Circuits ◽

High Performance ◽

Large Scale ◽

Process Development ◽

Structure And Properties ◽

Metal Silicides ◽

High Oxidation ◽

Important Approach ◽

High Oxidation Resistance

Recently, the use of refractory metal silicides as low resistivity, high temperature and high oxidation resistance gate materials in large scale integrated circuits (LSI) has become an important approach in advanced MOS process development (1). This research is a systematic study on the structure and properties of molybdenum silicide thin film and its applicability to high performance LSI fabrication.

Download Full-text

RECOMMENDATIONS FOR THE CHOICE OF MILKING INSTALLATIONS IN LOOSE HOUSING SYSTEMS OF COWS

Molochnoe i miasnoe skotovodstvo ◽

10.33943/mms.2020.12.24.001 ◽

2020 ◽

Author(s):

В.В. ГОРДЕЕВ ◽

В.Е. ХАЗАНОВ

Keyword(s):

Dairy Cows ◽

High Performance ◽

Large Scale ◽

Dairy Farms ◽

Economic Indicators ◽

Technical Level ◽

Housing Systems ◽

Working Shift ◽

Technical And Economic Indicators

При выборе типа доильной установки и ее размера необходимо учитывать максимальное планируемое поголовье дойных коров и размер технологической группы, кратность и время одного доения, продолжительность рабочей смены дояров. Анализ технико-экономических показателей наиболее распространенных на сегодняшний день типов доильных установок одинакового технического уровня свидетельствует, что наилучшие удельные показатели имеет установка типа «Карусель» (1), а установка типа «Елочка» (2) требует более высоких затрат труда и средств. Установка «Параллель» (3) занимает промежуточное положение. Из анализа пропускной способности и количества необходимых операторов: установка 2 рекомендована для ферм с поголовьем дойного стада до 600 голов, 3 — не более 1200 дойных коров, 1 — более 1200 дойных коров. «Карусель» — наиболее рациональный, высокопроизводительный, легко автоматизируемый и, следовательно, перспективный способ доения в залах, особенно для крупных молочных ферм. The choice of the proper type and size of milking installations needs to take into account the maximum planned number of dairy cows, the size of a technological group, the number of milkings per day, and the duration of one milking and the operator's working shift. The analysis of technical and economic indicators of currently most common types of milking machines of the same technical level revealed that the Carousel installation had the best specific indicators while the Herringbone installation featured higher labour inputs and cash costs. The Parallel installation was found somewhere in between. In terms of the throughput and the required number of operators Herringbone is recommended for farms with up to 600 dairy cows, Parallel — below 1200 dairy cows, Carousel — above 1200 dairy cows. Carousel was found the most practical, high-performance, easily automated and, therefore, promising milking system for milking parlours, especially on the large-scale dairy farms.

Download Full-text