REVA as a Well-curated Database for Human Expression-modulating Variants

Mapping Intimacies ◽

10.1101/2021.02.24.432622 ◽

2021 ◽

Author(s):

Yu Wang ◽

Fang-Yuan Shi ◽

Yu Liang ◽

Ge Gao

Keyword(s):

Large Scale ◽

Regulatory Mechanism ◽

State Of The Art ◽

Scale Analysis ◽

Computational Tools ◽

Functional Annotations ◽

Link Type ◽

Large Scale Analysis ◽

Multiple State ◽

Limited Sensitivity

AbstractMore than 80% of disease- and trait-associated human variants are noncoding. By systematically screening multiple large-scale studies, we compiled REVA, a manually curated database for over 11.8 million experimentally tested noncoding variants with expression-modulating potentials. We provided 2424 functional annotations that could be used to pinpoint plausible regulatory mechanism of these variants. We further benchmarked multiple state-of-the-art computational tools and found their limited sensitivity remains a serious challenge for effective large-scale analysis. REVA provides high-qualify experimentally tested expression-modulating variants with extensive functional annotations, which will be useful for users in the noncoding variants community. REVA is available at http://reva.gao-lab.org.

Download Full-text

Advances in decomposing complex metabolite mixtures using substructure- and network-based computational metabolomics approaches

Natural Product Reports ◽

10.1039/d1np00023c ◽

2021 ◽

Author(s):

Mehdi A. Beniddir ◽

Kyo Bin Kang ◽

Grégory Genta-Jouve ◽

Florian Huber ◽

Simon Rogers ◽

...

Keyword(s):

Natural Product ◽

Large Scale ◽

Scale Analysis ◽

Computational Tools ◽

Natural Product Discovery ◽

Large Scale Analysis ◽

Metabolite Annotation

This review highlights the key computational tools and emerging strategies for metabolite annotation, and discusses how these advances will enable integrated large-scale analysis to accelerate natural product discovery.

Download Full-text

Virtual Grid Engine: a simulated grid engine environment for large-scale supercomputers

BMC Bioinformatics ◽

10.1186/s12859-019-3085-x ◽

2019 ◽

Vol 20 (S16) ◽

Author(s):

Satoshi Ito ◽

Masaaki Yadome ◽

Tatsuo Nishiki ◽

Shigeru Ishiduki ◽

Hikaru Inoue ◽

...

Keyword(s):

Large Scale ◽

State Of The Art ◽

Massively Parallel ◽

Scale Analysis ◽

Large Scale Analysis ◽

Scientific Results ◽

Time Required ◽

Parallel Supercomputers ◽

Virtual Grid ◽

Processing Service

Abstract Background Supercomputers have become indispensable infrastructures in science and industries. In particular, most state-of-the-art scientific results utilize massively parallel supercomputers ranked in TOP500. However, their use is still limited in the bioinformatics field due to the fundamental fact that the asynchronous parallel processing service of Grid Engine is not provided on them. To encourage the use of massively parallel supercomputers in bioinformatics, we developed middleware called Virtual Grid Engine, which enables software pipelines to automatically perform their tasks as MPI programs. Result We conducted basic tests to check the time required to assign jobs to workers by VGE. The results showed that the overhead of the employed algorithm was 246 microseconds and our software can manage thousands of jobs smoothly on the K computer. We also tried a practical test in the bioinformatics field. This test included two tasks, the split and BWA alignment of input FASTQ data. 25,055 nodes (2,000,440 cores) were used for this calculation and accomplished it in three hours. Conclusion We considered that there were four important requirements for this kind of software, non-privilege server program, multiple job handling, dependency control, and usability. We carefully designed and checked all requirements. And this software fulfilled all the requirements and achieved good performance in a large scale analysis.

Download Full-text

A large-scale analysis of bioinformatics code on GitHub

10.1101/321919 ◽

2018 ◽

Author(s):

Pamela H Russell ◽

Rachel L Johnson ◽

Shreyas Ananthan ◽

Benjamin Harnke ◽

Nichole E Carlson

Keyword(s):

Large Scale ◽

Source Code ◽

The State ◽

Scale Analysis ◽

Development Activity ◽

High Profile ◽

Link Type ◽

Large Scale Analysis ◽

Bioinformatics Software ◽

Bioinformatics Community

AbstractIn recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/githubbioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.Author summaryWe present, to our knowledge, the first large-scale analysis of bioinformatics source code. The purpose of our work is to contribute data to the growing conversation in the bioinformatics community around reproducibility, code quality, and software usability. We analyze a large collection of bioinformatics software projects, identifying relationships between code properties, development activity, developer communities, and software impact. Throughout the work, we compare the large set of projects to a small set of highly popular bioinformatics tools, highlighting features associated with high-profile projects. We make our data and code publicly available to enable others to build upon our analysis or generate new datasets. The significance of our work is to (1) contribute a large base of knowledge to the bioinformatics community about the state of their software, (2) contribute tools and resources enabling the community to conduct their own analyses, and (3) demonstrate that it is possible to systematically analyze large volumes of bioinformatics code. This work and the provided resources will enable a more effective, data-driven conversation around software practices in the bioinformatics community.

Download Full-text

CellProfiler 4: Improvements in Speed, Utility and Usability

10.1101/2021.06.30.450416 ◽

2021 ◽

Author(s):

David R Stirling ◽

Madison J Swain-Bowden ◽

Alice M Lucas ◽

Anne E. Carpenter ◽

Beth A Cimini ◽

...

Keyword(s):

Image Analysis ◽

User Interface ◽

Large Scale ◽

Source Image ◽

Scale Analysis ◽

Analysis Program ◽

Computational Tools ◽

Large Scale Analysis ◽

Free Open Source ◽

Microscopy Images

CellProfiler is a free, open source image analysis program which enables researchers to generate modular pipelines with which to process microscopy images into interpretable measurements. Here we describe CellProfiler 4, a new version of this software which has been ported to the Python 3 language. Based on user feedback, we have made several user interface refinements to improve the usability of the software. We introduced new modules to expand the capabilities of the software. We also evaluated performance and made targeted optimisations to reduce the time and cost associated with running common large-scale analysis pipelines. This release will ensure that researchers will have continued access to CellProfilers powerful computational tools in the coming years.

Download Full-text

CellProfiler 4: improvements in speed, utility and usability

BMC Bioinformatics ◽

10.1186/s12859-021-04344-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

David R. Stirling ◽

Madison J. Swain-Bowden ◽

Alice M. Lucas ◽

Anne E. Carpenter ◽

Beth A. Cimini ◽

...

Keyword(s):

Large Scale ◽

Biological Information ◽

Source Image ◽

Scale Analysis ◽

Imaging Data ◽

Computational Tools ◽

Large Scale Analysis ◽

Improved Performance ◽

Free Open Source ◽

Microscopy Images

Abstract Background Imaging data contains a substantial amount of information which can be difficult to evaluate by eye. With the expansion of high throughput microscopy methodologies producing increasingly large datasets, automated and objective analysis of the resulting images is essential to effectively extract biological information from this data. CellProfiler is a free, open source image analysis program which enables researchers to generate modular pipelines with which to process microscopy images into interpretable measurements. Results Herein we describe CellProfiler 4, a new version of this software with expanded functionality. Based on user feedback, we have made several user interface refinements to improve the usability of the software. We introduced new modules to expand the capabilities of the software. We also evaluated performance and made targeted optimizations to reduce the time and cost associated with running common large-scale analysis pipelines. Conclusions CellProfiler 4 provides significantly improved performance in complex workflows compared to previous versions. This release will ensure that researchers will have continued access to CellProfiler’s powerful computational tools in the coming years.

Download Full-text