Knowledge Base Commons (KBCommons) v1.1: a universal framework for multi-omics data integration and biological discoveries

Abstract Background Knowledge Base Commons (KBCommons) v1.1 is a universal and all-inclusive web-based framework providing generic functionalities for storing, sharing, analyzing, exploring, integrating and visualizing multiple organisms’ genomics and integrative omics data. KBCommons is designed and developed to integrate diverse multi-level omics data and to support biological discoveries for all species via a common platform. Methods KBCommons has four modules including data storage, data processing, data accessing, and web interface for data management and retrieval. It provides a comprehensive framework for new plant-specific, animal-specific, virus-specific, bacteria-specific or human disease-specific knowledge base (KB) creation, for adding new genome versions and additional multi-omics data to existing KBs, and for exploring existing datasets within current KBs. Results KBCommons has an array of tools for data visualization and data analytics such as multiple gene/metabolite search, gene family/Pfam/Panther function annotation search, miRNA/metabolite/trait/SNP search, differential gene expression analysis, and bulk data download capacity. It contains a highly reliable data privilege management system to make users’ data publicly available easily and to share private or pre-publication data with members in their collaborative groups safely and securely. It allows users to conduct data analysis using our in-house developed workflow functionalities that are linked to XSEDE high performance computing resources. Using KBCommons’ intuitive web interface, users can easily retrieve genomic data, multi-omics data and analysis results from workflow according to their requirements and interests. Conclusions KBCommons addresses the needs of many diverse research communities to have a comprehensive multi-level OMICS web resource for data retrieval, sharing, analysis and visualization. KBCommons can be publicly accessed through a dedicated link for all organisms at http://kbcommons.org/.

Download Full-text

Data Storage, Retrieval and Management

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Large-Scale Distributed Computing and Applications ◽

10.4018/978-1-61520-703-9.ch006 ◽

2010 ◽

pp. 111-140

Author(s):

Valentin Cristea ◽

Ciprian Dobre ◽

Corina Stratan ◽

Florin Pop

Keyword(s):

Data Storage ◽

Resource Sharing ◽

High Performance ◽

Large Scale ◽

Workflow Management ◽

Large Data ◽

Data Retrieval ◽

Distributed Data Storage ◽

Processing Power ◽

Data Transfers

The latest advances in network and distributedsystem technologies now allow integration of a vast variety of services with almost unlimited processing power, using large amounts of data. Sharing of resources is often viewed as the key goal for distributed systems, and in this context the sharing of stored data appears as the most important aspect of distributed resource sharing. Scientific applications are the first to take advantage of such environments as the requirements of current and future high performance computing experiments are pressing, in terms of even higher volumes of issued data to be stored and managed. While these new environments reveal huge opportunities for large-scale distributed data storage and management, they also raise important technical challenges, which need to be addressed. The ability to support persistent storage of data on behalf of users, the consistent distribution of up-to-date data, the reliable replication of fast changing datasets or the efficient management of large data transfers are just some of these new challenges. In this chapter we discuss how the existing distributed computing infrastructure is adequate for supporting the required data storage and management functionalities. We highlight the issues raised from storing data over large distributed environments and discuss the recent research efforts dealing with challenges of data retrieval, replication and fast data transfers. Interaction of data management with other data sensitive, emerging technologies as the workflow management is also addressed.

Download Full-text

NaviCom: A web application to create interactive molecular network portraits using multi-level omics data

10.1101/089367 ◽

2016 ◽

Author(s):

Mathurin Dorel ◽

Emmanuel Barillot ◽

Andrei Zinovyev ◽

Inna Kuperstein

Keyword(s):

Web Application ◽

Large Scale ◽

Molecular Mechanisms ◽

Molecular Network ◽

Omics Data ◽

Web Interface ◽

Scientific Hypothesis ◽

Interactive Network ◽

Multi Level ◽

Web Platform

AbstractHuman diseases such as cancer are routinely characterized by high-throughput molecular technologies, and multi-level omics data are accumulated in public databases at increasing rate. Retrieval and visualization of these data in the context of molecular network maps can provide insights into the pattern of molecular functions encompassed by an omics profile. In order to make this task easy, we developed NaviCom, a Python package and web platform for visualization of multi-level omics data on top of biological network maps. NaviCom is bridging the gap between cBioPortal, the most used resource of large-scale cancer omics data and NaviCell, a data visualization web service that contains several molecular network map collections. NaviCom proposes several standardized modes of data display on top of molecular network maps, allowing to address specific biological questions. We illustrate how users can easily create interactive network-based cancer molecular portraits via NaviCom web interface using the maps of Atlas of Cancer Signaling Network (ACSN) and other maps. Analysis of these molecular portraits can help in formulating a scientific hypothesis on the molecular mechanisms deregulated in the studied disease.

Download Full-text

A comprehensive framework to capture the arcana of neuroimaging analysis

10.1101/447649 ◽

2018 ◽

Cited By ~ 3

Author(s):

Thomas G. Close ◽

Phillip G. D. Ward ◽

Francesco Sforazzini ◽

Wojtek Goscinski ◽

Zhaolin Chen ◽

...

Keyword(s):

Data Storage ◽

High Performance ◽

Software Framework ◽

Mri Contrast ◽

Analysis Methods ◽

Wide Range ◽

Computing Environments ◽

Comprehensive Framework ◽

Performance Computing ◽

Generic Analysis

AbstractMastering the “arcana of neuroimaging analysis”, the obscure knowledge required to apply an appropriate combination of software tools and parameters to analyse a given neuroimaging dataset, is a time consuming process. Therefore, it is not typically feasible to invest the additional effort required generalise workflow implementations to accommodate for the various acquisition parameters, data storage conventions and computing environments in use at different research sites, limiting the reusability of published workflows.We present a novel software framework, Abstraction of Repository-Centric ANAlysis (Arcana), which enables the development of complex, “end-to-end” workflows that are adaptable to new analyses and portable to a wide range of computing infrastructures. Analysis templates for specific image types (e.g. MRI contrast) are implemented as Python classes, which define a range of potential derivatives and analysis methods. Arcana retrieves data from imaging repositories, which can be BIDS datasets, XNAT instances or plain directories, and stores selected derivatives and associated provenance back into a repository for reuse by subsequent analyses. Workflows are constructed using Nipype and can be executed on local workstations or in high performance computing environments. Generic analysis methods can be consolidated within common base classes to facilitate code-reuse and collaborative development, which can be specialised for study-specific requirements via class inheritance. Arcana provides a framework in which to develop unified neuroimaging workflows that can be reused across a wide range of research studies and sites.

Download Full-text

Non-volatile multi-level cell storage via sequential phase transition in Sb7Te3/GeSb6Te multilayer thin film

Nanotechnology ◽

10.1088/1361-6528/ac3613 ◽

2021 ◽

Author(s):

Zhehao Xu ◽

Xiao Su ◽

Sicong Hua ◽

Jiwei Zhai ◽

Sannian Song ◽

...

Keyword(s):

Thin Film ◽

Phase Transition ◽

Data Storage ◽

High Performance ◽

Data Transfer ◽

Multilayer Thin Film ◽

Switching Speed ◽

Huge Data ◽

Multi Level ◽

Sequential Phase

Abstract For high-performance data centers, huge data transfer, reliable data storage and emerging in-memory computing require memory technology with the combination of accelerated access, large capacity and persistence. As for phase-change memory, the Sb-rich compounds Sb7Te3 and GeSb6Te have demonstrated fast switching speed and considerable difference of phase transition temperature. A multilayer structure is built up with the two compounds to reach three non-volatile resistance states. Sequential phase transition in a relationship with the temperature is confirmed to contribute to different resistance states with sufficient thermal stability. With the verification of nanoscale confinement for the integration of Sb7Te3/GeSb6Te multilayer thin film, T-shape PCM cells are fabricated and two SET operations are executed with 40 ns-width pulses, exhibiting good potential for the multi-level PCM candidate.

Download Full-text

Development of KBCommons : universal informatics framework for multi-omics translational research

10.32469/10355/66748 ◽

2017 ◽

Author(s):

◽

Siva Ratna Kumari Narisetti

Keyword(s):

Gene Expression ◽

Knowledge Base ◽

Differential Expression ◽

Biological Data ◽

Data Type ◽

Omics Data ◽

Web Based ◽

Coding Sequence ◽

Plain Text ◽

Web Resource

Multi-level 'OMICS' data integration for multiple organisms has been one of the major challenges in the era of advanced next generation sequencing and high performance technologies. Biological data has been producing tremendously fast with the availability of these high throughput sequencing technologies at low price and high speed. However, these data are often stored individually across different web resources based on data type and organism, making it difficult to find and integrate them. There are many websites available which store data from different data types and display that data in pie charts or plain text format but limit their data to only one fixed organism. These web-based multi-omics analysis is an efficient and easy way of analyzing the data but it would be difficult for other researchers working with other organisms and with complex data. The complex multi-omics data requires extensive data management, exhaustive computational analysis, and effective integration to have a one-stop interactive, web-based portal to browse, access, analyze, integrate and share knowledge about genomics and molecular mechanisms, with ultimate links to phenotypes and traits for many different organisms. To achieve this, we have developed Knowledge Base Commons (KBCommons), a platform that automates the process of establishing the database and making the tools available for organisms via a dedicated web resource. KBCommons is currently supporting four different categories including Plants and Crops; Animals and Pets; Humans and Diseases; Microbes and Viruses. It has four main functionalities including Browse KBCommons, Contribute to KB, Add version to KB, and Create a new KB. Using KBCommons, researchers from different groups with different organisms' data can be shared and accessed among all. KBCommons is an automatic framework which uses famous and widely used Laravel PHP framework. This is very efficient to deal with complex and diverse biological datasets. In the Browse KBCommons section, all existing organisms will be displayed under each category and it also shows organisms which can be used as model organisms. KBCommons also displays the logo of each organism along with existing versions, in this way it will give a detailed information on all existing organisms. The user can browse existing data of each organism using various tools including Blast, Multiple Sequence Alignment, Motif Sampler, etc., by going to that particular page. Users can also visualize gene expression and differential expression data via pie charts and plain text. Add version to KB and Create a new KB are related because of their similar steps in the process, users must bring corresponding data in each section. When a particular organism of interest is not existing then the user can create a new Knowledge Base for that new organism with 6 essential files of Genome Sequence, protein coding sequence for Amino acid, gene coding sequence for Nucleotide and Spliced mRNA transcripts, mRNA sequences in GFF3, and a functional annotation file. In Add version to KB, if an organism is already existing then the user can add a new version to the existing KB with these 6 essential files for the new version. In Contribute to KB, user can upload multi-omics data including Transcriptomics -- RNA-Seq and Microarray; Proteomics -- Mass Spectrometry and 2DGel; Epigenomics -- Bisulphite Sequencing, Methylation Array, and MBD-Seq Array. We support both gene expression/ protein expression/ or methylation data and differential expression comparison for each data type. We also support different entities including miRNA/sRNA, Metabolite, SNP/GWAS, Plant introduction lines/ Animal strains, and Phenotype/ TRAIT/Diseases.

Download Full-text

Realization of Dynamic Interface and High Performance Data Retrieval

Computer and Information Science ◽

10.5539/cis.v10n4p16 ◽

2017 ◽

Vol 10 (4) ◽

pp. 16

Author(s):

Haifeng Jiang ◽

Chang Wan

Keyword(s):

Data Storage ◽

High Performance ◽

Data Retrieval ◽

Performance Data ◽

Storage Model ◽

Model Based ◽

Software Designers ◽

Business System ◽

Dynamic Interface ◽

Database Storage

This paper introduces a method to realize dynamic interface, and designs a database storage model based on XML field technology to realize convenient data storage, any combination condition retrieval function and how to improve the retrieval speed in this kind of storage model. Usually a business system needs to provide information entry and retrieval functions, software designers have to design the appropriate entry items, input interface and retrieval functions for each business system and spend too much time on the repetitive works. And later engineers have to maintain the changing needs of the entry project, so we can apply the dynamic interface technology to achieve the customize needs of input items by the user, reducing the time of the repetitive works. Dynamic interface technology includes the realization of database storage and high performance data retrieval. This paper explores a storage model based on XML database to realize common and efficient storage and discuss on how to improve the retrieval speed in this kind of storage model.

Download Full-text

Development of Soybean Knowledge Base (SoyKB), a multi-omics data integration web resource for bridging molecular breeding and translational genomics in Glycine Max

10.32469/10355/43323 ◽

2013 ◽

Author(s):

Trupti Joshi

Keyword(s):

Glycine Max ◽

Data Integration ◽

Knowledge Base ◽

Molecular Breeding ◽

Omics Data ◽

Translational Genomics ◽

Web Resource ◽

Omics Data Integration

Download Full-text

Linking Web Resources in Web of Data to Encyclopedic Knowledge Base

Open Computer Science ◽

10.1515/comp-2020-0102 ◽

2020 ◽

Vol 10 (1) ◽

pp. 357-368

Author(s):

Farzam Matinfar

Keyword(s):

Knowledge Base ◽

High Performance ◽

Web Resources ◽

Additional Information ◽

Web Resource ◽

The Core ◽

Knowledge Based ◽

Database Evaluation ◽

Web Of Data ◽

The Web

AbstractThis paper introduces Wikipedia as an extensive knowledge base which provides additional information about a great number of web resources in the semantic web, and shows how RDF web resources in the web of data can be linked to this encyclopedia. Given an input web resource, the designed system identifies the topic of the web resource and links it to the corresponding Wikipedia article. To perform this task, we use the core labeling properties in web of data to specify the candidate Wikipedia articles for a web resource. Finally, a knowledge based approach is used to identify the most appropriate article in Wikipedia database. Evaluation of the system shows the high performance of the designed system.

Download Full-text

Exploring Object Stores for High-Energy Physics Data Storage

EPJ Web of Conferences ◽

10.1051/epjconf/202125102066 ◽

2021 ◽

Vol 251 ◽

pp. 02066

Author(s):

Javier López-Gómez ◽

Jakob Blomer

Keyword(s):

Data Storage ◽

High Performance ◽

High Energy Physics ◽

Data Retrieval ◽

High Energy ◽

Distributed Object ◽

Order Of Magnitude ◽

High Bandwidth ◽

Disk Layout ◽

Energy Physics

Over the last two decades, ROOT TTree has been used for storing over one exabyte of High-Energy Physics (HEP) events. The TTree columnar on-disk layout has been proved to be ideal for analyses of HEP data that typically require access to many events, but only a subset of the information stored for each of them. Future colliders, and particularly HL-LHC, will bring an increase of at least one order of magnitude in the volume of generated data. Therefore, the use of modern storage hardware, such as low-latency high-bandwidth NVMe devices and distributed object stores, becomes more important. However, TTree was not designed to optimally exploit modern hardware and may become a bottleneck for data retrieval. The ROOT RNTuple I/O system aims at overcoming TTree’s limitations and at providing improved effciency for modern storage systems. In this paper, we extend RNTuple with a backend that uses Intel DAOS as the underlying storage, demonstrating that the RNTuple architecture can accommodate high-performance object stores. From the user perspective, data can be accessed with minimal changes to the code, that is by replacing a filesystem path by a DAOS URI. Our performance evaluation shows that the new backend can be used for realistic analyses, while outperforming the compatibility solution provided by the DAOS project.

Download Full-text

Multi-level Parallelization of Genotype Imputation on Supercomputers

Current Bioinformatics ◽

10.2174/1574893615999200420071307 ◽

2020 ◽

Vol 15 ◽

Author(s):

Weiwen Zhang ◽

Long Wang ◽

Theint Theint Aye ◽

Juniarto Samsudin ◽

Yongqing Zhu

Keyword(s):

Association Study ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Genome Wide Association Study ◽

Job Scheduling ◽

Genotype Imputation ◽

Job Level ◽

Multi Level ◽

High Performance Requirement

Background: Genotype imputation as a service is developed to enable researchers to estimate genotypes on haplotyped data without performing whole genome sequencing. However, genotype imputation is computation intensive and thus it remains a challenge to satisfy the high performance requirement of genome wide association study (GWAS). Objective: In this paper, we propose a high performance computing solution for genotype imputation on supercomputers to enhance its execution performance. Method: We design and implement a multi-level parallelization that includes job level, process level and thread level parallelization, enabled by job scheduling management, message passing interface (MPI) and OpenMP, respectively. It involves job distribution, chunk partition and execution, parallelized iteration for imputation and data concatenation. Due to the design of multi-level parallelization, we can exploit the multi-machine/multi-core architecture to improve the performance of genotype imputation. Results: Experiment results show that our proposed method can outperform the Hadoop-based implementation of genotype imputation. Moreover, we conduct the experiments on supercomputers to evaluate the performance of the proposed method. The evaluation shows that it can significantly shorten the execution time, thus improving the performance for genotype imputation. Conclusion: The proposed multi-level parallelization, when deployed as an imputation as a service, will facilitate bioinformatics researchers in Singapore to conduct genotype imputation and enhance the association study.

Download Full-text