scholarly journals Knowledge Base Commons (KBCommons) v1.1: a universal framework for multi-omics data integration and biological discoveries

BMC Genomics ◽  
2019 ◽  
Vol 20 (S11) ◽  
Author(s):  
Shuai Zeng ◽  
Zhen Lyu ◽  
Siva Ratna Kumari Narisetti ◽  
Dong Xu ◽  
Trupti Joshi

Abstract Background Knowledge Base Commons (KBCommons) v1.1 is a universal and all-inclusive web-based framework providing generic functionalities for storing, sharing, analyzing, exploring, integrating and visualizing multiple organisms’ genomics and integrative omics data. KBCommons is designed and developed to integrate diverse multi-level omics data and to support biological discoveries for all species via a common platform. Methods KBCommons has four modules including data storage, data processing, data accessing, and web interface for data management and retrieval. It provides a comprehensive framework for new plant-specific, animal-specific, virus-specific, bacteria-specific or human disease-specific knowledge base (KB) creation, for adding new genome versions and additional multi-omics data to existing KBs, and for exploring existing datasets within current KBs. Results KBCommons has an array of tools for data visualization and data analytics such as multiple gene/metabolite search, gene family/Pfam/Panther function annotation search, miRNA/metabolite/trait/SNP search, differential gene expression analysis, and bulk data download capacity. It contains a highly reliable data privilege management system to make users’ data publicly available easily and to share private or pre-publication data with members in their collaborative groups safely and securely. It allows users to conduct data analysis using our in-house developed workflow functionalities that are linked to XSEDE high performance computing resources. Using KBCommons’ intuitive web interface, users can easily retrieve genomic data, multi-omics data and analysis results from workflow according to their requirements and interests. Conclusions KBCommons addresses the needs of many diverse research communities to have a comprehensive multi-level OMICS web resource for data retrieval, sharing, analysis and visualization. KBCommons can be publicly accessed through a dedicated link for all organisms at http://kbcommons.org/.

Author(s):  
Valentin Cristea ◽  
Ciprian Dobre ◽  
Corina Stratan ◽  
Florin Pop

The latest advances in network and distributedsystem technologies now allow integration of a vast variety of services with almost unlimited processing power, using large amounts of data. Sharing of resources is often viewed as the key goal for distributed systems, and in this context the sharing of stored data appears as the most important aspect of distributed resource sharing. Scientific applications are the first to take advantage of such environments as the requirements of current and future high performance computing experiments are pressing, in terms of even higher volumes of issued data to be stored and managed. While these new environments reveal huge opportunities for large-scale distributed data storage and management, they also raise important technical challenges, which need to be addressed. The ability to support persistent storage of data on behalf of users, the consistent distribution of up-to-date data, the reliable replication of fast changing datasets or the efficient management of large data transfers are just some of these new challenges. In this chapter we discuss how the existing distributed computing infrastructure is adequate for supporting the required data storage and management functionalities. We highlight the issues raised from storing data over large distributed environments and discuss the recent research efforts dealing with challenges of data retrieval, replication and fast data transfers. Interaction of data management with other data sensitive, emerging technologies as the workflow management is also addressed.


2016 ◽  
Author(s):  
Mathurin Dorel ◽  
Emmanuel Barillot ◽  
Andrei Zinovyev ◽  
Inna Kuperstein

AbstractHuman diseases such as cancer are routinely characterized by high-throughput molecular technologies, and multi-level omics data are accumulated in public databases at increasing rate. Retrieval and visualization of these data in the context of molecular network maps can provide insights into the pattern of molecular functions encompassed by an omics profile. In order to make this task easy, we developed NaviCom, a Python package and web platform for visualization of multi-level omics data on top of biological network maps. NaviCom is bridging the gap between cBioPortal, the most used resource of large-scale cancer omics data and NaviCell, a data visualization web service that contains several molecular network map collections. NaviCom proposes several standardized modes of data display on top of molecular network maps, allowing to address specific biological questions. We illustrate how users can easily create interactive network-based cancer molecular portraits via NaviCom web interface using the maps of Atlas of Cancer Signaling Network (ACSN) and other maps. Analysis of these molecular portraits can help in formulating a scientific hypothesis on the molecular mechanisms deregulated in the studied disease.


2018 ◽  
Author(s):  
Thomas G. Close ◽  
Phillip G. D. Ward ◽  
Francesco Sforazzini ◽  
Wojtek Goscinski ◽  
Zhaolin Chen ◽  
...  

AbstractMastering the “arcana of neuroimaging analysis”, the obscure knowledge required to apply an appropriate combination of software tools and parameters to analyse a given neuroimaging dataset, is a time consuming process. Therefore, it is not typically feasible to invest the additional effort required generalise workflow implementations to accommodate for the various acquisition parameters, data storage conventions and computing environments in use at different research sites, limiting the reusability of published workflows.We present a novel software framework, Abstraction of Repository-Centric ANAlysis (Arcana), which enables the development of complex, “end-to-end” workflows that are adaptable to new analyses and portable to a wide range of computing infrastructures. Analysis templates for specific image types (e.g. MRI contrast) are implemented as Python classes, which define a range of potential derivatives and analysis methods. Arcana retrieves data from imaging repositories, which can be BIDS datasets, XNAT instances or plain directories, and stores selected derivatives and associated provenance back into a repository for reuse by subsequent analyses. Workflows are constructed using Nipype and can be executed on local workstations or in high performance computing environments. Generic analysis methods can be consolidated within common base classes to facilitate code-reuse and collaborative development, which can be specialised for study-specific requirements via class inheritance. Arcana provides a framework in which to develop unified neuroimaging workflows that can be reused across a wide range of research studies and sites.


2021 ◽  
Author(s):  
Zhehao Xu ◽  
Xiao Su ◽  
Sicong Hua ◽  
Jiwei Zhai ◽  
Sannian Song ◽  
...  

Abstract For high-performance data centers, huge data transfer, reliable data storage and emerging in-memory computing require memory technology with the combination of accelerated access, large capacity and persistence. As for phase-change memory, the Sb-rich compounds Sb7Te3 and GeSb6Te have demonstrated fast switching speed and considerable difference of phase transition temperature. A multilayer structure is built up with the two compounds to reach three non-volatile resistance states. Sequential phase transition in a relationship with the temperature is confirmed to contribute to different resistance states with sufficient thermal stability. With the verification of nanoscale confinement for the integration of Sb7Te3/GeSb6Te multilayer thin film, T-shape PCM cells are fabricated and two SET operations are executed with 40 ns-width pulses, exhibiting good potential for the multi-level PCM candidate.


2017 ◽  
Author(s):  
◽  
Siva Ratna Kumari Narisetti

Multi-level 'OMICS' data integration for multiple organisms has been one of the major challenges in the era of advanced next generation sequencing and high performance technologies. Biological data has been producing tremendously fast with the availability of these high throughput sequencing technologies at low price and high speed. However, these data are often stored individually across different web resources based on data type and organism, making it difficult to find and integrate them. There are many websites available which store data from different data types and display that data in pie charts or plain text format but limit their data to only one fixed organism. These web-based multi-omics analysis is an efficient and easy way of analyzing the data but it would be difficult for other researchers working with other organisms and with complex data. The complex multi-omics data requires extensive data management, exhaustive computational analysis, and effective integration to have a one-stop interactive, web-based portal to browse, access, analyze, integrate and share knowledge about genomics and molecular mechanisms, with ultimate links to phenotypes and traits for many different organisms. To achieve this, we have developed Knowledge Base Commons (KBCommons), a platform that automates the process of establishing the database and making the tools available for organisms via a dedicated web resource. KBCommons is currently supporting four different categories including Plants and Crops; Animals and Pets; Humans and Diseases; Microbes and Viruses. It has four main functionalities including Browse KBCommons, Contribute to KB, Add version to KB, and Create a new KB. Using KBCommons, researchers from different groups with different organisms' data can be shared and accessed among all. KBCommons is an automatic framework which uses famous and widely used Laravel PHP framework. This is very efficient to deal with complex and diverse biological datasets. In the Browse KBCommons section, all existing organisms will be displayed under each category and it also shows organisms which can be used as model organisms. KBCommons also displays the logo of each organism along with existing versions, in this way it will give a detailed information on all existing organisms. The user can browse existing data of each organism using various tools including Blast, Multiple Sequence Alignment, Motif Sampler, etc., by going to that particular page. Users can also visualize gene expression and differential expression data via pie charts and plain text. Add version to KB and Create a new KB are related because of their similar steps in the process, users must bring corresponding data in each section. When a particular organism of interest is not existing then the user can create a new Knowledge Base for that new organism with 6 essential files of Genome Sequence, protein coding sequence for Amino acid, gene coding sequence for Nucleotide and Spliced mRNA transcripts, mRNA sequences in GFF3, and a functional annotation file. In Add version to KB, if an organism is already existing then the user can add a new version to the existing KB with these 6 essential files for the new version. In Contribute to KB, user can upload multi-omics data including Transcriptomics -- RNA-Seq and Microarray; Proteomics -- Mass Spectrometry and 2DGel; Epigenomics -- Bisulphite Sequencing, Methylation Array, and MBD-Seq Array. We support both gene expression/ protein expression/ or methylation data and differential expression comparison for each data type. We also support different entities including miRNA/sRNA, Metabolite, SNP/GWAS, Plant introduction lines/ Animal strains, and Phenotype/ TRAIT/Diseases.


2017 ◽  
Vol 10 (4) ◽  
pp. 16
Author(s):  
Haifeng Jiang ◽  
Chang Wan

This paper introduces a method to realize dynamic interface, and designs a database storage model based on XML field technology to realize convenient data storage, any combination condition retrieval function and how to improve the retrieval speed in this kind of storage model. Usually a business system needs to provide information entry and retrieval functions, software designers have to design the appropriate entry items, input interface and retrieval functions for each business system and spend too much time on the repetitive works. And later engineers have to maintain the changing needs of the entry project, so we can apply the dynamic interface technology to achieve the customize needs of input items by the user, reducing the time of the repetitive works. Dynamic interface technology includes the realization of database storage and high performance data retrieval. This paper explores a storage model based on XML database to realize common and efficient storage and discuss on how to improve the retrieval speed in this kind of storage model.


2020 ◽  
Vol 10 (1) ◽  
pp. 357-368
Author(s):  
Farzam Matinfar

AbstractThis paper introduces Wikipedia as an extensive knowledge base which provides additional information about a great number of web resources in the semantic web, and shows how RDF web resources in the web of data can be linked to this encyclopedia. Given an input web resource, the designed system identifies the topic of the web resource and links it to the corresponding Wikipedia article. To perform this task, we use the core labeling properties in web of data to specify the candidate Wikipedia articles for a web resource. Finally, a knowledge based approach is used to identify the most appropriate article in Wikipedia database. Evaluation of the system shows the high performance of the designed system.


2021 ◽  
Vol 251 ◽  
pp. 02066
Author(s):  
Javier López-Gómez ◽  
Jakob Blomer

Over the last two decades, ROOT TTree has been used for storing over one exabyte of High-Energy Physics (HEP) events. The TTree columnar on-disk layout has been proved to be ideal for analyses of HEP data that typically require access to many events, but only a subset of the information stored for each of them. Future colliders, and particularly HL-LHC, will bring an increase of at least one order of magnitude in the volume of generated data. Therefore, the use of modern storage hardware, such as low-latency high-bandwidth NVMe devices and distributed object stores, becomes more important. However, TTree was not designed to optimally exploit modern hardware and may become a bottleneck for data retrieval. The ROOT RNTuple I/O system aims at overcoming TTree’s limitations and at providing improved effciency for modern storage systems. In this paper, we extend RNTuple with a backend that uses Intel DAOS as the underlying storage, demonstrating that the RNTuple architecture can accommodate high-performance object stores. From the user perspective, data can be accessed with minimal changes to the code, that is by replacing a filesystem path by a DAOS URI. Our performance evaluation shows that the new backend can be used for realistic analyses, while outperforming the compatibility solution provided by the DAOS project.


2020 ◽  
Vol 15 ◽  
Author(s):  
Weiwen Zhang ◽  
Long Wang ◽  
Theint Theint Aye ◽  
Juniarto Samsudin ◽  
Yongqing Zhu

Background: Genotype imputation as a service is developed to enable researchers to estimate genotypes on haplotyped data without performing whole genome sequencing. However, genotype imputation is computation intensive and thus it remains a challenge to satisfy the high performance requirement of genome wide association study (GWAS). Objective: In this paper, we propose a high performance computing solution for genotype imputation on supercomputers to enhance its execution performance. Method: We design and implement a multi-level parallelization that includes job level, process level and thread level parallelization, enabled by job scheduling management, message passing interface (MPI) and OpenMP, respectively. It involves job distribution, chunk partition and execution, parallelized iteration for imputation and data concatenation. Due to the design of multi-level parallelization, we can exploit the multi-machine/multi-core architecture to improve the performance of genotype imputation. Results: Experiment results show that our proposed method can outperform the Hadoop-based implementation of genotype imputation. Moreover, we conduct the experiments on supercomputers to evaluate the performance of the proposed method. The evaluation shows that it can significantly shorten the execution time, thus improving the performance for genotype imputation. Conclusion: The proposed multi-level parallelization, when deployed as an imputation as a service, will facilitate bioinformatics researchers in Singapore to conduct genotype imputation and enhance the association study.


Sign in / Sign up

Export Citation Format

Share Document