PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Abstract Background With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. Results We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. Conclusions PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

Download Full-text

Graph Cutting in Image Processing Handling with Biological Data Analysis

Advances in Intelligent Systems and Computing - Information Technology, Systems Research, and Computational Physics ◽

10.1007/978-3-030-18058-4_16 ◽

2019 ◽

pp. 203-216

Author(s):

Mária Ždímalová ◽

Tomáš Bohumel ◽

Katarína Plachá-Gregorovská ◽

Peter Weismann ◽

Hisham El Falougy

Keyword(s):

Image Processing ◽

Data Analysis ◽

Biological Data ◽

Biological Data Analysis

Download Full-text

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Computational Intelligence Methods for Bioinformatics and Biostatistics - Lecture Notes in Computer Science ◽

10.1007/978-3-319-24462-4_22 ◽

2015 ◽

pp. 259-272 ◽

Cited By ~ 1

Author(s):

Lars Ailo Bongo ◽

Edvard Pedersen ◽

Martin Ernstsen

Keyword(s):

Data Analysis ◽

Biological Data ◽

Data Intensive Computing ◽

Infrastructure Systems ◽

Data Intensive ◽

Biological Data Analysis ◽

Computing Infrastructure

Download Full-text

SPDE: A Multi-functional Software for Sequence Processing and Data Extraction

10.1101/2020.11.08.373720 ◽

2020 ◽

Cited By ~ 1

Author(s):

Dong Xu ◽

Zhuchou Lu ◽

Kangming Jin ◽

Wenmin Qiu ◽

Guirong Qiao ◽

...

Keyword(s):

Data Extraction ◽

Single Gene ◽

Biological Data ◽

Sequence Processing ◽

Reverse Complement ◽

Biological Data Analysis ◽

Basic Functions ◽

Functional Software ◽

Genome Information ◽

Ncbi Blast

AbstractEfficiently extracting information from biological big data can be a huge challenge for people (especially those who lack programming skills). We developed Sequence Processing and Data Extraction (SPDE) as an integrated tool for sequence processing and data extraction for gene family and omics analyses. Currently, SPDE has seven modules comprising 100 basic functions that range from single gene processing (e.g., translation, reverse complement, and primer design) to genome information extraction. All SPDE functions can be used without the need for programming or command lines. The SPDE interface has enough prompt information to help users run SPDE without barriers. In addition to its own functions, SPDE also incorporates the publicly available analyses tools (such as, NCBI-blast, HMMER, Primer3 and SAMtools), thereby making SPDE a comprehensive bioinformatics platform for big biological data analysis.AvailabilitySPDE was built using Python and can be run on 32-bit, 64-bit Windows and macOS systems. It is an open-source software that can be downloaded from https://github.com/simon19891216/[email protected]

Download Full-text

Innovative Formalism for Biological Data Analysis

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch158 ◽

2018 ◽

pp. 1814-1824

Author(s):

Calin Ciufudean

Keyword(s):

Information Technology ◽

Data Analysis ◽

Medical Devices ◽

Fuzzy Set ◽

Biological Data ◽

Fuzzy Operators ◽

Biological Data Analysis ◽

Biological Signals ◽

Design And Manufacture ◽

Gathering Data

Modern medical devices involves information technology (IT) based on electronic structures for data and signals sensing and gathering, data and signals transmission as well as data and signals processing in order to assist and help the medical staff to diagnose, cure and to monitors the evolution of patients. By focusing on biological signals processing we may notice that numerical processing of information delivered by sensors has a significant importance for a fair and optimum design and manufacture of modern medical devices. We consider for this approach fuzzy set as a formalism of analysis of biological signals processing and we propose to be accomplished this goal by developing fuzzy operators for filtering the noise of biological signals measurement. We exemplify this approach on neurological measurements performed with an Electro-Encephalograph (EEG).

Download Full-text