SPDE: A Multi-functional Software for Sequence Processing and Data Extraction

AbstractEfficiently extracting information from biological big data can be a huge challenge for people (especially those who lack programming skills). We developed Sequence Processing and Data Extraction (SPDE) as an integrated tool for sequence processing and data extraction for gene family and omics analyses. Currently, SPDE has seven modules comprising 100 basic functions that range from single gene processing (e.g., translation, reverse complement, and primer design) to genome information extraction. All SPDE functions can be used without the need for programming or command lines. The SPDE interface has enough prompt information to help users run SPDE without barriers. In addition to its own functions, SPDE also incorporates the publicly available analyses tools (such as, NCBI-blast, HMMER, Primer3 and SAMtools), thereby making SPDE a comprehensive bioinformatics platform for big biological data analysis.AvailabilitySPDE was built using Python and can be run on 32-bit, 64-bit Windows and macOS systems. It is an open-source software that can be downloaded from https://github.com/simon19891216/[email protected]

Download Full-text

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

BMC Bioinformatics ◽

10.1186/s12859-019-3159-9 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 4

Author(s):

Luca Nanni ◽

Pietro Pinoli ◽

Arif Canakoglu ◽

Stefano Ceri

Keyword(s):

Data Analysis ◽

Data Extraction ◽

Expressive Power ◽

Biological Data ◽

Data Management System ◽

Data Interoperability ◽

Biological Data Analysis ◽

Region Data ◽

Innovative Tool ◽

And Performance

Abstract Background With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. Results We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. Conclusions PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

Download Full-text

Practical R for biologists: an introduction

10.1079/9781789245349.0000 ◽

2021 ◽

Keyword(s):

Statistical Tests ◽

Statistical Modelling ◽

Biological Data ◽

Early Years ◽

Main Text ◽

Biological Data Analysis ◽

Base Functions ◽

Almost All ◽

Selection Of ◽

R Functions

Abstract R is an open-source statistical environment modelled after the previously widely used commercial programs S and S-Plus, but in addition to powerful statistical analysis tools, it also provides powerful graphics outputs. In addition to its statistical and graphical capabilities, R is a programming language suitable for medium-sized projects. This book presents a set of studies that collectively represent almost all the R operations that beginners, analysing their own data up to perhaps the early years of doing a PhD, need. Although the chapters are organized around topics such as graphing, classical statistical tests, statistical modelling, mapping and text parsing, examples have been chosen based largely on real scientific studies at the appropriate level and within each the use of more R functions is nearly always covered than are simply necessary just to get a p-value or a graph. R comes with around a thousand base functions which are automatically installed when R is downloaded. This book covers the use of those of most relevance to biological data analysis, modelling and graphics. Throughout each chapter, the functions introduced and used in that chapter are summarized in Tool Boxes. The book also shows the user how to adapt and write their own code and functions. A selection of base functions relevant to graphics that are not necessarily covered in the main text are described in Appendix 1, and additional housekeeping functions in Appendix 2.

Download Full-text

Graph Cutting in Image Processing Handling with Biological Data Analysis

Advances in Intelligent Systems and Computing - Information Technology, Systems Research, and Computational Physics ◽

10.1007/978-3-030-18058-4_16 ◽

2019 ◽

pp. 203-216

Author(s):

Mária Ždímalová ◽

Tomáš Bohumel ◽

Katarína Plachá-Gregorovská ◽

Peter Weismann ◽

Hisham El Falougy

Keyword(s):

Image Processing ◽

Data Analysis ◽

Biological Data ◽

Biological Data Analysis

Download Full-text

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Computational Intelligence Methods for Bioinformatics and Biostatistics - Lecture Notes in Computer Science ◽

10.1007/978-3-319-24462-4_22 ◽

2015 ◽

pp. 259-272 ◽

Cited By ~ 1

Author(s):

Lars Ailo Bongo ◽

Edvard Pedersen ◽

Martin Ernstsen

Keyword(s):

Data Analysis ◽

Biological Data ◽

Data Intensive Computing ◽

Infrastructure Systems ◽

Data Intensive ◽

Biological Data Analysis ◽

Computing Infrastructure

Download Full-text

Molecular Inverse Comorbidity between Alzheimer’s disease and Lung Cancer: new insights from Matrix Factorization

10.1101/643890 ◽

2019 ◽

Author(s):

Alessandro Greco ◽

Jon Sanchez Valle ◽

Vera Pancaldi ◽

Anaïs Baudot ◽

Emmanuel Barillot ◽

...

Keyword(s):

Lung Cancer ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Matrix Factorization ◽

Large Scale ◽

Molecular Mechanisms ◽

Biological Data ◽

Biological Data Analysis ◽

Specific Factors ◽

Molecular Bases

AbstractMatrix Factorization (MF) is an established paradigm for large-scale biological data analysis with tremendous potential in computational biology.We here challenge MF in depicting the molecular bases of epidemiologically described Disease-Disease (DD) relationships. As use case, we focus on the inverse comorbidity association between Alzheimer’s disease (AD) and lung cancer (LC), described as a lower than expected probability of developing LC in AD patients. To the day, the molecular mechanisms underlying DD relationships remain poorly explained and their better characterization might offer unprecedented clinical opportunities.To this goal, we extend our previously designed MF-based framework for the molecular characterization of DD relationships. Considering AD-LC inverse comorbidity as a case study, we highlight multiple molecular mechanisms, among which the previously identified immune system and mitochondrial metabolism. We then discriminate mechanisms specific to LC from those shared with other cancers through a pancancer analysis. Additionally, new candidate molecular players, such as Estrogen Receptor (ER), CDH1 and HDAC, are pinpointed as factors that might underlie the inverse relationship, opening the way to new investigations. Finally, some lung cancer subtype-specific factors are also detected, suggesting the existence of heterogeneity across patients also in the context of inverse comorbidity.

Download Full-text

Extracting Data from Legacy Taxonomic Literature: Applications for planning field work

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37082 ◽

2019 ◽

Vol 3 ◽

Author(s):

Francisco Andres Rivera-Quiroz ◽

Jeremy Miller

Keyword(s):

Data Extraction ◽

Field Work ◽

Abundant Species ◽

Extraction Process ◽

Biological Data ◽

Knowledge Generation ◽

Data Repository ◽

Full Potential ◽

Biodiversity Data ◽

Open Access Journals

Traditional taxonomic publications have served as a biological data repository accumulating vast amounts of data on species diversity, geographical and temporal distributions, ecological interactions, taxonomic relations, among many other types of information. However, the fragmented nature of taxonomic literature has made this data difficult to access and use to its full potential. Current anthropogenic impact on biodiversity demands faster knowledge generation, but also making better use of what we already have. This could help us make better-informed decisions about conservation and resources management. In past years, several efforts have been made to make taxonomic literature more mobilized and accessible. These include online publications, open access journals, the digitization of old paper literature and improved availability through online specialized repositories such as the Biodiversity Heritage Library (BHL) and the World Spider Catalog (WSC), among others. Although easy to share, PDF publications still have most of their biodiversity data embedded in strings of text making them less dynamic and more difficult or impossible to read and analyze without a human interpreter. Recently developed tools as GoldenGATE-Imagine (GGI) allow transforming PDFs in XML files that extract and categorize taxonomically relevant data. These data can then be aggregated in databases such as Plazi TreatmentBank, where it can be re-explored, queried and analyzed. Here we combined several of these cybertaxonomic tools to test the data extraction process for one potential application: the design and planning of an expedition to collect fresh material in the field. We targeted the ground spider Teutamus politus and other related species from the Teutamus group (TG) (Araneae; Liocranidae). These spiders are known from South East Asia and have been cataloged in the family Liocranidae; however, their relations, biology and evolution are still poorly understood. We marked-up 56 publications that contained taxonomic treatments with specimen records for the Liocranidae. Of these publications, 20 contained information on members of the TG. Geographical distributions and occurrences of 90 TG species were analyzed based on 1,309 specimen records. These data were used to design our field collection in a way that allowed us to optimize the collection of adult specimens of our target taxa. The TG genera were most common in Indonesia, Thailand and Malaysia. From these, Thailand was the second richest but had the most records of T. politus. Seasonal distribution of TG specimens in Thailand suggested June and July as the best time for collecting adults. Based on these analyses, we decided to sample from mid-July to mid-August 2018 in the three Thai provinces that combined most records of TG species and T. politus. Relying on the results of our literature analyses and using standard collection methods for ground spiders, we captured at least one specimen of every TG genus reported for Thailand. Our one-month expedition captured 231 TG spiders; from these, T. politus was the most abundant species with 188 specimens (95 adults). By comparison, a total of 196 specimens of the TG and 66 of T. politus had been reported for the same provinces in the last 40 years. Our sampling greatly increased the number of available specimens, especially for the genera Teutamus and Oedignatha. Also, we extended the known distribution of Oedignatha and Sesieutes within Thailand. These results illustrate the relevance of making biodiversity data contained within taxonomic treatments accessible and reusable. It also exemplifies one potential use of taxonomic legacy data: to more efficiently use existing biodiversity data to fill knowledge gaps. A similar approach can be used to study neglected or interesting taxa and geographic areas, generating a better biodiversity documentation that could aid in decision making, management and conservation.

Download Full-text

Innovative Formalism for Biological Data Analysis

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch158 ◽

2018 ◽

pp. 1814-1824

Author(s):

Calin Ciufudean

Keyword(s):

Information Technology ◽

Data Analysis ◽

Medical Devices ◽

Fuzzy Set ◽

Biological Data ◽

Fuzzy Operators ◽

Biological Data Analysis ◽

Biological Signals ◽

Design And Manufacture ◽

Gathering Data

Modern medical devices involves information technology (IT) based on electronic structures for data and signals sensing and gathering, data and signals transmission as well as data and signals processing in order to assist and help the medical staff to diagnose, cure and to monitors the evolution of patients. By focusing on biological signals processing we may notice that numerical processing of information delivered by sensors has a significant importance for a fair and optimum design and manufacture of modern medical devices. We consider for this approach fuzzy set as a formalism of analysis of biological signals processing and we propose to be accomplished this goal by developing fuzzy operators for filtering the noise of biological signals measurement. We exemplify this approach on neurological measurements performed with an Electro-Encephalograph (EEG).

Download Full-text

Classification of Biological Sequences

Data Mining ◽

10.4018/978-1-4666-2455-9.ch052 ◽

2013 ◽

pp. 1019-1042

Author(s):

Pratibha Rani ◽

Vikram Pudi

Keyword(s):

Naive Bayes ◽

Analysis Data ◽

Naïve Bayes ◽

Biological Data ◽

Rapid Progress ◽

Biological Data Analysis ◽

Depth Analysis ◽

Iterative Scaling ◽

Mining Methods

The rapid progress of computational biology, biotechnology, and bioinformatics in the last two decades has led to the accumulation of tremendous amounts of biological data that demands in-depth analysis. Data mining methods have been applied successfully for analyzing this data. An important problem in biological data analysis is to classify a newly discovered sequence like a protein or DNA sequence based on their important features and functions, using the collection of available sequences. In this chapter, we study this problem and present two Bayesian classifiers RBNBC (Rani & Pudi, 2008a) and REBMEC (Rani & Pudi, 2008c). The algorithms used in these classifiers incorporate repeated occurrences of subsequences within each sequence (Rani, 2008). Specifically, Repeat Based Naive Bayes Classifier (RBNBC) uses a novel formulation of Naive Bayes, and the second classifier, Repeat Based Maximum Entropy Classifier (REBMEC) uses a novel framework based on the classical Generalized Iterative Scaling (GIS) algorithm.

Download Full-text

Projected Clustering for Biological Data Analysis

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch247 ◽

2011 ◽

pp. 1617-1622

Author(s):

Ping Deng ◽

Qingkai Ma ◽

Weili Wu

Keyword(s):

Nearest Neighbor ◽

Dimensional Space ◽

Clustering Algorithms ◽

Biological Data ◽

High Dimensional ◽

Projected Clustering ◽

Cluster Data ◽

Biological Data Analysis ◽

Data Points ◽

Entire Dataset

Clustering can be considered as the most important unsupervised learning problem. It has been discussed thoroughly by both statistics and database communities due to its numerous applications in problems such as classification, machine learning, and data mining. A summary of clustering techniques can be found in (Berkhin, 2002). Most known clustering algorithms such as DBSCAN (Easter, Kriegel, Sander, & Xu, 1996) and CURE (Guha, Rastogi, & Shim, 1998) cluster data points based on full dimensions. When the dimensional space grows higher, the above algorithms lose their efficiency and accuracy because of the so-called “curse of dimensionality”. It is shown in (Beyer, Goldstein, Ramakrishnan, & Shaft, 1999) that computing the distance based on full dimensions is not meaningful in high dimensional space since the distance of a point to its nearest neighbor approaches the distance to its farthest neighbor as dimensionality increases. Actually, natural clusters might exist in subspaces. Data points in different clusters may be correlated with respect to different subsets of dimensions. In order to solve this problem, feature selection (Kohavi & Sommerfield, 1995) and dimension reduction (Raymer, Punch, Goodman, Kuhn, & Jain, 2000) have been proposed to find the closely correlated dimensions for all the data and the clusters in such dimensions. Although both methods reduce the dimensionality of the space before clustering, the case where clusters may exist in different subspaces of full dimensions is not handled well. Projected clustering has been proposed recently to effectively deal with high dimensionalities. Finding clusters and their relevant dimensions are the objectives of projected clustering algorithms. Instead of projecting the entire dataset on the same subspace, projected clustering focuses on finding specific projection for each cluster such that the similarity is reserved as much as possible.

Download Full-text

A Hybrid Biological Data Analysis Approach for Students’ Learning Creative Characteristics Recognition

IEEE Access ◽

10.1109/access.2019.2940378 ◽

2019 ◽

Vol 7 ◽

pp. 134411-134421

Author(s):

Shih-Yeh Chen ◽

Chin-Feng Lai ◽

Chi-Cheng Chang

Keyword(s):

Data Analysis ◽

Biological Data ◽

Analysis Approach ◽

Biological Data Analysis

Download Full-text