scholarly journals SPDE: A Multi-functional Software for Sequence Processing and Data Extraction

Author(s):  
Dong Xu ◽  
Zhuchou Lu ◽  
Kangming Jin ◽  
Wenmin Qiu ◽  
Guirong Qiao ◽  
...  

AbstractEfficiently extracting information from biological big data can be a huge challenge for people (especially those who lack programming skills). We developed Sequence Processing and Data Extraction (SPDE) as an integrated tool for sequence processing and data extraction for gene family and omics analyses. Currently, SPDE has seven modules comprising 100 basic functions that range from single gene processing (e.g., translation, reverse complement, and primer design) to genome information extraction. All SPDE functions can be used without the need for programming or command lines. The SPDE interface has enough prompt information to help users run SPDE without barriers. In addition to its own functions, SPDE also incorporates the publicly available analyses tools (such as, NCBI-blast, HMMER, Primer3 and SAMtools), thereby making SPDE a comprehensive bioinformatics platform for big biological data analysis.AvailabilitySPDE was built using Python and can be run on 32-bit, 64-bit Windows and macOS systems. It is an open-source software that can be downloaded from https://github.com/simon19891216/[email protected]

2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Luca Nanni ◽  
Pietro Pinoli ◽  
Arif Canakoglu ◽  
Stefano Ceri

Abstract Background With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. Results We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. Conclusions PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.


2021 ◽  

Abstract R is an open-source statistical environment modelled after the previously widely used commercial programs S and S-Plus, but in addition to powerful statistical analysis tools, it also provides powerful graphics outputs. In addition to its statistical and graphical capabilities, R is a programming language suitable for medium-sized projects. This book presents a set of studies that collectively represent almost all the R operations that beginners, analysing their own data up to perhaps the early years of doing a PhD, need. Although the chapters are organized around topics such as graphing, classical statistical tests, statistical modelling, mapping and text parsing, examples have been chosen based largely on real scientific studies at the appropriate level and within each the use of more R functions is nearly always covered than are simply necessary just to get a p-value or a graph. R comes with around a thousand base functions which are automatically installed when R is downloaded. This book covers the use of those of most relevance to biological data analysis, modelling and graphics. Throughout each chapter, the functions introduced and used in that chapter are summarized in Tool Boxes. The book also shows the user how to adapt and write their own code and functions. A selection of base functions relevant to graphics that are not necessarily covered in the main text are described in Appendix 1, and additional housekeeping functions in Appendix 2.


Author(s):  
Mária Ždímalová ◽  
Tomáš Bohumel ◽  
Katarína Plachá-Gregorovská ◽  
Peter Weismann ◽  
Hisham El Falougy

2019 ◽  
Author(s):  
Alessandro Greco ◽  
Jon Sanchez Valle ◽  
Vera Pancaldi ◽  
Anaïs Baudot ◽  
Emmanuel Barillot ◽  
...  

AbstractMatrix Factorization (MF) is an established paradigm for large-scale biological data analysis with tremendous potential in computational biology.We here challenge MF in depicting the molecular bases of epidemiologically described Disease-Disease (DD) relationships. As use case, we focus on the inverse comorbidity association between Alzheimer’s disease (AD) and lung cancer (LC), described as a lower than expected probability of developing LC in AD patients. To the day, the molecular mechanisms underlying DD relationships remain poorly explained and their better characterization might offer unprecedented clinical opportunities.To this goal, we extend our previously designed MF-based framework for the molecular characterization of DD relationships. Considering AD-LC inverse comorbidity as a case study, we highlight multiple molecular mechanisms, among which the previously identified immune system and mitochondrial metabolism. We then discriminate mechanisms specific to LC from those shared with other cancers through a pancancer analysis. Additionally, new candidate molecular players, such as Estrogen Receptor (ER), CDH1 and HDAC, are pinpointed as factors that might underlie the inverse relationship, opening the way to new investigations. Finally, some lung cancer subtype-specific factors are also detected, suggesting the existence of heterogeneity across patients also in the context of inverse comorbidity.


Author(s):  
Francisco Andres Rivera-Quiroz ◽  
Jeremy Miller

Traditional taxonomic publications have served as a biological data repository accumulating vast amounts of data on species diversity, geographical and temporal distributions, ecological interactions, taxonomic relations, among many other types of information. However, the fragmented nature of taxonomic literature has made this data difficult to access and use to its full potential. Current anthropogenic impact on biodiversity demands faster knowledge generation, but also making better use of what we already have. This could help us make better-informed decisions about conservation and resources management. In past years, several efforts have been made to make taxonomic literature more mobilized and accessible. These include online publications, open access journals, the digitization of old paper literature and improved availability through online specialized repositories such as the Biodiversity Heritage Library (BHL) and the World Spider Catalog (WSC), among others. Although easy to share, PDF publications still have most of their biodiversity data embedded in strings of text making them less dynamic and more difficult or impossible to read and analyze without a human interpreter. Recently developed tools as GoldenGATE-Imagine (GGI) allow transforming PDFs in XML files that extract and categorize taxonomically relevant data. These data can then be aggregated in databases such as Plazi TreatmentBank, where it can be re-explored, queried and analyzed. Here we combined several of these cybertaxonomic tools to test the data extraction process for one potential application: the design and planning of an expedition to collect fresh material in the field. We targeted the ground spider Teutamus politus and other related species from the Teutamus group (TG) (Araneae; Liocranidae). These spiders are known from South East Asia and have been cataloged in the family Liocranidae; however, their relations, biology and evolution are still poorly understood. We marked-up 56 publications that contained taxonomic treatments with specimen records for the Liocranidae. Of these publications, 20 contained information on members of the TG. Geographical distributions and occurrences of 90 TG species were analyzed based on 1,309 specimen records. These data were used to design our field collection in a way that allowed us to optimize the collection of adult specimens of our target taxa. The TG genera were most common in Indonesia, Thailand and Malaysia. From these, Thailand was the second richest but had the most records of T. politus. Seasonal distribution of TG specimens in Thailand suggested June and July as the best time for collecting adults. Based on these analyses, we decided to sample from mid-July to mid-August 2018 in the three Thai provinces that combined most records of TG species and T. politus. Relying on the results of our literature analyses and using standard collection methods for ground spiders, we captured at least one specimen of every TG genus reported for Thailand. Our one-month expedition captured 231 TG spiders; from these, T. politus was the most abundant species with 188 specimens (95 adults). By comparison, a total of 196 specimens of the TG and 66 of T. politus had been reported for the same provinces in the last 40 years. Our sampling greatly increased the number of available specimens, especially for the genera Teutamus and Oedignatha. Also, we extended the known distribution of Oedignatha and Sesieutes within Thailand. These results illustrate the relevance of making biodiversity data contained within taxonomic treatments accessible and reusable. It also exemplifies one potential use of taxonomic legacy data: to more efficiently use existing biodiversity data to fill knowledge gaps. A similar approach can be used to study neglected or interesting taxa and geographic areas, generating a better biodiversity documentation that could aid in decision making, management and conservation.


Author(s):  
Calin Ciufudean

Modern medical devices involves information technology (IT) based on electronic structures for data and signals sensing and gathering, data and signals transmission as well as data and signals processing in order to assist and help the medical staff to diagnose, cure and to monitors the evolution of patients. By focusing on biological signals processing we may notice that numerical processing of information delivered by sensors has a significant importance for a fair and optimum design and manufacture of modern medical devices. We consider for this approach fuzzy set as a formalism of analysis of biological signals processing and we propose to be accomplished this goal by developing fuzzy operators for filtering the noise of biological signals measurement. We exemplify this approach on neurological measurements performed with an Electro-Encephalograph (EEG).


Data Mining ◽  
2013 ◽  
pp. 1019-1042
Author(s):  
Pratibha Rani ◽  
Vikram Pudi

The rapid progress of computational biology, biotechnology, and bioinformatics in the last two decades has led to the accumulation of tremendous amounts of biological data that demands in-depth analysis. Data mining methods have been applied successfully for analyzing this data. An important problem in biological data analysis is to classify a newly discovered sequence like a protein or DNA sequence based on their important features and functions, using the collection of available sequences. In this chapter, we study this problem and present two Bayesian classifiers RBNBC (Rani & Pudi, 2008a) and REBMEC (Rani & Pudi, 2008c). The algorithms used in these classifiers incorporate repeated occurrences of subsequences within each sequence (Rani, 2008). Specifically, Repeat Based Naive Bayes Classifier (RBNBC) uses a novel formulation of Naive Bayes, and the second classifier, Repeat Based Maximum Entropy Classifier (REBMEC) uses a novel framework based on the classical Generalized Iterative Scaling (GIS) algorithm.


Author(s):  
Ping Deng ◽  
Qingkai Ma ◽  
Weili Wu

Clustering can be considered as the most important unsupervised learning problem. It has been discussed thoroughly by both statistics and database communities due to its numerous applications in problems such as classification, machine learning, and data mining. A summary of clustering techniques can be found in (Berkhin, 2002). Most known clustering algorithms such as DBSCAN (Easter, Kriegel, Sander, & Xu, 1996) and CURE (Guha, Rastogi, & Shim, 1998) cluster data points based on full dimensions. When the dimensional space grows higher, the above algorithms lose their efficiency and accuracy because of the so-called “curse of dimensionality”. It is shown in (Beyer, Goldstein, Ramakrishnan, & Shaft, 1999) that computing the distance based on full dimensions is not meaningful in high dimensional space since the distance of a point to its nearest neighbor approaches the distance to its farthest neighbor as dimensionality increases. Actually, natural clusters might exist in subspaces. Data points in different clusters may be correlated with respect to different subsets of dimensions. In order to solve this problem, feature selection (Kohavi & Sommerfield, 1995) and dimension reduction (Raymer, Punch, Goodman, Kuhn, & Jain, 2000) have been proposed to find the closely correlated dimensions for all the data and the clusters in such dimensions. Although both methods reduce the dimensionality of the space before clustering, the case where clusters may exist in different subspaces of full dimensions is not handled well. Projected clustering has been proposed recently to effectively deal with high dimensionalities. Finding clusters and their relevant dimensions are the objectives of projected clustering algorithms. Instead of projecting the entire dataset on the same subspace, projected clustering focuses on finding specific projection for each cluster such that the similarity is reserved as much as possible.


Sign in / Sign up

Export Citation Format

Share Document