CosinorPy: a python package for cosinor-based rhythmometry

Abstract Background Even though several computational methods for rhythmicity detection and analysis of biological data have been proposed in recent years, classical trigonometric regression based on cosinor still has several advantages over these methods and is still widely used. Different software packages for cosinor-based rhythmometry exist, but lack certain functionalities and require data in different, non-unified input formats. Results We present CosinorPy, a Python implementation of cosinor-based methods for rhythmicity detection and analysis. CosinorPy merges and extends the functionalities of existing cosinor packages. It supports the analysis of rhythmic data using single- or multi-component cosinor models, automatic selection of the best model, population-mean cosinor regression, and differential rhythmicity assessment. Moreover, it implements functions that can be used in a design of experiments, a synthetic data generator, and import and export of data in different formats. Conclusion CosinorPy is an easy-to-use Python package for straightforward detection and analysis of rhythmicity requiring minimal statistical knowledge, and produces publication-ready figures. Its code, examples, and documentation are available to download from https://github.com/mmoskon/CosinorPy. CosinorPy can be installed manually or by using pip, the package manager for Python packages. The implementation reported in this paper corresponds to the software release v1.1.

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text

UniBioDicts: Unified access to Biological Dictionaries

Bioinformatics ◽

10.1093/bioinformatics/btaa1065 ◽

2020 ◽

Author(s):

John Zobolas ◽

Vasundra Touré ◽

Martin Kuiper ◽

Steven Vercruysse

Keyword(s):

User Interface ◽

Life Science ◽

Biological Data ◽

Supplementary Information ◽

Supplementary Data ◽

Query Interface ◽

Controlled Vocabularies ◽

Search String ◽

Software Packages ◽

The Right

Abstract Summary We present a set of software packages that provide uniform access to diverse biological vocabulary resources that are instrumental for current biocuration efforts and tools. The Unified Biological Dictionaries (UniBioDicts or UBDs) provide a single query-interface for accessing the online API services of leading biological data providers. Given a search string, UBDs return a list of matching term, identifier and metadata units from databases (e.g. UniProt), controlled vocabularies (e.g. PSI-MI) and ontologies (e.g. GO, via BioPortal). This functionality can be connected to input fields (user-interface components) that offer autocomplete lookup for these dictionaries. UBDs create a unified gateway for accessing life science concepts, helping curators find annotation terms across resources (based on descriptive metadata and unambiguous identifiers), and helping data users search and retrieve the right query terms. Availability and implementation The UBDs are available through npm and the code is available in the GitHub organisation UniBioDicts (https://github.com/UniBioDicts) under the Affero GPL license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Practical R for biologists: an introduction

10.1079/9781789245349.0000 ◽

2021 ◽

Keyword(s):

Statistical Tests ◽

Statistical Modelling ◽

Biological Data ◽

Early Years ◽

Main Text ◽

Biological Data Analysis ◽

Base Functions ◽

Almost All ◽

Selection Of ◽

R Functions

Abstract R is an open-source statistical environment modelled after the previously widely used commercial programs S and S-Plus, but in addition to powerful statistical analysis tools, it also provides powerful graphics outputs. In addition to its statistical and graphical capabilities, R is a programming language suitable for medium-sized projects. This book presents a set of studies that collectively represent almost all the R operations that beginners, analysing their own data up to perhaps the early years of doing a PhD, need. Although the chapters are organized around topics such as graphing, classical statistical tests, statistical modelling, mapping and text parsing, examples have been chosen based largely on real scientific studies at the appropriate level and within each the use of more R functions is nearly always covered than are simply necessary just to get a p-value or a graph. R comes with around a thousand base functions which are automatically installed when R is downloaded. This book covers the use of those of most relevance to biological data analysis, modelling and graphics. Throughout each chapter, the functions introduced and used in that chapter are summarized in Tool Boxes. The book also shows the user how to adapt and write their own code and functions. A selection of base functions relevant to graphics that are not necessarily covered in the main text are described in Appendix 1, and additional housekeeping functions in Appendix 2.

Download Full-text

Selection of Optimal Method of Software Release Time Incorporating Imperfect Debugging

2018 4th International Conference on Computational Intelligence & Communication Technology (CICT) ◽

10.1109/ciact.2018.8480133 ◽

2018 ◽

Cited By ~ 2

Author(s):

Shubhra Gautam ◽

Deepak Kumar ◽

L.M. Patnaik

Keyword(s):

Release Time ◽

Optimal Method ◽

Imperfect Debugging ◽

Software Release ◽

Selection Of

Download Full-text

BSMA-Gen: A Parallel Synthetic Data Generator for Social Media Timeline Structures

Database Systems for Advanced Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-319-05813-9_40 ◽

2014 ◽

pp. 539-542

Author(s):

Chengcheng Yu ◽

Fan Xia ◽

Qunyan Zhang ◽

Haixin Ma ◽

Weining Qian ◽

...

Keyword(s):

Social Media ◽

Synthetic Data ◽

Data Generator

Download Full-text

A multi-site cutting device implements efficiently the divide-and-conquer strategy in tumor sampling

F1000Research ◽

10.12688/f1000research.9091.1 ◽

2016 ◽

Vol 5 ◽

pp. 1587 ◽

Cited By ~ 2

Author(s):

Jose I. Lopez ◽

Jesus M. Cortes

Keyword(s):

Synthetic Data ◽

Daily Practice ◽

Divide And Conquer ◽

Intratumor Heterogeneity ◽

Pathology Laboratory ◽

Simple Method ◽

French Fry ◽

Histological Processing ◽

Tumor Sampling ◽

Selection Of

We recently showed that in order to detect intra-tumor heterogeneity a Divide-and-Conquer (DAC) strategy of tumor sampling outperforms current routine protocols. This paper is a continuation of this work, but here we focus on DAC implementation in the Pathology Laboratory. In particular, we describe a new simple method that makes use of a cutting grid device and is applied to clear cell renal cell carcinomas for DAC implementation. This method assures a thorough sampling of large surgical specimens, facilitates the demonstration of intratumor heterogeneity, and saves time to pathologists in the daily practice. The method involves the following steps: 1. Thin slicing of the tumor (by hand or machine), 2. Application of a cutting grid to the slices (e.g., a French fry cutter), resulting in multiple tissue cubes with fixed position within the slice, 3. Selection of tissue cubes for analysis, and finally, 4. Inclusion of selected cubes into a cassette for histological processing (with about eight tissue fragments within each cassette). Thus, using our approach in a 10 cm in-diameter-tumor we generate 80 tumor tissue fragments placed in 10 cassettes and, notably, in a tenth of time. Eighty samples obtained across all the regions of the tumor will assure a much higher performance in detecting intratumor heterogeneity, as proved recently with synthetic data.

Download Full-text

Denoising large-scale biological data using network filters

10.21203/rs.3.rs-66071/v2 ◽

2021 ◽

Author(s):

Andrew J Kavran ◽

Aaron Clauset

Keyword(s):

Large Scale ◽

Synthetic Data ◽

Interaction Network ◽

Learning Task ◽

Biological Data ◽

Data Sets ◽

Proteomics Data ◽

Life History Variation ◽

Wide Range ◽

Underlying Processes

Abstract Background: Large-scale biological data sets are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation.Results: We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “ﬁltered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network ﬁlter may be applied to an entire system, or the system may be ﬁrst decomposed into distinct modules and a diﬀerent ﬁlter applied to each. Applied to synthetic data with known network structure and signal, network ﬁlters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network ﬁltering prior to training increases accuracy up to 43% compared to using unﬁltered data.Conclusions: Network ﬁlters are a general way to denoise biological data and can account for both correlation and anti-correlation between diﬀerent measurements. Furthermore, we ﬁnd that partitioning a network prior to ﬁltering can signiﬁcantly reduce errors in networks with heterogenous data and correlation patterns, and this approach outperforms existing diﬀusion based methods. Our results on proteomics data indicate the broad potential utility of network ﬁlters to applications in systems biology.

Download Full-text

A Partial Optimization Approach for Privacy Preserving Frequent Itemset Mining

International Journal of Computational Models and Algorithms in Medicine ◽

10.4018/jcmam.2010072002 ◽

2010 ◽

Vol 1 (1) ◽

pp. 19-33

Author(s):

Shibnath Mukherjee ◽

Aryya Gangopadhyay ◽

Zhiyuan Chen

Keyword(s):

Low Cost ◽

Synthetic Data ◽

Frequent Itemset ◽

Optimization Approach ◽

Data Generator ◽

Hidden Cost ◽

Potential Benefits ◽

The Difference ◽

The Given ◽

Optimal Set

While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, would result in hiding sensitive patterns while reducing the accidental hiding of legitimate patterns and the damage done to the database as much as possible. Their methodology allows the user to adjust their preference on the weights assigned to benefits in terms of the number of restrictive patterns hidden, cost in terms of the number of legitimate patterns hidden, and damage to the database in terms of the difference between marginal frequencies of items for the original and sanitized databases. Most approaches in solving the given problem found in literature are all-heuristic based without formal treatment for optimality. While in a few work, ILP has been used previously as a formal optimization approach, the novelty of this method is the extremely low cost-complexity model in contrast to the others. They implement our methodology in C and C++ and ran several experiments with synthetic data generated with the IBM synthetic data generator. The experiments show excellent results when compared to those in the literature.

Download Full-text