SpiderSeqR: an R package for crawling the web of high-throughput multi-omic data repositories for data-sets and annotation

Mapping Intimacies ◽

10.1101/2020.04.13.039420 ◽

2020 ◽

Author(s):

Anna M. Sozanska ◽

Charles Fletcher ◽

Dóra Bihary ◽

Shamith A. Samarajiwa

Keyword(s):

High Throughput ◽

R Package ◽

Data Reuse ◽

Massively Parallel ◽

Data Sets ◽

Similar Data ◽

Data Generation ◽

Data Repositories ◽

Public Data ◽

Omic Data

AbstractMore than three decades ago, the microarray revolution brought about high-throughput data generation capability to biology and medicine. Subsequently, the emergence of massively parallel sequencing technologies led to many big-data initiatives such as the human genome project and the encyclopedia of DNA elements (ENCODE) project. These, in combination with cheaper, faster massively parallel DNA sequencing capabilities, have democratised multi-omic (genomic, transcriptomic, translatomic and epigenomic) data generation leading to a data deluge in bio-medicine. While some of these data-sets are trapped in inaccessible silos, the vast majority of these data-sets are stored in public data resources and controlled access data repositories, enabling their wider use (or misuse). Currently, most peer reviewed publications require the deposition of the data-set associated with a study under consideration in one of these public data repositories. However, clunky and difficult to use interfaces, subpar or incomplete annotation prevent discovering, searching and filtering of these multi-omic data and hinder their re-purposing in other use cases. In addition, the proliferation of multitude of different data repositories, with partially redundant storage of similar data are yet another obstacle to their continued usefulness. Similarly, interfaces where annotation is spread across multiple web pages, use of accession identifiers with ambiguous and multiple interpretations and lack of good curation make these data-sets difficult to use. We have produced SpiderSeqR, an R package, whose main features include the integration between NCBI GEO and SRA databases, enabling an integrated unified search of SRA and GEO data-sets and associated annotations, conversion between database accessions, as well as convenient filtering of results and saving past queries for future use. All of the above features aim to promote data reuse to facilitate making new discoveries and maximising the potential of existing data-sets.Availabilityhttps://github.com/ss-lab-cancerunit/SpiderSeqR

Download Full-text

Improvements for research data repositories: The case of text spam

Journal of Information Science ◽

10.1177/0165551521998636 ◽

2021 ◽

pp. 016555152199863

Author(s):

Ismael Vázquez ◽

María Novo-Lourés ◽

Reyes Pavón ◽

Rosalía Laza ◽

José Ramón Méndez ◽

...

Keyword(s):

Web Application ◽

Research Data ◽

Data Sets ◽

Data Repositories ◽

Software Applications ◽

Public Data ◽

Protection Mechanisms ◽

Experimental Protocols ◽

Learning Research ◽

Processing Steps

Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS/ML ( Computer Science/ Machine Learning) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL https://rdata.4spam.group to facilitate understanding of this study.

Download Full-text

How to Automatically Document Data With the codebook Package to Facilitate Data Reuse

Advances in Methods and Practices in Psychological Science ◽

10.1177/2515245919838783 ◽

2019 ◽

Vol 2 (2) ◽

pp. 169-187 ◽

Cited By ~ 7

Author(s):

Ruben C. Arslan

Keyword(s):

R Package ◽

Data Reuse ◽

Data Sets ◽

Data Set ◽

Psychological Scales ◽

Rich Data ◽

Data Documentation ◽

Machine Readable ◽

Basic Standards ◽

Existing Data

Data documentation in psychology lags behind not only many other disciplines, but also basic standards of usefulness. Psychological scientists often prefer to invest the time and effort that would be necessary to document existing data well in other duties, such as writing and collecting more data. Codebooks therefore tend to be unstandardized and stored in proprietary formats, and they are rarely properly indexed in search engines. This means that rich data sets are sometimes used only once—by their creators—and left to disappear into oblivion. Even if they can find an existing data set, researchers are unlikely to publish analyses based on it if they cannot be confident that they understand it well enough. My codebook package makes it easier to generate rich metadata in human- and machine-readable codebooks. It uses metadata from existing sources and automates some tedious tasks, such as documenting psychological scales and reliabilities, summarizing descriptive statistics, and identifying patterns of missingness. The codebook R package and Web app make it possible to generate a rich codebook in a few minutes and just three clicks. Over time, its use could lead to psychological data becoming findable, accessible, interoperable, and reusable, thereby reducing research waste and benefiting both its users and the scientific community as a whole.

Download Full-text

ramr: an R package for detection of rare aberrantly methylated regions

10.1101/2020.12.01.403501 ◽

2020 ◽

Author(s):

Oleksii Nikolaienko ◽

Per Eystein Lønning ◽

Stian Knappskog

Keyword(s):

Performance Metrics ◽

Developmental Stages ◽

R Package ◽

Data Sets ◽

Bioinformatic Tools ◽

Public Data ◽

Risk Of Disease ◽

Low Incidence ◽

And Performance ◽

Risk Of Cancer

AbstractMotivationWith recent advances in the field of epigenetics, the focus is widening from large and frequent disease- or phenotype-related methylation signatures to rare alterations transmitted mitotically or transgenerationally (constitutional epimutations). Merging evidence indicate that such constitutional alterations, albeit occurring at a low mosaic level, may confer risk of disease later in life. Given their inherently low incidence rate and mosaic nature, there is a need for bioinformatic tools specifically designed to analyse such events.ResultsWe have developed a method (ramr) to identify aberrantly methylated DNA regions (AMRs). ramr can be applied to methylation data obtained by array or next-generation sequencing techniques to discover AMRs being associated with elevated risk of cancer as well as other diseases. We assessed accuracy and performance metrics of ramr and confirmed its applicability for analysis of large public data sets. Using ramr we identified aberrantly methylated regions that are known or may potentially be associated with development of colorectal cancer and provided functional annotation of AMRs that arise at early developmental stages.Availability and implementationThe R package is freely available at https://github.com/BBCG/ramr

Download Full-text

BioAssay Ontology Annotations Facilitate Cross-Analysis of Diverse High-Throughput Screening Data Sets

CrossRef Listing of Deleted DOIs ◽

10.1177/1087057111400191 ◽

2011 ◽

Vol 16 (4) ◽

pp. 415-426 ◽

Cited By ~ 33

Author(s):

Stephan C. Schürer ◽

Uma Vempati ◽

Robin Smith ◽

Mark Southern ◽

Vance Lemmon

Keyword(s):

Drug Discovery ◽

High Throughput ◽

High Throughput Screening ◽

Data Sets ◽

Large Set ◽

Chemical Probes ◽

Great Flexibility ◽

Data Repositories ◽

Data Format ◽

Entry Points

High-throughput screening data repositories, such as PubChem, represent valuable resources for the development of small-molecule chemical probes and can serve as entry points for drug discovery programs. Although the loose data format offered by PubChem allows for great flexibility, important annotations, such as the assay format and technologies employed, are not explicitly indexed. The authors have previously developed a BioAssay Ontology (BAO) and curated more than 350 assays with standardized BAO terms. Here they describe the use of BAO annotations to analyze a large set of assays that employ luciferase- and β-lactamase–based technologies. They identified promiscuous chemotypes pertaining to different subcategories of assays and specific mechanisms by which these chemotypes interfere in reporter gene assays. Results show that the data in PubChem can be used to identify promiscuous compounds that interfere nonspecifically with particular technologies. Furthermore, they show that BAO is a valuable toolset for the identification of related assays and for the systematic generation of insights that are beyond the scope of individual assays or screening campaigns.

Download Full-text

Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping

F1000Research ◽

10.12688/f1000research.8382.1 ◽

2016 ◽

Vol 5 ◽

pp. 672 ◽

Cited By ~ 3

Author(s):

Ben Busby ◽

Matthew Lesko ◽

Lisa Federer ◽

Keyword(s):

Open Source ◽

Data Science ◽

Biological Data ◽

Similar Data ◽

Data Types ◽

Data Repositories ◽

Collaborative Software ◽

Public Data ◽

Public Datasets ◽

Prototype Software

In genomics, bioinformatics and other areas of data science, gaps exist between extant public datasets and the open-source software tools built by the community to analyze similar data types. The purpose of biological data science hackathons is to assemble groups of genomics or bioinformatics professionals and software developers to rapidly prototype software to address these gaps. The only two rules for the NCBI-assisted hackathons run so far are that 1) data either must be housed in public data repositories or be deposited to such repositories shortly after the hackathon’s conclusion, and 2) all software comprising the final pipeline must be open-source or open-use. Proposed topics, as well as suggested tools and approaches, are distributed to participants at the beginning of each hackathon and refined during the event. Software, scripts, and pipelines are developed and published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development. The code resulting from each hackathon is published at https://github.com/NCBI-Hackathons/ with separate directories or repositories for each team.

Download Full-text

MOSim: Multi-Omics Simulation in R

10.1101/421834 ◽

2018 ◽

Cited By ~ 5

Author(s):

Carlos Martínez-Mira ◽

Ana Conesa ◽

Sonia Tarazona

Keyword(s):

Time Series Data ◽

Simulated Data ◽

R Package ◽

Experimental Designs ◽

Supplementary Information ◽

Series Data ◽

Data Sets ◽

Expression Data ◽

Supplementary Material ◽

Omic Data

AbstractMotivationAs new integrative methodologies are being developed to analyse multi-omic experiments, validation strategies are required for benchmarking. In silico approaches such as simulated data are popular as they are fast and cheap. However, few tools are available for creating synthetic multi-omic data sets.ResultsMOSim is a new R package for easily simulating multi-omic experiments consisting of gene expression data, other regulatory omics and the regulatory relationships between them. MOSim supports different experimental designs including time series data.AvailabilityThe package is freely available under the GPL-3 license from the Bitbucket repository (https://bitbucket.org/ConesaLab/mosim/)[email protected] informationSupplementary material is available at bioRxiv online.

Download Full-text

Consolidating Heterogeneous Enterprise Data for Named Entity Linking and Web Intelligence

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213015400084 ◽

2015 ◽

Vol 24 (02) ◽

pp. 1540008 ◽

Cited By ~ 7

Author(s):

Albert Weichselbraun ◽

Daniel Streiff ◽

Arno Scharl

Keyword(s):

Linked Data ◽

Data Repository ◽

Special Focus ◽

Data Sets ◽

Entity Linking ◽

Data Repositories ◽

Web Intelligence ◽

Named Entity ◽

Business Information ◽

Public Data

Linking named entities to structured knowledge sources paves the way for state-of-the-art Web intelligence applications which assign sentiment to the correct entities, identify trends, and reveal relations between organizations, persons and products. For this purpose this paper introduces Recognyze, a named entity linking component that uses background knowledge obtained from linked data repositories, and outlines the process of transforming heterogeneous data silos within an organization into a linked enterprise data repository which draws upon popular linked open data vocabularies to foster interoperability with public data sets. The presented examples use comprehensive real-world data sets from Orell Füssli Business Information, Switzerland's largest business information provider. The linked data repository created from these data sets comprises more than nine million triples on companies, the companies' contact information, key people, products and brands. We identify the major challenges of tapping into such sources for named entity linking, and describe required data pre-processing techniques to use and integrate such data sets, with a special focus on disambiguation and ranking algorithms. Finally, we conduct a comprehensive evaluation based on business news from the New Journal of Zurich and AWP Financial News to illustrate how these techniques improve the performance of the Recognyze named entity linking component.

Download Full-text

virMine: automated detection of viral sequences from complex metagenomic samples

PeerJ ◽

10.7717/peerj.6695 ◽

2019 ◽

Vol 7 ◽

pp. e6695 ◽

Cited By ~ 14

Author(s):

Andrea Garretto ◽

Thomas Hatzopoulos ◽

Catherine Putonti

Keyword(s):

Software Tool ◽

Metagenomic Data ◽

Data Sets ◽

Data Repositories ◽

Viral Genomes ◽

Manual Curation ◽

Public Data ◽

Control Assembly ◽

Signature Sequences ◽

Viral Sequences

Metagenomics has enabled sequencing of viral communities from a myriad of different environments. Viral metagenomic studies routinely uncover sequences with no recognizable homology to known coding regions or genomes. Nevertheless, complete viral genomes have been constructed directly from complex community metagenomes, often through tedious manual curation. To address this, we developed the software tool virMine to identify viral genomes from raw reads representative of viral or mixed (viral and bacterial) communities. virMine automates sequence read quality control, assembly, and annotation. Researchers can easily refine their search for a specific study system and/or feature(s) of interest. In contrast to other viral genome detection tools that often rely on the recognition of viral signature sequences, virMine is not restricted by the insufficient representation of viral diversity in public data repositories. Rather, viral genomes are identified through an iterative approach, first omitting non-viral sequences. Thus, both relatives of previously characterized viruses and novel species can be detected, including both eukaryotic viruses and bacteriophages. Here we present virMine and its analysis of synthetic communities as well as metagenomic data sets from three distinctly different environments: the gut microbiota, the urinary microbiota, and freshwater viromes. Several new viral genomes were identified and annotated, thus contributing to our understanding of viral genetic diversity in these three environments.

Download Full-text

plantR: An R package and workflow for managing species records from biological collections

10.1101/2021.04.06.437754 ◽

2021 ◽

Author(s):

Renato Augusto Ferreira Lima ◽

Andrea Sanchez-Tapia ◽

Sara R. Mortara ◽

Hans Steege ◽

Marinez F. Siqueira

Keyword(s):

Biodiversity Conservation ◽

Large Data ◽

Data Retrieval ◽

R Package ◽

Large Data Sets ◽

Data Sets ◽

Data Repositories ◽

Biological Collections ◽

Data Editing ◽

User Friendly

Species records from biological collections are becoming increasingly available online. This unprecedented availability of records has largely supported recent studies in taxonomy, bio-geography, macro-ecology, and biodiversity conservation. Biological collections vary in their documentation and notation standards, which have changed through time. For different reasons, neither collections nor data repositories perform the editing, formatting and standardization of the data, leaving these tasks to the final users of the species records (e.g. taxonomists, ecologists and conservationists). These tasks are challenging, particularly when working with millions of records from hundreds of biological collections. To help collection curators and final users to perform those tasks, we introduce plantR an open-source package that provides a comprehensive toolbox to manage species records from biological collections. The package is accompanied by the proposal of a reproducible workflow to manage this type of data in taxonomy, ecology and biodiversity conservation. It is implemented in R and designed to handle relatively large data sets as fast as possible. Initially designed to handle plant species records, many of the plantR features also apply to other groups of organisms, given that the data structure is similar. The plantR workflow includes tools to (1) download records from different data repositories, (2) standardize typical fields associated with species records, (3) validate the locality, geographical coordinates, taxonomic nomenclature and species identifications, including the retrieval of duplicates across collections, and (4) summarize and export records, including the construction of species checklists with vouchers. Other R packages provide tools to tackle some of the workflow steps described above. But in addition to the new features and resources related to the data editing and validation, the greatest strength of plantR is to provide a comprehensive and user-friendly workflow in one single environment, performing all tasks from data retrieval to export. Thus, plantR can help researchers to better assess data quality and avoid data leakage in a wide variety of studies using species records.

Download Full-text

Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping

F1000Research ◽

10.12688/f1000research.8382.2 ◽

2016 ◽

Vol 5 ◽

pp. 672 ◽

Cited By ~ 6

Author(s):

Ben Busby ◽

Matthew Lesko ◽

Lisa Federer ◽

Keyword(s):

Open Source ◽

Data Science ◽

Biological Data ◽

Similar Data ◽

Data Types ◽

Data Repositories ◽

Collaborative Software ◽

Public Data ◽

Public Datasets ◽

Prototype Software

Download Full-text