scholarly journals The Molecular Data Organization for Publication (MDOP) R package to aid the upload of data to shared databases

2020 ◽  
Vol 8 ◽  
Author(s):  
Robert Young ◽  
Jiaojia Yu ◽  
Marie-José Cote ◽  
Robert Hanner

Molecular identification methods, such as DNA barcoding, rely on centralized databases populated with morphologically identified individuals and their referential nucleotide sequence records. As molecular identification approaches have expanded in use to fields such as food fraud, environmental surveys, and border surveillance, there is a need for diverse international data sets. Although central data repositories, like the Barcode of Life Datasystems (BOLD), provided workarounds for formatting data for upload, these workarounds can be taxing on researchers with few resources and limited funding. To address these concerns, we present the Molecular Data Organization for Publication (MDOP) R package to assist researchers in uploading data to public databases. To illustrate the use of these scripts, we use the BOLD system as an example. The main intent of this writing is to assist in the movement of data, from academic, governmental, and other institutional computer systems, to public locations. The movement of these data can then better contribute to the global DNA barcoding initiative and other global molecular data efforts.

2020 ◽  
Author(s):  
Anna M. Sozanska ◽  
Charles Fletcher ◽  
Dóra Bihary ◽  
Shamith A. Samarajiwa

AbstractMore than three decades ago, the microarray revolution brought about high-throughput data generation capability to biology and medicine. Subsequently, the emergence of massively parallel sequencing technologies led to many big-data initiatives such as the human genome project and the encyclopedia of DNA elements (ENCODE) project. These, in combination with cheaper, faster massively parallel DNA sequencing capabilities, have democratised multi-omic (genomic, transcriptomic, translatomic and epigenomic) data generation leading to a data deluge in bio-medicine. While some of these data-sets are trapped in inaccessible silos, the vast majority of these data-sets are stored in public data resources and controlled access data repositories, enabling their wider use (or misuse). Currently, most peer reviewed publications require the deposition of the data-set associated with a study under consideration in one of these public data repositories. However, clunky and difficult to use interfaces, subpar or incomplete annotation prevent discovering, searching and filtering of these multi-omic data and hinder their re-purposing in other use cases. In addition, the proliferation of multitude of different data repositories, with partially redundant storage of similar data are yet another obstacle to their continued usefulness. Similarly, interfaces where annotation is spread across multiple web pages, use of accession identifiers with ambiguous and multiple interpretations and lack of good curation make these data-sets difficult to use. We have produced SpiderSeqR, an R package, whose main features include the integration between NCBI GEO and SRA databases, enabling an integrated unified search of SRA and GEO data-sets and associated annotations, conversion between database accessions, as well as convenient filtering of results and saving past queries for future use. All of the above features aim to promote data reuse to facilitate making new discoveries and maximising the potential of existing data-sets.Availabilityhttps://github.com/ss-lab-cancerunit/SpiderSeqR


2021 ◽  
Author(s):  
Renato Augusto Ferreira Lima ◽  
Andrea Sanchez-Tapia ◽  
Sara R. Mortara ◽  
Hans Steege ◽  
Marinez F. Siqueira

Species records from biological collections are becoming increasingly available online. This unprecedented availability of records has largely supported recent studies in taxonomy, bio-geography, macro-ecology, and biodiversity conservation. Biological collections vary in their documentation and notation standards, which have changed through time. For different reasons, neither collections nor data repositories perform the editing, formatting and standardization of the data, leaving these tasks to the final users of the species records (e.g. taxonomists, ecologists and conservationists). These tasks are challenging, particularly when working with millions of records from hundreds of biological collections. To help collection curators and final users to perform those tasks, we introduce plantR an open-source package that provides a comprehensive toolbox to manage species records from biological collections. The package is accompanied by the proposal of a reproducible workflow to manage this type of data in taxonomy, ecology and biodiversity conservation. It is implemented in R and designed to handle relatively large data sets as fast as possible. Initially designed to handle plant species records, many of the plantR features also apply to other groups of organisms, given that the data structure is similar. The plantR workflow includes tools to (1) download records from different data repositories, (2) standardize typical fields associated with species records, (3) validate the locality, geographical coordinates, taxonomic nomenclature and species identifications, including the retrieval of duplicates across collections, and (4) summarize and export records, including the construction of species checklists with vouchers. Other R packages provide tools to tackle some of the workflow steps described above. But in addition to the new features and resources related to the data editing and validation, the greatest strength of plantR is to provide a comprehensive and user-friendly workflow in one single environment, performing all tasks from data retrieval to export. Thus, plantR can help researchers to better assess data quality and avoid data leakage in a wide variety of studies using species records.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Javier Fernández-López ◽  
M. Teresa Telleria ◽  
Margarita Dueñas ◽  
Mara Laguna-Castro ◽  
Klaus Schliep ◽  
...  

AbstractThe use of different sources of evidence has been recommended in order to conduct species delimitation analyses to solve taxonomic issues. In this study, we use a maximum likelihood framework to combine morphological and molecular traits to study the case of Xylodon australis (Hymenochaetales, Basidiomycota) using the locate.yeti function from the phytools R package. Xylodon australis has been considered a single species distributed across Australia, New Zealand and Patagonia. Multi-locus phylogenetic analyses were conducted to unmask the actual diversity under X. australis as well as the kinship relations respect their relatives. To assess the taxonomic position of each clade, locate.yeti function was used to locate in a molecular phylogeny the X. australis type material for which no molecular data was available using morphological continuous traits. Two different species were distinguished under the X. australis name, one from Australia–New Zealand and other from Patagonia. In addition, a close relationship with Xylodon lenis, a species from the South East of Asia, was confirmed for the Patagonian clade. We discuss the implications of our results for the biogeographical history of this genus and we evaluate the potential of this method to be used with historical collections for which molecular data is not available.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yance Feng ◽  
Lei M. Li

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.


2021 ◽  
pp. 016555152199863
Author(s):  
Ismael Vázquez ◽  
María Novo-Lourés ◽  
Reyes Pavón ◽  
Rosalía Laza ◽  
José Ramón Méndez ◽  
...  

Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS/ML ( Computer Science/ Machine Learning) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL https://rdata.4spam.group to facilitate understanding of this study.


2017 ◽  
Vol 12 (7) ◽  
pp. 851-855 ◽  
Author(s):  
Louis Passfield ◽  
James G. Hopker

This paper explores the notion that the availability and analysis of large data sets have the capacity to improve practice and change the nature of science in the sport and exercise setting. The increasing use of data and information technology in sport is giving rise to this change. Web sites hold large data repositories, and the development of wearable technology, mobile phone applications, and related instruments for monitoring physical activity, training, and competition provide large data sets of extensive and detailed measurements. Innovative approaches conceived to more fully exploit these large data sets could provide a basis for more objective evaluation of coaching strategies and new approaches to how science is conducted. An emerging discipline, sports analytics, could help overcome some of the challenges involved in obtaining knowledge and wisdom from these large data sets. Examples of where large data sets have been analyzed, to evaluate the career development of elite cyclists and to characterize and optimize the training load of well-trained runners, are discussed. Careful verification of large data sets is time consuming and imperative before useful conclusions can be drawn. Consequently, it is recommended that prospective studies be preferred over retrospective analyses of data. It is concluded that rigorous analysis of large data sets could enhance our knowledge in the sport and exercise sciences, inform competitive strategies, and allow innovative new research and findings.


Author(s):  
Qian Tang ◽  
Qi Luo ◽  
Qian Duan ◽  
Lei Deng ◽  
Renyi Zhang

Nowadays, the global fish consumption continues to rise along with the continuous growth of the population, which has led to the dilemma of overfishing of fishery resources. Especially high-value fish that are overfished are often replaced by other fish. Therefore, the accurate identification of fish products in the market is a problem worthy of attention. In this study, full-DNA barcoding (FDB) and mini-DNA barcoding (MDB) used to detect the fraud of fish products in Guiyang, Guizhou province in China. The molecular identification results showed that 39 of the 191 samples were not consistent with the labels. The mislabelling of fish products for fresh, frozen, cooked and canned were 11.70%, 20.00%, 34.09% and 50.00%, respectively. The average kimura 2 parameter distances of MDB within species and genera were 0.27% and 5.41%, respectively; while average distances of FDB were 0.17% within species and 6.17% within genera. In this study, commercial fraud is noticeable, most of the high-priced fish were replaced of low-priced fish with a similar feature. Our study indicated that DNA barcoding is a valid tool for the identification of fish products and that it allows an idea of conservation and monitoring efforts, while confirming the MDB as a reliable tool for fish products.


Author(s):  
Bishal Dhar ◽  
Mohua Chakraborty ◽  
Madhurima Chakraborty ◽  
Sorokhaibam Malvika ◽  
N. Neelima Devi ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document