Reproducible Statistical Analysis in Microarray Profiling Studies

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Download Full-text

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

10.1101/342907 ◽

2018 ◽

Author(s):

Greg Finak ◽

Bryan T. Mayer ◽

William Fulp ◽

Paul Obrecht ◽

Alicia Sato ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Relational Databases ◽

Research Work ◽

Large Data ◽

Work Flow ◽

Primary Data ◽

Reproducible Research ◽

Data Sets ◽

Data Set

AbstractA central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ‘omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high “startup” costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Download Full-text

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

Gates Open Research ◽

10.12688/gatesopenres.12832.2 ◽

2018 ◽

Vol 2 ◽

pp. 31 ◽

Cited By ~ 3

Author(s):

Greg Finak ◽

Bryan Mayer ◽

William Fulp ◽

Paul Obrecht ◽

Alicia Sato ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Relational Databases ◽

Research Work ◽

Large Data ◽

Work Flow ◽

Primary Data ◽

Reproducible Research ◽

Data Sets ◽

Data Set

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Download Full-text

FORMATION OF METASUBJECT COMPETENCIES IN THE COURSE “INFORMATION TECHNOLOGIES” BY MEANS OF THE BIG DATA PROCESSING LANGUAGE R

Informatics and Education ◽

10.32517/0234-0453-2019-34-4-12-22 ◽

2019 ◽

pp. 12-22

Author(s):

D. M. Nazarov

Keyword(s):

Data Processing ◽

Information Technologies ◽

Data Sets ◽

Training Methods ◽

Economic Knowledge ◽

R Language ◽

Currency Exchange ◽

Business Informatics ◽

Exchange Data ◽

Processing Language

The article describes the training methods in the course “Information Technologies” for the future bachelors of the directions “Economics”, “Management”, “Finance”, “Business Informatics”, the development of metasubject competencies of the student while his use of tools for data processing by means of the language R. The metasubject essence of the work is to update traditional economic knowledge and skills through various presentation forms of the same data sets. As part of the laboratory work described in the article, future bachelors learn to use the basic tools of the R language and acquire specific skills and abilities in R-Studio using the example of processing currency exchange data. The description of the methods is presented in the form of the traditional Key-by-Key technology, which is widely used in teaching information technologies.

Download Full-text

Multi-view feature selection for identifying gene markers: a diversified biological data driven approach

BMC Bioinformatics ◽

10.1186/s12859-020-03810-0 ◽

2020 ◽

Vol 21 (S18) ◽

Author(s):

Sudipta Acharya ◽

Laizhong Cui ◽

Yi Pan

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Selection ◽

Marker Gene ◽

Biological Data ◽

Protein Interaction Data ◽

Marker Genes ◽

Data Sets ◽

Gene Markers ◽

Multi Objective

Abstract Background In recent years, to investigate challenging bioinformatics problems, the utilization of multiple genomic and proteomic sources has become immensely popular among researchers. One such issue is feature or gene selection and identifying relevant and non-redundant marker genes from high dimensional gene expression data sets. In that context, designing an efficient feature selection algorithm exploiting knowledge from multiple potential biological resources may be an effective way to understand the spectrum of cancer or other diseases with applications in specific epidemiology for a particular population. Results In the current article, we design the feature selection and marker gene detection as a multi-view multi-objective clustering problem. Regarding that, we propose an Unsupervised Multi-View Multi-Objective clustering-based gene selection approach called UMVMO-select. Three important resources of biological data (gene ontology, protein interaction data, protein sequence) along with gene expression values are collectively utilized to design two different views. UMVMO-select aims to reduce gene space without/minimally compromising the sample classification efficiency and determines relevant and non-redundant gene markers from three cancer gene expression benchmark data sets. Conclusion A thorough comparative analysis has been performed with five clustering and nine existing feature selection methods with respect to several internal and external validity metrics. Obtained results reveal the supremacy of the proposed method. Reported results are also validated through a proper biological significance test and heatmap plotting.

Download Full-text

Survey Data Quality in Analyzing Harmonized Indicators of Protest Behavior: A Survey Data Recycling Approach

American Behavioral Scientist ◽

10.1177/00027642211021623 ◽

2021 ◽

pp. 000276422110216

Author(s):

Kazimierz M. Slomczynski ◽

Irina Tomescu-Dubrow ◽

Ilona Wysmulek

Keyword(s):

Data Processing ◽

Data Quality ◽

Survey Data ◽

A Priori ◽

Data Sets ◽

New Approach ◽

Survey Quality ◽

Survey Error ◽

Ex Post ◽

The Impact

This article proposes a new approach to analyze protest participation measured in surveys of uneven quality. Because single international survey projects cover only a fraction of the world’s nations in specific periods, researchers increasingly turn to ex-post harmonization of different survey data sets not a priori designed as comparable. However, very few scholars systematically examine the impact of the survey data quality on substantive results. We argue that the variation in source data, especially deviations from standards of survey documentation, data processing, and computer files—proposed by methodologists of Total Survey Error, Survey Quality Monitoring, and Fitness for Intended Use—is important for analyzing protest behavior. In particular, we apply the Survey Data Recycling framework to investigate the extent to which indicators of attending demonstrations and signing petitions in 1,184 national survey projects are associated with measures of data quality, controlling for variability in the questionnaire items. We demonstrate that the null hypothesis of no impact of measures of survey quality on indicators of protest participation must be rejected. Measures of survey documentation, data processing, and computer records, taken together, explain over 5% of the intersurvey variance in the proportions of the populations attending demonstrations or signing petitions.

Download Full-text

Fundamental resource trade-offs for encoded distributed optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa026 ◽

2020 ◽

Author(s):

A Salman Avestimehr ◽

Seyed Mohammadreza Mousavi Kalan ◽

Mahdi Soltanolkotabi

Keyword(s):

Computational Time ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Computational Framework ◽

Data Set ◽

Trade Offs ◽

Major Bottleneck ◽

Computing Environments ◽

Analyze Data

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.

Download Full-text

BioNetLink - An Architecture for Working with Network Data

Journal of Integrative Bioinformatics ◽

10.1515/jib-2014-241 ◽

2014 ◽

Vol 11 (2) ◽

pp. 68-79

Author(s):

Matthias Klapperstück ◽

Falk Schreiber

Keyword(s):

Experimental Data ◽

Biological Networks ◽

Regulatory Networks ◽

Large Data ◽

Biological Data ◽

Network Data ◽

Data Sets ◽

Dynamic Views ◽

Gene Regulatory ◽

Multiple Network

Summary The visualization of biological data gained increasing importance in the last years. There is a large number of methods and software tools available that visualize biological data including the combination of measured experimental data and biological networks. With growing size of networks their handling and exploration becomes a challenging task for the user. In addition, scientists also have an interest in not just investigating a single kind of network, but on the combination of different types of networks, such as metabolic, gene regulatory and protein interaction networks. Therefore, fast access, abstract and dynamic views, and intuitive exploratory methods should be provided to search and extract information from the networks. This paper will introduce a conceptual framework for handling and combining multiple network sources that enables abstract viewing and exploration of large data sets including additional experimental data. It will introduce a three-tier structure that links network data to multiple network views, discuss a proof of concept implementation, and shows a specific visualization method for combining metabolic and gene regulatory networks in an example.

Download Full-text

Infrastructural approach to processing of spatial data in the problems of management of territorial development

Вычислительные технологии ◽

10.25743/ict.2018.23.16488 ◽

2018 ◽

Author(s):

И.В. Бычков ◽

Г.М. Ружников ◽

В.В. Парамонов ◽

А.С. Шумилов ◽

Р.К. Фёдоров

Keyword(s):

Data Processing ◽

Spatial Data ◽

Data Exchange ◽

Complex Analysis ◽

Data Entry ◽

Content Management ◽

Volume Data ◽

Territorial Development ◽

Distributed Services ◽

Spatial Data Processing

Рассмотрен инфраструктурный подход обработки пространственных данных для решения задач управления территориальным развитием, который основан на сервис-ориентированной парадигме, стандартах OGC, web-технологиях, WPS-сервисах и геопортале. The development of territories is a multi-dimensional and multi-aspect process, which can be characterized by large volumes of financial, natural resources, social, ecological and economic data. The data is highly localized and non-coordinated, which limits its complex analysis and usage. One of the methods of large volume data processing is information-analytical environments. The architecture and implementation of the information-analytical environment of the territorial development in the form of Geoportal is presented. Geoportal provides software instruments for spatial and thematic data exchange for its users, as well as OGC-based distributed services that deal with the data processing. Implementation of the processing and storing of the data in the form of services located on distributed servers allows simplifying their updating and maintenance. In addition, it allows publishing and makes processing to be more open and controlled process. Geoportal consists of following modules: content management system Calipso (presentation of user interface, user management, data visualization), RDBMS PostgreSQL with spatial data processing extension, services of relational data entry and editing, subsystem of launching and execution of WPS-services, as well as services of spatial data processing, deployed at the local cloud environment. The presented article states the necessity of using the infrastructural approach when creating the information-analytical environment for the territory management, which is characterized by large volumes of spatial and thematical data that needs to be processed. The data is stored in various formats and applications of service-oriented paradigm, OGC standards, web-technologies, Geoportal and distributed WPS-services. The developed software system was tested on a number of tasks that arise during the territory development.

Download Full-text

UPAH MENGUPAH PERTANIAN DALAM TINJAUAN FIQH MUAMALAH (Studi di Nagari Bukit Kandung Kecamatan X Koto Diatas Kabupaten Solok)

JISRAH: Jurnal Integrasi Ilmu Syariah ◽

10.31958/jisrah.v2i3.4968 ◽

2021 ◽

Vol 2 (3) ◽

pp. 59

Author(s):

Susanti Krismon ◽

Syukri Iska

Keyword(s):

Data Collection ◽

Data Processing ◽

Field Research ◽

Secondary Data ◽

Research Field ◽

Farm Workers ◽

Data Sources ◽

Primary Data

This article discusses the implementation of wages in agriculture in Nagari Bukit Kandung Subdistrict X Koto Atas, Solok Regency in a review of muamalah fiqh. The type of research is field research (field research). The data sources consist of primary data sources, namely from farmers and farm laborers who were carried out to 8 people and 4 farm workers, while the secondary data were obtained from documents in the form of the Bukit Kandung Nagari Profile that were related to this research, which could provide information or data. Addition to strengthen the primary data. Data collection techniques that the author uses are observation, interviews and documentation. The data processing that the author uses is qualitative. Based on the results of this study, the implementation of wages in agriculture carried out in Nagari Bukit Kandung District X Koto Diatas Solok Regency is farm laborers who ask for their wages to be given in advance before they carry out their work without an agreement to give their wages at the beginning. Because farm laborers ask for their wages to be given at the beginning, many farm workers work not as expected by farmers and there are also farm workers who are not on time to do the work that should be done. According to the muamalah fiqh review, the implementation of wages in agriculture in Nagari Bukit Kandung is not allowed because there is an element of gharar in the contract and there are parties who are disadvantaged in the contract, namely the owner of the fields.

Download Full-text