Reproducible Statistical Analysis in Microarray Profiling Studies

2006 ◽  
Vol 45 (02) ◽  
pp. 139-145
Author(s):  
M. Ruschhaupt ◽  
W. Huber ◽  
U. Mansmann

Summary Objectives: Microarrays are a recent biotechnology that offers the hope of improved cancer classification. A number of publications presented clinically promising results by combining this new kind of biological data with specifically designed algorithmic approaches. But, reproducing published results in this domain is harder than it may seem. Methods: This paper presents examples, discusses the problems hidden in the published analyses and demonstrates a strategy to improve the situation which is based on the vignette technology available from the R and Bioconductor projects. Results: The tool of a compendium is discussed to achieve reproducible calculations and to offer an extensible computational framework. A compendium is a document that bundles primary data, processing methods (computational code), derived data, and statistical output with textual documentation and conclusions. It is interactive in the sense that it allows for the modification of the processing options, plugging in new data, or inserting further algorithms and visualizations. Conclusions: Due to the complexity of the algorithms, the size of the data sets, and the limitations of the medium printed paper it is usually not possible to report all the minutiae of the data processing and statistical computations. The technique of a compendium allows a complete critical assessment of a complex analysis.

2018 ◽  
Vol 2 ◽  
pp. 31 ◽  
Author(s):  
Greg Finak ◽  
Bryan Mayer ◽  
William Fulp ◽  
Paul Obrecht ◽  
Alicia Sato ◽  
...  

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.


2018 ◽  
Author(s):  
Greg Finak ◽  
Bryan T. Mayer ◽  
William Fulp ◽  
Paul Obrecht ◽  
Alicia Sato ◽  
...  

AbstractA central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ‘omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high “startup” costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.


2018 ◽  
Vol 2 ◽  
pp. 31 ◽  
Author(s):  
Greg Finak ◽  
Bryan Mayer ◽  
William Fulp ◽  
Paul Obrecht ◽  
Alicia Sato ◽  
...  

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.


Author(s):  
D. M. Nazarov

The article describes the training methods in the course “Information Technologies” for the future bachelors of the directions “Economics”, “Management”, “Finance”, “Business Informatics”, the development of metasubject competencies of the student while his use of tools for data processing by means of the language R. The metasubject essence of the work is to update traditional economic knowledge and skills through various presentation forms of the same data sets. As part of the laboratory work described in the article, future bachelors learn to use the basic tools of the R language and acquire specific skills and abilities in R-Studio using the example of processing currency exchange data. The description of the methods is presented in the form of the traditional Key-by-Key technology, which is widely used in teaching information technologies.


2020 ◽  
Vol 21 (S18) ◽  
Author(s):  
Sudipta Acharya ◽  
Laizhong Cui ◽  
Yi Pan

Abstract Background In recent years, to investigate challenging bioinformatics problems, the utilization of multiple genomic and proteomic sources has become immensely popular among researchers. One such issue is feature or gene selection and identifying relevant and non-redundant marker genes from high dimensional gene expression data sets. In that context, designing an efficient feature selection algorithm exploiting knowledge from multiple potential biological resources may be an effective way to understand the spectrum of cancer or other diseases with applications in specific epidemiology for a particular population. Results In the current article, we design the feature selection and marker gene detection as a multi-view multi-objective clustering problem. Regarding that, we propose an Unsupervised Multi-View Multi-Objective clustering-based gene selection approach called UMVMO-select. Three important resources of biological data (gene ontology, protein interaction data, protein sequence) along with gene expression values are collectively utilized to design two different views. UMVMO-select aims to reduce gene space without/minimally compromising the sample classification efficiency and determines relevant and non-redundant gene markers from three cancer gene expression benchmark data sets. Conclusion A thorough comparative analysis has been performed with five clustering and nine existing feature selection methods with respect to several internal and external validity metrics. Obtained results reveal the supremacy of the proposed method. Reported results are also validated through a proper biological significance test and heatmap plotting.


2021 ◽  
pp. 000276422110216
Author(s):  
Kazimierz M. Slomczynski ◽  
Irina Tomescu-Dubrow ◽  
Ilona Wysmulek

This article proposes a new approach to analyze protest participation measured in surveys of uneven quality. Because single international survey projects cover only a fraction of the world’s nations in specific periods, researchers increasingly turn to ex-post harmonization of different survey data sets not a priori designed as comparable. However, very few scholars systematically examine the impact of the survey data quality on substantive results. We argue that the variation in source data, especially deviations from standards of survey documentation, data processing, and computer files—proposed by methodologists of Total Survey Error, Survey Quality Monitoring, and Fitness for Intended Use—is important for analyzing protest behavior. In particular, we apply the Survey Data Recycling framework to investigate the extent to which indicators of attending demonstrations and signing petitions in 1,184 national survey projects are associated with measures of data quality, controlling for variability in the questionnaire items. We demonstrate that the null hypothesis of no impact of measures of survey quality on indicators of protest participation must be rejected. Measures of survey documentation, data processing, and computer records, taken together, explain over 5% of the intersurvey variance in the proportions of the populations attending demonstrations or signing petitions.


Author(s):  
A Salman Avestimehr ◽  
Seyed Mohammadreza Mousavi Kalan ◽  
Mahdi Soltanolkotabi

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.


2014 ◽  
Vol 11 (2) ◽  
pp. 68-79
Author(s):  
Matthias Klapperstück ◽  
Falk Schreiber

Summary The visualization of biological data gained increasing importance in the last years. There is a large number of methods and software tools available that visualize biological data including the combination of measured experimental data and biological networks. With growing size of networks their handling and exploration becomes a challenging task for the user. In addition, scientists also have an interest in not just investigating a single kind of network, but on the combination of different types of networks, such as metabolic, gene regulatory and protein interaction networks. Therefore, fast access, abstract and dynamic views, and intuitive exploratory methods should be provided to search and extract information from the networks. This paper will introduce a conceptual framework for handling and combining multiple network sources that enables abstract viewing and exploration of large data sets including additional experimental data. It will introduce a three-tier structure that links network data to multiple network views, discuss a proof of concept implementation, and shows a specific visualization method for combining metabolic and gene regulatory networks in an example.


Author(s):  
И.В. Бычков ◽  
Г.М. Ружников ◽  
В.В. Парамонов ◽  
А.С. Шумилов ◽  
Р.К. Фёдоров

Рассмотрен инфраструктурный подход обработки пространственных данных для решения задач управления территориальным развитием, который основан на сервис-ориентированной парадигме, стандартах OGC, web-технологиях, WPS-сервисах и геопортале. The development of territories is a multi-dimensional and multi-aspect process, which can be characterized by large volumes of financial, natural resources, social, ecological and economic data. The data is highly localized and non-coordinated, which limits its complex analysis and usage. One of the methods of large volume data processing is information-analytical environments. The architecture and implementation of the information-analytical environment of the territorial development in the form of Geoportal is presented. Geoportal provides software instruments for spatial and thematic data exchange for its users, as well as OGC-based distributed services that deal with the data processing. Implementation of the processing and storing of the data in the form of services located on distributed servers allows simplifying their updating and maintenance. In addition, it allows publishing and makes processing to be more open and controlled process. Geoportal consists of following modules: content management system Calipso (presentation of user interface, user management, data visualization), RDBMS PostgreSQL with spatial data processing extension, services of relational data entry and editing, subsystem of launching and execution of WPS-services, as well as services of spatial data processing, deployed at the local cloud environment. The presented article states the necessity of using the infrastructural approach when creating the information-analytical environment for the territory management, which is characterized by large volumes of spatial and thematical data that needs to be processed. The data is stored in various formats and applications of service-oriented paradigm, OGC standards, web-technologies, Geoportal and distributed WPS-services. The developed software system was tested on a number of tasks that arise during the territory development.


2021 ◽  
Vol 2 (3) ◽  
pp. 59
Author(s):  
Susanti Krismon ◽  
Syukri Iska

This article discusses the implementation of wages in agriculture in Nagari Bukit Kandung Subdistrict X Koto Atas, Solok Regency in a review of muamalah fiqh. The type of research is field research (field research). The data sources consist of primary data sources, namely from farmers and farm laborers who were carried out to 8 people and 4 farm workers, while the secondary data were obtained from documents in the form of the Bukit Kandung Nagari Profile that were related to this research, which could provide information or data. Addition to strengthen the primary data. Data collection techniques that the author uses are observation, interviews and documentation. The data processing that the author uses is qualitative. Based on the results of this study, the implementation of wages in agriculture carried out in Nagari Bukit Kandung District X Koto Diatas Solok Regency is farm laborers who ask for their wages to be given in advance before they carry out their work without an agreement to give their wages at the beginning. Because farm laborers ask for their wages to be given at the beginning, many farm workers work not as expected by farmers and there are also farm workers who are not on time to do the work that should be done. According to the muamalah fiqh review, the implementation of wages in agriculture in Nagari Bukit Kandung is not allowed because there is an element of gharar in the contract and there are parties who are disadvantaged in the contract, namely the owner of the fields.


Sign in / Sign up

Export Citation Format

Share Document