Design and Application of a Containerized Hybrid Transaction Processing and Data Analysis Framework

Ye Tao; Xiaodong Wang; Xiaowei Xu

doi:10.4018/ijghpc.2018070106

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

Gates Open Research ◽

10.12688/gatesopenres.12832.1 ◽

2018 ◽

Vol 2 ◽

pp. 31 ◽

Cited By ~ 1

Author(s):

Greg Finak ◽

Bryan Mayer ◽

William Fulp ◽

Paul Obrecht ◽

Alicia Sato ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Relational Databases ◽

Research Work ◽

Large Data ◽

Work Flow ◽

Primary Data ◽

Reproducible Research ◽

Data Sets ◽

Data Set

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Download Full-text

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

10.1101/342907 ◽

2018 ◽

Author(s):

Greg Finak ◽

Bryan T. Mayer ◽

William Fulp ◽

Paul Obrecht ◽

Alicia Sato ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Relational Databases ◽

Research Work ◽

Large Data ◽

Work Flow ◽

Primary Data ◽

Reproducible Research ◽

Data Sets ◽

Data Set

AbstractA central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ‘omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high “startup” costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Download Full-text

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

Gates Open Research ◽

10.12688/gatesopenres.12832.2 ◽

2018 ◽

Vol 2 ◽

pp. 31 ◽

Cited By ~ 3

Author(s):

Greg Finak ◽

Bryan Mayer ◽

William Fulp ◽

Paul Obrecht ◽

Alicia Sato ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Relational Databases ◽

Research Work ◽

Large Data ◽

Work Flow ◽

Primary Data ◽

Reproducible Research ◽

Data Sets ◽

Data Set

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Download Full-text

Challenges and Trends in Clinical Data Analytics

Issue 4 - Journal of Science and Technology ◽

10.46243/jst.2020.v5.i4.pp348-360 ◽

2020 ◽

pp. 348-360

Keyword(s):

Health Care ◽

Data Analysis ◽

Clinical Data ◽

Data Analytics ◽

Health Care Systems ◽

Data Sets ◽

Handwritten Documents ◽

Healthcare Data ◽

Clinical Data Analysis ◽

Care Systems

:Today’s technological advancements facilitated the researcher in collecting and organizing various forms of healthcare data. Data is an integral part of health care analytics. Drug discovery for clinical data analytics forms an important breakthrough work in terms of computational approaches in health care systems. On the other hand, healthcare analysis provides better value for money. The health care data management is very challenging as 80% of the data is unstructured as it includes handwritten documents, images; computer-generated clinical reports such as MRI, ECG, city scan, etc. The paper aims at providing a summary of work carried out by scientists and researchers who worked in health care domains. More precisely the work focuses on clinical data analysis for the period 2013 to 2019. The organization of the work carried out is specifically with concerned to data sets, Techniques, and Methods used, Tools adopted, Key Findings in clinical data analysis. The overall objective is to identify the current challenges, trends, and gaps in clinical data analysis. The pathway of the work is focused on carrying out on the bibliometric survey and summarization of the key findings in a novel way.

Download Full-text

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

BMC Genomics ◽

10.1186/s12864-020-07013-y ◽

2020 ◽

Vol 21 (S10) ◽

Author(s):

Tanveer Ahmad ◽

Nauman Ahmed ◽

Zaid Al-Ars ◽

H. Peter Hofstee

Keyword(s):

Data Processing ◽

Shared Memory ◽

High Throughput ◽

Data Representation ◽

Data Sets ◽

Computing Systems ◽

Disk Storage ◽

High Throughput Data ◽

Data Framework ◽

Development Framework

Abstract Background Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. Availability The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM.

Download Full-text

MLPAnalyzer: Data Analysis Tool for Reliable Automated Normalization of MLPA Fragment Data

Analytical Cellular Pathology ◽

10.1155/2008/605109 ◽

2008 ◽

Vol 30 (4) ◽

pp. 323-335

Author(s):

Jordy Coffa ◽

Mark A. van de Wiel ◽

Begoña Diosdado ◽

Beatriz Carvalho ◽

Jan Schouten ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Copy Number ◽

Data Sets ◽

Analysis Tool ◽

Visual Examination ◽

High Throughput Analysis ◽

Analysis Strategy ◽

Suggested Strategy ◽

Copy Number Changes

Background: Multiplex Ligation dependent Probe Amplification (MLPA) is a rapid, simple, reliable and customized method for detection of copy number changes of individual genes at a high resolution and allows for high throughput analysis. This technique is typically applied for studying specific genes in large sample series. The large amount of data, dissimilarities in PCR efficiency among the different probe amplification products, and sample-to-sample variation pose a challenge to data analysis and interpretation. We therefore set out to develop an MLPA data analysis strategy and tool that is simple to use, while still taking into account the above-mentioned sources of variation.Materials and Methods: MLPAnalyzer was developed in Visual Basic for Applications, and can accept a large number of file formats directly from capillary sequence systems. Sizes of all MLPA probe signals are determined and filtered, quality control steps are performed, and variation in peak intensity related to size is corrected for. DNA copy number ratios of test samples are computed, displayed in a table view and a set of comprehensive figures is generated. To validate this approach, MLPA reactions were performed using a dedicated MLPA mix on 6 different colorectal cancer cell lines. The generated data were normalized using our program and results were compared to previously performed array-CGH results using both statistical methods and visual examination.Results and Discussion: Visual examination of bar graphs and direct ratios for both techniques showed very similar results, while the average Pearson moment correlation over all MLPA probes was found to be 0.42. Our results thus show that automated MLPA data processing following our suggested strategy may be of significant use, especially when handling large MLPA data sets, when samples are of different quality, or interpretation of MLPA electropherograms is too complex. It remains, however, important to recognize that automated MLPA data processing may only be successful when a dedicated experimental setup is also considered.

Download Full-text

Research and implementation of HTAP for distributed database

Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University ◽

10.1051/jnwpu/20213920430 ◽

2021 ◽

Vol 39 (2) ◽

pp. 430-438

Author(s):

Changhong Jing ◽

Wenjie Liu ◽

Jintao Gao ◽

Ouya Pei

Keyword(s):

Data Analysis ◽

Relational Databases ◽

Distributed Databases ◽

Distributed Database ◽

Transaction Processing ◽

Complex Data ◽

Main Application ◽

On Line ◽

Analytical Processing ◽

Implementation Method

Data processing can be roughly divided into two categories, online transaction processing OLTP(on-line transaction processing) and online analytical processing OLAP(on-line analytical processing). OLTP is the main application of traditional relational databases, and it is some basic daily transaction processing, such as bank pipeline transactions and so on. OLAP is the main application of the data warehouse system, it supports some more complex data analysis operations, focuses on decision support, and provides popular and intuitive analysis results. As the amount of data processed by enterprises continues to increase, distributed databases have gradually replaced stand-alone databases and become the mainstream of applications. However, the current business supported by distributed databases is mainly based on OLTP applications, lacking OLAP implementation. This paper proposes an implementation method of HTAP for distributed database CBase, which provides an implementation method of OLAP analysis for CBase, and can easily deal with data analysis of large amounts of data.

Download Full-text

A Hybrid Transaction Processing and Data Analysis Framework: A Use Case Study for Multi-Source Healthcare Data Management

2016 6th International Conference on Digital Home (ICDH) ◽

10.1109/icdh.2016.043 ◽

2016 ◽

Cited By ~ 1

Author(s):

Ye Tao ◽

Xiaodong Wang ◽

Xiaowei Xu ◽

Jiguang Yu

Keyword(s):

Data Analysis ◽

Data Management ◽

Transaction Processing ◽

Use Case ◽

Analysis Framework ◽

Healthcare Data

Download Full-text

DTI data analysis: application of fiber tracking to group averaged data sets

Klinische Neurophysiologie ◽

10.1055/s-0030-1250957 ◽

2010 ◽

Vol 41 (01) ◽

Author(s):

HP Müller ◽

A Unrath ◽

A Riecker ◽

AC Ludolph ◽

J Kassubek

Keyword(s):

Data Analysis ◽

Fiber Tracking ◽

Data Sets ◽

Analysis Application

Download Full-text

FORMATION OF METASUBJECT COMPETENCIES IN THE COURSE “INFORMATION TECHNOLOGIES” BY MEANS OF THE BIG DATA PROCESSING LANGUAGE R

Informatics and Education ◽

10.32517/0234-0453-2019-34-4-12-22 ◽

2019 ◽

pp. 12-22

Author(s):

D. M. Nazarov

Keyword(s):

Data Processing ◽

Information Technologies ◽

Data Sets ◽

Training Methods ◽

Economic Knowledge ◽

R Language ◽

Currency Exchange ◽

Business Informatics ◽

Exchange Data ◽

Processing Language

The article describes the training methods in the course “Information Technologies” for the future bachelors of the directions “Economics”, “Management”, “Finance”, “Business Informatics”, the development of metasubject competencies of the student while his use of tools for data processing by means of the language R. The metasubject essence of the work is to update traditional economic knowledge and skills through various presentation forms of the same data sets. As part of the laboratory work described in the article, future bachelors learn to use the basic tools of the R language and acquire specific skills and abilities in R-Studio using the example of processing currency exchange data. The description of the methods is presented in the form of the traditional Key-by-Key technology, which is widely used in teaching information technologies.

Download Full-text