Big Data Analytics Framework for Real-Time Genome Analysis: A Comprehensive Approach

2019 ◽  
Vol 16 (8) ◽  
pp. 3419-3427
Author(s):  
Shishir K. Shandilya ◽  
S. Sountharrajan ◽  
Smita Shandilya ◽  
E. Suganya

Big Data Technologies are well-accepted in the recent years in bio-medical and genome informatics. They are capable to process gigantic and heterogeneous genome information with good precision and recall. With the quick advancements in computation and storage technologies, the cost of acquiring and processing the genomic data has decreased significantly. The upcoming sequencing platforms will produce vast amount of data, which will imperatively require high-performance systems for on-demand analysis with time-bound efficiency. Recent bio-informatics tools are capable of utilizing the novel features of Hadoop in a more flexible way. In particular, big data technologies such as MapReduce and Hive are able to provide high-speed computational environment for the analysis of petabyte scale datasets. This has attracted the focus of bio-scientists to use the big data applications to automate the entire genome analysis. The proposed framework is designed over MapReduce and Java on extended Hadoop platform to achieve the parallelism of Big Data Analysis. It will assist the bioinformatics community by providing a comprehensive solution for Descriptive, Comparative, Exploratory, Inferential, Predictive and Causal Analysis on Genome data. The proposed framework is user-friendly, fully-customizable, scalable and fit for comprehensive real-time genome analysis from data acquisition till predictive sequence analysis.

Author(s):  
Armando Fandango ◽  
William Rivera

Scientific Big Data being gathered at exascale needs to be stored, retrieved and manipulated. The storage stack for scientific Big Data includes a file system at the system level for physical organization of the data, and a file format and input/output (I/O) system at the application level for logical organization of the data; both of them of high-performance variety for exascale. The high-performance file system is designed with concurrent access, high-speed transmission and fault tolerance characteristics. High-performance file formats and I/O are designed to allow parallel and distributed applications with easy and fast access to Big Data. These specialized file formats make it easier to store and access Big Data for scientific visualization and predictive analytics. This chapter provides a brief review of the characteristics of high-performance file systems such as Lustre and GPFS, and high-performance file formats such as HDF5, NetCDF, MPI-IO, and HDFS.


In the current day scenario, a huge amount of data is been generated from various heterogeneous sources like social networks, business apps, government sector, marketing, health care system, sensors, machine log data which is created at such a high speed and other sources. Big Data is chosen as one among the upcoming area of research by several industries. In this paper, the author presents wide collection of literature that has been reviewed and analyzed. This paper emphasizes on Big Data Technologies, Application & Challenges, a comparative study on architectures, methodologies, tools, and survey results proposed by various researchers are presented


Electronics ◽  
2021 ◽  
Vol 10 (19) ◽  
pp. 2322
Author(s):  
Xiaofei Ma ◽  
Xuan Liu ◽  
Xinxing Li ◽  
Yunfei Ma

With the rapid development of the Internet of Things (IoTs), big data analytics has been widely used in the sport field. In this paper, a light-weight, self-powered sensor based on a triboelectric nanogenerator for big data analytics in sports has been demonstrated. The weight of each sensing unit is ~0.4 g. The friction material consists of polyaniline (PANI) and polytetrafluoroethylene (PTFE). Based on the triboelectric nanogenerator (TENG), the device can convert small amounts of mechanical energy into the electrical signal, which contains information about the hitting position and hitting velocity of table tennis balls. By collecting data from daily table tennis training in real time, the personalized training program can be adjusted. A practical application has been exhibited for collecting table tennis information in real time and, according to these data, coaches can develop personalized training for an amateur to enhance the ability of hand control, which can improve their table tennis skills. This work opens up a new direction in intelligent athletic facilities and big data analytics.


Author(s):  
Lidong Wang

Visualization with graphs is popular in the data analysis of Information Technology (IT) networks or computer networks. An IT network is often modelled as a graph with hosts being nodes and traffic being flows on many edges. General visualization methods are introduced in this paper. Applications and technology progress of visualization in IT network analysis and big data in IT network visualization are presented. The challenges of visualization and Big Data analytics in IT network visualization are also discussed. Big Data analytics with High Performance Computing (HPC) techniques, especially Graphics Processing Units (GPUs) helps accelerate IT network analysis and visualization.


2020 ◽  
Author(s):  
Marcus H. Hansen ◽  
Anita T. Simonsen ◽  
Hans B. Ommen ◽  
Charlotte G. Nyvold

AbstractBackgroundRapid and practical DNA-sequencing processing has become essential for modern biomedical laboratories, especially in the field of cancer, pathology and genetics. While sequencing turn-over time has been, and still is, a bottleneck in research and diagnostics, the field of bioinformatics is moving at a rapid pace – both in terms of hardware and software development. Here, we benchmarked the local performance of three of the most important Spark-enabled Genome analysis toolkit 4 (GATK4) tools in a targeted sequencing workflow: Duplicate marking, base quality score recalibration (BQSR) and variant calling on targeted DNA sequencing using a modest hyperthreading 12-core single CPU and a high-speed PCI express solid-state drive.ResultsCompared to the previous GATK version the performance of Spark-enabled BQSR and HaplotypeCaller is shifted towards a more efficient usage of the available cores on CPU and outperforms the earlier GATK3.8 version with an order of magnitude reduction in processing time to analysis ready variants, whereas MarkDuplicateSpark was found to be thrice as fast. Furthermore, HaploTypeCallerSpark and BQSRPipelineSpark were significantly faster than the equivalent GATK4 standard tools with a combined ∼86% reduction in execution time, reaching a median rate of ten million processed bases per second, and duplicate marking was reduced ∼42%. The called variants were found to be in close agreement between the Spark and non-Spark versions, with an overall concordance of 98%. In this setup, the tools were also highly efficient when compared execution on a small 72 virtual CPU/18-node Google Cloud cluster.ConclusionIn conclusion, GATK4 offers practical parallelization possibilities for DNA sequence processing, and the Spark-enabled tools optimize performance and utilization of local CPUs. Spark utilizing GATK variant calling is several times faster than previous GATK3.8 multithreading with the same multi-core, single CPU, configuration. The improved opportunities for parallel computations not only hold implications for high-performance cluster, but also for modest laboratory or research workstations for targeted sequencing analysis, such as exome, panel or amplicon sequencing.


Author(s):  
Amir A. Khwaja

Big data explosion has already happened and the situation is only going to exacerbate with such a high number of data sources and high-end technology prevalent everywhere, generating data at a frantic pace. One of the most important aspects of big data is being able to capture, process, and analyze data as it is happening in real-time to allow real-time business decisions. Alternate approaches must be investigated especially consisting of highly parallel and real-time computations for big data processing. The chapter presents RealSpec real-time specification language that may be used for the modeling of big data analytics due to the inherent language features needed for real-time big data processing such as concurrent processes, multi-threading, resource modeling, timing constraints, and exception handling. The chapter provides an overview of RealSpec and applies the language to a detailed big data event recognition case study to demonstrate language applicability to big data framework and analytics modeling.


Sign in / Sign up

Export Citation Format

Share Document