scholarly journals Referential DNA Data Compression using Hadoop Map Reduce Framework

2019 ◽  
Vol 17 (2) ◽  
pp. 207-214
Author(s):  
Raju Bhukya ◽  
Sumit Deshmuk

The indispensable knowledge of Deoxyribonucleic Acid (DNA) sequences and sharply reducing cost of the DNA sequencing techniques has attracted numerous researchers in the field of Genetics. These sequences are getting available at an exponential rate leading to the bulging size of molecular biology databases making large disk arrays and compute clusters inevitable for analysis.In this paper, we proposed referential DNA data compression using hadoop MapReduce Framework to process humongous amount of genetic data in distributed environment on high performance compute clusters. Our method has successfully achieved a better balance between compression ratio and the amount of time required for DNA data compression as compared to other Referential DNA Data Compression methods.

2015 ◽  
Vol 71 (9) ◽  
pp. 3525-3548 ◽  
Author(s):  
Sangwhan Moon ◽  
Jaehwan Lee ◽  
Xiling Sun ◽  
Yang-suk Kee

2008 ◽  
Vol 59 (7) ◽  
Author(s):  
Corina Samoila ◽  
Alfa Xenia Lupea ◽  
Andrei Anghel ◽  
Marilena Motoc ◽  
Gabriela Otiman ◽  
...  

Denaturing High Performance Liquid Chromatography (DHPLC) is a relatively new method used for screening DNA sequences, characterized by high capacity to detect mutations/polymorphisms. This study is focused on the Transgenomic WAVETM DNA Fragment Analysis (based on DHPLC separation method) of a 485 bp fragment from human EC-SOD gene promoter in order to detect single nucleotide polymorphism (SNPs) associated with atherosclerosis and risk factors of cardiovascular disease. The fragment of interest was amplified by PCR reaction and analyzed by DHPLC in 100 healthy subjects and 70 patients characterized by atheroma. No different melting profiles were detected for the analyzed DNA samples. A combination of computational methods was used to predict putative transcription factors in the fragment of interest. Several putative transcription factors binding sites from the Ets-1 oncogene family: ETS member Elk-1, polyomavirus enhancer activator-3 (PEA3), protein C-Ets-1 (Ets-1), GABP: GA binding protein (GABP), Spi-1 and Spi-B/PU.1 related transcription factors, from the Krueppel-like family: Gut-enriched Krueppel-like factor (GKLF), Erythroid Krueppel-like factor (EKLF), Basic Krueppel-like factor (BKLF), GC box and myeloid zinc finger protein MZF-1 were identified in the evolutionary conserved regions. The bioinformatics results need to be investigated further in others studies by experimental approaches.


Author(s):  
Mark Endrei ◽  
Chao Jin ◽  
Minh Ngoc Dinh ◽  
David Abramson ◽  
Heidi Poxon ◽  
...  

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.


Genetics ◽  
2000 ◽  
Vol 155 (4) ◽  
pp. 2011-2014 ◽  
Author(s):  
Richard R Hudson

Abstract A new statistic for detecting genetic differentiation of subpopulations is described. The statistic can be calculated when genetic data are collected on individuals sampled from two or more localities. It is assumed that haplotypic data are obtained, either in the form of DNA sequences or data on many tightly linked markers. Using a symmetric island model, and assuming an infinite-sites model of mutation, it is found that the new statistic is as powerful or more powerful than previously proposed statistics for a wide range of parameter values.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Mahdi Torabzadehkashi ◽  
Siavash Rezaei ◽  
Ali HeydariGorji ◽  
Hosein Bobarshad ◽  
Vladimir Alves ◽  
...  

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.


2014 ◽  
Vol 28 (07) ◽  
pp. 1450056 ◽  
Author(s):  
Hua-Lin Cai ◽  
Yi Yang ◽  
Yi-Han Zhang ◽  
Chang-Jian Zhou ◽  
Cang-Ran Guo ◽  
...  

In this paper, a surface acoustic wave (SAW) biosensor with gold delay area on LiNbO 3 substrate detecting DNA sequences is proposed. By well-designed device parameters of the SAW sensor, it achieves a high performance for highly sensitive detection of target DNA. In addition, an effective biological treatment method for DNA immobilization and abundant experimental verification of the sensing effect have made it a reliable device in DNA detection. The loading mass of the probe and target DNA sequences is obtained from the frequency shifts, which are big enough in this work due to an effective biological treatment. The experimental results show that the biosensor has a high sensitivity of 1.2 pg/ml/Hz and high selectivity characteristic is also verified by the few responses of other substances. In combination with wireless transceiver, we develop a wireless receiving and processing system that can directly display the detection results.


Author(s):  
U.S.N. Raju ◽  
Irlanki Sandeep ◽  
Nattam Sai Karthik ◽  
Rayapudi Siva Praveen ◽  
Mayank Singh Sachan

Sign in / Sign up

Export Citation Format

Share Document