kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers

2019 ◽  
Vol 35 (23) ◽  
pp. 4871-4878
Author(s):  
Peng Jiang ◽  
Jie Luo ◽  
Yiqi Wang ◽  
Pingji Deng ◽  
Bertil Schmidt ◽  
...  

Abstract Motivation K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. Results We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays—one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. Availability and implementation The source codes of our algorithm are available at github.com/lzhLab/kmcEx. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Jun Wang ◽  
Pu-Feng Du ◽  
Xin-Yu Xue ◽  
Guang-Ping Li ◽  
Yuan-Ke Zhou ◽  
...  

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (24) ◽  
pp. 5146-5154 ◽  
Author(s):  
Joanna Zyla ◽  
Michal Marczyk ◽  
Teresa Domaszewska ◽  
Stefan H E Kaufmann ◽  
Joanna Polanska ◽  
...  

Abstract Motivation Analysis of gene set (GS) enrichment is an essential part of functional omics studies. Here, we complement the established evaluation metrics of GS enrichment algorithms with a novel approach to assess the practical reproducibility of scientific results obtained from GS enrichment tests when applied to related data from different studies. Results We evaluated eight established and one novel algorithm for reproducibility, sensitivity, prioritization, false positive rate and computational time. In addition to eight established algorithms, we also included Coincident Extreme Ranks in Numerical Observations (CERNO), a flexible and fast algorithm based on modified Fisher P-value integration. Using real-world datasets, we demonstrate that CERNO is robust to ranking metrics, as well as sample and GS size. CERNO had the highest reproducibility while remaining sensitive, specific and fast. In the overall ranking Pathway Analysis with Down-weighting of Overlapping Genes, CERNO and over-representation analysis performed best, while CERNO and GeneSetTest scored high in terms of reproducibility. Availability and implementation tmod package implementing the CERNO algorithm is available from CRAN (cran.r-project.org/web/packages/tmod/index.html) and an online implementation can be found at http://tmod.online/. The datasets analyzed in this study are widely available in the KEGGdzPathwaysGEO, KEGGandMetacoreDzPathwaysGEO R package and GEO repository. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Vol 2015 ◽  
pp. 1-12
Author(s):  
Siyu Lin ◽  
Hao Wu

Cyber-physical systems (CPSs) connect with the physical world via communication networks, which significantly increases security risks of CPSs. To secure the sensitive data, secure forwarding is an essential component of CPSs. However, CPSs require high dimensional multiattribute and multilevel security requirements due to the significantly increased system scale and diversity, and hence impose high demand on the secure forwarding information query and storage. To tackle these challenges, we propose a practical secure data forwarding scheme for CPSs. Considering the limited storage capability and computational power of entities, we adopt bloom filter to store the secure forwarding information for each entity, which can achieve well balance between the storage consumption and query delay. Furthermore, a novel link-based bloom filter construction method is designed to reduce false positive rate during bloom filter construction. Finally, the effects of false positive rate on the performance of bloom filter-based secure forwarding with different routing policies are discussed.


2010 ◽  
Vol 110 (21) ◽  
pp. 944-949 ◽  
Author(s):  
Ken Christensen ◽  
Allen Roginsky ◽  
Miguel Jimeno

2019 ◽  
Vol 35 (17) ◽  
pp. 3046-3054 ◽  
Author(s):  
Anastasia Gurinovich ◽  
Harold Bae ◽  
John J Farrell ◽  
Stacy L Andersen ◽  
Stefano Monti ◽  
...  

Abstract Motivation Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery. Results In this paper, we propose PopCluster: an algorithm to automatically discover subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects’ ethnicities. PopCluster combines logistic regression modeling, principal component analysis, hierarchical clustering and a recursive bottom-up tree parsing procedure. The evaluation of PopCluster suggests that the algorithm has a stable low false positive rate (∼4%) and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. Application of PopCluster to data from genetic studies of longevity discovers ethnicity-dependent heterogeneity in the association of rs3764814 (USP42) with the phenotype. Availability and implementation PopCluster was implemented using the R programming language, PLINK and Eigensoft software, and can be found at the following GitHub repository: https://github.com/gurinovich/PopCluster with instructions on its installation and usage. Supplementary information Supplementary data are available at Bioinformatics online.


In software development, Software quality analysis plays a considerable process. Through the software testing, the quality analysis is performed for efficient prediction of defects in the code. Due to the complicated structure of software projects, code examination has become a demanding issue that has to be addressed at the initial stage of testing for achieving the quality improved results. In order to resolve these issues, the Stochastic Gaussian Neighbor Embedding based Probit Regressive Reweight Boost Classification (SGNE-PRRBC) is introduced for accurate quality prediction system through code examination proficient system. The SGNE-PRRBC technique considers the number of program files as input for software quality analysis through feature selection and classification. Initially, the number of program files is taken from the dataset (DS). After collecting the files, the Gaussian distributive stochastic neighbor embedding technique choose the features (i.e. code metrics) based on the distance similarity. With the assist of Pearson correlative probit regressed reweight boost technique, the classification of program files is performed. The boosting algorithm creates ‘m’ number of weak classifiers i.e. Pearson correlative probit regression to categorize the input program files as normal or defected through analyze the source codes and chosen metrics. After that, the weak learners results are combined into strong through minimizing the out of sample error with gradient descent function. This enhances the accuracy of quality prediction and lessens the false positive rate (FPR). Experimental analysis is performed with various metrics namely accuracy, FPR and computation time (CT) with number of program files. Experimental results evident that the SGNE-PRRBC technique achieves better performance in terms of accuracy, CT and FPR as compared to the conventional methods.


2020 ◽  
Vol 36 (8) ◽  
pp. 2492-2499
Author(s):  
Yifan Ji ◽  
Chang Yu ◽  
Hong Zhang

Abstract Motivation Tumor and adjacent normal RNA samples are commonly used to screen differentially expressed genes between normal and tumor samples or among tumor subtypes. Such paired-sample design could avoid numerous confounders in differential expression (DE) analysis, but the cellular contamination of tumor samples can be an important noise and confounding factor, which can both inflate false-positive rate and deflate true-positive rate. The existing DE tools that use next-generation RNA-seq data either do not account for cellular contamination or are computationally extensive with increasingly large sample size. Results A novel linear model was proposed to avoid the problem that could arise from tumor–normal correlation for paired samples. A statistically robust and computationally very fast DE analysis procedure, contamDE-lm, was developed based on the novel model to account for cellular contamination, boosting DE analysis power through the reduction in individual residual variances using gene-wise information. The desired advantages of contamDE-lm over some state-of-the-art methods (limma and DESeq2) were evaluated through the applications to simulated data, TCGA database and Gene Expression Omnibus (GEO) database. Availability and implementation The proposed method contamDE-lm was implemented in an updated R package contamDE (version 2.0), which is freely available at https://github.com/zhanghfd/contamDE. Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Vol 2016 ◽  
pp. 1-7 ◽  
Author(s):  
Hazalila Kamaludin ◽  
Hairulnizam Mahdin ◽  
Jemal H. Abawajy

Radio Frequency Identification (RFID) enabled systems are evolving in many applications that need to know the physical location of objects such as supply chain management. Naturally, RFID systems create large volumes of duplicate data. As the duplicate data wastes communication, processing, and storage resources as well as delaying decision-making, filtering duplicate data from RFID data stream is an important and challenging problem. Existing Bloom Filter-based approaches for filtering duplicate RFID data streams are complex and slow as they use multiple hash functions. In this paper, we propose an approach for filtering duplicate data from RFID data streams. The proposed approach is based on modified Bloom Filter and uses only a single hash function. We performed extensive empirical study of the proposed approach and compared it against the Bloom Filter, d-Left Time Bloom Filter, and the Count Bloom Filter approaches. The results show that the proposed approach outperforms the baseline approaches in terms of false positive rate, execution time, and true positive rate.


2021 ◽  
Vol 17 (4) ◽  
pp. 1-23
Author(s):  
Datong Zhang ◽  
Yuhui Deng ◽  
Yi Zhou ◽  
Yifeng Zhu ◽  
Xiao Qin

Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are two challenging issues in data deduplication. Deduplication-based backup systems generally employ containers storing contiguous chunks together with their fingerprints to preserve data locality for alleviating the two issues, which is still inadequate. To address these two issues, we propose a container utilization based hot fingerprint entry distilling strategy to improve the performance of deduplication-based backup systems. We divide the index into three parts: hot fingerprint entries, fragmented fingerprint entries, and useless fingerprint entries. A container with utilization smaller than a given threshold is called a sparse container . Fingerprint entries that point to non-sparse containers are hot fingerprint entries. For the remaining fingerprint entries, if a fingerprint entry matches any fingerprint of forthcoming backup chunks, it is classified as a fragmented fingerprint entry. Otherwise, it is classified as a useless fingerprint entry. We observe that hot fingerprint entries account for a small part of the index, whereas the remaining fingerprint entries account for the majority of the index. This intriguing observation inspires us to develop a hot fingerprint entry distilling approach named HID . HID segregates useless fingerprint entries from the index to improve memory utilization and bypass disk accesses. In addition, HID separates fragmented fingerprint entries to make a deduplication-based backup system directly rewrite fragmented chunks, thereby alleviating adverse fragmentation. Moreover, HID introduces a feature to treat fragmented chunks as unique chunks. This feature compensates for the shortcoming that a Bloom filter cannot directly identify certain duplicated chunks (i.e., the fragmented chunks). To take full advantage of the preceding feature, we propose an evolved HID strategy called EHID . EHID incorporates a Bloom filter, to which only hot fingerprints are mapped. In doing so, EHID exhibits two salient features: (i) EHID avoids disk accesses to identify unique chunks and the fragmented chunks; (ii) EHID slashes the false positive rate of the integrated Bloom filter. These salient features push EHID into the high-efficiency mode. Our experimental results show our approach reduces the average memory overhead of the index by 34.11% and 25.13% when using the Linux dataset and the FSL dataset, respectively. Furthermore, compared with the state-of-the-art method HAR, EHID boosts the average backup throughput by up to a factor of 2.25 with the Linux dataset, and EHID reduces the average disk I/O traffic by up to 66.21% when it comes to the FSL dataset. EHID also marginally improves the system's restore performance.


Sign in / Sign up

Export Citation Format

Share Document