A four-dimensional analysis of partitioned approximate filters

2021 ◽  
Vol 14 (11) ◽  
pp. 2355-2368
Author(s):  
Tobias Schmidt ◽  
Maximilian Bandle ◽  
Jana Giceva

With today's data deluge, approximate filters are particularly attractive to avoid expensive operations like remote data/disk accesses. Among the many filter variants available, it is non-trivial to find the most suitable one and its optimal configuration for a specific use-case. We provide open-source implementations for the most relevant filters (Bloom, Cuckoo, Morton, and Xor filters) and compare them in four key dimensions: the false-positive rate, space consumption, build, and lookup throughput. We improve upon existing state-of-the-art implementations with a new optimization, radix partitioning, which boosts the build and lookup throughput for large filters by up to 9x and 5x. Our in-depth evaluation first studies the impact of all available optimizations separately before combining them to determine the optimal filter for specific use-cases. While register-blocked Bloom filters offer the highest throughput, the new Xor filters are best suited when optimizing for small filter sizes or low false-positive rates.

2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Ginette Lafit ◽  
Francis Tuerlinckx ◽  
Inez Myin-Germeys ◽  
Eva Ceulemans

AbstractGaussian Graphical Models (GGMs) are extensively used in many research areas, such as genomics, proteomics, neuroimaging, and psychology, to study the partial correlation structure of a set of variables. This structure is visualized by drawing an undirected network, in which the variables constitute the nodes and the partial correlations the edges. In many applications, it makes sense to impose sparsity (i.e., some of the partial correlations are forced to zero) as sparsity is theoretically meaningful and/or because it improves the predictive accuracy of the fitted model. However, as we will show by means of extensive simulations, state-of-the-art estimation approaches for imposing sparsity on GGMs, such as the Graphical lasso, ℓ1 regularized nodewise regression, and joint sparse regression, fall short because they often yield too many false positives (i.e., partial correlations that are not properly set to zero). In this paper we present a new estimation approach that allows to control the false positive rate better. Our approach consists of two steps: First, we estimate an undirected network using one of the three state-of-the-art estimation approaches. Second, we try to detect the false positives, by flagging the partial correlations that are smaller in absolute value than a given threshold, which is determined through cross-validation; the flagged correlations are set to zero. Applying this new approach to the same simulated data, shows that it indeed performs better. We also illustrate our approach by using it to estimate (1) a gene regulatory network for breast cancer data, (2) a symptom network of patients with a diagnosis within the nonaffective psychotic spectrum and (3) a symptom network of patients with PTSD.


2021 ◽  
Vol 42 (Supplement_1) ◽  
Author(s):  
A Rosier ◽  
E Crespin ◽  
A Lazarus ◽  
G Laurent ◽  
A Menet ◽  
...  

Abstract Background Implantable Loop Recorders (ILRs) are increasingly used and generate a high workload for timely adjudication of ECG recordings. In particular, the excessive false positive rate leads to a significant review burden. Purpose A novel machine learning algorithm was developed to reclassify ILR episodes in order to decrease by 80% the False Positive rate while maintaining 99% sensitivity. This study aims to evaluate the impact of this algorithm to reduce the number of abnormal episodes reported in Medtronic ILRs. Methods Among 20 European centers, all Medtronic ILR patients were enrolled during the 2nd semester of 2020. Using a remote monitoring platform, every ILR transmitted episode was collected and anonymised. For every ILR detected episode with a transmitted ECG, the new algorithm reclassified it applying the same labels as the ILR (asystole, brady, AT/AF, VT, artifact, normal). We measured the number of episodes identified as false positive and reclassified as normal by the algorithm, and their proportion among all episodes. Results In 370 patients, ILRs recorded 3755 episodes including 305 patient-triggered and 629 with no ECG transmitted. 2821 episodes were analyzed by the novel algorithm, which reclassified 1227 episodes as normal rhythm. These reclassified episodes accounted for 43% of analyzed episodes and 32.6% of all episodes recorded. Conclusion A novel machine learning algorithm significantly reduces the quantity of episodes flagged as abnormal and typically reviewed by healthcare professionals. FUNDunding Acknowledgement Type of funding sources: None. Figure 1. ILR episodes analysis


2017 ◽  
Author(s):  
Harry Crane

A recent proposal to "redefine statistical significance" (Benjamin, et al. Nature Human Behaviour, 2017) claims that false positive rates "would immediately improve" by factors greater than two and replication rates would double simply by changing the conventional cutoff for 'statistical significance' from P<0.05 to P<0.005. I analyze the veracity of these claims, focusing especially on how Benjamin, et al neglect the effects of P-hacking in assessing the impact of their proposal. My analysis shows that once P-hacking is accounted for the perceived benefits of the lower threshold all but disappear, prompting two main conclusions: (i) The claimed improvements to false positive rate and replication rate in Benjamin, et al (2017) are exaggerated and misleading. (ii) There are plausible scenarios under which the lower cutoff will make the replication crisis worse.


2020 ◽  
Vol 2020 (1) ◽  
pp. 235-255 ◽  
Author(s):  
Tobias Pulls ◽  
Rasmus Dahlberg

AbstractWebsite Fingerprinting (WF) attacks are a subset of traffic analysis attacks where a local passive attacker attempts to infer which websites a target victim is visiting over an encrypted tunnel, such as the anonymity network Tor. We introduce the security notion of a Website Oracle (WO) that gives a WF attacker the capability to determine whether a particular monitored website was among the websites visited by Tor clients at the time of a victim’s trace. Our simulations show that combining a WO with a WF attack—which we refer to as a WF+WO attack—significantly reduces false positives for about half of all website visits and for the vast majority of websites visited over Tor. The measured false positive rate is on the order one false positive per million classified website trace for websites around Alexa rank 10,000. Less popular monitored websites show orders of magnitude lower false positive rates.We argue that WOs are inherent to the setting of anonymity networks and should be an assumed capability of attackers when assessing WF attacks and defenses. Sources of WOs are abundant and available to a wide range of realistic attackers, e.g., due to the use of DNS, OCSP, and real-time bidding for online advertisement on the Internet, as well as the abundance of middleboxes and access logs. Access to a WO indicates that the evaluation of WF defenses in the open world should focus on the highest possible recall an attacker can achieve. Our simulations show that augmenting the Deep Fingerprinting WF attack by Sirinam et al. [60] with access to a WO significantly improves the attack against five state-of-the-art WF defenses, rendering some of them largely ineffective in this new WF+WO setting.


2021 ◽  
Author(s):  
Lucas Robidou ◽  
Pierre Peterlongo

Approximate membership query (AMQ) structures as Cuckoo filters or Bloom filters are widely used for representing large sets of elements. Their lightweight space usage explains their success, mainly as they are the only way to scale hundreds of billions or trillions of elements. However, they suffer by nature from non-avoidable false-positive calls that bias downstream analyses of methods using these data structures. In this work we propose a simple strategy and its implementation for reducing the false-positive rate of any AMQ data structure indexing k-mers (words of length k). The method we propose, called findere, enables to speed-up the queries by a factor two and to decrease the false-positive rate by two order of magnitudes. This achievement is done one the fly at query time, without modifying the original indexing data-structure, without generating false-negative calls and with no memory overhead. With no drawback, this method, as simple as it is effective, reduces either the false-positive rate or the space required to represent a set given a user-defined false-positive rate.


2008 ◽  
Vol 108 (4) ◽  
pp. 210-213 ◽  
Author(s):  
Prosenjit Bose ◽  
Hua Guo ◽  
Evangelos Kranakis ◽  
Anil Maheshwari ◽  
Pat Morin ◽  
...  

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
K. Sujatha ◽  
V. Udayarani

PurposeThe purpose of this paper is to improve the privacy in healthcare datasets that hold sensitive information. Putting a stop to privacy divulgence and bestowing relevant information to legitimate users are at the same time said to be of differing goals. Also, the swift evolution of big data has put forward considerable ease to all chores of life. As far as the big data era is concerned, propagation and information sharing are said to be the two main facets. Despite several research works performed on these aspects, with the incremental nature of data, the likelihood of privacy leakage is also substantially expanded through various benefits availed of big data. Hence, safeguarding data privacy in a complicated environment has become a major setback.Design/methodology/approachIn this study, a method called deep restricted additive homomorphic ElGamal privacy preservation (DR-AHEPP) to preserve the privacy of data even in case of incremental data is proposed. An entropy-based differential privacy quasi identification and DR-AHEPP algorithms are designed, respectively, for obtaining privacy-preserved minimum falsified quasi-identifier set and computationally efficient privacy-preserved data.FindingsAnalysis results using Diabetes 130-US hospitals illustrate that the proposed DR-AHEPP method is more significant in preserving privacy on incremental data than existing methods. A comparative analysis of state-of-the-art works with the objective to minimize information loss, false positive rate and execution time with higher accuracy is calibrated.Originality/valueThe paper provides better performance using Diabetes 130-US hospitals for achieving high accuracy, low information loss and false positive rate. The result illustrates that the proposed method increases the accuracy by 4% and reduces the false positive rate and information loss by 25 and 35%, respectively, as compared to state-of-the-art works.


2019 ◽  
Vol 104 (6) ◽  
pp. 822-826
Author(s):  
Zhichao Wu ◽  
Felipe A Medeiros ◽  
Robert N Weinreb ◽  
Christopher A Girkin ◽  
Linda M Zangwill

PurposeThis study aimed to evaluate the specificity of commonly used cluster criteria for defining the presence of glaucomatous visual field abnormalities and the impact of variations in the criterion used.MethodsThis is an observational study including 607 eyes from 384 healthy participants, and 501 eyes of 345 participants with glaucoma, with at least two reliable 24–2 visual field tests. An abnormal visual field cluster was defined as the presence of ≥3 contiguous abnormal locations. Variations in this definition were evaluated and included (1) whether abnormalities were based on total deviation and/or pattern deviation values; (2) probability cut-off for defining an abnormal location; and (3) whether abnormalities were required to be repeatable (within the same hemifield or at the same locations) or not. These definitions were also compared against pattern standard deviation (PSD) values.ResultsFalse-positive rates of various cluster criteria ranged between 9% and 46% depending on the specific definitions used. Only definitions that required abnormalities to be repeatable at the same location achieved a false-positive rate of ≤6%. The various cluster criteria generally performed similarly or worse at detecting glaucoma eyes compared with the PSD values.ConclusionsCommonly used visual field cluster criteria have high false-positive rates that vary widely depending on the definition used. These findings highlight the need to carefully consider the criteria used when designing and interpreting glaucoma clinical studies.Trial registration numberNCT00221923.


2020 ◽  
Vol 7 (2) ◽  
pp. 190831
Author(s):  
Luis Morís Fernández ◽  
Miguel A. Vadillo

In the present article, we explore the influence of undisclosed flexibility in the analysis of reaction times (RTs). RTs entail some degrees of freedom of their own, due to their skewed distribution, the potential presence of outliers and the availability of different methods to deal with these issues. Moreover, these degrees of freedom are usually not considered part of the analysis itself, but preprocessing steps that are contingent on data. We analysed the impact of these degrees of freedom on the false-positive rate using simulations over real and simulated data. When several preprocessing methods are used in combination, the false-positive rate can easily rise to 17%. This figure becomes more concerning if we consider that more degrees of freedom are awaiting down the analysis pipeline, potentially making the final false-positive rate much higher.


2019 ◽  
Author(s):  
Luis Morís Fernández ◽  
Miguel A. Vadillo

In the present article, we explore the influence of undisclosed flexibility in the analysis of reaction times (RTs). RTs entail some degrees of freedom of their own, due to their skewed distribution, the potential presence of outliers and the availability of different methods to deal with these issues. Moreover, these degrees of freedom are usually not considered part of the analysis itself, but preprocessing steps that are contingent on data. We analysed the impact of these degrees of freedom on the false-positive rate using simulations over real and simulated data. When several preprocessing methods are used in combination, the false-positive rate can easily rise to 17%. This figure becomes more concerning if we consider that more degrees of freedom are awaiting down the analysis pipeline, potentially making the final false-positive rate much higher.


Sign in / Sign up

Export Citation Format

Share Document