A four-dimensional analysis of partitioned approximate filters

With today's data deluge, approximate filters are particularly attractive to avoid expensive operations like remote data/disk accesses. Among the many filter variants available, it is non-trivial to find the most suitable one and its optimal configuration for a specific use-case. We provide open-source implementations for the most relevant filters (Bloom, Cuckoo, Morton, and Xor filters) and compare them in four key dimensions: the false-positive rate, space consumption, build, and lookup throughput. We improve upon existing state-of-the-art implementations with a new optimization, radix partitioning, which boosts the build and lookup throughput for large filters by up to 9x and 5x. Our in-depth evaluation first studies the impact of all available optimizations separately before combining them to determine the optimal filter for specific use-cases. While register-blocked Bloom filters offer the highest throughput, the new Xor filters are best suited when optimizing for small filter sizes or low false-positive rates.

Download Full-text

A Partial Correlation Screening Approach for Controlling the False Positive Rate in Sparse Gaussian Graphical Models

Scientific Reports ◽

10.1038/s41598-019-53795-x ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Ginette Lafit ◽

Francis Tuerlinckx ◽

Inez Myin-Germeys ◽

Eva Ceulemans

Keyword(s):

Graphical Models ◽

False Positive ◽

Partial Correlation ◽

State Of The Art ◽

False Positive Rate ◽

False Positives ◽

Gaussian Graphical Models ◽

Undirected Network ◽

Partial Correlations ◽

Positive Rate

AbstractGaussian Graphical Models (GGMs) are extensively used in many research areas, such as genomics, proteomics, neuroimaging, and psychology, to study the partial correlation structure of a set of variables. This structure is visualized by drawing an undirected network, in which the variables constitute the nodes and the partial correlations the edges. In many applications, it makes sense to impose sparsity (i.e., some of the partial correlations are forced to zero) as sparsity is theoretically meaningful and/or because it improves the predictive accuracy of the fitted model. However, as we will show by means of extensive simulations, state-of-the-art estimation approaches for imposing sparsity on GGMs, such as the Graphical lasso, ℓ1 regularized nodewise regression, and joint sparse regression, fall short because they often yield too many false positives (i.e., partial correlations that are not properly set to zero). In this paper we present a new estimation approach that allows to control the false positive rate better. Our approach consists of two steps: First, we estimate an undirected network using one of the three state-of-the-art estimation approaches. Second, we try to detect the false positives, by flagging the partial correlations that are smaller in absolute value than a given threshold, which is determined through cross-validation; the flagged correlations are set to zero. Applying this new approach to the same simulated data, shows that it indeed performs better. We also illustrate our approach by using it to estimate (1) a gene regulatory network for breast cancer data, (2) a symptom network of patients with a diagnosis within the nonaffective psychotic spectrum and (3) a symptom network of patients with PTSD.

Download Full-text

A novel machine learning algorithm has the potential to reduce by 1/3 the quantity of ILR episodes needing review

European Heart Journal ◽

10.1093/eurheartj/ehab724.0316 ◽

2021 ◽

Vol 42 (Supplement_1) ◽

Author(s):

A Rosier ◽

E Crespin ◽

A Lazarus ◽

G Laurent ◽

A Menet ◽

...

Keyword(s):

Machine Learning ◽

False Positive ◽

Learning Algorithm ◽

False Positive Rate ◽

Machine Learning Algorithm ◽

The Novel ◽

Funding Sources ◽

High Workload ◽

Positive Rate ◽

The Impact

Abstract Background Implantable Loop Recorders (ILRs) are increasingly used and generate a high workload for timely adjudication of ECG recordings. In particular, the excessive false positive rate leads to a significant review burden. Purpose A novel machine learning algorithm was developed to reclassify ILR episodes in order to decrease by 80% the False Positive rate while maintaining 99% sensitivity. This study aims to evaluate the impact of this algorithm to reduce the number of abnormal episodes reported in Medtronic ILRs. Methods Among 20 European centers, all Medtronic ILR patients were enrolled during the 2nd semester of 2020. Using a remote monitoring platform, every ILR transmitted episode was collected and anonymised. For every ILR detected episode with a transmitted ECG, the new algorithm reclassified it applying the same labels as the ILR (asystole, brady, AT/AF, VT, artifact, normal). We measured the number of episodes identified as false positive and reclassified as normal by the algorithm, and their proportion among all episodes. Results In 370 patients, ILRs recorded 3755 episodes including 305 patient-triggered and 629 with no ECG transmitted. 2821 episodes were analyzed by the novel algorithm, which reclassified 1227 episodes as normal rhythm. These reclassified episodes accounted for 43% of analyzed episodes and 32.6% of all episodes recorded. Conclusion A novel machine learning algorithm significantly reduces the quantity of episodes flagged as abnormal and typically reviewed by healthcare professionals. FUNDunding Acknowledgement Type of funding sources: None. Figure 1. ILR episodes analysis

Download Full-text

The impact of P-hacking on "Redefine statistical significance"

10.31234/osf.io/bp2z4 ◽

2017 ◽

Author(s):

Harry Crane

Keyword(s):

False Positive ◽

Statistical Significance ◽

False Positive Rate ◽

Perceived Benefits ◽

Replication Rate ◽

Recent Proposal ◽

Positive Rate ◽

Replication Crisis ◽

The Impact ◽

Lower Cutoff

A recent proposal to "redefine statistical significance" (Benjamin, et al. Nature Human Behaviour, 2017) claims that false positive rates "would immediately improve" by factors greater than two and replication rates would double simply by changing the conventional cutoff for 'statistical significance' from P<0.05 to P<0.005. I analyze the veracity of these claims, focusing especially on how Benjamin, et al neglect the effects of P-hacking in assessing the impact of their proposal. My analysis shows that once P-hacking is accounted for the perceived benefits of the lower threshold all but disappear, prompting two main conclusions: (i) The claimed improvements to false positive rate and replication rate in Benjamin, et al (2017) are exaggerated and misleading. (ii) There are plausible scenarios under which the lower cutoff will make the replication crisis worse.

Download Full-text

Website Fingerprinting with Website Oracles

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2020-0013 ◽

2020 ◽

Vol 2020 (1) ◽

pp. 235-255 ◽

Cited By ~ 2

Author(s):

Tobias Pulls ◽

Rasmus Dahlberg

Keyword(s):

False Positive ◽

State Of The Art ◽

False Positive Rate ◽

Open World ◽

Wide Range ◽

Online Advertisement ◽

Positive Rate ◽

Access Logs ◽

Attacks And Defenses ◽

Security Notion

AbstractWebsite Fingerprinting (WF) attacks are a subset of traffic analysis attacks where a local passive attacker attempts to infer which websites a target victim is visiting over an encrypted tunnel, such as the anonymity network Tor. We introduce the security notion of a Website Oracle (WO) that gives a WF attacker the capability to determine whether a particular monitored website was among the websites visited by Tor clients at the time of a victim’s trace. Our simulations show that combining a WO with a WF attack—which we refer to as a WF+WO attack—significantly reduces false positives for about half of all website visits and for the vast majority of websites visited over Tor. The measured false positive rate is on the order one false positive per million classified website trace for websites around Alexa rank 10,000. Less popular monitored websites show orders of magnitude lower false positive rates.We argue that WOs are inherent to the setting of anonymity networks and should be an assumed capability of attackers when assessing WF attacks and defenses. Sources of WOs are abundant and available to a wide range of realistic attackers, e.g., due to the use of DNS, OCSP, and real-time bidding for online advertisement on the Internet, as well as the abundance of middleboxes and access logs. Access to a WO indicates that the evaluation of WF defenses in the open world should focus on the highest possible recall an attacker can achieve. Our simulations show that augmenting the Deep Fingerprinting WF attack by Sirinam et al. [60] with access to a WO significantly improves the attack against five state-of-the-art WF defenses, rendering some of them largely ineffective in this new WF+WO setting.

Download Full-text

findere: fast and precise approximate membership query

10.1101/2021.05.31.446182 ◽

2021 ◽

Author(s):

Lucas Robidou ◽

Pierre Peterlongo

Keyword(s):

Data Structure ◽

False Positive ◽

False Positive Rate ◽

False Negative ◽

Bloom Filters ◽

Membership Query ◽

Simple Strategy ◽

Large Sets ◽

Positive Rate ◽

Speed Up

Approximate membership query (AMQ) structures as Cuckoo filters or Bloom filters are widely used for representing large sets of elements. Their lightweight space usage explains their success, mainly as they are the only way to scale hundreds of billions or trillions of elements. However, they suffer by nature from non-avoidable false-positive calls that bias downstream analyses of methods using these data structures. In this work we propose a simple strategy and its implementation for reducing the false-positive rate of any AMQ data structure indexing k-mers (words of length k). The method we propose, called findere, enables to speed-up the queries by a factor two and to decrease the false-positive rate by two order of magnitudes. This achievement is done one the fly at query time, without modifying the original indexing data-structure, without generating false-negative calls and with no memory overhead. With no drawback, this method, as simple as it is effective, reduces either the false-positive rate or the space required to represent a set given a user-defined false-positive rate.

Download Full-text

On the false-positive rate of Bloom filters

Information Processing Letters ◽

10.1016/j.ipl.2008.05.018 ◽

2008 ◽

Vol 108 (4) ◽

pp. 210-213 ◽

Cited By ~ 71

Author(s):

Prosenjit Bose ◽

Hua Guo ◽

Evangelos Kranakis ◽

Anil Maheshwari ◽

Pat Morin ◽

...

Keyword(s):

False Positive ◽

False Positive Rate ◽

Bloom Filters ◽

Positive Rate

Download Full-text

Deep restricted and additive homomorphic ElGamal privacy preservations over big healthcare data

International Journal of Intelligent Computing and Cybernetics ◽

10.1108/ijicc-05-2021-0094 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

K. Sujatha ◽

V. Udayarani

Keyword(s):

Big Data ◽

False Positive ◽

Data Privacy ◽

Differential Privacy ◽

State Of The Art ◽

False Positive Rate ◽

Information Loss ◽

Content Type ◽

Art Works ◽

Positive Rate

PurposeThe purpose of this paper is to improve the privacy in healthcare datasets that hold sensitive information. Putting a stop to privacy divulgence and bestowing relevant information to legitimate users are at the same time said to be of differing goals. Also, the swift evolution of big data has put forward considerable ease to all chores of life. As far as the big data era is concerned, propagation and information sharing are said to be the two main facets. Despite several research works performed on these aspects, with the incremental nature of data, the likelihood of privacy leakage is also substantially expanded through various benefits availed of big data. Hence, safeguarding data privacy in a complicated environment has become a major setback.Design/methodology/approachIn this study, a method called deep restricted additive homomorphic ElGamal privacy preservation (DR-AHEPP) to preserve the privacy of data even in case of incremental data is proposed. An entropy-based differential privacy quasi identification and DR-AHEPP algorithms are designed, respectively, for obtaining privacy-preserved minimum falsified quasi-identifier set and computationally efficient privacy-preserved data.FindingsAnalysis results using Diabetes 130-US hospitals illustrate that the proposed DR-AHEPP method is more significant in preserving privacy on incremental data than existing methods. A comparative analysis of state-of-the-art works with the objective to minimize information loss, false positive rate and execution time with higher accuracy is calibrated.Originality/valueThe paper provides better performance using Diabetes 130-US hospitals for achieving high accuracy, low information loss and false positive rate. The result illustrates that the proposed method increases the accuracy by 4% and reduces the false positive rate and information loss by 25 and 35%, respectively, as compared to state-of-the-art works.

Download Full-text

Specificity of various cluster criteria used for the detection of glaucomatous visual field abnormalities

British Journal of Ophthalmology ◽

10.1136/bjophthalmol-2019-314593 ◽

2019 ◽

Vol 104 (6) ◽

pp. 822-826

Author(s):

Zhichao Wu ◽

Felipe A Medeiros ◽

Robert N Weinreb ◽

Christopher A Girkin ◽

Linda M Zangwill

Keyword(s):

Visual Field ◽

False Positive ◽

False Positive Rate ◽

Field Tests ◽

Pattern Deviation ◽

Positive Rate ◽

Registration Number ◽

The Impact ◽

Abnormal Visual ◽

Healthy Participants

PurposeThis study aimed to evaluate the specificity of commonly used cluster criteria for defining the presence of glaucomatous visual field abnormalities and the impact of variations in the criterion used.MethodsThis is an observational study including 607 eyes from 384 healthy participants, and 501 eyes of 345 participants with glaucoma, with at least two reliable 24–2 visual field tests. An abnormal visual field cluster was defined as the presence of ≥3 contiguous abnormal locations. Variations in this definition were evaluated and included (1) whether abnormalities were based on total deviation and/or pattern deviation values; (2) probability cut-off for defining an abnormal location; and (3) whether abnormalities were required to be repeatable (within the same hemifield or at the same locations) or not. These definitions were also compared against pattern standard deviation (PSD) values.ResultsFalse-positive rates of various cluster criteria ranged between 9% and 46% depending on the specific definitions used. Only definitions that required abnormalities to be repeatable at the same location achieved a false-positive rate of ≤6%. The various cluster criteria generally performed similarly or worse at detecting glaucoma eyes compared with the PSD values.ConclusionsCommonly used visual field cluster criteria have high false-positive rates that vary widely depending on the definition used. These findings highlight the need to carefully consider the criteria used when designing and interpreting glaucoma clinical studies.Trial registration numberNCT00221923.

Download Full-text

Flexibility in reaction time analysis: many roads to a false positive?

Royal Society Open Science ◽

10.1098/rsos.190831 ◽

2020 ◽

Vol 7 (2) ◽

pp. 190831

Author(s):

Luis Morís Fernández ◽

Miguel A. Vadillo

Keyword(s):

False Positive ◽

Degrees Of Freedom ◽

False Positive Rate ◽

Reaction Times ◽

Simulated Data ◽

Skewed Distribution ◽

Time Analysis ◽

Analysis Pipeline ◽

Positive Rate ◽

The Impact

In the present article, we explore the influence of undisclosed flexibility in the analysis of reaction times (RTs). RTs entail some degrees of freedom of their own, due to their skewed distribution, the potential presence of outliers and the availability of different methods to deal with these issues. Moreover, these degrees of freedom are usually not considered part of the analysis itself, but preprocessing steps that are contingent on data. We analysed the impact of these degrees of freedom on the false-positive rate using simulations over real and simulated data. When several preprocessing methods are used in combination, the false-positive rate can easily rise to 17%. This figure becomes more concerning if we consider that more degrees of freedom are awaiting down the analysis pipeline, potentially making the final false-positive rate much higher.

Download Full-text