Toward optimal fingerprint indexing for large scale genomics

Motivation: To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results: We present NIQKI, a novel structure using well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a matter of days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe that this approach can lead to tremendous improvement allowing fast query, scaling on extensive genomic databases. Availability and implementation: We wrote the NIQKI index as an open-source C++ library under the AGPL3 license available at https://github.com/Malfoy/ NIQKI. It is designed as a user-friendly tool and comes along with usage sample

Download Full-text

Identification of and Correction for Publication Bias: Comment

10.31222/osf.io/dh87m ◽

2019 ◽

Author(s):

Amanda Kvarven ◽

Eirik Strømland ◽

Magnus Johannesson

Keyword(s):

Publication Bias ◽

False Positive ◽

Large Scale ◽

Meta Analysis ◽

False Positive Rate ◽

Effect Sizes ◽

Replication Studies ◽

Moderate Reduction ◽

Positive Rate ◽

Meta Analyses

Andrews & Kasy (2019) propose an approach for adjusting effect sizes in meta-analysis for publication bias. We use the Andrews-Kasy estimator to adjust the result of 15 meta-analyses and compare the adjusted results to 15 large-scale multiple labs replication studies estimating the same effects. The pre-registered replications provide precisely estimated effect sizes, which do not suffer from publication bias. The Andrews-Kasy approach leads to a moderate reduction of the inflated effect sizes in the meta-analyses. However, the approach still overestimates effect sizes by a factor of about two or more and has an estimated false positive rate of between 57% and 100%.

Download Full-text

Efficient Detection of Large-Scale Multimedia Network Information Data Anomalies Based on the Rule-Extracting Matrix Algorithm

Advances in Multimedia ◽

10.1155/2021/3299891 ◽

2021 ◽

Vol 2021 ◽

pp. 1-7

Author(s):

Jie Zhao

Keyword(s):

Large Scale ◽

False Positive Rate ◽

Detection Accuracy ◽

Network Information ◽

Matrix Algorithm ◽

Efficient Detection ◽

Sample Data ◽

Multimedia Social Networks ◽

Positive Rate ◽

Rule Extracting

With the continuous development of multimedia social networks, online public opinion information is becoming more and more popular. The rule extraction matrix algorithm can effectively improve the probability of information data to be tested. The network information data abnormality detection is realized through the probability calculation, and the prior probability is calculated, to realize the detection of abnormally high network data. Practical results show that the rule-extracting matrix algorithm can effectively control the false positive rate of sample data, the detection accuracy is improved, and it has efficient detection performance.

Download Full-text

A Universal, Genomewide GuideFinder for CRISPR/Cas9 Targeting in Microbial Genomes

mSphere ◽

10.1128/msphere.00086-20 ◽

2020 ◽

Vol 5 (1) ◽

Author(s):

Michelle Spoto ◽

Changhui Guan ◽

Elizabeth Fleming ◽

Julia Oh

Keyword(s):

Gene Function ◽

Large Scale ◽

Essential Gene ◽

Bacterial Species ◽

Bacterial Genome ◽

Model Organisms ◽

Design Parameters ◽

Bacterial Genomes ◽

Wide Range ◽

User Friendly

ABSTRACT The CRISPR/Cas system has significant potential to facilitate gene editing in a variety of bacterial species. CRISPR interference (CRISPRi) and CRISPR activation (CRISPRa) represent modifications of the CRISPR/Cas9 system utilizing a catalytically inactive Cas9 protein for transcription repression and activation, respectively. While CRISPRi and CRISPRa have tremendous potential to systematically investigate gene function in bacteria, few programs are specifically tailored to identify guides in draft bacterial genomes genomewide. Furthermore, few programs offer open-source code with flexible design parameters for bacterial targeting. To address these limitations, we created GuideFinder, a customizable, user-friendly program that can design guides for any annotated bacterial genome. GuideFinder designs guides from NGG protospacer-adjacent motif (PAM) sites for any number of genes by the use of an annotated genome and FASTA file input by the user. Guides are filtered according to user-defined design parameters and removed if they contain any off-target matches. Iteration with lowered parameter thresholds allows the program to design guides for genes that did not produce guides with the more stringent parameters, one of several features unique to GuideFinder. GuideFinder can also identify paired guides for targeting multiplicity, whose validity we tested experimentally. GuideFinder has been tested on a variety of diverse bacterial genomes, finding guides for 95% of genes on average. Moreover, guides designed by the program are functionally useful—focusing on CRISPRi as a potential application—as demonstrated by essential gene knockdown in two staphylococcal species. Through the large-scale generation of guides, this open-access software will improve accessibility to CRISPR/Cas studies of a variety of bacterial species. IMPORTANCE With the explosion in our understanding of human and environmental microbial diversity, corresponding efforts to understand gene function in these organisms are strongly needed. CRISPR/Cas9 technology has revolutionized interrogation of gene function in a wide variety of model organisms. Efficient CRISPR guide design is required for systematic gene targeting. However, existing tools are not adapted for the broad needs of microbial targeting, which include extraordinary species and subspecies genetic diversity, the overwhelming majority of which is characterized by draft genomes. In addition, flexibility in guide design parameters is important to consider the wide range of factors that can affect guide efficacy, many of which can be species and strain specific. We designed GuideFinder, a customizable, user-friendly program that addresses the limitations of existing software and that can design guides for any annotated bacterial genome with numerous features that facilitate guide design in a wide variety of microorganisms.

Download Full-text

A Partial Correlation Screening Approach for Controlling the False Positive Rate in Sparse Gaussian Graphical Models

Scientific Reports ◽

10.1038/s41598-019-53795-x ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Ginette Lafit ◽

Francis Tuerlinckx ◽

Inez Myin-Germeys ◽

Eva Ceulemans

Keyword(s):

Graphical Models ◽

False Positive ◽

Partial Correlation ◽

State Of The Art ◽

False Positive Rate ◽

False Positives ◽

Gaussian Graphical Models ◽

Undirected Network ◽

Partial Correlations ◽

Positive Rate

AbstractGaussian Graphical Models (GGMs) are extensively used in many research areas, such as genomics, proteomics, neuroimaging, and psychology, to study the partial correlation structure of a set of variables. This structure is visualized by drawing an undirected network, in which the variables constitute the nodes and the partial correlations the edges. In many applications, it makes sense to impose sparsity (i.e., some of the partial correlations are forced to zero) as sparsity is theoretically meaningful and/or because it improves the predictive accuracy of the fitted model. However, as we will show by means of extensive simulations, state-of-the-art estimation approaches for imposing sparsity on GGMs, such as the Graphical lasso, ℓ1 regularized nodewise regression, and joint sparse regression, fall short because they often yield too many false positives (i.e., partial correlations that are not properly set to zero). In this paper we present a new estimation approach that allows to control the false positive rate better. Our approach consists of two steps: First, we estimate an undirected network using one of the three state-of-the-art estimation approaches. Second, we try to detect the false positives, by flagging the partial correlations that are smaller in absolute value than a given threshold, which is determined through cross-validation; the flagged correlations are set to zero. Applying this new approach to the same simulated data, shows that it indeed performs better. We also illustrate our approach by using it to estimate (1) a gene regulatory network for breast cancer data, (2) a symptom network of patients with a diagnosis within the nonaffective psychotic spectrum and (3) a symptom network of patients with PTSD.

Download Full-text

Bloom Filter-Based Secure Data Forwarding in Large-Scale Cyber-Physical Systems

Mathematical Problems in Engineering ◽

10.1155/2015/150512 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12

Author(s):

Siyu Lin ◽

Hao Wu

Keyword(s):

False Positive ◽

Large Scale ◽

False Positive Rate ◽

Bloom Filter ◽

Cyber Physical Systems ◽

Security Requirements ◽

Data Forwarding ◽

Physical Systems ◽

Secure Data ◽

Positive Rate

Cyber-physical systems (CPSs) connect with the physical world via communication networks, which significantly increases security risks of CPSs. To secure the sensitive data, secure forwarding is an essential component of CPSs. However, CPSs require high dimensional multiattribute and multilevel security requirements due to the significantly increased system scale and diversity, and hence impose high demand on the secure forwarding information query and storage. To tackle these challenges, we propose a practical secure data forwarding scheme for CPSs. Considering the limited storage capability and computational power of entities, we adopt bloom filter to store the secure forwarding information for each entity, which can achieve well balance between the storage consumption and query delay. Furthermore, a novel link-based bloom filter construction method is designed to reduce false positive rate during bloom filter construction. Finally, the effects of false positive rate on the performance of bloom filter-based secure forwarding with different routing policies are discussed.

Download Full-text

iSUMO - integrative prediction of functionally relevant SUMOylation events

10.1101/056564 ◽

2016 ◽

Author(s):

Xiaotong Yao ◽

Shuvadeep Maity ◽

Shashank Gandhi ◽

Marcin Imielenski ◽

Christine Vogel

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Rna Binding ◽

Rna Binding Proteins ◽

False Positive Rate ◽

Protein Protein Interactions ◽

Cellular Functions ◽

Positive Rate ◽

Protein Nucleic Acid ◽

Scale Experiment

AbstractPost-translational modifications by the Small Ubiquitin-like Modifier (SUMO) are essential for diverse cellular functions. Large-scale experiment and sequence-based predictions have identified thousands of SUMOylated proteins. However, the overlap between the datasets is small, suggesting many false positives with low functional relevance. Therefore, we integrated ~800 sequence features and protein characteristics such as cellular function and protein-protein interactions in a machine learning approach to score likely functional SUMOylation events (iSUMO). iSUMO is trained on a total of 24 large-scale datasets, and it predicts 2,291 and 706 SUMO targets in human and yeast, respectively. These estimates are five times higher than what existing sequence-based tools predict at the same 5% false positive rate. Protein-protein and protein-nucleic acid interactions are highly predictive of protein SUMOylation, supporting a role of the modification in protein complex formation. We note the marked prevalence of SUMOylation amongst RNA-binding proteins. We validate iSUMO predictions by experimental or other evidence. iSUMO therefore represents a comprehensive tool to identify high-confidence, functional SUMOylation events for human and yeast.

Download Full-text

Deep Learning Model Improves Radiologists’ Performance in Detection and Classification of Breast Lesions

10.21203/rs.3.rs-746374/v1 ◽

2021 ◽

Author(s):

Ying-Shi Sun ◽

Yu-Hong Qu ◽

Dong Wang ◽

Yi Li ◽

Lin Ye ◽

...

Keyword(s):

Artificial Intelligence ◽

Deep Learning ◽

Roc Curve ◽

False Positive ◽

Large Scale ◽

False Positive Rate ◽

Training Dataset ◽

Validation Dataset ◽

Breast Lesions ◽

Positive Rate

Abstract Background: Computer-aided diagnosis using deep learning algorithms has been initially applied in the field of mammography, but there is no large-scale clinical application.Methods: This study proposed to develop and verify an artificial intelligence model based on mammography. Firstly, retrospectively collected mammograms from six centers were randomized to a training dataset and a validation dataset for establishing the model. Secondly, the model was tested by comparing 12 radiologists’ performance with and without it. Finally, prospectively multicenter mammograms were diagnosed by radiologists with the model. The detection and diagnostic capabilities were evaluated using the free-response receiver operating characteristic (FROC) curve and ROC curve.Results: The sensitivity of model for detecting lesion after matching was 0.908 for false positive rate of 0.25 in unilateral images. The area under ROC curve (AUC) to distinguish the benign from malignant lesions was 0.855 (95% CI: 0.830, 0.880). The performance of 12 radiologists with the model was higher than that of radiologists alone (AUC: 0.852 vs. 0.808, P = 0.005). The mean reading time of with the model was shorter than that of reading alone (80.18 s vs. 62.28 s, P = 0.03). In prospective application, the sensitivity of detection reached 0.887 at false positive rate of 0.25; the AUC of radiologists with the model was 0.983 (95% CI: 0.978, 0.988), with sensitivity, specificity, PPV, and NPV of 94.36%, 98.07%, 87.76%, and 99.09%, respectively.Conclusions: The artificial intelligence model exhibits high accuracy for detecting and diagnosing breast lesions, improves diagnostic accuracy and saves time.Trial registration: NCT, NCT03708978. Registered 17 April 2018, https://register.clinicaltrials.gov/prs/app/ NCT03708978

Download Full-text

RiceRelativesGD: a genomic database of rice relatives for rice research

Database ◽

10.1093/database/baz110 ◽

2019 ◽

Vol 2019 ◽

Author(s):

Lingfeng Mao ◽

Meihong Chen ◽

Qinjie Chu ◽

Lei Jia ◽

Most Humaira Sultana ◽

...

Keyword(s):

Large Scale ◽

Genetic Resource ◽

Oryza Sativa L ◽

Specific Gene ◽

Cultivated Rice ◽

Biotic And Abiotic Stresses ◽

Genomic Databases ◽

Genomic Database ◽

User Friendly ◽

External Stimuli

Abstract Rice (Oryza sativa L.) is one of the most important crops worldwide. Its relatives, including phylogenetically related species of rice and paddy weeds with a similar ecological niche, can provide crucial genetic resources (such as resistance to biotic and abiotic stresses and high photosynthetic efficiency) for rice research. Although many rice genomic databases have been constructed, a database providing large-scale curated genomic data from rice relatives and offering specific gene resources is still lacking. Here, we present RiceRelativesGD, a user-friendly genomic database of rice relatives. RiceRelativesGD integrates large-scale genomic resources from 2 cultivated rice and 11 rice relatives, including 208 321 specific genes and 13 643 genes related to photosynthesis and responsive to external stimuli. Diverse bioinformatics tools are embedded in the database, which allow users to search, visualize and download the information of interest. To our knowledge, this is the first genomic database providing a centralized genetic resource of rice relatives. RiceRelativesGD will serve as a significant and comprehensive knowledgebase for the rice community.

Download Full-text

Wearable Sensors for Monitoring Human Motion: A Review on Mechanisms, Materials, and Challenges

SLAS TECHNOLOGY Translating Life Sciences Innovation ◽

10.1177/2472630319891128 ◽

2019 ◽

Vol 25 (1) ◽

pp. 9-24 ◽

Cited By ~ 4

Author(s):

S. Zohreh Homayounfar ◽

Trisha L. Andrew

Keyword(s):

Large Scale ◽

Large Range ◽

State Of The Art ◽

Wearable Sensors ◽

Human Locomotion ◽

Human Motion ◽

Human Gait ◽

Wearable Electronics ◽

New Horizons ◽

User Friendly

The emergence of flexible wearable electronics as a new platform for accurate, unobtrusive, user-friendly, and longitudinal sensing has opened new horizons for personalized assistive tools for monitoring human locomotion and physiological signals. Herein, we survey recent advances in methodologies and materials involved in unobtrusively sensing a medium to large range of applied pressures and motions, such as those encountered in large-scale body and limb movements or posture detection. We discuss three commonly used methodologies in human gait studies: inertial, optical, and angular sensors. Next, we survey the various kinds of electromechanical devices (piezoresistive, piezoelectric, capacitive, triboelectric, and transistive) that are incorporated into these sensor systems; define the key metrics used to quantitate, compare, and optimize the efficiency of these technologies; and highlight state-of-the-art examples. In the end, we provide the readers with guidelines and perspectives to address the current challenges of the field.

Download Full-text

Deep learning-based detection and segmentation of diffusion abnormalities in acute ischemic stroke

Communications Medicine ◽

10.1038/s43856-021-00062-8 ◽

2021 ◽

Vol 1 (1) ◽

Author(s):

Chin-Fu Liu ◽

Johnny Hsu ◽

Xin Xu ◽

Sandhya Ramachandran ◽

Victor Wang ◽

...

Keyword(s):

Deep Learning ◽

Large Scale ◽

False Positive Rate ◽

Learning Networks ◽

Large Dataset ◽

Lower False Positive Rate ◽

Proposed Model ◽

Positive Rate ◽

Total Agreement ◽

Clinical And Translational Research

Abstract Background Accessible tools to efficiently detect and segment diffusion abnormalities in acute strokes are highly anticipated by the clinical and research communities. Methods We developed a tool with deep learning networks trained and tested on a large dataset of 2,348 clinical diffusion weighted MRIs of patients with acute and sub-acute ischemic strokes, and further tested for generalization on 280 MRIs of an external dataset (STIR). Results Our proposed model outperforms generic networks and DeepMedic, particularly in small lesions, with lower false positive rate, balanced precision and sensitivity, and robustness to data perturbs (e.g., artefacts, low resolution, technical heterogeneity). The agreement with human delineation rivals the inter-evaluator agreement; the automated lesion quantification of volume and contrast has virtually total agreement with human quantification. Conclusion Our tool is fast, public, accessible to non-experts, with minimal computational requirements, to detect and segment lesions via a single command line. Therefore, it fulfills the conditions to perform large scale, reliable and reproducible clinical and translational research.

Download Full-text