Challenges in Peptide-Spectrum Matching: a Robust and Reproducible Statistical Framework for Removing Low-Accuracy, High-Scoring Hits

AbstractWorkflows for large-scale (MS)-based shotgun proteomics can potentially lead to costly errors in the form of incorrect peptide spectrum matches (PSMs). To improve robustness of these workflows, we have investigated the use of the precursor mass discrepancy (PMD) to detect and filter potentially false PSMs that have, nonetheless, a high confidence score. We identified and addressed three cases of unexpected bias in PMD results: time of acquisition within a LC-MS run, decoy PSMs, and length of peptide. We created a post-analysis Bayesian confidence measure based on score and PMD, called PMD-FDR. We tested PMD-FDR on four datasets across three types of MS-based proteomics projects: standard (single organism; reference database), proteogenomics (single organism; customized genomic-based database plus reference), and metaproteomics (microorganism community; customized conglomerate database). On a ground truth dataset and other representative data, PMD-FDR was able to detect 60-80% of likely incorrect PSMs (false-hits) while losing only 5% of correct PSMs (true-hits). PMD-FDR can also be used to evaluate data quality for results generated within different experimental PSM-generating workflows, assisting in method development. Going forward, PMD-FDR should provide detection of high-scoring but likely false-hits, aiding applications which rely heavily on accurate PSMs, such as proteogenomics and metaproteomics.

Download Full-text

Accurate Image Retrieval With Unsupervised 2-Stage k-NN Re-Ranking

Computer Vision ◽

10.4018/978-1-5225-5204-8.ch072 ◽

2018 ◽

pp. 1726-1745

Author(s):

Dawei Li ◽

Mooi Choo Chuah

Keyword(s):

Image Retrieval ◽

Large Scale ◽

Ground Truth ◽

Confidence Score ◽

Image Feature ◽

Superior Performance ◽

Ranking Algorithm ◽

Ranking List ◽

Image Patches ◽

Benchmark Datasets

Many state-of-the-art image retrieval systems include a re-ranking step to refine the suggested initial ranking list so as to improve the retrieval accuracy. In this paper, we present a novel 2-stage k-NN re-ranking algorithm. In stage one, we generate an expanded list of candidate database images for re-ranking so that lower ranked ground truth images will be included and re-ranked. In stage two, we re-rank the list of candidate images using a confidence score which is calculated based on, rRBO, a new proposed ranking list similarity measure. In addition, we propose the rLoCATe image feature, which captures robust color and texture information on salient image patches, and shows superior performance in the image retrieval task. We evaluate the proposed re-ranking algorithm on various initial ranking lists created using both SIFT and rLoCATe on two popular benchmark datasets along with a large-scale one million distraction dataset. The results show that our proposed algorithm is not sensitive for different parameter configurations and it outperforms existing k-NN re-ranking methods.

Download Full-text

Accurate Image Retrieval with Unsupervised 2-Stage k-NN Re-Ranking

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2016010103 ◽

2016 ◽

Vol 7 (1) ◽

pp. 41-59

Author(s):

Dawei Li ◽

Mooi Choo Chuah

Keyword(s):

Image Retrieval ◽

Large Scale ◽

Ground Truth ◽

Confidence Score ◽

Image Feature ◽

Superior Performance ◽

Ranking Algorithm ◽

Retrieval Task ◽

Ranking List ◽

Benchmark Datasets

Download Full-text

Model and Method for Contributor’s Quality Assessment in Community Image Tagging Systems

Information and Control Systems ◽

10.31799/1684-8853-2018-4-45-51 ◽

2018 ◽

pp. 45-51

Author(s):

A. V. Ponomarev

Keyword(s):

Large Scale ◽

Wide Spectrum ◽

Preference Relation ◽

Pairwise Comparison ◽

Ground Truth ◽

Comparison Method ◽

Characteristic Matrix ◽

Image Tagging ◽

Proposed Model

Introduction: Large-scale human-computer systems involving people of various skills and motivation into the information processing process are currently used in a wide spectrum of applications. An acute problem in such systems is assessing the expected quality of each contributor; for example, in order to penalize incompetent or inaccurate ones and to promote diligent ones.Purpose: To develop a method of assessing the expected contributor’s quality in community tagging systems. This method should only use generally unreliable and incomplete information provided by contributors (with ground truth tags unknown).Results:A mathematical model is proposed for community image tagging (including the model of a contributor), along with a method of assessing the expected contributor’s quality. The method is based on comparing tag sets provided by different contributors for the same images, being a modification of pairwise comparison method with preference relation replaced by a special domination characteristic. Expected contributors’ quality is evaluated as a positive eigenvector of a pairwise domination characteristic matrix. Community tagging simulation has confirmed that the proposed method allows you to adequately estimate the expected quality of community tagging system contributors (provided that the contributors' behavior fits the proposed model).Practical relevance: The obtained results can be used in the development of systems based on coordinated efforts of community (primarily, community tagging systems).

Download Full-text

MegaLMM: Mega-scale linear mixed models for genomic predictions with thousands of traits

Genome Biology ◽

10.1186/s13059-021-02416-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Daniel E. Runcie ◽

Jiayi Qu ◽

Hao Cheng ◽

Lorin Crawford

Keyword(s):

Genomic Prediction ◽

Large Scale ◽

Mixed Model ◽

Human Genetics ◽

Linear Mixed Effect Model ◽

Mixed Effect ◽

Statistical Framework ◽

Effect Model ◽

Plant Data ◽

Genetic Value

AbstractLarge-scale phenotype data can enhance the power of genomic prediction in plant and animal breeding, as well as human genetics. However, the statistical foundation of multi-trait genomic prediction is based on the multivariate linear mixed effect model, a tool notorious for its fragility when applied to more than a handful of traits. We present , a statistical framework and associated software package for mixed model analyses of a virtually unlimited number of traits. Using three examples with real plant data, we show that can leverage thousands of traits at once to significantly improve genetic value prediction accuracy.

Download Full-text

Method for Identifying and Clustering Rossby Wave Breaking Events in the Northern Hemisphere

Meteorologiya i Gidrologiya ◽

10.52002/0130-2906-2021-1-17-28 ◽

2021 ◽

pp. 17-28

Author(s):

A. V. Gochakov ◽

◽

O. Yu. Antokhina ◽

V. N. Krupchatnikov ◽

Yu. V. Martynova ◽

...

Keyword(s):

Rossby Wave ◽

Potential Vorticity ◽

Wave Breaking ◽

Large Scale ◽

Method Development ◽

Spatial Clustering ◽

Potential Temperature ◽

Reanalysis Data ◽

Rossby Wave Breaking ◽

Dynamic Phenomena

Many large-scale dynamic phenomena in the Earth’s atmosphere are associated with the processes of propagation and breaking of Rossby waves. A new method for identifying the Rossby wave breaking (RWB) is proposed. It is based on the detection of breakings centers by analyzing the shape of the contours of potential vorticity or temperature on quasimaterial surfaces: isentropic and iserthelic (surfaces of constant Ertel potential vorticity (PV)), with further RWB center clustering to larger regions. The method is applied to the set of constant PV levels (0.3 to 9.8 PVU with a step of 0.5 PVU) at the level of potential temperature of 350 K for 12:00 UTC. The ERA-Interim reanalysis data from 1979 to 2019 are used for the method development. The type of RWB (cyclonic/anticyclonic), its area and center are determined by analyzing the vortex geometry at each PV level for every day. The RWBs obtained at this stage are designated as elementary breakings. Density-Based Spatial Clustering of Applications with Noise algorithm (DBSCAN) was applied to all elementary breakings for each month. As a result, a graphic dataset describing locations and dynamics of RWBs for every month from 1979 to 2019 is formed. The RWB frequency is also evaluated for each longitude, taking into account the duration of each RWB and the number of levels involved, as well as the anomalies of these parameters.

Download Full-text

Lake-effect rains over Lake Victoria and their association with Mesoscale Convective Systems

Journal of Hydrometeorology ◽

10.1175/jhm-d-20-0244.1 ◽

2021 ◽

Author(s):

Sharon E. Nicholson ◽

Douglas Klotter ◽

Adam T. Hartman

Keyword(s):

Large Scale ◽

Lake Victoria ◽

Ground Truth ◽

Mesoscale Convective Systems ◽

Convective Systems ◽

Lake Effect ◽

Mesoscale Convective ◽

Strong Convection ◽

Short Rains ◽

Lake Catchment

AbstractThis article examined rainfall enhancement over Lake Victoria. Estimates of over-lake rainfall were compared with rainfall in the surrounding lake catchment. Four satellite products were initially tested against estimates based on gauges or water balance models. These included TRMM 3B43, IMERG V06 Final Run (IMERG-F), CHIRPS2, and PERSIANN-CDR. There was agreement among the satellite products for catchment rainfall but a large disparity among them for over-lake rainfall. IMERG-F was clearly an outlier, exceeding the estimate from TRMM 3B43 by 36%. The overestimation by IMERG-F was likely related to passive microwave assessments of strong convection, such as prevails over Lake Victoria. Overall, TRMM 3B43 showed the best agreement with the "ground truth" and was used in further analyses. Over-lake rainfall was found to be enhanced compared to catchment rainfall in all months. During the March-to-May long rains the enhancement varied between 40% and 50%. During the October-to-December short rains the enhancement varied between 33% and 44%. Even during the two dry seasons the enhancement was at least 20% and over 50% in some months. While the magnitude of enhancement varied from month to month, the seasonal cycle was essentially the same for over-lake and catchment rainfall, suggesting that the dominant influence on over-lake rainfall is the large-scale environment. The association with Mesoscale Convective Systems (MCSs) was also evaluated. The similarity of the spatial patterns of rainfall and MCS count each month suggested that these produced a major share of rainfall over the lake. Similarity in interannual variability further supported this conclusion.

Download Full-text

Open Source Software Implementation of Anatomical Segmentation

Inquiry@Queen's Undergraduate Research Conference Proceedings ◽

10.24908/iqurcp.9960 ◽

2018 ◽

Author(s):

Maggie Hess

Keyword(s):

Large Scale ◽

Focused Ultrasound ◽

Region Growing ◽

Ground Truth ◽

Clot Lysis ◽

Focused Ultrasound Surgery ◽

Manual Methods ◽

Areas Of Interest ◽

Seed Points ◽

Viable Method

Purpose: Intraventricular hemorrhage (IVH) affects nearly 15% of preterm infants. It can lead to ventricular dilation and cognitive impairment. To ablate IVH clots, MR-guided focused ultrasound surgery (MRgFUS) is investigated. This procedure requires accurate, fast and consistent quantification of ventricle and clot volumes. Methods: We developed a semi-autonomous segmentation (SAS) algorithm for measuring changes in the ventricle and clot volumes. Images are normalized, and then ventricle and clot masks are registered to the images. Voxels of the registered masks and voxels obtained by thresholding the normalized images are used as seed points for competitive region growing, which provides the final segmentation. The user selects the areas of interest for correspondence after thresholding and these selections are the final seeds for region growing. SAS was evaluated on an IVH porcine model. Results: SAS was compared to ground truth manual segmentation (MS) for accuracy, efficiency, and consistency. Accuracy was determined by comparing clot and ventricle volumes produced by SAS and MS. In Two-One-Sided Test, SAS and MS were found to be significantly equivalent (p < 0.01). SAS on average was found to be 15 times faster than MS (p < 0.01). Consistency was determined by repeated segmentation of the same image by both SAS and manual methods, SAS being significantly more consistent than MS (p < 0.05). Conclusion: SAS is a viable method to quantify the IVH clot and the lateral brain ventricles and it is serving in a large- scale porcine study of MRgFUS treatment of IVH clot lysis.

Download Full-text

Efficient Vector Partitioning Algorithms for Graph Clustering

journal of Data Intelligence ◽

10.26421/jdi1.2-1 ◽

2020 ◽

Vol 1 (2) ◽

pp. 101-123

Author(s):

Hiroaki Shiokawa ◽

Yasunori Futamura

Keyword(s):

Social Networks ◽

Large Scale ◽

Clustering Algorithm ◽

Ground Truth ◽

Graph Clustering ◽

Mining Communities ◽

Fine Grained ◽

Efficient Vector ◽

Public Datasets ◽

Many Core

This paper addressed the problem of finding clusters included in graph-structured data such as Web graphs, social networks, and others. Graph clustering is one of the fundamental techniques for understanding structures present in the complex graphs such as Web pages, social networks, and others. In the Web and data mining communities, the modularity-based graph clustering algorithm is successfully used in many applications. However, it is difficult for the modularity-based methods to find fine-grained clusters hidden in large-scale graphs; the methods fail to reproduce the ground truth. In this paper, we present a novel modularity-based algorithm, \textit{CAV}, that shows better clustering results than the traditional algorithm. The proposed algorithm employs a cohesiveness-aware vector partitioning into the graph spectral analysis to improve the clustering accuracy. Additionally, this paper also presents a novel efficient algorithm \textit{P-CAV} for further improving the clustering speed of CAV; P-CAV is an extension of CAV that utilizes the thread-based parallelization on a many-core CPU. Our extensive experiments on synthetic and public datasets demonstrate the performance superiority of our approaches over the state-of-the-art approaches.

Download Full-text

DeepMAsED: evaluating the quality of metagenomic assemblies

Bioinformatics ◽

10.1093/bioinformatics/btaa124 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3011-3017 ◽

Cited By ~ 5

Author(s):

Olga Mineeva ◽

Mateo Rojas-Carulla ◽

Ruth E Ley ◽

Bernhard Schölkopf ◽

Nicholas D Youngblut

Keyword(s):

Large Scale ◽

State Of The Art ◽

Ground Truth ◽

Supplementary Information ◽

Learning Approach ◽

Wide Range ◽

Metagenome Assembly ◽

Model Training ◽

Reference Genomes

Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability and implementation DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Online parallel accumulation − serial fragmentation (PASEF) with a novel trapped ion mobility mass spectrometer

10.1101/336743 ◽

2018 ◽

Cited By ~ 7

Author(s):

Florian Meier ◽

Andreas-David Brunner ◽

Scarlet Koch ◽

Heiner Koch ◽

Markus Lubeck ◽

...

Keyword(s):

Cross Sections ◽

Ion Mobility ◽

Large Scale ◽

High Efficiency ◽

Low Voltage ◽

Shotgun Proteomics ◽

Trapped Ion ◽

Operating Modes ◽

Third Dimension ◽

Mobility Device

ABSTRACTIn bottom-up proteomics, peptides are separated by liquid chromatography with elution peak widths in the range of seconds, while mass spectra are acquired in about 100 microseconds with time-of-fight (TOF) instruments. This allows adding ion mobility as a third dimension of separation. Among several formats, trapped ion mobility spectrometry (TIMS) is attractive due to its small size, low voltage requirements and high efficiency of ion utilization. We have recently demonstrated a scan mode termed parallel accumulation – serial fragmentation (PASEF), which multiplies the sequencing speed without any loss in sensitivity (Meier et al., PMID: 26538118). Here we introduce the timsTOF Pro instrument, which optimally implements online PASEF. It features an orthogonal ion path into the ion mobility device, limiting the amount of debris entering the instrument and making it very robust in daily operation. We investigate different precursor selection schemes for shotgun proteomics to optimally allocate in excess of 100 fragmentation events per second. More than 800,000 fragmentation spectra in standard 120 min LC runs are easily achievable, which can be used for near exhaustive precursor selection in complex mixtures or re-sequencing weak precursors. MaxQuant identified more than 6,400 proteins in single run HeLa analyses without matching to a library, and with high quantitative reproducibility (R > 0.97). Online PASEF achieves a remarkable sensitivity with more than 2,900 proteins identified in 30 min runs of only 10 ng HeLa digest. We also show that highly reproducible collisional cross sections can be acquired on a large scale (R > 0.99). PASEF on the timsTOF Pro is a valuable addition to the technological toolbox in proteomics, with a number of unique operating modes that are only beginning to be explored.

Download Full-text