scholarly journals Challenges in Peptide-Spectrum Matching: a Robust and Reproducible Statistical Framework for Removing Low-Accuracy, High-Scoring Hits

2019 ◽  
Author(s):  
Shane L. Hubler ◽  
Praveen Kumar ◽  
Subina Mehta ◽  
Caleb Easterly ◽  
James E. Johnson ◽  
...  

AbstractWorkflows for large-scale (MS)-based shotgun proteomics can potentially lead to costly errors in the form of incorrect peptide spectrum matches (PSMs). To improve robustness of these workflows, we have investigated the use of the precursor mass discrepancy (PMD) to detect and filter potentially false PSMs that have, nonetheless, a high confidence score. We identified and addressed three cases of unexpected bias in PMD results: time of acquisition within a LC-MS run, decoy PSMs, and length of peptide. We created a post-analysis Bayesian confidence measure based on score and PMD, called PMD-FDR. We tested PMD-FDR on four datasets across three types of MS-based proteomics projects: standard (single organism; reference database), proteogenomics (single organism; customized genomic-based database plus reference), and metaproteomics (microorganism community; customized conglomerate database). On a ground truth dataset and other representative data, PMD-FDR was able to detect 60-80% of likely incorrect PSMs (false-hits) while losing only 5% of correct PSMs (true-hits). PMD-FDR can also be used to evaluate data quality for results generated within different experimental PSM-generating workflows, assisting in method development. Going forward, PMD-FDR should provide detection of high-scoring but likely false-hits, aiding applications which rely heavily on accurate PSMs, such as proteogenomics and metaproteomics.

2018 ◽  
pp. 1726-1745
Author(s):  
Dawei Li ◽  
Mooi Choo Chuah

Many state-of-the-art image retrieval systems include a re-ranking step to refine the suggested initial ranking list so as to improve the retrieval accuracy. In this paper, we present a novel 2-stage k-NN re-ranking algorithm. In stage one, we generate an expanded list of candidate database images for re-ranking so that lower ranked ground truth images will be included and re-ranked. In stage two, we re-rank the list of candidate images using a confidence score which is calculated based on, rRBO, a new proposed ranking list similarity measure. In addition, we propose the rLoCATe image feature, which captures robust color and texture information on salient image patches, and shows superior performance in the image retrieval task. We evaluate the proposed re-ranking algorithm on various initial ranking lists created using both SIFT and rLoCATe on two popular benchmark datasets along with a large-scale one million distraction dataset. The results show that our proposed algorithm is not sensitive for different parameter configurations and it outperforms existing k-NN re-ranking methods.


Author(s):  
Dawei Li ◽  
Mooi Choo Chuah

Many state-of-the-art image retrieval systems include a re-ranking step to refine the suggested initial ranking list so as to improve the retrieval accuracy. In this paper, we present a novel 2-stage k-NN re-ranking algorithm. In stage one, we generate an expanded list of candidate database images for re-ranking so that lower ranked ground truth images will be included and re-ranked. In stage two, we re-rank the list of candidate images using a confidence score which is calculated based on, rRBO, a new proposed ranking list similarity measure. In addition, we propose the rLoCATe image feature, which captures robust color and texture information on salient image patches, and shows superior performance in the image retrieval task. We evaluate the proposed re-ranking algorithm on various initial ranking lists created using both SIFT and rLoCATe on two popular benchmark datasets along with a large-scale one million distraction dataset. The results show that our proposed algorithm is not sensitive for different parameter configurations and it outperforms existing k-NN re-ranking methods.


Author(s):  
A. V. Ponomarev

Introduction: Large-scale human-computer systems involving people of various skills and motivation into the information processing process are currently used in a wide spectrum of applications. An acute problem in such systems is assessing the expected quality of each contributor; for example, in order to penalize incompetent or inaccurate ones and to promote diligent ones.Purpose: To develop a method of assessing the expected contributor’s quality in community tagging systems. This method should only use generally unreliable and incomplete information provided by contributors (with ground truth tags unknown).Results:A mathematical model is proposed for community image tagging (including the model of a contributor), along with a method of assessing the expected contributor’s quality. The method is based on comparing tag sets provided by different contributors for the same images, being a modification of pairwise comparison method with preference relation replaced by a special domination characteristic. Expected contributors’ quality is evaluated as a positive eigenvector of a pairwise domination characteristic matrix. Community tagging simulation has confirmed that the proposed method allows you to adequately estimate the expected quality of community tagging system contributors (provided that the contributors' behavior fits the proposed model).Practical relevance: The obtained results can be used in the development of systems based on coordinated efforts of community (primarily, community tagging systems). 


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Daniel E. Runcie ◽  
Jiayi Qu ◽  
Hao Cheng ◽  
Lorin Crawford

AbstractLarge-scale phenotype data can enhance the power of genomic prediction in plant and animal breeding, as well as human genetics. However, the statistical foundation of multi-trait genomic prediction is based on the multivariate linear mixed effect model, a tool notorious for its fragility when applied to more than a handful of traits. We present , a statistical framework and associated software package for mixed model analyses of a virtually unlimited number of traits. Using three examples with real plant data, we show that can leverage thousands of traits at once to significantly improve genetic value prediction accuracy.


2021 ◽  
pp. 17-28
Author(s):  
A. V. Gochakov ◽  
◽  
O. Yu. Antokhina ◽  
V. N. Krupchatnikov ◽  
Yu. V. Martynova ◽  
...  

Many large-scale dynamic phenomena in the Earth’s atmosphere are associated with the processes of propagation and breaking of Rossby waves. A new method for identifying the Rossby wave breaking (RWB) is proposed. It is based on the detection of breakings centers by analyzing the shape of the contours of potential vorticity or temperature on quasimaterial surfaces: isentropic and iserthelic (surfaces of constant Ertel potential vorticity (PV)), with further RWB center clustering to larger regions. The method is applied to the set of constant PV levels (0.3 to 9.8 PVU with a step of 0.5 PVU) at the level of potential temperature of 350 K for 12:00 UTC. The ERA-Interim reanalysis data from 1979 to 2019 are used for the method development. The type of RWB (cyclonic/anticyclonic), its area and center are determined by analyzing the vortex geometry at each PV level for every day. The RWBs obtained at this stage are designated as elementary breakings. Density-Based Spatial Clustering of Applications with Noise algorithm (DBSCAN) was applied to all elementary breakings for each month. As a result, a graphic dataset describing locations and dynamics of RWBs for every month from 1979 to 2019 is formed. The RWB frequency is also evaluated for each longitude, taking into account the duration of each RWB and the number of levels involved, as well as the anomalies of these parameters.


Author(s):  
Sharon E. Nicholson ◽  
Douglas Klotter ◽  
Adam T. Hartman

AbstractThis article examined rainfall enhancement over Lake Victoria. Estimates of over-lake rainfall were compared with rainfall in the surrounding lake catchment. Four satellite products were initially tested against estimates based on gauges or water balance models. These included TRMM 3B43, IMERG V06 Final Run (IMERG-F), CHIRPS2, and PERSIANN-CDR. There was agreement among the satellite products for catchment rainfall but a large disparity among them for over-lake rainfall. IMERG-F was clearly an outlier, exceeding the estimate from TRMM 3B43 by 36%. The overestimation by IMERG-F was likely related to passive microwave assessments of strong convection, such as prevails over Lake Victoria. Overall, TRMM 3B43 showed the best agreement with the "ground truth" and was used in further analyses. Over-lake rainfall was found to be enhanced compared to catchment rainfall in all months. During the March-to-May long rains the enhancement varied between 40% and 50%. During the October-to-December short rains the enhancement varied between 33% and 44%. Even during the two dry seasons the enhancement was at least 20% and over 50% in some months. While the magnitude of enhancement varied from month to month, the seasonal cycle was essentially the same for over-lake and catchment rainfall, suggesting that the dominant influence on over-lake rainfall is the large-scale environment. The association with Mesoscale Convective Systems (MCSs) was also evaluated. The similarity of the spatial patterns of rainfall and MCS count each month suggested that these produced a major share of rainfall over the lake. Similarity in interannual variability further supported this conclusion.


Author(s):  
Maggie Hess

Purpose: Intraventricular hemorrhage (IVH) affects nearly 15% of preterm infants. It can lead to ventricular dilation and cognitive impairment. To ablate IVH clots, MR-guided focused ultrasound surgery (MRgFUS) is investigated. This procedure requires accurate, fast and consistent quantification of ventricle and clot volumes. Methods: We developed a semi-autonomous segmentation (SAS) algorithm for measuring changes in the ventricle and clot volumes. Images are normalized, and then ventricle and clot masks are registered to the images. Voxels of the registered masks and voxels obtained by thresholding the normalized images are used as seed points for competitive region growing, which provides the final segmentation. The user selects the areas of interest for correspondence after thresholding and these selections are the final seeds for region growing. SAS was evaluated on an IVH porcine model.  Results: SAS was compared to ground truth manual segmentation (MS) for accuracy, efficiency, and consistency. Accuracy was determined by comparing clot and ventricle volumes produced by SAS and MS. In Two-One-Sided Test, SAS and MS were found to be significantly equivalent (p < 0.01). SAS on average was found to be 15 times faster than MS (p < 0.01). Consistency was determined by repeated segmentation of the same image by both SAS and manual methods, SAS being significantly more consistent than MS (p < 0.05).  Conclusion: SAS is a viable method to quantify the IVH clot and the lateral brain ventricles and it is serving in a large- scale porcine study of MRgFUS treatment of IVH clot lysis.


2020 ◽  
Vol 1 (2) ◽  
pp. 101-123
Author(s):  
Hiroaki Shiokawa ◽  
Yasunori Futamura

This paper addressed the problem of finding clusters included in graph-structured data such as Web graphs, social networks, and others. Graph clustering is one of the fundamental techniques for understanding structures present in the complex graphs such as Web pages, social networks, and others. In the Web and data mining communities, the modularity-based graph clustering algorithm is successfully used in many applications. However, it is difficult for the modularity-based methods to find fine-grained clusters hidden in large-scale graphs; the methods fail to reproduce the ground truth. In this paper, we present a novel modularity-based algorithm, \textit{CAV}, that shows better clustering results than the traditional algorithm. The proposed algorithm employs a cohesiveness-aware vector partitioning into the graph spectral analysis to improve the clustering accuracy. Additionally, this paper also presents a novel efficient algorithm \textit{P-CAV} for further improving the clustering speed of CAV; P-CAV is an extension of CAV that utilizes the thread-based parallelization on a many-core CPU. Our extensive experiments on synthetic and public datasets demonstrate the performance superiority of our approaches over the state-of-the-art approaches.


2020 ◽  
Vol 36 (10) ◽  
pp. 3011-3017 ◽  
Author(s):  
Olga Mineeva ◽  
Mateo Rojas-Carulla ◽  
Ruth E Ley ◽  
Bernhard Schölkopf ◽  
Nicholas D Youngblut

Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability and implementation DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Florian Meier ◽  
Andreas-David Brunner ◽  
Scarlet Koch ◽  
Heiner Koch ◽  
Markus Lubeck ◽  
...  

ABSTRACTIn bottom-up proteomics, peptides are separated by liquid chromatography with elution peak widths in the range of seconds, while mass spectra are acquired in about 100 microseconds with time-of-fight (TOF) instruments. This allows adding ion mobility as a third dimension of separation. Among several formats, trapped ion mobility spectrometry (TIMS) is attractive due to its small size, low voltage requirements and high efficiency of ion utilization. We have recently demonstrated a scan mode termed parallel accumulation – serial fragmentation (PASEF), which multiplies the sequencing speed without any loss in sensitivity (Meier et al., PMID: 26538118). Here we introduce the timsTOF Pro instrument, which optimally implements online PASEF. It features an orthogonal ion path into the ion mobility device, limiting the amount of debris entering the instrument and making it very robust in daily operation. We investigate different precursor selection schemes for shotgun proteomics to optimally allocate in excess of 100 fragmentation events per second. More than 800,000 fragmentation spectra in standard 120 min LC runs are easily achievable, which can be used for near exhaustive precursor selection in complex mixtures or re-sequencing weak precursors. MaxQuant identified more than 6,400 proteins in single run HeLa analyses without matching to a library, and with high quantitative reproducibility (R > 0.97). Online PASEF achieves a remarkable sensitivity with more than 2,900 proteins identified in 30 min runs of only 10 ng HeLa digest. We also show that highly reproducible collisional cross sections can be acquired on a large scale (R > 0.99). PASEF on the timsTOF Pro is a valuable addition to the technological toolbox in proteomics, with a number of unique operating modes that are only beginning to be explored.


Sign in / Sign up

Export Citation Format

Share Document