scholarly journals Targeted realignment of LC-MS profiles by neighbor-wise compound-specific graphical time warping with misalignment detection

2020 ◽  
Vol 36 (9) ◽  
pp. 2862-2871
Author(s):  
Chiung-Ting Wu ◽  
Yizhi Wang ◽  
Yinxue Wang ◽  
Timothy Ebbels ◽  
Ibrahim Karaman ◽  
...  

Abstract Motivation Liquid chromatography–mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention times (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. For a large-scale experiment involving many samples, the existence of misalignment becomes inevitable and concerning. Results Here, we describe an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. Validated with both realistic synthetic data and internal quality control samples, ncGTW applied to two large-scale metabolomics LC-MS datasets identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. The ncGTW software tool is developed currently as a plug-in to detect and realign misaligned features present in standard XCMS output. Availability and implementation An R package of ncGTW is freely available at Bioconductor and https://github.com/ChiungTingWu/ncGTW. A detailed user’s manual and a vignette are provided within the package. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Chiung-Ting Wu ◽  
David M. Herrington ◽  
Yizhi Wang ◽  
Timothy Ebbels ◽  
Ibrahim Karaman ◽  
...  

AbstractMotivationLiquid chromatography - mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention times (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot completely correct misalignments. For a large-scale experiment involving many samples, the existence of misalignment becomes inevitable and concerning.ResultsHere we describe an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. Validated with both realistic synthetic data and internal quality control samples, ncGTW applied to two large-scale metabolomics LC-MS datasets identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. The ncGTW software tool is developed currently as a plug-in to the XCMS package.Availability and ImplementationAn R package of ncGTW is freely available at https://github.com/ChiungTingWu/ncGTW. A detailed user’s manual and a vignette are provided within the [email protected], [email protected] informationSupplementary data are available at Bioinformatics online.


Author(s):  
Zachary B Abrams ◽  
Dwayne G Tally ◽  
Lynne V Abruzzo ◽  
Kevin R Coombes

Abstract Summary Cytogenetics data, or karyotypes, are among the most common clinically used forms of genetic data. Karyotypes are stored as standardized text strings using the International System for Human Cytogenomic Nomenclature (ISCN). Historically, these data have not been used in large-scale computational analyses due to limitations in the ISCN text format and structure. Recently developed computational tools such as CytoGPS have enabled large-scale computational analyses of karyotypes. To further enable such analyses, we have now developed RCytoGPS, an R package that takes JSON files generated from CytoGPS.org and converts them into objects in R. This conversion facilitates the analysis and visualizations of karyotype data. In effect this tool streamlines the process of performing large-scale karyotype analyses, thus advancing the field of computational cytogenetic pathology. Availability and Implementation Freely available at https://CRAN.R-project.org/package=RCytoGPS. The code for the underlying CytoGPS software can be found at https://github.com/i2-wustl/CytoGPS. Supplementary information There is no supplementary data.


2019 ◽  
Author(s):  
L Cao ◽  
C Clish ◽  
FB Hu ◽  
MA Martínez-González ◽  
C Razquin ◽  
...  

AbstractMotivationLarge-scale untargeted metabolomics experiments lead to detection of thousands of novel metabolic features as well as false positive artifacts. With the incorporation of pooled QC samples and corresponding bioinformatics algorithms, those measurement artifacts can be well quality controlled. However, it is impracticable for all the studies to apply such experimental design.ResultsWe introduce a post-alignment quality control method called genuMet, which is solely based on injection order of biological samples to identify potential false metabolic features. In terms of the missing pattern of metabolic signals, genuMet can reach over 95% true negative rate and 85% true positive rate with suitable parameters, compared with the algorithm utilizing pooled QC samples. genu-Met makes it possible for studies without pooled QC samples to reduce false metabolic signals and perform robust statistical analysis.Availability and implementationgenuMet is implemented in a R package and available on https://github.com/liucaomics/genuMet under GPL-v2 license.ContactLiming Liang: [email protected] informationSupplementary data are available at ….


2019 ◽  
Author(s):  
Zachary B. Abrams ◽  
Caitlin E. Coombes ◽  
Suli Li ◽  
Kevin R. Coombes

AbstractSummaryUnsupervised data analysis in many scientific disciplines is based on calculating distances between observations and finding ways to visualize those distances. These kinds of unsupervised analyses help researchers uncover patterns in large-scale data sets. However, researchers can select from a vast number of different distance metrics, each designed to highlight different aspects of different data types. There are also numerous visualization methods with their own strengths and weaknesses. To help researchers perform unsupervised analyses, we developed the Mercator R package. Mercator enables users to see important patterns in their data by generating multiple visualizations using different standard algorithms, making it particularly easy to compare and contrast the results arising from different metrics. By allowing users to select the distance metric that best fits their needs, Mercator helps researchers perform unsupervised analyses that use pattern identification through computation and visual inspection.Availability and ImplementationMercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html)[email protected] informationSupplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (19) ◽  
pp. 3861-3863 ◽  
Author(s):  
Chuang Li ◽  
Kenli Li ◽  
Tao Chen ◽  
Yunping Zhu ◽  
Qiang He

Abstract Summary Tandem mass spectrometry based database searching is a widely acknowledged and adopted method that identifies peptide sequence in shotgun proteomics. However, database searching is extremely computationally expensive, which can take days even weeks to process a large spectra dataset. To address this critical issue, this paper presents SW-Tandem, a new tool for large-scale peptide sequencing. SW-Tandem parallelizes the spectrum dot product scoring algorithm and leverages the advantages of Sunway TaihuLight, the No. 1 supercomputer in the world in 2017. Sunway TaihuLight is powered by the brand new many-core SW26010 processors and provides a peak computation performance greater than 100PFlops. To fully utilize the Sunway TaihuLights capacity, SW-Tandem employs three mechanisms to accelerate large-scale peptide identification, memory-access optimizations, double buffering and vectorization. The results of experiments conducted on multiple datasets demonstrate the performance of SW-Tandem against three state-of-the-art tools for peptide identification, including X!! Tandem, MR-Tandem and MSFragger. In addition, it shows high scalability in the experiments on extremely large datasets sized up to 12 GB. Availability and implementation SW-Tandem is an open source software tool implemented in C++. The source code and the parameter settings are available at https://github.com/Logic09/SW-Tandem. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Grégoire Versmée ◽  
Laura Versmée ◽  
Mikaël Dusenne ◽  
Niloofar Jalali ◽  
Paul Avillach

Abstract Summary Based on the Genomic Data Sharing Policy issued in August 2007, the National Institutes of Health (NIH) has supported several repositories such as the database of Genotypes and Phenotypes (dbGaP). dbGaP is an online repository that provides access to large-scale genetic and phenotypic datasets with more than 1,000 studies. However, navigating the website and understanding the relationship between the studies are not easy tasks. Moreover, the decryption of the files is a complex procedure. In this study we propose the dbgap2x R package that covers a broad range of functions for searching dbGaP studies, exploring the characteristics of a study and easily decrypting the files from dbGaP. Availability and implementation dbgap2x is an R package with the code available at https://github.com/gversmee/dbgap2x. A containerized version including the package, a Jupyter server and with a Notebook example is available at https://hub.docker.com/r/gversmee/dbgap2x. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (11) ◽  
pp. 1901-1906 ◽  
Author(s):  
Mary D Fortune ◽  
Chris Wallace

Abstract Motivation Methods for analysis of GWAS summary statistics have encouraged data sharing and democratized the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some ‘truth’ is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study. Results We have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis. Availability and implementation Our method is available under a GPL license as an R package from http://github.com/chr1swallace/simGWAS. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (19) ◽  
pp. 4960-4962
Author(s):  
Lauren Marazzi ◽  
Andrew Gainer-Dewar ◽  
Paola Vera-Licona

Abstract Summary OCSANA+ is a Cytoscape app for identifying nodes to drive the system toward a desired long-term behavior, prioritizing combinations of interventions in large-scale complex networks, and estimating the effects of node perturbations in signaling networks, all based on the analysis of the network’s structure. OCSANA+ includes an update to optimal combinations of interventions from network analysis software tool with cutting-edge and rigorously tested algorithms, together with recently developed structure-based control algorithms for non-linear systems and an algorithm for estimating signal flow. All these algorithms are based on the network’s topology. OCSANA+ is implemented as a Cytoscape app to enable a user interface for running analyses and visualizing results. Availability and implementation OCSANA+ app and its tutorial can be downloaded from the Cytoscape App Store or https://veraliconaresearchgroup.github.io/OCSANA-Plus/. The source code and computations are available in https://github.com/VeraLiconaResearchGroup/OCSANA-Plus_SourceCode. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Bo Wang ◽  
Daniele Ramazzotti ◽  
Luca De Sano ◽  
Junjie Zhu ◽  
Emma Pierson ◽  
...  

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Mary D. Fortune ◽  
Chris Wallace

AbstractMotivationMethods for analysis of GWAS summary statistics have encouraged data sharing and democratised the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some “truth” is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study.ResultsWe have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis.Availability and ImplementationOur method is available under a GPL license as an R package from http://github.com/chr1swallace/[email protected] InformationSupplementary Information is appended.


Sign in / Sign up

Export Citation Format

Share Document