scholarly journals Cloud Computing Based Immunopeptidomics Utilizing Community Curated Variant Libraries Simplifies and Improves Neo-Antigen Discovery in Metastatic Melanoma

Cancers ◽  
2021 ◽  
Vol 13 (15) ◽  
pp. 3754
Author(s):  
Amol Prakash ◽  
Keira E. Mahoney ◽  
Benjamin C. Orsburn

Unique peptide neo-antigens presented on the cell surface are attractive targets for researchers in nearly all areas of personalized medicine. Cells presenting peptides with mutated or other non-canonical sequences can be utilized for both targeted therapies and diagnostics. Today’s state-of-the-art pipelines utilize complementary proteogenomic approaches where RNA or ribosomal sequencing data helps to create libraries from which tandem mass spectrometry data can be compared. In this study, we present an alternative approach whereby cloud computing is utilized to power neo-antigen searches against community curated databases containing more than 7 million human sequence variants. Using these expansive databases of high-quality sequences as a reference, we reanalyze the original data from two previously reported studies to identify neo-antigen targets in metastatic melanoma. Using our approach, we identify 79 percent of the non-canonical peptides reported by previous genomic analyses of these files. Furthermore, we report 18-fold more non-canonical peptides than previously reported. The novel neo-antigens we report herein can be corroborated by secondary analyses such as high predicted binding affinity, when analyzed by well-established tools such as NetMHC. Finally, we report 738 non-canonical peptides shared by at least five patient samples, and 3258 shared across the two studies. This illustrates the depth of data that is present, but typically missed by lower statistical power proteogenomic approaches. This large list of shared peptides across the two studies, their annotation, non-canonical origin, as well as MS/MS spectra from the two studies are made available on a web portal for community analysis.

Author(s):  
Lauren V. Alteio ◽  
Joana Séneca ◽  
Alberto Canarini ◽  
Roey Angel ◽  
Ksenia Guseva ◽  
...  

Microbial community analysis via marker gene amplicon sequencing has become a routine method in the field of soil research. In this perspective, we discuss technical challenges and limitations of amplicon sequencing studies in soil and present statistical and experimental approaches that can help addressing the spatio-temporal complexity of soil and the high diversity of organisms therein. We illustrate the impact of compositionality on the interpretation of relative abundance data and discuss effects of sample replication on the statistical power in soil community analysis. Additionally, we argue for the need of increased study reproducibility and data availability, as well as complementary techniques for generating deeper ecological insights into microbial roles and our understanding thereof in soil ecosystems. At this stage, we call upon researchers and specialized soil journals to consider the current state of data analysis, interpretation and availability to improve the rigor of future studies.


F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 1453 ◽  
Author(s):  
Zeeshan Ahmed ◽  
Thomas Dandekar

Published scientific literature contains millions of figures, including information about the results obtained from different scientific experiments e.g. PCR-ELISA data, microarray analysis, gel electrophoresis, mass spectrometry data, DNA/RNA sequencing, diagnostic imaging (CT/MRI and ultrasound scans), and medicinal imaging like electroencephalography (EEG), magnetoencephalography (MEG), echocardiography  (ECG), positron-emission tomography (PET) images. The importance of biomedical figures has been widely recognized in scientific and medicine communities, as they play a vital role in providing major original data, experimental and computational results in concise form. One major challenge for implementing a system for scientific literature analysis is extracting and analyzing text and figures from published PDF files by physical and logical document analysis. Here we present a product line architecture based bioinformatics tool ‘Mining Scientific Literature (MSL)’, which supports the extraction of text and images by interpreting all kinds of published PDF files using advanced data mining and image processing techniques. It provides modules for the marginalization of extracted text based on different coordinates and keywords, visualization of extracted figures and extraction of embedded text from all kinds of biological and biomedical figures using applied Optimal Character Recognition (OCR). Moreover, for further analysis and usage, it generates the system’s output in different formats including text, PDF, XML and images files. Hence, MSL is an easy to install and use analysis tool to interpret published scientific literature in PDF format.


2018 ◽  
Author(s):  
Ruth Stoney ◽  
Jean-Mark Schwartz ◽  
David L Robertson ◽  
Goran Nenadic

1.Abstract1.01BackgroundThe consolidation of pathway databases, such as KEGG[1], Reactome[2]and ConsensusPathDB[3], has generated widespread biological interest, however the issue of pathway redundancy impedes the use of these consolidated datasets. Attempts to reduce this redundancy have focused on visualizing pathway overlap or merging pathways, but the resulting pathways may be of heterogeneous sizes and cover multiple biological functions. Efforts have also been made to deal with redundancy in pathway data by consolidating enriched pathways into a number of clusters or concepts. We present an alternative approach, which generates pathway subsets capable of covering all of genes presented within either pathway databases or enrichment results, generating substantial reductions in redundancy.1.02ResultsWe propose a method that uses set cover to reduce pathway redundancy, without merging pathways. The proposed approach considers three objectives: removal of pathway redundancy, controlling pathway size and coverage of the gene set. By applying set cover to the ConsensusPathDB dataset we were able to produce a reduced set of pathways, representing 100% of the genes in the original data set with 74% less redundancy, or 95% of the genes with 88% less redundancy. We also developed an algorithm to simplify enrichment data and applied it to a set of enriched osteoarthritis pathways, revealing that within the top ten pathways, five were redundant subsets of more enriched pathways. Applying set cover to the enrichment results removed these redundant pathways allowing more informative pathways to take their place.1.03ConclusionOur method provides an alternative approach for handling pathway redundancy, while ensuring that the pathways are of homogeneous size and gene coverage is maximised. Pathways are not altered from their original form, allowing biological knowledge regarding the data set to be directly applicable. We demonstrate the ability of the algorithms to prioritise redundancy reduction, pathway size control or gene set coverage. The application of set cover to pathway enrichment results produces an optimised summary of the pathways that best represent the differentially regulated gene set.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. e13032-e13032 ◽  
Author(s):  
Anton Buzdin ◽  
Andrew Garazha ◽  
Maxim Sorokin ◽  
Alex Glusker ◽  
Alexey Aleshin ◽  
...  

e13032 Background: Intracellular molecular pathways (IMPs) control all major events in the living cell. They are considered hotspots in contemporary oncology because knowledge of IMPs activation is essential for understanding mechanisms of molecular pathogenesis in oncology. Profiling IMPs requires RNA-seq data for tumors and for a collection of reference normal tissues. However, there is a shortage now in such profiles for normal tissues from healthy human donors, uniformly profiled in a single series of experiments. Access to the largest dataset of normal profiles GTEx is only partly available through the dbGaP. In TCGA database, norms are adjacent to surgically removed tumors and may be affected by tumor-linked growth factors, inflammation and altered vascularization. ENCODE datasets were for the autopsies of normal tissues, but they can’t form statistically significant reference groups. Methods: Tissue samples representing 20 organs were taken from post-mortal human healthy donors killed in road accidents no later than 36 hours after death, blood samples were taken from healthy volunteers. Gene expression was profiled in RNA-seq experiments using the same reagents, equipment and protocols. Bioinformatic algorithms for IMP analysis were developed and validated using experimental and public gene expression datasets. Results: From original sequencing data we constructed the biggest fully open reference expression database of normal human tissues including 465 profiles termed Oncobox Atlas of Normal Tissue Expression (ANTE, original data: GSE120795). We next developed a method termed Oncobox for interrogating activation of IMPs in human cancers. It includes modules of expression data harmonization and comparison and an algorithm for automatic annotation of molecular pathways. The Oncobox system enables accurate scoring of thousands molecular pathways using RNA-seq data. Oncobox pathway analysis is also applicable for quantitative proteomics and microRNA data in oncology. Conclusions: The Oncobox system can be used for a plethora of applications in cancer research including finding differentially regulated genes and IMPs, and for discovery of new pathway-related diagnostic and prognostic biomarkers.


2017 ◽  
Author(s):  
Dongfang Wang ◽  
Jin Gu

AbstractSingle cell RNA sequencing (scRNA-seq) is a powerful technique to analyze the transcriptomic heterogeneities in single cell level. It is an important step for studying cell sub-populations and lineages based on scRNA-seq data by finding an effective low-dimensional representation and visualization of the original data. The scRNA-seq data are much noiser than traditional bulk RNA-Seq: in the single cell level, the transcriptional fluctuations are much larger than the average of a cell population and the low amount of RNA transcripts will increase the rate of technical dropout events. In this study, we proposed VASC (deep Variational Autoencoder for scRNA-seq data), a deep multi-layer generative model, for the unsupervised dimension reduction and visualization of scRNA-seq data. It can explicitly model the dropout events and find the nonlinear hierarchical feature representations of the original data. Tested on twenty datasets, VASC shows superior performances in most cases and broader dataset compatibility compared with four state-of-the-art dimension reduction methods. Then, for a case study of pre-implantation embryos, VASC successfully re-establishes the cell dynamics and identifies several candidate marker genes associated with the early embryo development.


2018 ◽  
Author(s):  
Andy Lin ◽  
J. Jeffry Howbert ◽  
William Stafford Noble

AbstractTo achieve accurate assignment of peptide sequences to observed fragmentation spectra, a shotgun proteomics database search tool must make good use of the very high resolution information produced by state-of-the-art mass spectrometers. However, making use of this information while also ensuring that the search engine’s scores are well calibrated—i.e., that the score assigned to one spectrum can be meaningfully compared to the score assigned to a different spectrum—has proven to be challenging. Here, we describe a database search score function, the “residue evidence” (res-ev) score, that achieves both of these goals simultaneously. We also demonstrate how to combine calibrated res-ev scores with calibrated XCorr scores to produce a “combined p-value” score function. We provide a benchmark consisting of four mass spectrometry data sets, which we use to compare the combined p-value to the score functions used by several existing search engines. Our results suggest that the combined p-value achieves state-of-the-art performance, generally outperforming MS Amanda and Morpheus and performing comparably to MS-GF+. The res-ev and combined p-value score functions are freely available as part of the Tide search engine in the Crux mass spectrometry toolkit (http://crux.ms).


2021 ◽  
Author(s):  
Laura Y. Zhou ◽  
Fei Zou ◽  
Wei Sun

AbstractRecent development of cancer immunotherapy has opened unprecedented avenues to eliminate tumor cells using the human immune system. Cancer vaccines composed of neoantigens, or peptides unique to tumor cells due to somatic mutations, have emerged as a promising approach to activate or strengthen the immune response against cancer. A key step to identifying neoantigens is computationally predicting which somatically mutated peptides are presented on the cell surface by a human leukocyte antigen (HLA). Computational prediction relies on large amounts of high-quality training data, such as mass spectrometry data of peptides presented by one of several HLAs in living cells. We developed a complete pipeline to prioritize neoantigens for cancer vaccines. A key step of our pipeline is PEPPRMINT (PEPtide PResentation using a MIxture model and Neural neTwork), a model designed to exploit mass spectrometry data to predict peptide presentation by HLAs. We applied our pipeline to DNA sequencing data of 60 melanoma patients and identified a group of neoantigens that were more immunogenic in tumor cells than in normal cells. Additionally, the neoantigen burden estimated by PEPPRMINT was significantly associated with activity of the immune system, suggesting these neoantigens could induce an immune response.


2020 ◽  
Author(s):  
Bo Chen ◽  
Wei Xu

Abstract Background Existing models for assessing microbiome sequencing such as operational taxonomic units (OTUs) can only test predictors' effects on OTUs. There is limited work on how to estimate the correlations between multiple OTUs and incorporate such relationship into models to evaluate longitudinal OTU measures.Results We propose a novel approach to estimate OTU correlations based on their taxonomic structure, and apply such correlation structure in Generalized Estimating Equations (GEE) models to estimate both predictors' effects and OTU correlations. We develop a two-part Microbiome Taxonomic Longitudinal Correlation (MTLC) model for multivariate zero-inflated OTU outcomes based on the GEE framework. In addition, longitudinal and other types of repeated OTU measures are integrated in the MTLC model.Conclusions Extensive simulations have been conducted to evaluate the performance of the MTLC method. Compared with the existing methods, the MTLC method shows robust and consistent estimation, and improved statistical power for testing predictors' effects. Lastly we demonstrate our proposed method by implementing it into a human microbiome study to evaluate the obesity on twins.


2020 ◽  
Vol 26 (1) ◽  
pp. 78-83
Author(s):  
Demet Cidem Dogan ◽  
Huseyin Altindis

With introduction of smart things into our lives, cloud computing is used in many different areas and changes the communication method. However, cloud computing should guarantee the complete security assurance in terms of privacy protection, confidentiality, and integrity. In this paper, a Homomorphic Encryption Scheme based on Elliptic Curve Cryptography (HES-ECC) is proposed for secure data transfer and storage. The scheme stores the data in the cloud after encrypting them. While calculations, such as addition or multiplication, are applied to encrypted data on cloud, these calculations are transmitted to the original data without any decryption process. Thus, the cloud server has only ability of accessing the encrypted data for performing the required computations and for fulfilling requested actions by the user. Hence, storage and transmission security of data are ensured. The proposed public key HES-ECC is designed using modified Weil-pairing for encryption and additional homomorphic property. HES-ECC also uses bilinear pairing for multiplicative homomorphic property. Security of encryption scheme and its homomorphic aspects are based on the hardness of Elliptic Curve Discrete Logarithm Problem (ECDLP), Weil Diffie-Hellman Problem (WDHP), and Bilinear Diffie-Helman Problem (BDHP).


2020 ◽  
Author(s):  
Ellen S. Cameron ◽  
Philip J. Schmidt ◽  
Benjamin J.-M. Tremblay ◽  
Monica B. Emelko ◽  
Kirsten M. Müller

AbstractThe application of amplicon sequencing in water research provides a rapid and sensitive technique for microbial community analysis in a variety of environments ranging from freshwater lakes to water and wastewater treatment plants. It has revolutionized our ability to study DNA collected from environmental samples by eliminating the challenges associated with lab cultivation and taxonomic identification. DNA sequencing data consist of discrete counts of sequence reads, the total number of which is the library size. Samples may have different library sizes and thus, a normalization technique is required to meaningfully compare them. The process of randomly subsampling sequences to a selected normalized library size from the sample library—rarefying—is one such normalization technique. However, rarefying has been criticized as a normalization technique because data can be omitted through the exclusion of either excess sequences or entire samples, depending on the rarefied library size selected. Although it has been suggested that rarefying should be avoided altogether, we propose that repeatedly rarefying enables (i) characterization of the variation introduced to diversity analyses by this random subsampling and (ii) selection of smaller library sizes where necessary to incorporate all samples in the analysis. Rarefying may be a statistically valid normalization technique, but researchers should evaluate their data to make appropriate decisions regarding library size selection and subsampling type. The impact of normalized library size selection and rarefying with or without replacement in diversity analyses were evaluated herein.Highlights▪ Amplicon sequencing technology for environmental water samples is reviewed▪ Sequencing data must be normalized to allow comparison in diversity analyses▪ Rarefying normalizes library sizes by subsampling from observed sequences▪ Criticisms of data loss through rarefying can be resolved by rarefying repeatedly▪ Rarefying repeatedly characterizes errors introduced by subsampling sequences


Sign in / Sign up

Export Citation Format

Share Document