scholarly journals Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships

2020 ◽  
Author(s):  
Florian Huber ◽  
Lars Ridder ◽  
Stefan Verhoeven ◽  
Jurriaan H. Spaaks ◽  
Faruk Diblen ◽  
...  

AbstractSpectral similarity is used as a proxy for structural similarity in many tandem mass spectrometry (MS/MS) based metabolomics analyses such as library matching and molecular networking. Although weaknesses in the relationship between spectral similarity scores and the true structural similarities have been described, little development of alternative scores has been undertaken. Here, we introduce Spec2Vec, a novel spectral similarity score inspired by a natural language processing algorithm -- Word2Vec. Spec2Vec learns fragmental relationships within a large set of spectral data to derive abstract spectral embeddings that can be used to assess spectral similarities. Using data derived from GNPS MS/MS libraries including spectra for nearly 13,000 unique molecules, we show how Spec2Vec scores correlate better with structural similarity than cosine-based scores. We demonstrate the advantages of Spec2Vec in library matching and molecular networking. Spec2Vec is computationally more scalable allowing structural analogue searches in large databases within seconds.

2021 ◽  
Vol 17 (2) ◽  
pp. e1008724 ◽  
Author(s):  
Florian Huber ◽  
Lars Ridder ◽  
Stefan Verhoeven ◽  
Jurriaan H. Spaaks ◽  
Faruk Diblen ◽  
...  

Spectral similarity is used as a proxy for structural similarity in many tandem mass spectrometry (MS/MS) based metabolomics analyses such as library matching and molecular networking. Although weaknesses in the relationship between spectral similarity scores and the true structural similarities have been described, little development of alternative scores has been undertaken. Here, we introduce Spec2Vec, a novel spectral similarity score inspired by a natural language processing algorithm—Word2Vec. Spec2Vec learns fragmental relationships within a large set of spectral data to derive abstract spectral embeddings that can be used to assess spectral similarities. Using data derived from GNPS MS/MS libraries including spectra for nearly 13,000 unique molecules, we show how Spec2Vec scores correlate better with structural similarity than cosine-based scores. We demonstrate the advantages of Spec2Vec in library matching and molecular networking. Spec2Vec is computationally more scalable allowing structural analogue searches in large databases within seconds.


2021 ◽  
Author(s):  
Florian Huber ◽  
Sven van der Burg ◽  
Justin J.J. van der Hooft ◽  
Lars Ridder

Mass spectrometry data is one of the key sources of information in many workflows in medicine and across the life sciences. Mass fragmentation spectra are considered characteristic signatures of the chemical compound they originate from, yet the chemical structure itself usually cannot be easily deduced from the spectrum. Often, spectral similarity measures are used as a proxy for structural similarity but this approach is strongly limited by a generally poor correlation between both metrics. Here, we propose MS2DeepScore: a novel Siamese neural network to predict the structural similarity between two chemical structures solely based on their MS/MS fragmentation spectra. Using a cleaned dataset of >100,000 mass spectra of about 15,000 unique known compounds, MS2DeepScore learns to predict structural similarity scores for spectrum pairs with high accuracy. In addition, sampling different model varieties through Monte-Carlo Dropout is used to further improve the predictions and assess the model's prediction uncertainty. On 3,600 spectra of 500 unseen compounds, MS2DeepScore is able to identify highly-reliable structural matches and predicts Tanimoto scores with a root mean squared error of about 0.15. The prediction uncertainty estimate can be used to select a subset of predictions with a root mean squared error of about 0.1. We demonstrate that MS2DeepScore outperforms classical spectral similarity measures in retrieving chemically related compound pairs from large mass spectral datasets, thereby illustrating its potential for spectral library matching. Finally, MS2DeepScore can also be used to create chemically meaningful mass spectral embeddings that could be used to cluster large numbers of spectra. Added to the recently introduced unsupervised Spec2Vec metric, we believe that machine learning-supported mass spectral similarity metrics have great potential for a range of metabolomics data processing pipelines.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Florian Huber ◽  
Sven van der Burg ◽  
Justin J. J. van der Hooft ◽  
Lars Ridder

AbstractMass spectrometry data is one of the key sources of information in many workflows in medicine and across the life sciences. Mass fragmentation spectra are generally considered to be characteristic signatures of the chemical compound they originate from, yet the chemical structure itself usually cannot be easily deduced from the spectrum. Often, spectral similarity measures are used as a proxy for structural similarity but this approach is strongly limited by a generally poor correlation between both metrics. Here, we propose MS2DeepScore: a novel Siamese neural network to predict the structural similarity between two chemical structures solely based on their MS/MS fragmentation spectra. Using a cleaned dataset of > 100,000 mass spectra of about 15,000 unique known compounds, we trained MS2DeepScore to predict structural similarity scores for spectrum pairs with high accuracy. In addition, sampling different model varieties through Monte-Carlo Dropout is used to further improve the predictions and assess the model’s prediction uncertainty. On 3600 spectra of 500 unseen compounds, MS2DeepScore is able to identify highly-reliable structural matches and to predict Tanimoto scores for pairs of molecules based on their fragment spectra with a root mean squared error of about 0.15. Furthermore, the prediction uncertainty estimate can be used to select a subset of predictions with a root mean squared error of about 0.1. Furthermore, we demonstrate that MS2DeepScore outperforms classical spectral similarity measures in retrieving chemically related compound pairs from large mass spectral datasets, thereby illustrating its potential for spectral library matching. Finally, MS2DeepScore can also be used to create chemically meaningful mass spectral embeddings that could be used to cluster large numbers of spectra. Added to the recently introduced unsupervised Spec2Vec metric, we believe that machine learning-supported mass spectral similarity measures have great potential for a range of metabolomics data processing pipelines.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Lena Y. E. Ekaney ◽  
Donatus B. Eni ◽  
Fidele Ntie-Kang

Abstract The relation that exists between the structure of a compound and its function is an integral part of chemoinformatics. The similarity principle states that “structurally similar molecules tend to have similar properties and similar molecules exert similar biological activities”. The similarity of the molecules can either be studied at the structure level or at the descriptor level (properties level). Generally, the objective of chemical similarity measures is to enhance prediction of the biological activities of molecules. In this article, an overview of various methods used to compare the similarity between metabolite structures has been provided, including two-dimensional (2D) and three-dimensional (3D) approaches. The focus has been on methods description; e.g. fingerprint-based similarity in which the molecules under study are first fragmented and their fingerprints are computed, 2D structural similarity by comparing the Tanimoto coefficients and Euclidean distances, as well as the use of physiochemical properties descriptor-based similarity methods. The similarity between molecules could also be measured by using data mining (clustering) techniques, e.g. by using virtual screening (VS)-based similarity methods. In this approach, the molecules with the desired descriptors or /and structures are screened from large databases. Lastly, SMILES-based chemical similarity search is an important method for studying the exact structure search, substructure search and also descriptor similarity. The use of a particular method depends upon the requirements of the researcher.


Author(s):  
Sahil Gupta ◽  
Eugene Saltanov ◽  
Igor Pioro

Canada among many other countries is in pursuit of developing next generation (Generation IV) nuclear-reactor concepts. One of the main objectives of Generation-IV concepts is to achieve high thermal efficiencies (45–50%). It has been proposed to make use of SuperCritical Fluids (SCFs) as the heat-transfer medium in such Gen IV reactor design concepts such as SuperCritical Water-cooled Reactor (SCWR). An important aspect towards development of SCF applications in novel Gen IV Nuclear Power Plant (NPP) designs is to understand the thermodynamic behavior and prediction of Heat Transfer Coefficients (HTCs) at supercritical (SC) conditions. To calculate forced convection HTCs for simple geometries, a number of empirical 1-D correlations have been proposed using dimensional analysis. These 1-D HTC correlations are developed by applying data-fitting techniques to a model equation with dimensionless terms and can be used for rudimentary calculations. Using similar statistical techniques three correlations were proposed by Gupta et al. [1] for Heat Transfer (HT) in SCCO2. These SCCO2 correlations were developed at the University of Ontario Institute of Technology (Canada) by using a large set of experimental SCCO2 data (∼4,000 data-points) obtained at the Chalk River Laboratories (CRL) AECL. These correlations predict HTC values with an accuracy of ±30% and wall temperatures with an accuracy of ±20% for the analyzed dataset. Since these correlations were developed using data from a single source - CRL (AECL), they can be limited in their range of applicability. To investigate the tangible applicability of these SCCO2 correlations it was imperative to perform a thorough error analysis by checking their results against a set of independent SCCO2 tube data. In this paper SCCO2 data are compiled from various sources and within various experimental flow conditions. HTC and wall-temperature values for these data points are calculated using updated correlations presented in [1] and compared to the experimental values. Error analysis is then shown for these datasets to obtain a sense of the applicability of these updated SCCO2 correlations.


2016 ◽  
Vol 14 (3) ◽  
pp. 0-0 ◽  
Author(s):  
Giuseppe Falvo D’Urso Labate ◽  
Francesco Baino ◽  
Mara Terzini ◽  
Alberto Audenino ◽  
Chiara Vitale-Brovarone ◽  
...  

2021 ◽  
Author(s):  
Jiaming Zeng ◽  
Michael F. Gensheimer ◽  
Daniel L. Rubin ◽  
Susan Athey ◽  
Ross D. Shachter

AbstractIn medicine, randomized clinical trials (RCT) are the gold standard for informing treatment decisions. Observational comparative effectiveness research (CER) is often plagued by selection bias, and expert-selected covariates may not be sufficient to adjust for confounding. We explore how the unstructured clinical text in electronic medical records (EMR) can be used to reduce selection bias and improve medical practice. We develop a method based on natural language processing to uncover interpretable potential confounders from the clinical text. We validate our method by comparing the hazard ratio (HR) from survival analysis with and without the confounders against the results from established RCTs. We apply our method to four study cohorts built from localized prostate and lung cancer datasets from the Stanford Cancer Institute Research Database and show that our method adjusts the HR estimate towards the RCT results. We further confirm that the uncovered terms can be interpreted by an oncologist as potential confounders. This research helps enable more credible causal inference using data from EMRs, offers a transparent way to improve the design of observational CER, and could inform high-stake medical decisions. Our method can also be applied to studies within and beyond medicine to extract important information from observational data to support decisions.


2017 ◽  
Author(s):  
Dat Duong ◽  
Wasi Uddin Ahmad ◽  
Eleazar Eskin ◽  
Kai-Wei Chang ◽  
Jingyi Jessica Li

AbstractThe Gene Ontology (GO) database contains GO terms that describe biological functions of genes. Previous methods for comparing GO terms have relied on the fact that GO terms are organized into a tree structure. In this paradigm, the locations of two GO terms in the tree dictate their similarity score. In this paper, we introduce two new solutions for this problem, by focusing instead on the definitions of the GO terms. We apply neural network based techniques from the natural language processing (NLP) domain. The first method does not rely on the GO tree, whereas the second indirectly depends on the GO tree. In our first approach, we compare two GO definitions by treating them as two unordered sets of words. The word similarity is estimated by a word embedding model that maps words into an N-dimensional space. In our second approach, we account for the word-ordering within a sentence. We use a sentence encoder to embed GO definitions into vectors and estimate how likely one definition entails another. We validate our methods in two ways. In the first experiment, we test the model’s ability to differentiate a true protein-protein network from a randomly generated network. In the second experiment, we test the model in identifying orthologs from randomly-matched genes in human, mouse, and fly. In both experiments, a hybrid of NLP and GO-tree based method achieves the best classification accuracy.Availabilitygithub.com/datduong/NLPMethods2CompareGOterms


2021 ◽  
Author(s):  
Fabian Braesemann ◽  
Fabian Stephany ◽  
Leonie Neuhäuser ◽  
Niklas Stoehr ◽  
Philipp Darius ◽  
...  

Abstract The global spread of Covid-19 has caused major economic disruptions. Governments around the world provide considerable financial support to mitigate the economic downturn. However, effective policy responses require reliable data on the economic consequences of the corona pandemic. We propose the CoRisk-Index: a real-time economic indicator of Covid-19 related risk assessments by industry. Using data mining, we analyse all reports from US companies filed since January 2020, representing more than a third of all US employees. We construct two measures - the number of 'corona' words in each report and the average text negativity of the sentences mentioning corona in each industry - that are aggregated in the CoRisk-Index. The index correlates with U.S. unemployment data and preempts stock market losses of February 2020. Moreover, thanks to topic modelling and natural language processing techniques, the CoRisk data provides unique granularity with regards to the particular contexts of the crisis and the concerns of individual industries about them. The data presented here help researchers and decision makers to measure, the previously unobserved, risk awareness of industries with regard to Covid-19, bridging the quantification gap between highly volatile stock market dynamics and long-term macro-economic figures. For immediate access to the data, we provide all findings and raw data on an interactive online dashboard in real time.


2019 ◽  
Vol 5 (1) ◽  
pp. 11-42
Author(s):  
Teodor Petrič

AbstractIn this paper psycholinguistic and emotional properties of 619 German idiomatic expressions are explored. The list of idiomatic expressions has been adapted from Citron et al. (2015), who have used it with German native speakers. In our study the same idioms were evaluated by Slovene learners of German as a foreign language. Our participants rated each idiom for emotional valence, emotional arousal, familiarity, concreteness, ambiguity (literality), semantic transparency and figurativeness. They also had the task to describe the meaning of the German idioms and to rate their confidence about the attributed meaning. The aims of our study were (1) to provide descriptive norms for psycholinguistic and affective properties of a large set of idioms in German as a second language, (2) to explore the relationships between psycholinguistic and affective properties of idioms in German as a second language, and (3) to compare the ratings of the German native speakers studied in Citron et al. (2015) with the ratings of the Slovene second language learners from our study. On one hand, the results of the Slovene participants show many similarities with those of of the German native speakers, on the other hand, they show a slight positivity bias and slightly shallower emotional processing of the German idioms. Our study provides data that could be useful for future studies investigating the role of affect in figurative language in a second language setting (methodology, translation science, language technology).


Sign in / Sign up

Export Citation Format

Share Document