MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra

AbstractMass spectrometry data is one of the key sources of information in many workflows in medicine and across the life sciences. Mass fragmentation spectra are generally considered to be characteristic signatures of the chemical compound they originate from, yet the chemical structure itself usually cannot be easily deduced from the spectrum. Often, spectral similarity measures are used as a proxy for structural similarity but this approach is strongly limited by a generally poor correlation between both metrics. Here, we propose MS2DeepScore: a novel Siamese neural network to predict the structural similarity between two chemical structures solely based on their MS/MS fragmentation spectra. Using a cleaned dataset of > 100,000 mass spectra of about 15,000 unique known compounds, we trained MS2DeepScore to predict structural similarity scores for spectrum pairs with high accuracy. In addition, sampling different model varieties through Monte-Carlo Dropout is used to further improve the predictions and assess the model’s prediction uncertainty. On 3600 spectra of 500 unseen compounds, MS2DeepScore is able to identify highly-reliable structural matches and to predict Tanimoto scores for pairs of molecules based on their fragment spectra with a root mean squared error of about 0.15. Furthermore, the prediction uncertainty estimate can be used to select a subset of predictions with a root mean squared error of about 0.1. Furthermore, we demonstrate that MS2DeepScore outperforms classical spectral similarity measures in retrieving chemically related compound pairs from large mass spectral datasets, thereby illustrating its potential for spectral library matching. Finally, MS2DeepScore can also be used to create chemically meaningful mass spectral embeddings that could be used to cluster large numbers of spectra. Added to the recently introduced unsupervised Spec2Vec metric, we believe that machine learning-supported mass spectral similarity measures have great potential for a range of metabolomics data processing pipelines.

Download Full-text

MS2DeepScore - a novel deep learning similarity measure for mass fragmentation spectrum comparisons

10.1101/2021.04.18.440324 ◽

2021 ◽

Author(s):

Florian Huber ◽

Sven van der Burg ◽

Justin J.J. van der Hooft ◽

Lars Ridder

Keyword(s):

Mean Squared Error ◽

Similarity Measures ◽

Structural Similarity ◽

Metabolomics Data ◽

Spectral Similarity ◽

Mass Spectral ◽

Root Mean Squared Error ◽

Prediction Uncertainty ◽

Squared Error ◽

Mass Fragmentation

Mass spectrometry data is one of the key sources of information in many workflows in medicine and across the life sciences. Mass fragmentation spectra are considered characteristic signatures of the chemical compound they originate from, yet the chemical structure itself usually cannot be easily deduced from the spectrum. Often, spectral similarity measures are used as a proxy for structural similarity but this approach is strongly limited by a generally poor correlation between both metrics. Here, we propose MS2DeepScore: a novel Siamese neural network to predict the structural similarity between two chemical structures solely based on their MS/MS fragmentation spectra. Using a cleaned dataset of >100,000 mass spectra of about 15,000 unique known compounds, MS2DeepScore learns to predict structural similarity scores for spectrum pairs with high accuracy. In addition, sampling different model varieties through Monte-Carlo Dropout is used to further improve the predictions and assess the model's prediction uncertainty. On 3,600 spectra of 500 unseen compounds, MS2DeepScore is able to identify highly-reliable structural matches and predicts Tanimoto scores with a root mean squared error of about 0.15. The prediction uncertainty estimate can be used to select a subset of predictions with a root mean squared error of about 0.1. We demonstrate that MS2DeepScore outperforms classical spectral similarity measures in retrieving chemically related compound pairs from large mass spectral datasets, thereby illustrating its potential for spectral library matching. Finally, MS2DeepScore can also be used to create chemically meaningful mass spectral embeddings that could be used to cluster large numbers of spectra. Added to the recently introduced unsupervised Spec2Vec metric, we believe that machine learning-supported mass spectral similarity metrics have great potential for a range of metabolomics data processing pipelines.

Download Full-text

Use of reflectance spectroscopy to estimate the organic carbon and CaCO3 contents of soils

Agrokémia és Talajtan ◽

10.1556/agrokem.60.2012.2.5 ◽

2012 ◽

Vol 61 (2) ◽

pp. 277-290 ◽

Cited By ~ 1

Author(s):

Ádám Csorba ◽

Vince Láng ◽

László Fenyvesi ◽

Erika Michéli

Keyword(s):

Organic Carbon ◽

Least Squares ◽

Partial Least Squares ◽

Partial Least Squares Regression ◽

Mean Squared Error ◽

Reflectance Spectroscopy ◽

Least Squares Regression ◽

Root Mean Squared Error ◽

Squared Error

Napjainkban egyre nagyobb igény mutatkozik olyan technológiák és módszerek kidolgozására és alkalmazására, melyek lehetővé teszik a gyors, költséghatékony és környezetbarát talajadat-felvételezést és kiértékelést. Ezeknek az igényeknek felel meg a reflektancia spektroszkópia, mely az elektromágneses spektrum látható (VIS) és közeli infravörös (NIR) tartományában (350–2500 nm) végzett reflektancia-mérésekre épül. Figyelembe véve, hogy a talajokról felvett reflektancia spektrum információban nagyon gazdag, és a vizsgált tartományban számos talajalkotó rendelkezik karakterisztikus spektrális „ujjlenyomattal”, egyetlen görbéből lehetővé válik nagyszámú, kulcsfontosságú talajparaméter egyidejű meghatározása. Dolgozatunkban, a reflektancia spektroszkópia alapjaira helyezett, a talajok ösz-szetételének meghatározását célzó módszertani fejlesztés első lépéseit mutatjuk be. Munkánk során talajok szervesszén- és CaCO3-tartalmának megbecslését lehetővé tévő többváltozós matematikai-statisztikai módszerekre (részleges legkisebb négyzetek módszere, partial least squares regression – PLSR) épülő prediktív modellek létrehozását és tesztelését végeztük el. A létrehozott modellek tesztelése során megállapítottuk, hogy az eljárás mindkét talajparaméter esetében magas R2értéket [R2(szerves szén) = 0,815; R2(CaCO3) = 0,907] adott. A becslés pontosságát jelző közepes négyzetes eltérés (root mean squared error – RMSE) érték mindkét paraméter esetében közepesnek mondható [RMSE (szerves szén) = 0,467; RMSE (CaCO3) = 3,508], mely a reflektancia mérési előírások standardizálásával jelentősen javítható. Vizsgálataink alapján arra a következtetésre jutottunk, hogy a reflektancia spektroszkópia és a többváltozós kemometriai eljárások együttes alkalmazásával, gyors és költséghatékony adatfelvételezési és -értékelési módszerhez juthatunk.

Download Full-text

Comparative study of the pencil-and-paper and digital formats of the Spanish DARS scale

Acta Neuropsychiatrica ◽

10.1017/neu.2021.45 ◽

2021 ◽

pp. 1-21

Author(s):

Elsa Arrua-Duarte ◽

Marta Migoya-Borja ◽

Igor Barahona ◽

Lena C. Quilty ◽

Sakina J. Rizvi ◽

...

Keyword(s):

Rating Scale ◽

Mean Squared Error ◽

Intraclass Correlation ◽

Test Validity ◽

Wilcoxon Test ◽

Digital Version ◽

Root Mean Squared Error ◽

Squared Error ◽

Digital Format ◽

Paper And Pencil

Abstract Objective: The Dimensional Anhedonia Rating Scale (DARS) is a novel questionnaire to assess anhedonia of recent validation. In this work we aim to study the equivalence between the traditional paper-and-pencil and the digital format of DARS. Methods: 69 patients filled the DARS in a paper-based and digital versions. We assessed differences between formats (Wilcoxon test), validity of the scales (Kappa and Intraclass Correlation Coefficients), and reliability (Cronbach’s alpha and Guttman’s coefficient). We calculated the Comparative Fit Index and the Root Mean Squared Error associated with the proposed one-factor structure. Results: Total scores were higher for paper-based format. Significant differences between both formats were found for three items. The weighted Kappa coefficient was approximately 0.40 for most of the items. Internal consistency was greater than 0.94, and the Intraclass Correlation Coefficient for the digital version was 0.95 and 0.94 for the paper-and-pencil version (F= 16.7, p < 0.001). Comparative Adjustment Index was 0.97 for the digital DARS and 0.97 for the paper-and-pencil DARS, and Root Mean Squared Error was 0.11 for the digital DARS and 0.10 for the paper-and-pencil DARS. Conclusion: The digital DARS is consistent in many respects to the paper-and-pencil questionnaire, but equivalence with this format cannot be assumed without caution.

Download Full-text

Prediksi Indeks Harga Saham Gabungan (IHSG) Menggunakan Algoritma Neural Network

Jurnal Edukasi dan Penelitian Informatika (JEPIN) ◽

10.26418/jp.v4i1.25384 ◽

2018 ◽

Vol 4 (1) ◽

pp. 24

Author(s):

Imam Halimi ◽

Wahyu Andhyka Kusuma

Keyword(s):

Neural Network ◽

Data Mining ◽

Linear Regression ◽

Mean Squared Error ◽

Composite Index ◽

T Test ◽

Sliding Windows ◽

Root Mean Squared Error ◽

Squared Error

Investasi saham merupakan hal yang tidak asing didengar maupun dilakukan. Ada berbagai macam saham di Indonesia, salah satunya adalah Indeks Harga Saham Gabungan (IHSG) atau dalam bahasa inggris disebut Indonesia Composite Index, ICI, atau IDX Composite. IHSG merupakan parameter penting yang dipertimbangkan pada saat akan melakukan investasi mengingat IHSG adalah saham gabungan. Penelitian ini bertujuan memprediksi pergerakan IHSG dengan teknik data mining menggunakan algoritma neural network dan dibandingkan dengan algoritma linear regression, yang dapat dijadikan acuan investor saat akan melakukan investasi. Hasil dari penelitian ini berupa nilai Root Mean Squared Error (RMSE) serta label tambahan angka hasil prediksi yang didapatkan setelah dilakukan validasi menggunakan sliding windows validation dengan hasil paling baik yaitu pada pengujian yang menggunakan algoritma neural network yang menggunakan windowing yaitu sebesar 37,786 dan pada pengujian yang tidak menggunakan windowing sebesar 13,597 dan untuk pengujian algoritma linear regression yang menggunakan windowing yaitu sebesar 35,026 dan pengujian yang tidak menggunakan windowing sebesar 12,657. Setelah dilakukan pengujian T-Test menunjukan bahwa pengujian menggunakan neural network yang dibandingkan dengan linear regression memiliki hasil yang tidak signifikan dengan nilai T-Test untuk pengujian dengan windowing dan tanpa windowing hasilnya sama, yaitu sebesar 1,000.

Download Full-text

Two-Run Genetic Programming for Predicting Slump Flow of Concrete

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.590.321 ◽

2014 ◽

Vol 590 ◽

pp. 321-325

Author(s):

Li Chen ◽

Chang Huan Kou ◽

Kuan Ting Chen ◽

Shih Wei Ma

Keyword(s):

Nonlinear Systems ◽

Genetic Programming ◽

High Performance ◽

Mean Squared Error ◽

High Performance Concrete ◽

Local Optimum ◽

Root Mean Squared Error ◽

Squared Error ◽

Slump Flow ◽

Low Dimensional

A two-run genetic programming (GP) is proposed to estimate the slump flow of high-performance concrete (HPC) using several significant concrete ingredients in this study. GP optimizes functions and their associated coefficients simultaneously and is suitable to automatically discover relationships between nonlinear systems. Basic-GP usually suffers from premature convergence, which cannot acquire satisfying solutions and show satisfied performance only on low dimensional problems. Therefore it was improved by an automatically incremental procedure to improve the search ability and avoid local optimum. The results demonstrated that two-run GP generates an accurate formula through and has 7.5 % improvement on root mean squared error (RMSE) for predicting the slump flow of HPC than Basic-GP.

Download Full-text

Evaluation of GEOS-Simulated L-Band Microwave Brightness Temperature Using Aquarius Observations over Non-Frozen Land across North America

Remote Sensing ◽

10.3390/rs12183098 ◽

2020 ◽

Vol 12 (18) ◽

pp. 3098

Author(s):

Jongmin Park ◽

Barton A. Forman ◽

Rolf H. Reichle ◽

Gabrielle De Lannoy ◽

Saad B. Tarik

Keyword(s):

North America ◽

Brightness Temperature ◽

Mean Squared Error ◽

Transfer Model ◽

Vegetation Types ◽

Large Variability ◽

Root Mean Squared Error ◽

Squared Error ◽

L Band ◽

Soil Hydraulic

L-band brightness temperature (Tb) is one of the key remotely-sensed variables that provides information regarding surface soil moisture conditions. In order to harness the information in Tb observations, a radiative transfer model (RTM) is investigated for eventual inclusion into a data assimilation framework. In this study, Tb estimates from the RTM implemented in the NASA Goddard Earth Observing System (GEOS) were evaluated against the nearly four-year record of daily Tb observations collected by L-band radiometers onboard the Aquarius satellite. Statistics between the modeled and observed Tb were computed over North America as a function of soil hydraulic properties and vegetation types. Overall, statistics showed good agreement between the modeled and observed Tb with a relatively low, domain-average bias (0.79 K (ascending) and −2.79 K (descending)), root mean squared error (11.0 K (ascending) and 11.7 K (descending)), and unbiased root mean squared error (8.14 K (ascending) and 8.28 K (descending)). In terms of soil hydraulic parameters, large porosity and large wilting point both lead to high uncertainty in modeled Tb due to the large variability in dielectric constant and surface roughness used by the RTM. The performance of the RTM as a function of vegetation type suggests better agreement in regions with broadleaf deciduous and needleleaf forests while grassland regions exhibited the worst accuracy amongst the five different vegetation types.

Download Full-text

Combined Multilateration with Machine Learning for Enhanced Aircraft Localization

Proceedings ◽

10.3390/proceedings2020059002 ◽

2020 ◽

Vol 59 (1) ◽

pp. 2

Author(s):

Benoit Figuet ◽

Raphael Monstein ◽

Michael Felux

Keyword(s):

Machine Learning ◽

Sensitivity Analysis ◽

Real World ◽

Mean Squared Error ◽

Accurate Estimate ◽

Regression Technique ◽

Gradient Boosting ◽

Root Mean Squared Error ◽

Squared Error ◽

Using Data

In this paper, we present an aircraft localization solution developed in the context of the Aircraft Localization Competition and applied to the OpenSky Network real-world ADS-B data. The developed solution is based on a combination of machine learning and multilateration using data provided by time synchronized ground receivers. A gradient boosting regression technique is used to obtain an estimate of the geometric altitude of the aircraft, as well as a first guess of the 2D aircraft position. Then, a triplet-wise and an all-in-view multilateration technique are implemented to obtain an accurate estimate of the aircraft latitude and longitude. A sensitivity analysis of the accuracy as a function of the number of receivers is conducted and used to optimize the proposed solution. The obtained predictions have an accuracy below 25 m for the 2D root mean squared error and below 35 m for the geometric altitude.

Download Full-text

Forecasting Oil Price Using Web-based Sentiment Analysis

Energies ◽

10.3390/en12224291 ◽

2019 ◽

Vol 12 (22) ◽

pp. 4291 ◽

Cited By ~ 2

Author(s):

Lu-Tao Zhao ◽

Guan-Rong Zeng ◽

Wen-Jing Wang ◽

Zhi-Gang Zhang

Keyword(s):

Sentiment Analysis ◽

Mean Squared Error ◽

Oil Price ◽

Future Research ◽

Price Forecasting ◽

Web Based ◽

Market Sentiment ◽

Root Mean Squared Error ◽

Strong Intensity ◽

Squared Error

International oil price forecasting is a complex and important issue in the research area of energy economy. In this paper, a new model based on web-based sentiment analysis is proposed. For the oil market, sentiment analysis is used to extract key information from web texts from the four perspectives of: compound, negative, neutral, and positive sentiment. These are constructed as feature and input into oil price forecasting models with oil price itself. Finally, we analyze the effect in various views and get some interesting discoveries. The results show that the root mean squared error can be reduced by about 0.2 and the error variance by 0.2, which means that the accuracy and stability are thereby improved. Furthermore, we find that different types of sentiments can all improve performance but by similar amounts. Last but not least, text with strong intensity can better support oil price forecasting than weaker text, for which the root mean squared error can be reduced by up to 0.5, and the number of the bad cases is reduced by 20%, indicating that text with strong intensity can correct the original oil price forecast. We believe that our research will play a strong supporting role in future research on using web information for oil price forecasting.

Download Full-text

Mass spectral similarity for untargeted metabolomics data analysis of complex mixtures

International Journal of Mass Spectrometry ◽

10.1016/j.ijms.2014.06.005 ◽

2015 ◽

Vol 377 ◽

pp. 719-727 ◽

Cited By ~ 48

Author(s):

Neha Garg ◽

Clifford A. Kapono ◽

Yan Wei Lim ◽

Nobuhiro Koyama ◽

Mark J.A. Vermeij ◽

...

Keyword(s):

Data Analysis ◽

Complex Mixtures ◽

Untargeted Metabolomics ◽

Metabolomics Data ◽

Spectral Similarity ◽

Mass Spectral

Download Full-text

A Hybrid Recommender Algorithm Based on an Improved Similarity Method

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.475-476.978 ◽

2013 ◽

Vol 475-476 ◽

pp. 978-982 ◽

Cited By ~ 3

Author(s):

Rui Ping Song ◽

Bo Wang ◽

Guo Ming Huang ◽

Qi Dong Liu ◽

Rong Jing Hu ◽

...

Keyword(s):

Mean Squared Error ◽

Similarity Measures ◽

Absolute Error ◽

Squared Error ◽

Coverage Accuracy ◽

The Mean ◽

Recommender Algorithm ◽

Hybrid Recommender ◽

Important Measurement ◽

Similarity Method

Recommendation systems have achieved widespread success in E-commerce nowadays. There are several evaluation metrics for recommender systems, such as accuracy, diversity, computational efficiency and coverage. Accuracy is one of the most important measurement criteria. In this paper, to improve accuracy, we proposed a hybrid recommender algorithm by an improved similarity method (ISM), combining demographic recommendation techniques and user-based collaborative filtering (CF) algorithms. Experiments were performed to compare the present approach with the other classical similarity measures based on the MovieLens dataset. The Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) values show the superiority of the proposed algorithm.

Download Full-text