METHOD FOR DETERMINING THE SEMANTIC SIMILARITY OF ARBITRARY LENGTH TEXTS USING THE TRANSFORMERS MODELS

The paper considers the results of a method development for determining the semantic similarity of arbitrary length texts based on their vector representations. These vector representations are obtained via multilingual Transformers model usage, and direct problem of determining semantic similarity of arbitrary length texts is considered as the text sequence pairs classification problem using Transformers model. Comparative analysis of the most optimal Transformers model for solving such class of problems was performed. Considered in this case main stages of the method are: Transformers model fine-tuning stage in the framework of pretrained model second problem (sentence prediction), also selection and implementation stage of the summarizing method for text sequence more than 512 (1024) tokens long to solve the problem of determining the semantic similarity for arbitrary length texts.

Download Full-text

GO2Vec: transforming GO terms and proteins to vector representations via graph embeddings

BMC Genomics ◽

10.1186/s12864-019-6272-2 ◽

2019 ◽

Vol 20 (S9) ◽

Cited By ~ 3

Author(s):

Xiaoshi Zhong ◽

Rama Kaalia ◽

Jagath C. Rajapakse

Keyword(s):

Information Content ◽

Semantic Similarity ◽

Protein Interactions ◽

Large Scale ◽

Functional Similarity ◽

Experimental Results ◽

Graph Embeddings ◽

Protein Protein Interactions ◽

Vector Representations ◽

Go Terms

Abstract Background Semantic similarity between Gene Ontology (GO) terms is a fundamental measure for many bioinformatics applications, such as determining functional similarity between genes or proteins. Most previous research exploited information content to estimate the semantic similarity between GO terms; recently some research exploited word embeddings to learn vector representations for GO terms from a large-scale corpus. In this paper, we proposed a novel method, named GO2Vec, that exploits graph embeddings to learn vector representations for GO terms from GO graph. GO2Vec combines the information from both GO graph and GO annotations, and its learned vectors can be applied to a variety of bioinformatics applications, such as calculating functional similarity between proteins and predicting protein-protein interactions. Results We conducted two kinds of experiments to evaluate the quality of GO2Vec: (1) functional similarity between proteins on the Collaborative Evaluation of GO-based Semantic Similarity Measures (CESSM) dataset and (2) prediction of protein-protein interactions on the Yeast and Human datasets from the STRING database. Experimental results demonstrate the effectiveness of GO2Vec over the information content-based measures and the word embedding-based measures. Conclusion Our experimental results demonstrate the effectiveness of using graph embeddings to learn vector representations from undirected GO and GOA graphs. Our results also demonstrate that GO annotations provide useful information for computing the similarity between GO terms and between proteins.

Download Full-text

Neural networks for classification problem on tabular data

Journal of Physics Conference Series ◽

10.1088/1742-6596/2142/1/012013 ◽

2021 ◽

Vol 2142 (1) ◽

pp. 012013

Author(s):

A S Nazdryukhin ◽

A M Fedrak ◽

N A Radeev

Keyword(s):

Neural Networks ◽

Comparative Analysis ◽

Open Source ◽

Classification Problem ◽

Classification Methods ◽

Tabular Data ◽

Standard Methods ◽

Automatic Selection ◽

Open Source Framework ◽

Selection Of

Abstract This work presents the results of using self-normalizing neural networks with automatic selection of hyperparameters, TabNet and NODE to solve the problem of tabular data classification. The method of automatic selection of hyperparameters was realised. Testing was carried out with the open source framework OpenML AutoML Benchmark. As part of the work, a comparative analysis was carried out with seven classification methods, experiments were carried out for 39 datasets with 5 methods. NODE shows the best results among the following methods and overperformed standard methods for four datasets.

Download Full-text

DIBERT: Dependency Injected Bidirectional Encoder Representations from Transformers

10.36227/techrxiv.16444611.v1 ◽

2021 ◽

Author(s):

Abdul Wahab ◽

Rafet Sifa

Keyword(s):

Natural Language ◽

Sentiment Analysis ◽

Semantic Similarity ◽

Syntactic Structure ◽

Language Modeling ◽

Benchmark Dataset ◽

Fine Tuning ◽

New Model ◽

Dependency Tree ◽

Better Than

<div> <div> <div> <p> </p><div> <div> <div> <p>In this paper, we propose a new model named DIBERT which stands for Dependency Injected Bidirectional Encoder Representations from Transformers. DIBERT is a variation of the BERT and has an additional third objective called Parent Prediction (PP) apart from Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). PP injects the syntactic structure of a dependency tree while pre-training the DIBERT which generates syntax-aware generic representations. We use the WikiText-103 benchmark dataset to pre-train both BERT- Base and DIBERT. After fine-tuning, we observe that DIBERT performs better than BERT-Base on various downstream tasks including Semantic Similarity, Natural Language Inference and Sentiment Analysis. </p> </div> </div> </div> </div> </div> </div>

Download Full-text

DIBERT: Dependency Injected Bidirectional Encoder Representations from Transformers

10.36227/techrxiv.16444611.v2 ◽

2021 ◽

Author(s):

Abdul Wahab ◽

Rafet Sifa

Keyword(s):

Natural Language ◽

Sentiment Analysis ◽

Semantic Similarity ◽

Syntactic Structure ◽

Language Modeling ◽

Benchmark Dataset ◽

Fine Tuning ◽

New Model ◽

Dependency Tree ◽

Better Than

Download Full-text

Combining weakly and strongly supervised learning improves strong supervision in Gleason pattern classification

BMC Medical Imaging ◽

10.1186/s12880-021-00609-0 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Sebastian Otálora ◽

Niccolò Marini ◽

Henning Müller ◽

Manfredo Atzori

Keyword(s):

Supervised Learning ◽

Transfer Learning ◽

Data Augmentation ◽

Model Performance ◽

Classification Problem ◽

Heterogeneous Data ◽

Fine Tuning ◽

Data Sources ◽

Gleason Pattern ◽

Heterogeneous Data Sources

Abstract Background One challenge to train deep convolutional neural network (CNNs) models with whole slide images (WSIs) is providing the required large number of costly, manually annotated image regions. Strategies to alleviate the scarcity of annotated data include: using transfer learning, data augmentation and training the models with less expensive image-level annotations (weakly-supervised learning). However, it is not clear how to combine the use of transfer learning in a CNN model when different data sources are available for training or how to leverage from the combination of large amounts of weakly annotated images with a set of local region annotations. This paper aims to evaluate CNN training strategies based on transfer learning to leverage the combination of weak and strong annotations in heterogeneous data sources. The trade-off between classification performance and annotation effort is explored by evaluating a CNN that learns from strong labels (region annotations) and is later fine-tuned on a dataset with less expensive weak (image-level) labels. Results As expected, the model performance on strongly annotated data steadily increases as the percentage of strong annotations that are used increases, reaching a performance comparable to pathologists ($$\kappa = 0.691 \pm 0.02$$ κ = 0.691 ± 0.02 ). Nevertheless, the performance sharply decreases when applied for the WSI classification scenario with $$\kappa = 0.307 \pm 0.133$$ κ = 0.307 ± 0.133 . Moreover, it only provides a lower performance regardless of the number of annotations used. The model performance increases when fine-tuning the model for the task of Gleason scoring with the weak WSI labels $$\kappa = 0.528 \pm 0.05$$ κ = 0.528 ± 0.05 . Conclusion Combining weak and strong supervision improves strong supervision in classification of Gleason patterns using tissue microarrays (TMA) and WSI regions. Our results contribute very good strategies for training CNN models combining few annotated data and heterogeneous data sources. The performance increases in the controlled TMA scenario with the number of annotations used to train the model. Nevertheless, the performance is hindered when the trained TMA model is applied directly to the more challenging WSI classification problem. This demonstrates that a good pre-trained model for prostate cancer TMA image classification may lead to the best downstream model if fine-tuned on the WSI target dataset. We have made available the source code repository for reproducing the experiments in the paper: https://github.com/ilmaro8/Digital_Pathology_Transfer_Learning

Download Full-text

Comparative analysis of word embeddings in assessing semantic similarity of complex sentences

IEEE Access ◽

10.1109/access.2021.3135807 ◽

2021 ◽

pp. 1-1

Author(s):

Dhivya Chandrasekaran ◽

Vijay Mago

Keyword(s):

Comparative Analysis ◽

Semantic Similarity ◽

Word Embeddings ◽

Complex Sentences

Download Full-text

Prediction of Reversed-Phase Liquid Chromatography Retention Parameters for Phenylisothiocyanate Derivatives of Amino Acids

Latvian Journal of Chemistry ◽

10.2478/ljc-2013-0007 ◽

2014 ◽

Vol 52 (1-2) ◽

pp. 61-70

Author(s):

S. Vorslova ◽

J. Golushko ◽

S. Galushko ◽

A. Viksna

Keyword(s):

Amino Acids ◽

High Performance ◽

Method Development ◽

Initial Conditions ◽

Reversed Phase ◽

Fine Tuning ◽

Model Parameters ◽

Step Method ◽

Retention Model ◽

Derivatives Of

Abstract We report our experience with highly polar and charged analyte retention parameter prediction for a reversed-phase high-performance liquid chromatographic method. The solvatic retention model has been used to predict retention of phenylisothiocyanate derivatives of 25 natural amino acids under gradient elution conditions. Retention factors have been calculated from molecular parameters of analyte structures and from the column and eluent characteristics. A step-by-step method which includes the first guess prediction of initial conditions from structural formula and fine tuning of the retention model parameters using data from successive runs can substantially save method development time

Download Full-text

Assessing the Comparative Effectiveness of Text Models in the Document Classification Problem

Vestnik MEI ◽

10.24160/1993-6982-2021-5-117-127 ◽

2021 ◽

pp. 117-127

Author(s):

Oleg V. Bartenyev ◽

Keyword(s):

Language Processing ◽

Classification Accuracy ◽

Comparative Effectiveness ◽

Classification Problem ◽

Document Classification ◽

Real Vector ◽

Text Model ◽

Hidden Layer ◽

Vector Representations ◽

And Training

Various text models used in solving natural language processing problems are considered. Text models are used to perform document classification, the results of which are then used to estimate the comparative effectiveness of the used models. From two classification accuracy values obtained on the evaluation and training sets, the minimum value is selected to evaluate the model. A multilayer perceptron with one hidden layer is used as a classifier. The classifier input receives a real vector representing the document. At its output, the classifier generates a forecast about the document class. The input vector is determined, depending on the used text model, either by the text frequency characteristics, or by distributed vector representations of the pre-trained text model's tokens. The obtained results demonstrate the advantage of models based on the Transformer architecture over other models used in the study, e.g., the word2vec, doc2vec, and fasttext models.

Download Full-text

Correction of Vector Representations of Words to Improve the Semantic Similarity

10.1109/itnt52450.2021.9649102 ◽

2021 ◽

Author(s):

Alexey Kolosov ◽

Archil Maysuradze

Keyword(s):

Semantic Similarity ◽

Vector Representations

Download Full-text

Job placement for graduates as staff training criteria for construction industry at the current stage (by the case Immanuel Kant Baltic Federal University)

MATEC Web of Conferences ◽

10.1051/matecconf/201825105015 ◽

2018 ◽

Vol 251 ◽

pp. 05015

Author(s):

Lidiia Shershova ◽

Irina Nuzhina ◽

Evgeny Kurochkin

Keyword(s):

Comparative Analysis ◽

Immanuel Kant ◽

Construction Industry ◽

New Technologies ◽

Educational Institution ◽

Fine Tuning ◽

Employment Policy ◽

Personnel Training ◽

The Public ◽

Kaliningrad Region

The aim of the survey is to study the employment of graduates of the direction «Construction» IKBFU for the period 2017-2018. The methods of systematic, logical and comparative analysis, the results of the public opinion survey on the employment of the «Construction» direction graduates, the results of the authors 'own research on identifying employers' preferences and the needs of the construction industry in the region, taking into account the development of new technologies in construction, were used. The aspects are shown and the content of the curricula determining the priorities in the training of personnel for the construction industry in the region is disclosed. The indexes of employment of graduates analysed as a criterion of the effectiveness of the activity of an educational institution. It is shown that practical orientation is an integral part of bachelors training, which is a special feature of the personnel training for the construction industry in the Kaliningrad region. The results of graduates employment monitoring are published. Priority profiles of personnel training for the construction industry in the region have been underlined. A comparative analysis could become an important area of results application for fine tuning of territorial labour and employment policy techniques.

Download Full-text