scholarly journals METHOD FOR DETERMINING THE SEMANTIC SIMILARITY OF ARBITRARY LENGTH TEXTS USING THE TRANSFORMERS MODELS

2021 ◽  
Vol 5 (2) ◽  
pp. 126-130
Author(s):  
Сергій Олізаренко ◽  
В’ячеслав Радченко

The paper considers the results of a method development for determining the semantic similarity of arbitrary length texts based on their vector representations. These vector representations are obtained via multilingual Transformers model usage, and direct problem of determining semantic similarity of arbitrary length texts is considered as the text sequence pairs classification problem using Transformers model. Comparative analysis of the most optimal Transformers model for solving such class of problems was performed. Considered in this case main stages of the method are: Transformers model fine-tuning stage in the framework of pretrained model second problem (sentence prediction), also selection and implementation stage of the summarizing method for text sequence more than 512 (1024) tokens long to solve the problem of determining the semantic similarity for arbitrary length texts.

BMC Genomics ◽  
2019 ◽  
Vol 20 (S9) ◽  
Author(s):  
Xiaoshi Zhong ◽  
Rama Kaalia ◽  
Jagath C. Rajapakse

Abstract Background Semantic similarity between Gene Ontology (GO) terms is a fundamental measure for many bioinformatics applications, such as determining functional similarity between genes or proteins. Most previous research exploited information content to estimate the semantic similarity between GO terms; recently some research exploited word embeddings to learn vector representations for GO terms from a large-scale corpus. In this paper, we proposed a novel method, named GO2Vec, that exploits graph embeddings to learn vector representations for GO terms from GO graph. GO2Vec combines the information from both GO graph and GO annotations, and its learned vectors can be applied to a variety of bioinformatics applications, such as calculating functional similarity between proteins and predicting protein-protein interactions. Results We conducted two kinds of experiments to evaluate the quality of GO2Vec: (1) functional similarity between proteins on the Collaborative Evaluation of GO-based Semantic Similarity Measures (CESSM) dataset and (2) prediction of protein-protein interactions on the Yeast and Human datasets from the STRING database. Experimental results demonstrate the effectiveness of GO2Vec over the information content-based measures and the word embedding-based measures. Conclusion Our experimental results demonstrate the effectiveness of using graph embeddings to learn vector representations from undirected GO and GOA graphs. Our results also demonstrate that GO annotations provide useful information for computing the similarity between GO terms and between proteins.


2021 ◽  
Vol 2142 (1) ◽  
pp. 012013
Author(s):  
A S Nazdryukhin ◽  
A M Fedrak ◽  
N A Radeev

Abstract This work presents the results of using self-normalizing neural networks with automatic selection of hyperparameters, TabNet and NODE to solve the problem of tabular data classification. The method of automatic selection of hyperparameters was realised. Testing was carried out with the open source framework OpenML AutoML Benchmark. As part of the work, a comparative analysis was carried out with seven classification methods, experiments were carried out for 39 datasets with 5 methods. NODE shows the best results among the following methods and overperformed standard methods for four datasets.


2021 ◽  
Author(s):  
Abdul Wahab ◽  
Rafet Sifa

<div> <div> <div> <p> </p><div> <div> <div> <p>In this paper, we propose a new model named DIBERT which stands for Dependency Injected Bidirectional Encoder Representations from Transformers. DIBERT is a variation of the BERT and has an additional third objective called Parent Prediction (PP) apart from Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). PP injects the syntactic structure of a dependency tree while pre-training the DIBERT which generates syntax-aware generic representations. We use the WikiText-103 benchmark dataset to pre-train both BERT- Base and DIBERT. After fine-tuning, we observe that DIBERT performs better than BERT-Base on various downstream tasks including Semantic Similarity, Natural Language Inference and Sentiment Analysis. </p> </div> </div> </div> </div> </div> </div>


2021 ◽  
Author(s):  
Abdul Wahab ◽  
Rafet Sifa

<div> <div> <div> <p> </p><div> <div> <div> <p>In this paper, we propose a new model named DIBERT which stands for Dependency Injected Bidirectional Encoder Representations from Transformers. DIBERT is a variation of the BERT and has an additional third objective called Parent Prediction (PP) apart from Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). PP injects the syntactic structure of a dependency tree while pre-training the DIBERT which generates syntax-aware generic representations. We use the WikiText-103 benchmark dataset to pre-train both BERT- Base and DIBERT. After fine-tuning, we observe that DIBERT performs better than BERT-Base on various downstream tasks including Semantic Similarity, Natural Language Inference and Sentiment Analysis. </p> </div> </div> </div> </div> </div> </div>


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Sebastian Otálora ◽  
Niccolò Marini ◽  
Henning Müller ◽  
Manfredo Atzori

Abstract Background One challenge to train deep convolutional neural network (CNNs) models with whole slide images (WSIs) is providing the required large number of costly, manually annotated image regions. Strategies to alleviate the scarcity of annotated data include: using transfer learning, data augmentation and training the models with less expensive image-level annotations (weakly-supervised learning). However, it is not clear how to combine the use of transfer learning in a CNN model when different data sources are available for training or how to leverage from the combination of large amounts of weakly annotated images with a set of local region annotations. This paper aims to evaluate CNN training strategies based on transfer learning to leverage the combination of weak and strong annotations in heterogeneous data sources. The trade-off between classification performance and annotation effort is explored by evaluating a CNN that learns from strong labels (region annotations) and is later fine-tuned on a dataset with less expensive weak (image-level) labels. Results As expected, the model performance on strongly annotated data steadily increases as the percentage of strong annotations that are used increases, reaching a performance comparable to pathologists ($$\kappa = 0.691 \pm 0.02$$ κ = 0.691 ± 0.02 ). Nevertheless, the performance sharply decreases when applied for the WSI classification scenario with $$\kappa = 0.307 \pm 0.133$$ κ = 0.307 ± 0.133 . Moreover, it only provides a lower performance regardless of the number of annotations used. The model performance increases when fine-tuning the model for the task of Gleason scoring with the weak WSI labels $$\kappa = 0.528 \pm 0.05$$ κ = 0.528 ± 0.05 . Conclusion Combining weak and strong supervision improves strong supervision in classification of Gleason patterns using tissue microarrays (TMA) and WSI regions. Our results contribute very good strategies for training CNN models combining few annotated data and heterogeneous data sources. The performance increases in the controlled TMA scenario with the number of annotations used to train the model. Nevertheless, the performance is hindered when the trained TMA model is applied directly to the more challenging WSI classification problem. This demonstrates that a good pre-trained model for prostate cancer TMA image classification may lead to the best downstream model if fine-tuned on the WSI target dataset. We have made available the source code repository for reproducing the experiments in the paper: https://github.com/ilmaro8/Digital_Pathology_Transfer_Learning


2014 ◽  
Vol 52 (1-2) ◽  
pp. 61-70
Author(s):  
S. Vorslova ◽  
J. Golushko ◽  
S. Galushko ◽  
A. Viksna

Abstract We report our experience with highly polar and charged analyte retention parameter prediction for a reversed-phase high-performance liquid chromatographic method. The solvatic retention model has been used to predict retention of phenylisothiocyanate derivatives of 25 natural amino acids under gradient elution conditions. Retention factors have been calculated from molecular parameters of analyte structures and from the column and eluent characteristics. A step-by-step method which includes the first guess prediction of initial conditions from structural formula and fine tuning of the retention model parameters using data from successive runs can substantially save method development time


Vestnik MEI ◽  
2021 ◽  
pp. 117-127
Author(s):  
Oleg V. Bartenyev ◽  

Various text models used in solving natural language processing problems are considered. Text models are used to perform document classification, the results of which are then used to estimate the comparative effectiveness of the used models. From two classification accuracy values obtained on the evaluation and training sets, the minimum value is selected to evaluate the model. A multilayer perceptron with one hidden layer is used as a classifier. The classifier input receives a real vector representing the document. At its output, the classifier generates a forecast about the document class. The input vector is determined, depending on the used text model, either by the text frequency characteristics, or by distributed vector representations of the pre-trained text model's tokens. The obtained results demonstrate the advantage of models based on the Transformer architecture over other models used in the study, e.g., the word2vec, doc2vec, and fasttext models.


2018 ◽  
Vol 251 ◽  
pp. 05015
Author(s):  
Lidiia Shershova ◽  
Irina Nuzhina ◽  
Evgeny Kurochkin

The aim of the survey is to study the employment of graduates of the direction «Construction» IKBFU for the period 2017-2018. The methods of systematic, logical and comparative analysis, the results of the public opinion survey on the employment of the «Construction» direction graduates, the results of the authors 'own research on identifying employers' preferences and the needs of the construction industry in the region, taking into account the development of new technologies in construction, were used. The aspects are shown and the content of the curricula determining the priorities in the training of personnel for the construction industry in the region is disclosed. The indexes of employment of graduates analysed as a criterion of the effectiveness of the activity of an educational institution. It is shown that practical orientation is an integral part of bachelors training, which is a special feature of the personnel training for the construction industry in the Kaliningrad region. The results of graduates employment monitoring are published. Priority profiles of personnel training for the construction industry in the region have been underlined. A comparative analysis could become an important area of results application for fine tuning of territorial labour and employment policy techniques.


Sign in / Sign up

Export Citation Format

Share Document