language representation
Recently Published Documents


TOTAL DOCUMENTS

315
(FIVE YEARS 158)

H-INDEX

22
(FIVE YEARS 4)

2022 ◽  
Vol 16 (2) ◽  
pp. 1-26
Author(s):  
Riccardo Cantini ◽  
Fabrizio Marozzo ◽  
Giovanni Bruno ◽  
Paolo Trunfio

The growing use of microblogging platforms is generating a huge amount of posts that need effective methods to be classified and searched. In Twitter and other social media platforms, hashtags are exploited by users to facilitate the search, categorization, and spread of posts. Choosing the appropriate hashtags for a post is not always easy for users, and therefore posts are often published without hashtags or with hashtags not well defined. To deal with this issue, we propose a new model, called HASHET ( HAshtag recommendation using Sentence-to-Hashtag Embedding Translation ), aimed at suggesting a relevant set of hashtags for a given post. HASHET is based on two independent latent spaces for embedding the text of a post and the hashtags it contains. A mapping process based on a multi-layer perceptron is then used for learning a translation from the semantic features of the text to the latent representation of its hashtags. We evaluated the effectiveness of two language representation models for sentence embedding and tested different search strategies for semantic expansion, finding out that the combined use of BERT ( Bidirectional Encoder Representation from Transformer ) and a global expansion strategy leads to the best recommendation results. HASHET has been evaluated on two real-world case studies related to the 2016 United States presidential election and COVID-19 pandemic. The results reveal the effectiveness of HASHET in predicting one or more correct hashtags, with an average F -score up to 0.82 and a recommendation hit-rate up to 0.92. Our approach has been compared to the most relevant techniques used in the literature ( generative models , unsupervised models, and attention-based supervised models ) by achieving up to 15% improvement in F -score for the hashtag recommendation task and 9% for the topic discovery task.


2022 ◽  
Author(s):  
Loïc Kwate Dassi ◽  
Matteo Manica ◽  
Daniel Probst ◽  
Philippe Schwaller ◽  
Yves Gaetan Nana Teukam ◽  
...  

The first decade of genome sequencing saw a surge in the characterization of proteins with unknown functionality. Even still, more than 20% of proteins in well-studied model animals have yet to be identified, making the discovery of their active site one of biology's greatest puzzle. Herein, we apply a Transformer architecture to a language representation of bio-catalyzed chemical reactions to learn the signal at the base of the substrate-active site atomic interactions. The language representation comprises a reaction simplified molecular-input line-entry system (SMILES) for substrate and products, complemented with amino acid (AA) sequence information for the enzyme. We demonstrate that by creating a custom tokenizer and a score based on attention values, we can capture the substrate-active site interaction signal and utilize it to determine the active site position in unknown protein sequences, unraveling complicated 3D interactions using just 1D representations. This approach exhibits remarkable results and can recover, with no supervision, 31.51% of the active site when considering co-crystallized substrate-enzyme structures as a ground-truth, vastly outperforming approaches based on sequence similarities only. Our findings are further corroborated by docking simulations on the 3D structure of few enzymes. This work confirms the unprecedented impact of natural language processing and more specifically of the Transformer architecture on domain-specific languages, paving the way to effective solutions for protein functional characterization and bio-catalysis engineering.


Author(s):  
Ekaterina V. Sharapova

The paper is dedicated to the analysis of irregular intensifying constructions in Fyodor Dostoevskys individual style, more specifically of irregular lexical combinations including an intensifier and a main word. The meaning of intensifies is completely focused on expressing the intensity of an attribute, action or state denominated by the main word. Fyodor Dostoevskys writing style tends to form certain types of irregular lexical combinations including intensifiers. Through a case study of intensifiers ochen , goryachii / goryacho , izo vsekh sil , etc. the article describes several main types of irregular combinations with the view to find out semantic effects of irregular combinability. The analysis has covered the complete corpus of Fyodor Doestoevskys texts (fiction, journalism, private correspondence, business correspondence and documents). The first discriminant mark of irregular combinations is the semantic contradiction between the terms, the violation of idiomatic compatibility rules is the second feature taken into account. Whereas collocational rules in modern Russian differ a lot from the language norms of the 19th century, the subcorpus of 19th century texts of the National Russian Corpus serves to test the irregular idiomatic combinability. The article shows how the irregular combinability causes context meaning shifts profiling peripheral semantic features of the main word or imposing on it an untypical and contradictory semantics. In irregular lexical combinations, the language representation changes due to the original authors reinterpretation of the referential situation. The irregular lexical combinations with intensifiers are thereafter a distinctive feature of Fyodor Dostoevskys individual writing style that reveals his original conceptualization of the world.


NeuroImage ◽  
2021 ◽  
pp. 118811
Author(s):  
Laura V. Cuaya ◽  
Raúl Hernández-Pérez ◽  
Marianna Boros ◽  
Andrea Deme ◽  
Attila Andics

2021 ◽  
Author(s):  
Eliane Maria De Bortoli Fávero ◽  
Dalcimar Casanova

The application of Natural Language Processing (NLP) has achieved a high level of relevance in several areas. In the field of software engineering (SE), NLP applications are based on the classification of similar texts (e.g. software requirements), applied in tasks of estimating software effort, selection of human resources, etc. Classifying software requirements has been a complex task, considering the informality and complexity inherent in the texts produced during the software development process. The pre-trained embedding models are shown as a viable alternative when considering the low volume of textual data labeled in the area of software engineering, as well as the lack of quality of these data. Although there is much research around the application of word embedding in several areas, to date, there is no knowledge of studies that have explored its application in the creation of a specific model for the domain of the SE area. Thus, this article presents the proposal for a contextualized embedding model, called BERT_SE, which allows the recognition of specific and relevant terms in the context of SE. The assessment of BERT_SE was performed using the software requirements classification task, demonstrating that this model has an average improvement rate of 13% concerning the BERT_base model, made available by the authors of BERT. The code and pre-trained models are available at https://github.com/elianedb.


2021 ◽  
Author(s):  
Loïc Kwate Dassi ◽  
Matteo Manica ◽  
Daniel Probst ◽  
Philippe Schwaller ◽  
Yves Gaetan Nana Teukam ◽  
...  

The first decade of genome sequencing saw a surge in the characterization of proteins with unknown functionality. Even still, more than 20% of proteins in well-studied model animals have yet to be identified, making the discovery of their active site one of biology's greatest puzzle. Herein, we apply a Transformer architecture to a language representation of bio-catalyzed chemical reactions to learn the signal at the base of the substrate-active site atomic interactions. The language representation comprises a reaction simplified molecular-input line-entry system (SMILES) for substrate and products, complemented with amino acid (AA) sequence information for the enzyme. We demonstrate that by creating a custom tokenizer and a score based on attention values, we can capture the substrate-active site interaction signal and utilize it to determine the active site position in unknown protein sequences, unraveling complicated 3D interactions using just 1D representations. This approach exhibits remarkable results and can recover, with no supervision, 31.51% of the active site when considering co-crystallized substrate-enzyme structures as a ground-truth, vastly outperforming approaches based on sequence similarities only. Our findings are further corroborated by docking simulations on the 3D structure of few enzymes. This work confirms the unprecedented impact of natural language processing and more specifically of the Transformer architecture on domain-specific languages, paving the way to effective solutions for protein functional characterization and bio-catalysis engineering.


2021 ◽  
Vol 21 (S9) ◽  
Author(s):  
Dongfang Li ◽  
Ying Xiong ◽  
Baotian Hu ◽  
Buzhou Tang ◽  
Weihua Peng ◽  
...  

Abstract Background Drug repurposing is to find new indications of approved drugs, which is essential for investigating new uses for approved or investigational drug efficiency. The active gene annotation corpus (named AGAC) is annotated by human experts, which was developed to support knowledge discovery for drug repurposing. The AGAC track of the BioNLP Open Shared Tasks using this corpus is organized by EMNLP-BioNLP 2019, where the “Selective annotation” attribution makes AGAC track more challenging than other traditional sequence labeling tasks. In this work, we show our methods for trigger word detection (Task 1) and its thematic role identification (Task 2) in the AGAC track. As a step forward to drug repurposing research, our work can also be applied to large-scale automatic extraction of medical text knowledge. Methods To meet the challenges of the two tasks, we consider Task 1 as the medical name entity recognition (NER), which cultivates molecular phenomena related to gene mutation. And we regard Task 2 as a relation extraction task, which captures the thematic roles between entities. In this work, we exploit pre-trained biomedical language representation models (e.g., BioBERT) in the information extraction pipeline for mutation-disease knowledge collection from PubMed. Moreover, we design the fine-tuning framework by using a multi-task learning technique and extra features. We further investigate different approaches to consolidate and transfer the knowledge from varying sources and illustrate the performance of our model on the AGAC corpus. Our approach is based on fine-tuned BERT, BioBERT, NCBI BERT, and ClinicalBERT using multi-task learning. Further experiments show the effectiveness of knowledge transformation and the ensemble integration of models of two tasks. We conduct a performance comparison of various algorithms. We also do an ablation study on the development set of Task 1 to examine the effectiveness of each component of our method. Results Compared with competitor methods, our model obtained the highest Precision (0.63), Recall (0.56), and F-score value (0.60) in Task 1, which ranks first place. It outperformed the baseline method provided by the organizers by 0.10 in F-score. The model shared the same encoding layers for the named entity recognition and relation extraction parts. And we obtained a second high F-score (0.25) in Task 2 with a simple but effective framework. Conclusions Experimental results on the benchmark annotation of genes with active mutation-centric function changes corpus show that integrating pre-trained biomedical language representation models (i.e., BERT, NCBI BERT, ClinicalBERT, BioBERT) into a pipe of information extraction methods with multi-task learning can improve the ability to collect mutation-disease knowledge from PubMed.


Sign in / Sign up

Export Citation Format

Share Document