scholarly journals A Study of Analogical Density in Various Corpora at Various Granularity

Information ◽  
2021 ◽  
Vol 12 (8) ◽  
pp. 314
Author(s):  
Rashel Fam ◽  
Yves Lepage

In this paper, we inspect the theoretical problem of counting the number of analogies between sentences contained in a text. Based on this, we measure the analogical density of the text. We focus on analogy at the sentence level, based on the level of form rather than on the level of semantics. Experiments are carried on two different corpora in six European languages known to have various levels of morphological richness. Corpora are tokenised using several tokenisation schemes: character, sub-word and word. For the sub-word tokenisation scheme, we employ two popular sub-word models: unigram language model and byte-pair-encoding. The results show that the corpus with a higher Type-Token Ratio tends to have higher analogical density. We also observe that masking the tokens based on their frequency helps to increase the analogical density. As for the tokenisation scheme, the results show that analogical density decreases from the character to word. However, this is not true when tokens are masked based on their frequencies. We find that tokenising the sentences using sub-word models and masking the least frequent tokens increase analogical density.

Author(s):  
Kelvin Guu ◽  
Tatsunori B. Hashimoto ◽  
Yonatan Oren ◽  
Percy Liang

We propose a new generative language model for sentences that first samples a prototype sentence from the training corpus and then edits it into a new sentence. Compared to traditional language models that generate from scratch either left-to-right or by first sampling a latent sentence vector, our prototype-then-edit model improves perplexity on language modeling and generates higher quality outputs according to human evaluation. Furthermore, the model gives rise to a latent edit vector that captures interpretable semantics such as sentence similarity and sentence-level analogies.


2001 ◽  
Vol 6 (3) ◽  
pp. 43-53 ◽  
Author(s):  
Primoz Jakopin

In this paper, a language model, based on probabilities of text n-grams, is used as a measure of distance between Slovenian and 15 other European languages. During the construction of the model, a Huffman tree is generated from all the n-grams (n= 1to 32, frequency 2 or more) in the training corpus of Slovenian literary texts (2.7 million words), and appropriate Huffman codes are computed for every leaf in the tree. To apply the model to a new text sample, it is cut into n-grams (1–32) in such a way that the sum of model Huffman code lengths for all the obtained n-grams of new text is minimal. The above model, applied to all (16) translations of Plato’s Republic from the TELRI CD ROM, produced the following language order (average coding length in bits per character): Slovenian (2,37), Serbocroatian (3,77), Croatian (3,84), Bulgarian (3,96), Czech (4,10), Polish (4,32), Russian (4,46), Slovak (4,46), Latvian (4,74), Lithuanian (4,94), English (5,40), French (5,67), German (5,69), Romanian (5,76), Finnish (6,11), and Hungarian (6,47).


2020 ◽  
Author(s):  
Ying Xiong ◽  
Shuai Chen ◽  
Qingcai Chen ◽  
Jun Yan ◽  
Buzhou Tang

BACKGROUND With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. OBJECTIVE In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. METHODS We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). RESULTS We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). CONCLUSIONS Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model.


2021 ◽  
pp. 318-330
Author(s):  
Melika Golestani ◽  
Seyedeh Zahra Razavi ◽  
Zeinab Borhanifard ◽  
Farnaz Tahmasebian ◽  
Hesham Faili

2021 ◽  
Vol 9 (5) ◽  
pp. 306-309
Author(s):  
Bavishi Hilloni ◽  
◽  
Debalina Nandy ◽  

The COVID-19 literature has accelerated at a rapid pace and the Artificial Intelligence community as well as researchers all over the globe has the responsibility to help the medical community. The CORD-19 dataset contains various articles about COVID-19, SARS-CoV-2, and related corona viruses. Due to massive size of literature and documents it is difficult to find relevant and accurate pieces of information. There are question answering system using pre-trained models and fine-tuning them using BERT Transformers. BERT is a language model that powerfully learns from tokens and sentence-level training. The variants of BERT like ALBERT, DistilBERT, RoBERTa, SciBERT alongwith BioSentVec can be effective in training the model as they help in improving accuracy and increase the training speed. This will also provide the information on using SPECTER-document level relatedness like CORD 19 embeddings for pre-training a Transformer language model. This article will help in building the question answering model to facilitate the research and save the lives of people in the fight against COVID 19.


2020 ◽  
Author(s):  
Li Mingzheng ◽  
Chen Lei ◽  
Zhao Jing ◽  
Li Qiang

Abstract A huge amount of stock reviews occurred on the Internet due to its rapid development, therefore, the stock reviews sentiment analysis has profound significance for the study of the financial market. Due to the lack of a large amount of labeled data, the accuracy of existing sentiment analysis of Chinese stock reviews remains to be further improved. In this paper, a sentiment analysis algorithm for Chinese stock reviews based on BERT is proposed and it improves the accuracy of sentiment classification. The algorithm uses BERT pre-training language model to perform representation of stock reviews on the sentence level, and then input the obtained feature vector into the classifier layer for classification. In the experiments, we show our method has nearly 8% and 9% improvement than TextCNN and TextRNN in F1, respectively. Our model can obtain the best results via fine-tuning which is proved to be effective in Chinese stock review sentiment analysis.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Yufeng Sun ◽  
Fengbao Yang ◽  
Xiaoxia Wang ◽  
Hongsong Dong

The automatic generation of the draft procuratorial suggestions is to extract the description of illegal facts, administrative omission, description of laws and regulations, and other information from the case documents. Previously, the existing deep learning methods mainly focus on context-free word embeddings when addressing legal domain-specific extractive summarization tasks, which cannot get a better semantic understanding of the text and in turn leads to an adverse summarization performance. To this end, we propose a novel deep contextualized embeddings-based method BERTSLCA to conduct the extractive summarization task. The model is mainly based on the variant of BERT called BERTSUM. Firstly, the input document is fed into BERTSUM to get sentence-level embeddings. Then, we design an extracting architecture to catch the long dependency between sentences utilizing the Bi-Long Short-Term Memory (Bi-LSTM) unit, and at the end of the architecture, three cascaded convolution kernels with different sizes are designed to extract the relationships between adjacent sentences. Last, we introduce an attention mechanism to strengthen the ability to distinguish the importance of different sentences. To the best of our knowledge, this is the first work to use the pretrained language model for extractive summarization tasks in the field of Chinese judicial litigation. Experimental results on public interest litigation data and CAIL 2020 dataset all demonstrate that the proposed method achieves competitive performance.


10.2196/23357 ◽  
2020 ◽  
Vol 8 (12) ◽  
pp. e23357
Author(s):  
Ying Xiong ◽  
Shuai Chen ◽  
Qingcai Chen ◽  
Jun Yan ◽  
Buzhou Tang

Background With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. Objective In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. Methods We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). Results We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). Conclusions Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model.


2020 ◽  
Vol 63 (7) ◽  
pp. 2281-2292
Author(s):  
Ying Zhao ◽  
Xinchun Wu ◽  
Hongjun Chen ◽  
Peng Sun ◽  
Ruibo Xie ◽  
...  

Purpose This exploratory study aimed to investigate the potential impact of sentence-level comprehension and sentence-level fluency on passage comprehension of deaf students in elementary school. Method A total of 159 deaf students, 65 students ( M age = 13.46 years) in Grades 3 and 4 and 94 students ( M age = 14.95 years) in Grades 5 and 6, were assessed for nonverbal intelligence, vocabulary knowledge, sentence-level comprehension, sentence-level fluency, and passage comprehension. Group differences were examined using t tests, whereas the predictive and mediating mechanisms were examined using regression modeling. Results The regression analyses showed that the effect of sentence-level comprehension on passage comprehension was not significant, whereas sentence-level fluency was an independent predictor in Grades 3–4. Sentence-level comprehension and fluency contributed significant variance to passage comprehension in Grades 5–6. Sentence-level fluency fully mediated the influence of sentence-level comprehension on passage comprehension in Grades 3–4, playing a partial mediating role in Grades 5–6. Conclusions The relative contributions of sentence-level comprehension and fluency to deaf students' passage comprehension varied, and sentence-level fluency mediated the relationship between sentence-level comprehension and passage comprehension.


Sign in / Sign up

Export Citation Format

Share Document