Unsupervised learning of semantic representation for documents with the law of total probability

2017 ◽  
Vol 24 (4) ◽  
pp. 491-522 ◽  
Author(s):  
YANG WEI ◽  
JINMAO WEI ◽  
ZHENGLU YANG

AbstractThe semantic information of documents needs to be represented because it is the basis for many applications, such as document summarization, web search, and text analysis. Although many studies have explored this problem by enriching document vectors with the relatedness of the words involved, the performance remains far from satisfactory because the physical boundaries of documents hinder the evaluation of the relatedness between words. To address this problem, we propose an effective approach to further infer the implicit relatedness between words via their common related words. To avoid overestimation of the implicit relatedness, we restrict the inference in terms of the marginal probabilities of the words based on the law of total probability. The proposed method measures the relatedness between words, which is confirmed theoretically and experimentally. Thorough evaluation on real datasets illustrates that significant improvement on document clustering has been achieved with the proposed method compared with state-of-the-art methods.

2021 ◽  
Vol 21 (S9) ◽  
Author(s):  
Yinyu Lan ◽  
Shizhu He ◽  
Kang Liu ◽  
Xiangrong Zeng ◽  
Shengping Liu ◽  
...  

Abstract Background Knowledge graphs (KGs), especially medical knowledge graphs, are often significantly incomplete, so it necessitating a demand for medical knowledge graph completion (MedKGC). MedKGC can find new facts based on the existed knowledge in the KGs. The path-based knowledge reasoning algorithm is one of the most important approaches to this task. This type of method has received great attention in recent years because of its high performance and interpretability. In fact, traditional methods such as path ranking algorithm take the paths between an entity pair as atomic features. However, the medical KGs are very sparse, which makes it difficult to model effective semantic representation for extremely sparse path features. The sparsity in the medical KGs is mainly reflected in the long-tailed distribution of entities and paths. Previous methods merely consider the context structure in the paths of knowledge graph and ignore the textual semantics of the symbols in the path. Therefore, their performance cannot be further improved due to the two aspects of entity sparseness and path sparseness. Methods To address the above issues, this paper proposes two novel path-based reasoning methods to solve the sparsity issues of entity and path respectively, which adopts the textual semantic information of entities and paths for MedKGC. By using the pre-trained model BERT, combining the textual semantic representations of the entities and the relationships, we model the task of symbolic reasoning in the medical KG as a numerical computing issue in textual semantic representation. Results Experiments results on the publicly authoritative Chinese symptom knowledge graph demonstrated that the proposed method is significantly better than the state-of-the-art path-based knowledge graph reasoning methods, and the average performance is improved by 5.83% for all relations. Conclusions In this paper, we propose two new knowledge graph reasoning algorithms, which adopt textual semantic information of entities and paths and can effectively alleviate the sparsity problem of entities and paths in the MedKGC. As far as we know, it is the first method to use pre-trained language models and text path representations for medical knowledge reasoning. Our method can complete the impaired symptom knowledge graph in an interpretable way, and it outperforms the state-of-the-art path-based reasoning methods.


2021 ◽  
Vol 14 (11) ◽  
pp. 2445-2458
Author(s):  
Valerio Cetorelli ◽  
Paolo Atzeni ◽  
Valter Crescenzi ◽  
Franco Milicchio

We introduce landmark grammars , a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem (SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.


2021 ◽  
Vol 10 (2) ◽  
pp. 42-60
Author(s):  
Khadidja Chettah ◽  
Amer Draa

Automatic text summarization has recently become a key instrument for reducing the huge quantity of textual data. In this paper, the authors propose a quantum-inspired genetic algorithm (QGA) for extractive single-document summarization. The QGA is used inside a totally automated system as an optimizer to search for the best combination of sentences to be put in the final summary. The presented approach is compared with 11 reference methods including supervised and unsupervised summarization techniques. They have evaluated the performances of the proposed approach on the DUC 2001 and DUC 2002 datasets using the ROUGE-1 and ROUGE-2 evaluation metrics. The obtained results show that the proposal can compete with other state-of-the-art methods. It is ranked first out of 12, outperforming all other algorithms.


Sensors ◽  
2019 ◽  
Vol 19 (20) ◽  
pp. 4595 ◽  
Author(s):  
Clara Gomez ◽  
Alejandra C. Hernandez ◽  
Ramon Barber

Exploration of unknown environments is a fundamental problem in autonomous robotics that deals with the complexity of autonomously traversing an unknown area while acquiring the most important information of the environment. In this work, a mobile robot exploration algorithm for indoor environments is proposed. It combines frontier-based concepts with behavior-based strategies in order to build a topological representation of the environment. Frontier-based approaches assume that, to gain the most information of an environment, the robot has to move to the regions on the boundary between open space and unexplored space. The novelty of this work is in the semantic frontier classification and frontier selection according to a cost–utility function. In addition, a probabilistic loop closure algorithm is proposed to solve cyclic situations. The system outputs a topological map of the free areas of the environment for further navigation. Finally, simulated and real-world experiments have been carried out, their results and the comparison to other state-of-the-art algorithms show the feasibility of the exploration algorithm proposed and the improvement that it offers with regards to execution time and travelled distance.


2020 ◽  
Vol 16 (1) ◽  
pp. 24-32
Author(s):  
Rubyantara Jalu Permana ◽  
Sonny Eli Zaluchu

The literal differences found in the text of Exodus 34 verses 1 and 28 can trigger accusations of Bible inconsistency. In fact, in the Christian view, the Bible is a book that cannot be wrong or inner. Evangelical Christian beliefs assert that the Bible contains God's word and God's word itself. If there are differences and inconsistencies in the Bible, is that an indicator to deduce the low credibility of truth in the Christian scriptures? This study aims to answer that question through a hermeneutic and theological analysis of the differences in texts in Exodus 34 or 1 and verse 28, about who actually wrote the two new tablets. God as referred to verse 1 or Moses as read in verse 28. In addition to conducting text analysis, the author also uses the source approach and theological concepts. As a result, verse 28 actually legitimizes verse 1 that God himself wrote the law. This perspective also confirms that the search for the meaning of texts in context does not merely involve a grammatical approach.


2021 ◽  
Vol 47 (1) ◽  
pp. 17-56
Author(s):  
Marcus Galdia

Abstract This essay is a survey of methods applied and topics scrutinized in legal-linguistic studies. It starts with the elucidation of the epistemic interest that led to the emergence and to the subsequent expansion of the mainstream legal-linguistic knowledge that we dispose of today. Thus, the essay focuses upon the development of problem awareness in the emerging legal-linguistic studies as well as upon the results of research that might be perceived as the state of the art in the mainstream legal linguistics. Meanwhile, some methodologically innovative tilts and twists that enrich and inspire contemporary legal linguistics are considered as well. Essentially, this essay traces the conceptual landscape in which the paradigms of legal-linguistic studies came about. This conceptual landscape extends from the research into the isolated words of law and the style used by jurists to the scrutiny of legal texts and legal discourses in all their socio-linguistic complexity. Within this broad frame of reference, many achievements in legal-linguistic studies are mentioned in order to sketch the consequences of processes in which legal-linguistic paradigms take shape. The author concludes upon a vision of legal linguistics called pragmatic legal linguistics as the newest stage in the intellectual enterprise that aims to pierce the language of the law and by so doing to understand law better.


Author(s):  
Xiang Lisa Li ◽  
Jason Eisner

Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.


Mathematics ◽  
2021 ◽  
Vol 9 (19) ◽  
pp. 2502
Author(s):  
Natalia Vanetik ◽  
Marina Litvak

Definitions are extremely important for efficient learning of new materials. In particular, mathematical definitions are necessary for understanding mathematics-related areas. Automated extraction of definitions could be very useful for automated indexing educational materials, building taxonomies of relevant concepts, and more. For definitions that are contained within a single sentence, this problem can be viewed as a binary classification of sentences into definitions and non-definitions. In this paper, we focus on automatic detection of one-sentence definitions in mathematical and general texts. We experiment with different classification models arranged in an ensemble and applied to a sentence representation containing syntactic and semantic information, to classify sentences. Our ensemble model is applied to the data adjusted with oversampling. Our experiments demonstrate the superiority of our approach over state-of-the-art methods in both general and mathematical domains.


2021 ◽  
Vol 29 (2) ◽  
pp. 859
Author(s):  
Márcio De Souza Dias ◽  
Ariani Di Felippo ◽  
Amanda Pontes Rassi ◽  
Paula Cristina Figueira Cardoso ◽  
Fernando Antônio Asevedo Nóbrega ◽  
...  

Abstract: Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured.Keywords: automatic summarization; multi-document summary; linguistic problem; corpus annotation.Resumo: Sumários automáticos geralmente apresentam vários problemas linguísticos que afetam a sua qualidade textual e, consequentemente, sua compreensão pelos usuários. Alguns trabalhos caracterizam tais problemas e os relacionam ao desempenho dos sistemas de sumarização. Neste artigo, investigaram-se os problemas em extratos (isto é, sumários produzidos pela concatenação de sentenças extraídas na íntegra dos textos-fonte) multidocumento em Português do Brasil gerados por sistemas que apresentam diferentes abordagens (isto é, superficial e profunda) e desempenho (isto é, métodos baseline e do estado-da-arte). Para tanto, as principais caracterizações dos problemas linguísticos em sumários automáticos foram investigadas, resultando em uma tipologia mais adequada à sumarização multidocumento. Em seguida, anotou-se manualmente um corpus de extratos com base na tipologia, evidenciando que alguns tipos de problemas são significativamente mais recorrentes que outros. Assim, essa anotação gera subsídios para as tarefas automáticas de detecção e correção de problemas linguísticos com vistas à produção de sumários automáticos não só mais informativos (isto é, que cobrem o conteúdo do material de origem), como também linguisticamente bem-estruturados.Palavras-chave: sumarização automática; sumário multidocumento; problema linguístico; anotação de corpus.


Sign in / Sign up

Export Citation Format

Share Document