scholarly journals Improving Topic Coherence Using Entity Extraction Denoising

2018 ◽  
Vol 110 (1) ◽  
pp. 85-101 ◽  
Author(s):  
Ronald Cardenas ◽  
Kevin Bello ◽  
Alberto Coronado ◽  
Elizabeth Villota

Abstract Managing large collections of documents is an important problem for many areas of science, industry, and culture. Probabilistic topic modeling offers a promising solution. Topic modeling is an unsupervised machine learning method and the evaluation of this model is an interesting problem on its own. Topic interpretability measures have been developed in recent years as a more natural option for topic quality evaluation, emulating human perception of coherence with word sets correlation scores. In this paper, we show experimental evidence of the improvement of topic coherence score by restricting the training corpus to that of relevant information in the document obtained by Entity Recognition. We experiment with job advertisement data and find that with this approach topic models improve interpretability in about 40 percentage points on average. Our analysis reveals as well that using the extracted text chunks, some redundant topics are joined while others are split into more skill-specific topics. Fine-grained topics observed in models using the whole text are preserved.

Energies ◽  
2021 ◽  
Vol 14 (5) ◽  
pp. 1497
Author(s):  
Chankook Park ◽  
Minkyu Kim

It is important to examine in detail how the distribution of academic research topics related to renewable energy is structured and which topics are likely to receive new attention in the future in order for scientists to contribute to the development of renewable energy. This study uses an advanced probabilistic topic modeling to statistically examine the temporal changes of renewable energy topics by using academic abstracts from 2010–2019 and explores the properties of the topics from the perspective of future signs such as weak signals. As a result, in strong signals, methods for optimally integrating renewable energy into the power grid are paid great attention. In weak signals, interest in large-capacity energy storage systems such as hydrogen, supercapacitors, and compressed air energy storage showed a high rate of increase. In not-strong-but-well-known signals, comprehensive topics have been included, such as renewable energy potential, barriers, and policies. The approach of this study is applicable not only to renewable energy but also to other subjects.


2014 ◽  
Vol 40 (2) ◽  
pp. 469-510 ◽  
Author(s):  
Khaled Shaalan

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.


2021 ◽  
Vol 2021 (2) ◽  
pp. 88-110
Author(s):  
Duc Bui ◽  
Kang G. Shin ◽  
Jong-Min Choi ◽  
Junbum Shin

Abstract Privacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated analysis, it remains challenging due to the complex policy statements written in legal (vague) language. Prior work is limited by small/generated datasets and manually created rules. We formulate the extraction of fine-grained personal data phrases and the corresponding data collection or sharing practices as a sequence-labeling problem that can be solved by an entity-recognition model. We create a large dataset with 4.1k sentences (97k tokens) and 2.6k annotated fine-grained data practices from 30 real-world privacy policies to train and evaluate neural networks. We present a fully automated system, called PI-Extract, which accurately extracts privacy practices by a neural model and outperforms, by a large margin, strong rule-based baselines. We conduct a user study on the effects of data practice annotation which highlights and describes the data practices extracted by PI-Extract to help users better understand privacy-policy documents. Our experimental evaluation results show that the annotation significantly improves the users’ reading comprehension of policy texts, as indicated by a 26.6% increase in the average total reading score.


2013 ◽  
Vol 48 (2) ◽  
pp. 307-343 ◽  
Author(s):  
Bart Desmet ◽  
Véronique Hoste

2018 ◽  
Vol 11 (9) ◽  
pp. 3781-3794 ◽  
Author(s):  
Joy Merwin Monteiro ◽  
Jeremy McGibbon ◽  
Rodrigo Caballero

Abstract. sympl (System for Modelling Planets) and climt (Climate Modelling and Diagnostics Toolkit) are an attempt to rethink climate modelling frameworks from the ground up. The aim is to use expressive data structures available in the scientific Python ecosystem along with best practices in software design to allow scientists to easily and reliably combine model components to represent the climate system at a desired level of complexity and to enable users to fully understand what the model is doing. sympl is a framework which formulates the model in terms of a state that gets evolved forward in time or modified within a specific time by well-defined components. sympl's design facilitates building models that are self-documenting, are highly interoperable, and provide fine-grained control over model components and behaviour. sympl components contain all relevant information about the input they expect and output that they provide. Components are designed to be easily interchanged, even when they rely on different units or array configurations. sympl provides basic functions and objects which could be used in any type of Earth system model. climt is an Earth system modelling toolkit that contains scientific components built using sympl base objects. These include both pure Python components and wrapped Fortran libraries. climt provides functionality requiring model-specific assumptions, such as state initialization and grid configuration. climt's programming interface designed to be easy to use and thus appealing to a wide audience. Model building, configuration and execution are performed through a Python script (or Jupyter Notebook), enabling researchers to build an end-to-end Python-based pipeline along with popular Python data analysis and visualization tools.


2021 ◽  
Vol 48 (3) ◽  
pp. 231-247
Author(s):  
Xu Tan ◽  
Xiaoxi Luo ◽  
Xiaoguang Wang ◽  
Hongyu Wang ◽  
Xilong Hou

Digital images of cultural heritage (CH) contain rich semantic information. However, today’s semantic representations of CH images fail to fully reveal the content entities and context within these vital surrogates. This paper draws on the fields of image research and digital humanities to propose a systematic methodology and a technical route for semantic enrichment of CH digital images. This new methodology systematically applies a series of procedures including: semantic annotation, entity-based enrichment, establishing internal relations, event-centric enrichment, defining hierarchy relations between properties text annotation, and finally, named entity recognition in order to ultimately provide fine-grained contextual semantic content disclosure. The feasibility and advantages of the proposed semantic enrichment methods for semantic representation are demonstrated via a visual display platform for digital images of CH built to represent the Wutai Mountain Map, a typical Dunhuang mural. This study proves that semantic enrichment offers a promising new model for exposing content at a fine-grained level, and establishing a rich semantic network centered on the content of digital images of CH.


2019 ◽  
Vol 62 (12) ◽  
pp. 1849-1862
Author(s):  
San Ling ◽  
Khoa Nguyen ◽  
Huaxiong Wang ◽  
Juanyang Zhang

Abstract Efficient user revocation is a necessary but challenging problem in many multi-user cryptosystems. Among known approaches, server-aided revocation yields a promising solution, because it allows to outsource the major workloads of system users to a computationally powerful third party, called the server, whose only requirement is to carry out the computations correctly. Such a revocation mechanism was considered in the settings of identity-based encryption and attribute-based encryption by Qin et al. (2015, ESORICS) and Cui et al. (2016, ESORICS ), respectively. In this work, we consider the server-aided revocation mechanism in the more elaborate setting of predicate encryption (PE). The latter, introduced by Katz et al. (2008, EUROCRYPT), provides fine-grained and role-based access to encrypted data and can be viewed as a generalization of identity-based and attribute-based encryption. Our contribution is 2-fold. First, we formalize the model of server-aided revocable PE (SR-PE), with rigorous definitions and security notions. Our model can be seen as a non-trivial adaptation of Cui et al.’s work into the PE context. Second, we put forward a lattice-based instantiation of SR-PE. The scheme employs the PE scheme of Agrawal et al. (2011, ASIACRYPT) and the complete subtree method of Naor et al. (2001, CRYPTO) as the two main ingredients, which work smoothly together thanks to a few additional techniques. Our scheme is proven secure in the standard model (in a selective manner), based on the hardness of the learning with errors problem.


Symmetry ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 1986
Author(s):  
Liguo Yao ◽  
Haisong Huang ◽  
Kuan-Wei Wang ◽  
Shih-Huan Chen ◽  
Qiaoqiao Xiong

Manufacturing text often exists as unlabeled data; the entity is fine-grained and the extraction is difficult. The above problems mean that the manufacturing industry knowledge utilization rate is low. This paper proposes a novel Chinese fine-grained NER (named entity recognition) method based on symmetry lightweight deep multinetwork collaboration (ALBERT-AttBiLSTM-CRF) and model transfer considering active learning (MTAL) to research fine-grained named entity recognition of a few labeled Chinese textual data types. The method is divided into two stages. In the first stage, the ALBERT-AttBiLSTM-CRF was applied for verification in the CLUENER2020 dataset (Public dataset) to get a pretrained model; the experiments show that the model obtains an F1 score of 0.8962, which is better than the best baseline algorithm, an improvement of 9.2%. In the second stage, the pretrained model was transferred into the Manufacturing-NER dataset (our dataset), and we used the active learning strategy to optimize the model effect. The final F1 result of Manufacturing-NER was 0.8931 after the model transfer (it was higher than 0.8576 before the model transfer); so, this method represents an improvement of 3.55%. Our method effectively transfers the existing knowledge from public source data to scientific target data, solving the problem of named entity recognition with scarce labeled domain data, and proves its effectiveness.


Sign in / Sign up

Export Citation Format

Share Document