unstructured text
Recently Published Documents

Incorporating external knowledge into dialogue generation has been proven to benefit the performance of an open-domain Dialogue System (DS), such as generating informative or stylized responses, controlling conversation topics. In this article, we study the open-domain DS that uses unstructured text as external knowledge sources ( U nstructured T ext E nhanced D ialogue S ystem ( UTEDS )). The existence of unstructured text entails distinctions between UTEDS and traditional data-driven DS and we aim at analyzing these differences. We first give the definition of the UTEDS related concepts, then summarize the recently released datasets and models. We categorize UTEDS into Retrieval and Generative models and introduce them from the perspective of model components. The retrieval models consist of Fusion, Matching, and Ranking modules, while the generative models comprise Dialogue and Knowledge Encoding, Knowledge Selection (KS), and Response Generation modules. We further summarize the evaluation methods utilized in UTEDS and analyze the current models’ performance. At last, we discuss the future development trends of UTEDS, hoping to inspire new research in this field.

Download Full-text

Constituent vs Dependency Parsing-Based RDF Model Generation from Dengue Patients’ Case Sheets

Journal of Information & Knowledge Management ◽

10.1142/s0219649222500137 ◽

2022 ◽

Author(s):

Runumi Devi ◽

Deepti Mehrotra ◽

Sana Ben Abdallah Ben Lamine

Keyword(s):

Language Processing ◽

Generation Time ◽

Model Generation ◽

Health Record ◽

Text Data ◽

Dependency Structure ◽

Unstructured Text ◽

Similar Information ◽

Description Framework ◽

Resource Description

Electronic Health Record (EHR) systems in healthcare organisations are primarily maintained in isolation from each other that makes interoperability of unstructured(text) data stored in these EHR systems challenging in the healthcare domain. Similar information may be described using different terminologies by different applications that can be evaded by transforming the content into the Resource Description Framework (RDF) model that is interoperable amongst organisations. RDF requires a document’s contents to be translated into a repository of triplets (subject, predicate, object) known as RDF statements. Natural Language Processing (NLP) techniques can help get actionable insights from these text data and create triplets for RDF model generation. This paper discusses two NLP-based approaches to generate the RDF models from unstructured patients’ documents, namely dependency structure-based and constituent(phrase) structure-based parser. Models generated by both approaches are evaluated in two aspects: exhaustiveness of the represented knowledge and the model generation time. The precision measure is used to compute the models’ exhaustiveness in terms of the number of facts that are transformed into RDF representations.

Download Full-text

Analysing Lesbian, Gay, Bisexual, Transgender and Queer or Questioning (LQBTQ) Cyberbullying Using Unsupervised Associative Approach Text Analytics Technique

WSEAS TRANSACTIONS ON ENVIRONMENT AND DEVELOPMENT ◽

10.37394/232015.2021.17.109 ◽

2021 ◽

Vol 17 ◽

pp. 1201-1209

Author(s):

Frederick F. Patacsil ◽

Jennifer M. Parrone ◽

Christine Lourrine Tablatin ◽

Michael Acosta

Keyword(s):

Text Analysis ◽

Relevant Information ◽

Discussion Groups ◽

Text Analytics ◽

Unstructured Text ◽

Analysis Technique ◽

The Social ◽

Information Technologists ◽

Cyberbullying Detection ◽

Society Today

Cyberbullying has become one of the major threats in our society today due to the massive damage that it can cause not only in the cyber world and the internet-based business but also in the lives of many people. The sole purpose of cyberbullying is to hurt and humiliate someone by posting and sending threats online. However, recognition of cyberbullying has proved to be a hard and challenging task for information technologists. The main objective of this study is to analyze and decode the ambiguity of human language used in cyberbullying Lesbian, Gay, Bisexual, Transgender and Queer or Questioning (LGBTQ) victims and detect patterns and trends from the results to produce meaning and knowledge. This study will utilize an unsupervised associative approach text analysis technique that will be used to extract the relevant information from the unstructured text of cyberbullying messages. Furthermore, cyberbullying incidence patterns will be analyzed based on recognizing relationships and meaning between cyberbullying keywords with other words to generate knowledge discovery. “Fuck” and “Shit” account almost half of all cyberbullying words and appear more that 75 % in the dataset as the most frequently used words. Further, the terms “shit”+“hate”+ “fuck” with a positive lift value and “shit”+ “stupid” positive obtained the highest chance of togetherness / chance of utilizing both of these words to cyber bully. The combination of words / word patterns was considered abusive swearing is always considered rude when it is used to intimidate or humiliate someone. The output and results of this study will contribute to formulating future intervention to combat cyberbullying. Furthermore, the results can be utilized as a model in the development of a cyberbullying detection application based on the text relations / associations of words in the comments, replies, blog discussion and discussion groups across the social networks.

Download Full-text

Language Modeling with Reduced Densities

Compositionality ◽

10.32408/compositionality-3-4 ◽

2021 ◽

Vol 3 ◽

pp. 4

Author(s):

Tai-Danae Bradley ◽

Yiannis Vlassopoulos

Keyword(s):

Mathematical Structure ◽

Positive Semidefinite ◽

Fundamental Question ◽

Language Models ◽

Finite Alphabet ◽

Text Data ◽

Enriched Category ◽

Unstructured Text ◽

Statistical Language Models ◽

Categorical Structure

This work originates from the observation that today's state-of-the-art statistical language models are impressive not only for their performance, but also---and quite crucially---because they are built entirely from correlations in unstructured text data. The latter observation prompts a fundamental question that lies at the heart of this paper: What mathematical structure exists in unstructured text data? We put forth enriched category theory as a natural answer. We show that sequences of symbols from a finite alphabet, such as those found in a corpus of text, form a category enriched over probabilities. We then address a second fundamental question: How can this information be stored and modeled in a way that preserves the categorical structure? We answer this by constructing a functor from our enriched category of text to a particular enriched category of reduced density operators. The latter leverages the Loewner order on positive semidefinite operators, which can further be interpreted as a toy example of entailment.

Download Full-text

Summary Generation of Dengue Outbreaks from ProMED-mail Database using a Linguistic Pattern-infused Dual-channel BiLSTM (Preprint)

10.2196/preprints.34583 ◽

2021 ◽

Author(s):

Yung-Chun Chang ◽

Yu-Wen Chiu ◽

Ting-Wu Chuang

Keyword(s):

Public Health ◽

Neural Network ◽

Language Processing ◽

Environmental Changes ◽

Short Term Memory ◽

Dual Channel ◽

Unstructured Text ◽

Syntactic Features ◽

Dengue Outbreaks ◽

Text Content

BACKGROUND Globalization and environmental changes have increased the emergence and re-emergence of infectious diseases worldwide. The collaboration of regional infectious disease surveillance systems is critical but difficult to achieve because of the different transparency levels of health information sharing systems among countries. ProMED-mail is the most comprehensive expert-curated platform that provides rich outbreak information among humans, animals, and plants from different countries. However, owing to unstructured text content in reports, it is difficult to analyze them for further applications. Therefore, we have devised an idea to develop an automatic summary of the alerting articles from ProMED-mail. In this research, we propose a text summarization method that uses natural language processing to extract important sentences automatically from alert articles in ProMED emails to generate summaries of dengue outbreaks in Southeast Asia. Our method, can be used to capture crucial information quickly and make decisions for epidemic surveillance. OBJECTIVE To generate automatic summaries of unstructured text content from reports. METHODS Our materials come from the ProMED-mail website, spanning a period from 1994 to 2019. The collected data were annotated by professionals to establish a unique Taiwan dengue corpus through, which achieved almost perfect agreement (90% Cohen’s Kappa statistic). To generate a ProMED-mail summary, we developed a dual-channel bidirectional long-short term memory with an attention mechanism that infuses latent syntactic features to identify crucial sentences from the alerting articles. RESULTS Our method is superior to many well-known machine learning and neural network approaches in identifying important sentences, achieving a macro average F1-score of 93%. Moreover, the method can successfully extract key information about dengue fever outbreaks in ProMED-mail, and help researchers or public health practitioners to capture important summaries quickly. Besides verifying the model, we also recruited five professional experts and five students from related fields to carry out a satisfaction survey on the generated summary. The results showed that 83.6% of the summaries received high satisfaction ratings. CONCLUSIONS The proposed approach successfully fuses latent syntactic features into a deep neural network to analyze syntactic, semantic, and content information in the text. It then exploits the derived information to identify the crucial sentences in ProMED-mail. The experimental results show that the proposed method is effective and outperforms the comparisons. In addition, our method demonstrated the potential for summary generation from ProMED-mail. When a new alerting article arrives, public health decision makers can identify the outbreak information in a lengthy article quickly and deliver immediate responses to disease control and prevention. CLINICALTRIAL NA

Download Full-text

Appropriateness of Referrals From Primary Care for Lumbar MRI

10.21203/rs.3.rs-1012188/v1 ◽

2021 ◽

Author(s):

Susanne Brogaard Krogh ◽

Tue Secher Jensen ◽

Nanna Rolving ◽

Malene Laursen ◽

Janus Nikolaj Laust Thomsen ◽

...

Keyword(s):

Surgical Treatment ◽

General Practitioners ◽

Cauda Equina ◽

Relevant Information ◽

Appropriateness Criteria ◽

Cancer Symptoms ◽

Unstructured Text ◽

Lumbar Mri ◽

Imaging Appropriateness ◽

Non Surgical Treatment

Abstract Background: A number of papers highlight the extent to which low back pain (LBP) is generally mismanaged, especially regarding overuse of magnetic resonance imaging (MRI). International guidelines do not recommend routine imaging, including MRI, and seek to guide clinicians only to refer for imaging based on specific indications. Despite this, several studies show an increase in the use of MRI among patients with LBP and an imbalance between appropriate versus inappropriate use of MRI for LBP. This study aimed to investigate to what extent referrals from general practice for lumbar MRI complied with clinical guideline recommendations in a Danish setting.Materials and methods: From 2014-2018, all referrals for lumbar MRI were included from general practitioners in the Central Denmark Region for diagnostic imaging at a public regional hospital. A modified version of the American College of Radiology Imaging Appropriateness Criteria for LBP was used to classify referrals as appropriate or inappropriate, based on the unstructured text in the GPs’ referrals. Appropriate referrals included fractures, cancer, symptoms persisting for more than 6 weeks of non-surgical treatment, previous surgery, candidate for surgery or suspicion of cauda equina. Inappropriate referrals were sub-classified as lacking information about previous non-surgical treatment and duration. Results: Of the 3,772 retrieved referrals for MRI of the lumbar spine, 55% were selected and a total of 2,051 referrals were categorised. Approximately one quarter (24.5%) were categorised as appropriate, and 75.5% were deemed inappropriate. 51% of the inappropriate referrals lacked information about previous non-surgical treatment, and 49% had no information about the duration of non-surgical treatment. Apart from minor yearly fluctuations, there was no change in the distribution of appropriate and inappropriate MRI referrals from 2014 to 2018.Conclusion:The majority of lumbar MRI referrals (75.5%) from general practitioners for lumbar MRI did not fulfil the ACR Imaging Appropriateness Criteria for LBP based on the unstructured text of their referrals. There is a need for referrers to include all guideline-relevant information in referrals for imaging. More research is needed to determine whether this is due to patients not fulfilling guideline recommendations or simply the content of the referrals.

Download Full-text

UTtoKB: a Model for Semantic Relation Extraction from Unstructured Text

10.1109/ismsit52890.2021.9604538 ◽

2021 ◽

Author(s):

Mustafa Nabeel Salim ◽

Ban Shareef Mustafa

Keyword(s):

Relation Extraction ◽

Semantic Relation ◽

Unstructured Text

Download Full-text

Extraction and Visualization of Tourist Attraction Semantics from Travel Blogs

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10100710 ◽

2021 ◽

Vol 10 (10) ◽

pp. 710

Author(s):

Erum Haris ◽

Keng Hoon Gan

Keyword(s):

Language Processing ◽

Management Strategies ◽

Semantic Knowledge ◽

Equation Model ◽

Weighted Sum ◽

Tourist Attractions ◽

Tourist Destinations ◽

Unstructured Text ◽

Points Of Interest ◽

Travel Decision

Travel blogs are a significant source for modeling human travelling behavior and characterizing tourist destinations owing to the presence of rich geospatial and thematic content. However, the bulk of unstructured text requires extensive processing for an efficient transformation of data to knowledge. Existing works have studied tourist places, but results lack a coherent outline and visualization of the semantic knowledge associated with tourist attractions. Hence, this work proposes place semantics extraction based on a fusion of content analysis and natural language processing (NLP) techniques. A weighted-sum equation model is then employed to construct a points of interest graph (POI graph) that integrates extracted semantics with conventional frequency-based weighting of tourist spots and routes. The framework offers determination and visualization of massive blog text in a comprehensible manner to facilitate individuals in travel decision-making as well as tourism managers to devise effective destination planning and management strategies.

Download Full-text

Low-Quality Error Detection for Noisy Knowledge Graphs

Journal of Database Management ◽

10.4018/jdm.2021100104 ◽

2021 ◽

Vol 32 (4) ◽

pp. 48-64

Author(s):

*Chenyang Bu ◽

Xingchen Yu ◽

Yan Hong ◽

Tingting Jiang

Keyword(s):

Error Detection ◽

Data Sources ◽

Construction Process ◽

Automatic Construction ◽

Construction Of Knowledge ◽

Unstructured Text ◽

Multiple Data ◽

Extraction Processes ◽

Knowledge Graphs ◽

Type Information

The automatic construction of knowledge graphs (KGs) from multiple data sources has received increasing attention. The automatic construction process inevitably brings considerable noise, especially in the construction of KGs from unstructured text. The noise in a KG can be divided into two categories: factual noise and low-quality noise. Factual noise refers to plausible triples that meet the requirements of ontology constraints. For example, the plausible triple <New_York, IsCapitalOf, America> satisfies the constraints that the head entity “New_York” is a city and the tail entity “America” belongs to a country. Low-quality noise denotes the obvious errors commonly created in information extraction processes. This study focuses on entity type errors. Most existing approaches concentrate on refining an existing KG, assuming that the type information of most entities or the ontology information in the KG is known in advance. However, such methods may not be suitable at the start of a KG's construction. Therefore, the authors propose an effective framework to eliminate entity type errors. The experimental results demonstrate the effectiveness of the proposed method.

Download Full-text

Evaluating latent content within unstructured text: an analytical methodology based on a temporal network of associated topics

Journal Of Big Data ◽

10.1186/s40537-021-00511-0 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Edwin Camilleri ◽

Shah Jahan Miah

Keyword(s):

Structural Properties ◽

Network Theory ◽

Topic Modelling ◽

Temporal Network ◽

Analytical Methodology ◽

Step Process ◽

Unstructured Text ◽

Latent Topics ◽

Latent Content ◽

Over Time

AbstractIn this research various concepts from network theory and topic modelling are combined, to provision a temporal network of associated topics. This solution is presented as a step-by-step process to facilitate the evaluation of latent topics from unstructured text, as well as the domain area that textual documents are sourced from. In addition to ensuring shifts and changes in the structural properties of a given corpus are visible, non-stationary classes of cooccurring topics are determined, and trends in topic prevalence, positioning, and association patterns are evaluated over time. The aforementioned capabilities extend the insights fostered from stand-alone topic modelling outputs, by ensuring latent topics are not only identified and summarized, but more systematically interpreted, analysed, and explained, in a transparent and reliable way.

Download Full-text

unstructured textRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Unstructured Text Enhanced Open-Domain Dialogue System: A Systematic Survey

Constituent vs Dependency Parsing-Based RDF Model Generation from Dengue Patients’ Case Sheets

Analysing Lesbian, Gay, Bisexual, Transgender and Queer or Questioning (LQBTQ) Cyberbullying Using Unsupervised Associative Approach Text Analytics Technique

Language Modeling with Reduced Densities

Summary Generation of Dengue Outbreaks from ProMED-mail Database using a Linguistic Pattern-infused Dual-channel BiLSTM (Preprint)

Appropriateness of Referrals From Primary Care for Lumbar MRI

UTtoKB: a Model for Semantic Relation Extraction from Unstructured Text

Extraction and Visualization of Tourist Attraction Semantics from Travel Blogs

Low-Quality Error Detection for Noisy Knowledge Graphs

Evaluating latent content within unstructured text: an analytical methodology based on a temporal network of associated topics

unstructured text
Recently Published Documents