unstructured text
Recently Published Documents


TOTAL DOCUMENTS

335
(FIVE YEARS 137)

H-INDEX

18
(FIVE YEARS 4)

2022 ◽  
Vol 40 (1) ◽  
pp. 1-44
Author(s):  
Longxuan Ma ◽  
Mingda Li ◽  
Wei-Nan Zhang ◽  
Jiapeng Li ◽  
Ting Liu

Incorporating external knowledge into dialogue generation has been proven to benefit the performance of an open-domain Dialogue System (DS), such as generating informative or stylized responses, controlling conversation topics. In this article, we study the open-domain DS that uses unstructured text as external knowledge sources ( U nstructured T ext E nhanced D ialogue S ystem ( UTEDS )). The existence of unstructured text entails distinctions between UTEDS and traditional data-driven DS and we aim at analyzing these differences. We first give the definition of the UTEDS related concepts, then summarize the recently released datasets and models. We categorize UTEDS into Retrieval and Generative models and introduce them from the perspective of model components. The retrieval models consist of Fusion, Matching, and Ranking modules, while the generative models comprise Dialogue and Knowledge Encoding, Knowledge Selection (KS), and Response Generation modules. We further summarize the evaluation methods utilized in UTEDS and analyze the current models’ performance. At last, we discuss the future development trends of UTEDS, hoping to inspire new research in this field.


Author(s):  
Runumi Devi ◽  
Deepti Mehrotra ◽  
Sana Ben Abdallah Ben Lamine

Electronic Health Record (EHR) systems in healthcare organisations are primarily maintained in isolation from each other that makes interoperability of unstructured(text) data stored in these EHR systems challenging in the healthcare domain. Similar information may be described using different terminologies by different applications that can be evaded by transforming the content into the Resource Description Framework (RDF) model that is interoperable amongst organisations. RDF requires a document’s contents to be translated into a repository of triplets (subject, predicate, object) known as RDF statements. Natural Language Processing (NLP) techniques can help get actionable insights from these text data and create triplets for RDF model generation. This paper discusses two NLP-based approaches to generate the RDF models from unstructured patients’ documents, namely dependency structure-based and constituent(phrase) structure-based parser. Models generated by both approaches are evaluated in two aspects: exhaustiveness of the represented knowledge and the model generation time. The precision measure is used to compute the models’ exhaustiveness in terms of the number of facts that are transformed into RDF representations.


2021 ◽  
Vol 17 ◽  
pp. 1201-1209
Author(s):  
Frederick F. Patacsil ◽  
Jennifer M. Parrone ◽  
Christine Lourrine Tablatin ◽  
Michael Acosta

Cyberbullying has become one of the major threats in our society today due to the massive damage that it can cause not only in the cyber world and the internet-based business but also in the lives of many people. The sole purpose of cyberbullying is to hurt and humiliate someone by posting and sending threats online. However, recognition of cyberbullying has proved to be a hard and challenging task for information technologists. The main objective of this study is to analyze and decode the ambiguity of human language used in cyberbullying Lesbian, Gay, Bisexual, Transgender and Queer or Questioning (LGBTQ) victims and detect patterns and trends from the results to produce meaning and knowledge. This study will utilize an unsupervised associative approach text analysis technique that will be used to extract the relevant information from the unstructured text of cyberbullying messages. Furthermore, cyberbullying incidence patterns will be analyzed based on recognizing relationships and meaning between cyberbullying keywords with other words to generate knowledge discovery. “Fuck” and “Shit” account almost half of all cyberbullying words and appear more that 75 % in the dataset as the most frequently used words. Further, the terms “shit”+“hate”+ “fuck” with a positive lift value and “shit”+ “stupid” positive obtained the highest chance of togetherness / chance of utilizing both of these words to cyber bully. The combination of words / word patterns was considered abusive swearing is always considered rude when it is used to intimidate or humiliate someone. The output and results of this study will contribute to formulating future intervention to combat cyberbullying. Furthermore, the results can be utilized as a model in the development of a cyberbullying detection application based on the text relations / associations of words in the comments, replies, blog discussion and discussion groups across the social networks.


2021 ◽  
Vol 3 ◽  
pp. 4
Author(s):  
Tai-Danae Bradley ◽  
Yiannis Vlassopoulos

This work originates from the observation that today's state-of-the-art statistical language models are impressive not only for their performance, but also---and quite crucially---because they are built entirely from correlations in unstructured text data. The latter observation prompts a fundamental question that lies at the heart of this paper: What mathematical structure exists in unstructured text data? We put forth enriched category theory as a natural answer. We show that sequences of symbols from a finite alphabet, such as those found in a corpus of text, form a category enriched over probabilities. We then address a second fundamental question: How can this information be stored and modeled in a way that preserves the categorical structure? We answer this by constructing a functor from our enriched category of text to a particular enriched category of reduced density operators. The latter leverages the Loewner order on positive semidefinite operators, which can further be interpreted as a toy example of entailment.


2021 ◽  
Author(s):  
Yung-Chun Chang ◽  
Yu-Wen Chiu ◽  
Ting-Wu Chuang

BACKGROUND Globalization and environmental changes have increased the emergence and re-emergence of infectious diseases worldwide. The collaboration of regional infectious disease surveillance systems is critical but difficult to achieve because of the different transparency levels of health information sharing systems among countries. ProMED-mail is the most comprehensive expert-curated platform that provides rich outbreak information among humans, animals, and plants from different countries. However, owing to unstructured text content in reports, it is difficult to analyze them for further applications. Therefore, we have devised an idea to develop an automatic summary of the alerting articles from ProMED-mail. In this research, we propose a text summarization method that uses natural language processing to extract important sentences automatically from alert articles in ProMED emails to generate summaries of dengue outbreaks in Southeast Asia. Our method, can be used to capture crucial information quickly and make decisions for epidemic surveillance. OBJECTIVE To generate automatic summaries of unstructured text content from reports. METHODS Our materials come from the ProMED-mail website, spanning a period from 1994 to 2019. The collected data were annotated by professionals to establish a unique Taiwan dengue corpus through, which achieved almost perfect agreement (90% Cohen’s Kappa statistic). To generate a ProMED-mail summary, we developed a dual-channel bidirectional long-short term memory with an attention mechanism that infuses latent syntactic features to identify crucial sentences from the alerting articles. RESULTS Our method is superior to many well-known machine learning and neural network approaches in identifying important sentences, achieving a macro average F1-score of 93%. Moreover, the method can successfully extract key information about dengue fever outbreaks in ProMED-mail, and help researchers or public health practitioners to capture important summaries quickly. Besides verifying the model, we also recruited five professional experts and five students from related fields to carry out a satisfaction survey on the generated summary. The results showed that 83.6% of the summaries received high satisfaction ratings. CONCLUSIONS The proposed approach successfully fuses latent syntactic features into a deep neural network to analyze syntactic, semantic, and content information in the text. It then exploits the derived information to identify the crucial sentences in ProMED-mail. The experimental results show that the proposed method is effective and outperforms the comparisons. In addition, our method demonstrated the potential for summary generation from ProMED-mail. When a new alerting article arrives, public health decision makers can identify the outbreak information in a lengthy article quickly and deliver immediate responses to disease control and prevention. CLINICALTRIAL NA


2021 ◽  
Author(s):  
Susanne Brogaard Krogh ◽  
Tue Secher Jensen ◽  
Nanna Rolving ◽  
Malene Laursen ◽  
Janus Nikolaj Laust Thomsen ◽  
...  

Abstract Background: A number of papers highlight the extent to which low back pain (LBP) is generally mismanaged, especially regarding overuse of magnetic resonance imaging (MRI). International guidelines do not recommend routine imaging, including MRI, and seek to guide clinicians only to refer for imaging based on specific indications. Despite this, several studies show an increase in the use of MRI among patients with LBP and an imbalance between appropriate versus inappropriate use of MRI for LBP. This study aimed to investigate to what extent referrals from general practice for lumbar MRI complied with clinical guideline recommendations in a Danish setting.Materials and methods: From 2014-2018, all referrals for lumbar MRI were included from general practitioners in the Central Denmark Region for diagnostic imaging at a public regional hospital. A modified version of the American College of Radiology Imaging Appropriateness Criteria for LBP was used to classify referrals as appropriate or inappropriate, based on the unstructured text in the GPs’ referrals. Appropriate referrals included fractures, cancer, symptoms persisting for more than 6 weeks of non-surgical treatment, previous surgery, candidate for surgery or suspicion of cauda equina. Inappropriate referrals were sub-classified as lacking information about previous non-surgical treatment and duration. Results: Of the 3,772 retrieved referrals for MRI of the lumbar spine, 55% were selected and a total of 2,051 referrals were categorised. Approximately one quarter (24.5%) were categorised as appropriate, and 75.5% were deemed inappropriate. 51% of the inappropriate referrals lacked information about previous non-surgical treatment, and 49% had no information about the duration of non-surgical treatment. Apart from minor yearly fluctuations, there was no change in the distribution of appropriate and inappropriate MRI referrals from 2014 to 2018.Conclusion:The majority of lumbar MRI referrals (75.5%) from general practitioners for lumbar MRI did not fulfil the ACR Imaging Appropriateness Criteria for LBP based on the unstructured text of their referrals. There is a need for referrers to include all guideline-relevant information in referrals for imaging. More research is needed to determine whether this is due to patients not fulfilling guideline recommendations or simply the content of the referrals.


2021 ◽  
Vol 10 (10) ◽  
pp. 710
Author(s):  
Erum Haris ◽  
Keng Hoon Gan

Travel blogs are a significant source for modeling human travelling behavior and characterizing tourist destinations owing to the presence of rich geospatial and thematic content. However, the bulk of unstructured text requires extensive processing for an efficient transformation of data to knowledge. Existing works have studied tourist places, but results lack a coherent outline and visualization of the semantic knowledge associated with tourist attractions. Hence, this work proposes place semantics extraction based on a fusion of content analysis and natural language processing (NLP) techniques. A weighted-sum equation model is then employed to construct a points of interest graph (POI graph) that integrates extracted semantics with conventional frequency-based weighting of tourist spots and routes. The framework offers determination and visualization of massive blog text in a comprehensible manner to facilitate individuals in travel decision-making as well as tourism managers to devise effective destination planning and management strategies.


2021 ◽  
Vol 32 (4) ◽  
pp. 48-64
Author(s):  
*Chenyang Bu ◽  
Xingchen Yu ◽  
Yan Hong ◽  
Tingting Jiang

The automatic construction of knowledge graphs (KGs) from multiple data sources has received increasing attention. The automatic construction process inevitably brings considerable noise, especially in the construction of KGs from unstructured text. The noise in a KG can be divided into two categories: factual noise and low-quality noise. Factual noise refers to plausible triples that meet the requirements of ontology constraints. For example, the plausible triple <New_York, IsCapitalOf, America> satisfies the constraints that the head entity “New_York” is a city and the tail entity “America” belongs to a country. Low-quality noise denotes the obvious errors commonly created in information extraction processes. This study focuses on entity type errors. Most existing approaches concentrate on refining an existing KG, assuming that the type information of most entities or the ontology information in the KG is known in advance. However, such methods may not be suitable at the start of a KG's construction. Therefore, the authors propose an effective framework to eliminate entity type errors. The experimental results demonstrate the effectiveness of the proposed method.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Edwin Camilleri ◽  
Shah Jahan Miah

AbstractIn this research various concepts from network theory and topic modelling are combined, to provision a temporal network of associated topics. This solution is presented as a step-by-step process to facilitate the evaluation of latent topics from unstructured text, as well as the domain area that textual documents are sourced from. In addition to ensuring shifts and changes in the structural properties of a given corpus are visible, non-stationary classes of cooccurring topics are determined, and trends in topic prevalence, positioning, and association patterns are evaluated over time. The aforementioned capabilities extend the insights fostered from stand-alone topic modelling outputs, by ensuring latent topics are not only identified and summarized, but more systematically interpreted, analysed, and explained, in a transparent and reliable way.


Sign in / Sign up

Export Citation Format

Share Document