Relation extraction in structured and unstructured data: a comparative investigation on smartphone titles in the e-commerce domain

Mapping Intimacies ◽

10.5753/stil.2021.17789 ◽

2021 ◽

Author(s):

João Gabriel Melo Barbirato ◽

Livy Real ◽

Helena de Medeiros Caseli

Keyword(s):

Relation Extraction ◽

Comparative Investigation ◽

Structured Data ◽

Unstructured Data ◽

Relational Information ◽

Textual Data

As large amounts of unstructured data are generated on a regular basis, expressing or storing knowledge in a way that is useful remains a challenge. In this context, Relation Extraction (RE) is the task of automatically identifying relationships in unstructured textual data. Thus, we investigated the relation extraction on unstructured e-commerce data from the smartphone domain, using a BERT model fine-tuned for this task. We conducted two experiments to acknowledge how much relational information it is possible to extract from product sheets (structured data) and product titles (unstructured data), and a third experiment to compare both. Analysis shows that extracting relations within a title can retrieve correct relations that are not evident on the related sheet.

Download Full-text

GREG: A Global Level Relation Extraction with Knowledge Graph Embedding

Applied Sciences ◽

10.3390/app10031181 ◽

2020 ◽

Vol 10 (3) ◽

pp. 1181 ◽

Cited By ~ 1

Author(s):

Kuekyeng Kim ◽

Yuna Hur ◽

Gyeongmin Kim ◽

Heuiseok Lim

Keyword(s):

Local Level ◽

Relation Extraction ◽

Graph Embedding ◽

Structured Data ◽

Unstructured Data ◽

Knowledge Graph ◽

Global Level ◽

Proposed Model ◽

Vector Representations ◽

Document Level

In an age overflowing with information, the task of converting unstructured data into structured data are a vital task of great need. Currently, most relation extraction modules are more focused on the extraction of local mention-level relations—usually from short volumes of text. However, in most cases, the most vital and important relations are those that are described in length and detail. In this research, we propose GREG: A Global level Relation Extractor model using knowledge graph embeddings for document-level inputs. The model uses vector representations of mention-level ‘local’ relation’s to construct knowledge graphs that can represent the input document. The knowledge graph is then used to predict global level relations from documents or large bodies of text. The proposed model is largely divided into two modules which are synchronized during their training. Thus, each of the model’s modules is designed to deal with local relations and global relations separately. This allows the model to avoid the problem of struggling against loss of information due to too much information crunched into smaller sized representations when attempting global level relation extraction. Through evaluation, we have shown that the proposed model yields high performances in both predicting global level relations and local level relations consistently.

Download Full-text

Named Entity Recognition and Relation Extraction

ACM Computing Surveys ◽

10.1145/3445965 ◽

2021 ◽

Vol 54 (1) ◽

pp. 1-39

Author(s):

Zara Nasar ◽

Syed Waqar Jaffry ◽

Muhammad Kamran Malik

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Named Entity Recognition ◽

Relation Extraction ◽

The State ◽

Entity Recognition ◽

Joint Models ◽

Named Entity ◽

Textual Data ◽

Benchmark Datasets

With the advent of Web 2.0, there exist many online platforms that result in massive textual-data production. With ever-increasing textual data at hand, it is of immense importance to extract information nuggets from this data. One approach towards effective harnessing of this unstructured textual data could be its transformation into structured text. Hence, this study aims to present an overview of approaches that can be applied to extract key insights from textual data in a structured way. For this, Named Entity Recognition and Relation Extraction are being majorly addressed in this review study. The former deals with identification of named entities, and the latter deals with problem of extracting relation between set of entities. This study covers early approaches as well as the developments made up till now using machine learning models. Survey findings conclude that deep-learning-based hybrid and joint models are currently governing the state-of-the-art. It is also observed that annotated benchmark datasets for various textual-data generators such as Twitter and other social forums are not available. This scarcity of dataset has resulted into relatively less progress in these domains. Additionally, the majority of the state-of-the-art techniques are offline and computationally expensive. Last, with increasing focus on deep-learning frameworks, there is need to understand and explain the under-going processes in deep architectures.

Download Full-text

A Framework for the Forensic Investigation of Unstructured Email Relationship Data

International Journal of Digital Crime and Forensics ◽

10.4018/jdcf.2011070101 ◽

2011 ◽

Vol 3 (3) ◽

pp. 1-18 ◽

Cited By ~ 11

Author(s):

John Haggerty ◽

Alexander J. Karran ◽

David J. Lamb ◽

Mark Taylor

Keyword(s):

Forensic Analysis ◽

Structured Data ◽

Unstructured Data ◽

Forensic Investigation ◽

Qualitative Information ◽

Data Set ◽

Analysis Techniques ◽

Potential Sources ◽

Network Identification ◽

Digital Investigation

The continued reliance on email communications ensures that it remains a major source of evidence during a digital investigation. Emails comprise both structured and unstructured data. Structured data provides qualitative information to the forensics examiner and is typically viewed through existing tools. Unstructured data is more complex as it comprises information associated with social networks, such as relationships within the network, identification of key actors and power relations, and there are currently no standardised tools for its forensic analysis. This paper posits a framework for the forensic investigation of email data. In particular, it focuses on the triage and analysis of unstructured data to identify key actors and relationships within an email network. This paper demonstrates the applicability of the approach by applying relevant stages of the framework to the Enron email corpus. The paper illustrates the advantage of triaging this data to identify (and discount) actors and potential sources of further evidence. It then applies social network analysis techniques to key actors within the data set. This paper posits that visualisation of unstructured data can greatly aid the examiner in their analysis of evidence discovered during an investigation.

Download Full-text

Enhancing the management of unstructured data in e-learning systems using MongoDB

Program electronic library and information systems ◽

10.1108/prog-11-2013-0063 ◽

2015 ◽

Vol 49 (1) ◽

pp. 91-114 ◽

Cited By ~ 5

Author(s):

Milorad Pantelija Stevic ◽

Branko Milosavljevic ◽

Branko Rade Perisic

Keyword(s):

Service Oriented Architecture ◽

Structured Data ◽

Learning Systems ◽

Unstructured Data ◽

Development Effort ◽

Content Type ◽

Learning Platform ◽

Service Oriented ◽

E Learning ◽

Learning Platforms

Purpose – Current e-learning platforms are based on relational database management systems (RDBMS) and are well suited for handling structured data. However, it is expected from e-learning solutions to efficiently handle unstructured data as well. The purpose of this paper is to show an alternative to current solutions for unstructured data management. Design/methodology/approach – Current repository-based solution for file management was compared to MongoDB architecture according to their functionalities and characteristics. This included several categories: data integrity, hardware acquisition, processing files, availability, handling concurrent users, partition tolerance, disaster recovery, backup policies and scalability. Findings – This paper shows that it is possible to improve e-learning platform capabilities by implementing a hybrid database architecture that incorporates RDBMS for handling structured data and MongoDB database system for handling unstructured data. Research limitations/implications – The study shows an acceptable adoption of MongoDB inside a service-oriented architecture (SOA) for enhancing e-learning solutions. Practical implications – This research enables an efficient file handling not only for e-learning systems, but also for any system where file handling is needed. Originality/value – It is expected that future single/joint e-learning initiatives will need to manage huge amount of files and they will require effective file handling solution. The new architecture solution for file handling is offered in this paper: it is different from current solutions because it is less expensive, more efficient, more flexible and requires less administrative and development effort for building and maintaining.

Download Full-text

The Impact of Big Data on Security

Big Data ◽

10.4018/978-1-4666-9840-6.ch068 ◽

2016 ◽

pp. 1495-1518

Author(s):

Mohammad Alaa Hussain Al-Hamami

Keyword(s):

Social Media ◽

Big Data ◽

Management System ◽

Database Management ◽

Database Systems ◽

Structured Data ◽

Database Management System ◽

Unstructured Data ◽

And Behavior ◽

The Impact

Big Data is comprised systems, to remain competitive by techniques emerging due to Big Data. Big Data includes structured data, semi-structured and unstructured. Structured data are those data formatted for use in a database management system. Semi-structured and unstructured data include all types of unformatted data including multimedia and social media content. Among practitioners and applied researchers, the reaction to data available through blogs, Twitter, Facebook, or other social media can be described as a “data rush” promising new insights about consumers' choices and behavior and many other issues. In the past Big Data has been used just by very large organizations, governments and large enterprises that have the ability to create its own infrastructure for hosting and mining large amounts of data. This chapter will show the requirements for the Big Data environments to be protected using the same rigorous security strategies applied to traditional database systems.

Download Full-text

Learning to Transfer Relational Representations through Analogy

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.330110015 ◽

2019 ◽

Vol 33 ◽

pp. 10015-10016

Author(s):

Gaetano Rossiello ◽

Alfio Gliozzo ◽

Michael Glass

Keyword(s):

State Of The Art ◽

Relation Extraction ◽

Knowledge Bases ◽

The State ◽

Large Set ◽

Relational Information ◽

Siamese Network ◽

Distant Supervision ◽

Novel Approach ◽

Art Methods

We propose a novel approach to learn representations of relations expressed by their textual mentions. In our assumption, if two pairs of entities belong to the same relation, then those two pairs are analogous. We collect a large set of analogous pairs by matching triples in knowledge bases with web-scale corpora through distant supervision. This dataset is adopted to train a hierarchical siamese network in order to learn entity-entity embeddings which encode relational information through the different linguistic paraphrasing expressing the same relation. The model can be used to generate pre-trained embeddings which provide a valuable signal when integrated into an existing neural-based model by outperforming the state-of-the-art methods on a relation extraction task.

Download Full-text

Biomedical Relation Extraction Using Distant Supervision

Scientific Programming ◽

10.1155/2020/8893749 ◽

2020 ◽

Vol 2020 ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Nada Boudjellal ◽

Huaping Zhang ◽

Asif Khan ◽

Arshad Ahmad

Keyword(s):

Big Data ◽

Information Extraction ◽

State Of The Art ◽

Relation Extraction ◽

Knowledge Bases ◽

Structured Data ◽

Distant Supervision ◽

Future Challenges ◽

Unstructured Information ◽

Biomedical Relation Extraction

With the accelerating growth of big data, especially in the healthcare area, information extraction is more needed currently than ever, for it can convey unstructured information into an easily interpretable structured data. Relation extraction is the second of the two important tasks of relation extraction. This study presents an overview of relation extraction using distant supervision, providing a generalized architecture of this task based on the state-of-the-art work that proposed this method. Besides, it surveys the methods used in the literature targeting this topic with a description of different knowledge bases used in the process along with the corpora, which can be helpful for beginner practitioners seeking knowledge on this subject. Moreover, the limitations of the proposed approaches and future challenges were highlighted, and possible solutions were proposed.

Download Full-text

TLabel

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2016100103 ◽

2016 ◽

Vol 12 (4) ◽

pp. 54-74 ◽

Cited By ~ 1

Author(s):

Lamia Oukid ◽

Omar Boussaid ◽

Nadjia Benblidia ◽

Fadila Bentayeb

Keyword(s):

Text Categorization ◽

Numerical Data ◽

Semantic Content ◽

Aggregation Operators ◽

Structured Data ◽

Wide Range ◽

Textual Data ◽

On Line ◽

Analytical Processing ◽

On Line Analytical Processing

Data Warehousing technologies and On-Line Analytical Processing (OLAP) feature a wide range of techniques for the analysis of structured data. However, these techniques are inadequate when it comes to analyzing textual data. Indeed, classical aggregation operators have earned their spurs in the online analysis of numerical data, but are unsuitable for the analysis of textual data. To alleviate this shortcoming, on-line analytical processing in text cubes requires new analysis operators adapted to textual data. In this paper, the authors propose a new aggregation operator named Text Label (TLabel), based on text categorization. Their operator aggregates textual data in several classes of documents. Each class is associated with a label that represents the semantic content of the textual data of the class. TLabel is founded on a tailoring of text mining techniques to OLAP. To validate their operator, the authors perform an experimental study and the preliminary results show the interest of their approach for Text OLAP.

Download Full-text

Unstructured clinical documentation reflecting cognitive and behavioral dysfunction: toward an EHR-based phenotype for cognitive impairment

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocy070 ◽

2018 ◽

Vol 25 (9) ◽

pp. 1206-1212

Author(s):

Andrea L Gilmore-Bykovskyi ◽

Laura M Block ◽

Lily Walljasper ◽

Nikki Hill ◽

Carey Gleason ◽

...

Keyword(s):

Cognitive Impairment ◽

Acute Care ◽

Structured Data ◽

Narrative Text ◽

Unstructured Data ◽

Cognitive Behavioral ◽

Clinical Documentation ◽

Behavioral Dysfunction ◽

Increased Risk ◽

Data Elements

Abstract Despite increased risk for negative outcomes, cognitive impairment (CI) is greatly under-detected during hospitalization. While automated EHR-based phenotypes have potential to improve recognition of CI, they are hindered by widespread under-diagnosis of underlying etiologies such as dementia—limiting the utility of more precise structured data elements. This study examined unstructured data on symptoms of CI in the acute-care EHRs of hip and stroke fracture patients with dementia from two hospitals. Clinician reviewers identified and classified unstructured EHR data using standardized criteria. Relevant narrative text was descriptively characterized and evaluated for key terminology. Most patient EHRs (90%) had narrative text reflecting cognitive and/or behavioral dysfunction common in CI that were reliably classified (κ 0.82). The majority of statements reflected vague descriptions of cognitive/behavioral dysfunction as opposed to diagnostic terminology. Findings from this preliminary derivation study suggest that clinicians use specific terminology in unstructured EHR fields to describe common symptoms of CI. This terminology can inform the design of EHR-based phenotypes for CI and merits further investigation in more diverse, robustly characterized samples.

Download Full-text

Mining Local Specialties for Travelers by Leveraging Structured and Unstructured Data

Advances in Multimedia ◽

10.1155/2012/987124 ◽

2012 ◽

Vol 2012 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Kai Jiang ◽

Like Liu ◽

Rong Xiao ◽

Nenghai Yu

Keyword(s):

Daily Life ◽

Large Data ◽

Structured Data ◽

Unstructured Data ◽

User Generated Content ◽

Data Set ◽

Large Cities ◽

Mining Algorithm ◽

Large Data Set ◽

Ranking Score

Recently, many local review websites such as Yelp are emerging, which have greatly facilitated people's daily life such as cuisine hunting. However they failed to meet travelers' demands because travelers are more concerned about a city's local specialties instead of the city's high ranked restaurants. To solve this problem, this paper presents a local specialty mining algorithm, which utilizes both the structured data from local review websites and the unstructured user-generated content (UGC) from community Q&A websites, and travelogues. The proposed algorithm extracts dish names from local review data to build a document for each city, and appliestfidfweighting algorithm on these documents to rank dishes. Dish-city correlations are calculated from unstructured UGC, and combined with thetfidfranking score to discover local specialties. Finally, duplicates in the local specialty mining results are merged. A recommendation service is built to present local specialties to travelers, along with specialties' associated restaurants, Q&A threads, and travelogues. Experiments on a large data set show that the proposed algorithm can achieve a good performance, and compared to using local review data alone, leveraging unstructured UGC can boost the mining performance a lot, especially in large cities.

Download Full-text