scholarly journals Relation extraction in structured and unstructured data: a comparative investigation on smartphone titles in the e-commerce domain

2021 ◽  
Author(s):  
João Gabriel Melo Barbirato ◽  
Livy Real ◽  
Helena de Medeiros Caseli

As large amounts of unstructured data are generated on a regular basis, expressing or storing knowledge in a way that is useful remains a challenge. In this context, Relation Extraction (RE) is the task of automatically identifying relationships in unstructured textual data. Thus, we investigated the relation extraction on unstructured e-commerce data from the smartphone domain, using a BERT model fine-tuned for this task. We conducted two experiments to acknowledge how much relational information it is possible to extract from product sheets (structured data) and product titles (unstructured data), and a third experiment to compare both. Analysis shows that extracting relations within a title can retrieve correct relations that are not evident on the related sheet.

2020 ◽  
Vol 10 (3) ◽  
pp. 1181 ◽  
Author(s):  
Kuekyeng Kim ◽  
Yuna Hur ◽  
Gyeongmin Kim ◽  
Heuiseok Lim

In an age overflowing with information, the task of converting unstructured data into structured data are a vital task of great need. Currently, most relation extraction modules are more focused on the extraction of local mention-level relations—usually from short volumes of text. However, in most cases, the most vital and important relations are those that are described in length and detail. In this research, we propose GREG: A Global level Relation Extractor model using knowledge graph embeddings for document-level inputs. The model uses vector representations of mention-level ‘local’ relation’s to construct knowledge graphs that can represent the input document. The knowledge graph is then used to predict global level relations from documents or large bodies of text. The proposed model is largely divided into two modules which are synchronized during their training. Thus, each of the model’s modules is designed to deal with local relations and global relations separately. This allows the model to avoid the problem of struggling against loss of information due to too much information crunched into smaller sized representations when attempting global level relation extraction. Through evaluation, we have shown that the proposed model yields high performances in both predicting global level relations and local level relations consistently.


2021 ◽  
Vol 54 (1) ◽  
pp. 1-39
Author(s):  
Zara Nasar ◽  
Syed Waqar Jaffry ◽  
Muhammad Kamran Malik

With the advent of Web 2.0, there exist many online platforms that result in massive textual-data production. With ever-increasing textual data at hand, it is of immense importance to extract information nuggets from this data. One approach towards effective harnessing of this unstructured textual data could be its transformation into structured text. Hence, this study aims to present an overview of approaches that can be applied to extract key insights from textual data in a structured way. For this, Named Entity Recognition and Relation Extraction are being majorly addressed in this review study. The former deals with identification of named entities, and the latter deals with problem of extracting relation between set of entities. This study covers early approaches as well as the developments made up till now using machine learning models. Survey findings conclude that deep-learning-based hybrid and joint models are currently governing the state-of-the-art. It is also observed that annotated benchmark datasets for various textual-data generators such as Twitter and other social forums are not available. This scarcity of dataset has resulted into relatively less progress in these domains. Additionally, the majority of the state-of-the-art techniques are offline and computationally expensive. Last, with increasing focus on deep-learning frameworks, there is need to understand and explain the under-going processes in deep architectures.


2011 ◽  
Vol 3 (3) ◽  
pp. 1-18 ◽  
Author(s):  
John Haggerty ◽  
Alexander J. Karran ◽  
David J. Lamb ◽  
Mark Taylor

The continued reliance on email communications ensures that it remains a major source of evidence during a digital investigation. Emails comprise both structured and unstructured data. Structured data provides qualitative information to the forensics examiner and is typically viewed through existing tools. Unstructured data is more complex as it comprises information associated with social networks, such as relationships within the network, identification of key actors and power relations, and there are currently no standardised tools for its forensic analysis. This paper posits a framework for the forensic investigation of email data. In particular, it focuses on the triage and analysis of unstructured data to identify key actors and relationships within an email network. This paper demonstrates the applicability of the approach by applying relevant stages of the framework to the Enron email corpus. The paper illustrates the advantage of triaging this data to identify (and discount) actors and potential sources of further evidence. It then applies social network analysis techniques to key actors within the data set. This paper posits that visualisation of unstructured data can greatly aid the examiner in their analysis of evidence discovered during an investigation.


2015 ◽  
Vol 49 (1) ◽  
pp. 91-114 ◽  
Author(s):  
Milorad Pantelija Stevic ◽  
Branko Milosavljevic ◽  
Branko Rade Perisic

Purpose – Current e-learning platforms are based on relational database management systems (RDBMS) and are well suited for handling structured data. However, it is expected from e-learning solutions to efficiently handle unstructured data as well. The purpose of this paper is to show an alternative to current solutions for unstructured data management. Design/methodology/approach – Current repository-based solution for file management was compared to MongoDB architecture according to their functionalities and characteristics. This included several categories: data integrity, hardware acquisition, processing files, availability, handling concurrent users, partition tolerance, disaster recovery, backup policies and scalability. Findings – This paper shows that it is possible to improve e-learning platform capabilities by implementing a hybrid database architecture that incorporates RDBMS for handling structured data and MongoDB database system for handling unstructured data. Research limitations/implications – The study shows an acceptable adoption of MongoDB inside a service-oriented architecture (SOA) for enhancing e-learning solutions. Practical implications – This research enables an efficient file handling not only for e-learning systems, but also for any system where file handling is needed. Originality/value – It is expected that future single/joint e-learning initiatives will need to manage huge amount of files and they will require effective file handling solution. The new architecture solution for file handling is offered in this paper: it is different from current solutions because it is less expensive, more efficient, more flexible and requires less administrative and development effort for building and maintaining.


Big Data ◽  
2016 ◽  
pp. 1495-1518
Author(s):  
Mohammad Alaa Hussain Al-Hamami

Big Data is comprised systems, to remain competitive by techniques emerging due to Big Data. Big Data includes structured data, semi-structured and unstructured. Structured data are those data formatted for use in a database management system. Semi-structured and unstructured data include all types of unformatted data including multimedia and social media content. Among practitioners and applied researchers, the reaction to data available through blogs, Twitter, Facebook, or other social media can be described as a “data rush” promising new insights about consumers' choices and behavior and many other issues. In the past Big Data has been used just by very large organizations, governments and large enterprises that have the ability to create its own infrastructure for hosting and mining large amounts of data. This chapter will show the requirements for the Big Data environments to be protected using the same rigorous security strategies applied to traditional database systems.


Author(s):  
Gaetano Rossiello ◽  
Alfio Gliozzo ◽  
Michael Glass

We propose a novel approach to learn representations of relations expressed by their textual mentions. In our assumption, if two pairs of entities belong to the same relation, then those two pairs are analogous. We collect a large set of analogous pairs by matching triples in knowledge bases with web-scale corpora through distant supervision. This dataset is adopted to train a hierarchical siamese network in order to learn entity-entity embeddings which encode relational information through the different linguistic paraphrasing expressing the same relation. The model can be used to generate pre-trained embeddings which provide a valuable signal when integrated into an existing neural-based model by outperforming the state-of-the-art methods on a relation extraction task.


2020 ◽  
Vol 2020 ◽  
pp. 1-9 ◽  
Author(s):  
Nada Boudjellal ◽  
Huaping Zhang ◽  
Asif Khan ◽  
Arshad Ahmad

With the accelerating growth of big data, especially in the healthcare area, information extraction is more needed currently than ever, for it can convey unstructured information into an easily interpretable structured data. Relation extraction is the second of the two important tasks of relation extraction. This study presents an overview of relation extraction using distant supervision, providing a generalized architecture of this task based on the state-of-the-art work that proposed this method. Besides, it surveys the methods used in the literature targeting this topic with a description of different knowledge bases used in the process along with the corpora, which can be helpful for beginner practitioners seeking knowledge on this subject. Moreover, the limitations of the proposed approaches and future challenges were highlighted, and possible solutions were proposed.


2016 ◽  
Vol 12 (4) ◽  
pp. 54-74 ◽  
Author(s):  
Lamia Oukid ◽  
Omar Boussaid ◽  
Nadjia Benblidia ◽  
Fadila Bentayeb

Data Warehousing technologies and On-Line Analytical Processing (OLAP) feature a wide range of techniques for the analysis of structured data. However, these techniques are inadequate when it comes to analyzing textual data. Indeed, classical aggregation operators have earned their spurs in the online analysis of numerical data, but are unsuitable for the analysis of textual data. To alleviate this shortcoming, on-line analytical processing in text cubes requires new analysis operators adapted to textual data. In this paper, the authors propose a new aggregation operator named Text Label (TLabel), based on text categorization. Their operator aggregates textual data in several classes of documents. Each class is associated with a label that represents the semantic content of the textual data of the class. TLabel is founded on a tailoring of text mining techniques to OLAP. To validate their operator, the authors perform an experimental study and the preliminary results show the interest of their approach for Text OLAP.


2018 ◽  
Vol 25 (9) ◽  
pp. 1206-1212
Author(s):  
Andrea L Gilmore-Bykovskyi ◽  
Laura M Block ◽  
Lily Walljasper ◽  
Nikki Hill ◽  
Carey Gleason ◽  
...  

Abstract Despite increased risk for negative outcomes, cognitive impairment (CI) is greatly under-detected during hospitalization. While automated EHR-based phenotypes have potential to improve recognition of CI, they are hindered by widespread under-diagnosis of underlying etiologies such as dementia—limiting the utility of more precise structured data elements. This study examined unstructured data on symptoms of CI in the acute-care EHRs of hip and stroke fracture patients with dementia from two hospitals. Clinician reviewers identified and classified unstructured EHR data using standardized criteria. Relevant narrative text was descriptively characterized and evaluated for key terminology. Most patient EHRs (90%) had narrative text reflecting cognitive and/or behavioral dysfunction common in CI that were reliably classified (κ 0.82). The majority of statements reflected vague descriptions of cognitive/behavioral dysfunction as opposed to diagnostic terminology. Findings from this preliminary derivation study suggest that clinicians use specific terminology in unstructured EHR fields to describe common symptoms of CI. This terminology can inform the design of EHR-based phenotypes for CI and merits further investigation in more diverse, robustly characterized samples.


2012 ◽  
Vol 2012 ◽  
pp. 1-9 ◽  
Author(s):  
Kai Jiang ◽  
Like Liu ◽  
Rong Xiao ◽  
Nenghai Yu

Recently, many local review websites such as Yelp are emerging, which have greatly facilitated people's daily life such as cuisine hunting. However they failed to meet travelers' demands because travelers are more concerned about a city's local specialties instead of the city's high ranked restaurants. To solve this problem, this paper presents a local specialty mining algorithm, which utilizes both the structured data from local review websites and the unstructured user-generated content (UGC) from community Q&A websites, and travelogues. The proposed algorithm extracts dish names from local review data to build a document for each city, and appliestfidfweighting algorithm on these documents to rank dishes. Dish-city correlations are calculated from unstructured UGC, and combined with thetfidfranking score to discover local specialties. Finally, duplicates in the local specialty mining results are merged. A recommendation service is built to present local specialties to travelers, along with specialties' associated restaurants, Q&A threads, and travelogues. Experiments on a large data set show that the proposed algorithm can achieve a good performance, and compared to using local review data alone, leveraging unstructured UGC can boost the mining performance a lot, especially in large cities.


Sign in / Sign up

Export Citation Format

Share Document