scholarly journals Creating Relational Data from Unstructured and Ungrammatical Data Sources

2008 ◽  
Vol 31 ◽  
pp. 543-590 ◽  
Author(s):  
M. Michelson ◽  
C. A. Knoblock

In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources are online classifieds like Craigslist and auction item listings like eBay. We call this unstructured, ungrammatical data "posts." The unstructured nature of posts makes query and integration difficult because the attributes are embedded within the text. Also, these attributes do not conform to standardized values, which prevents queries based on a common attribute value. The schema is unknown and the values may vary dramatically making accurate search difficult. Creating relational data for easy querying requires that we define a schema for the embedded attributes and extract values from the posts while standardizing these values. Traditional information extraction (IE) is inadequate to perform this task because it relies on clues from the data, such as structure or natural language, neither of which are found in posts. Furthermore, traditional information extraction does not incorporate data cleaning, which is necessary to accurately query and integrate the source. The two-step approach described in this paper creates relational data sets from unstructured and ungrammatical text by addressing both issues. To do this, we require a set of known entities called a "reference set." The first step aligns each post to each member of each reference set. This allows our algorithm to define a schema over the post and include standard values for the attributes defined by this schema. The second step performs information extraction for the attributes, including attributes not easily represented by reference sets, such as a price. In this manner we create a relational structure over previously unstructured data, supporting deep and accurate queries over the data as well as standard values for integration. Our experimental results show that our technique matches the posts to the reference set accurately and efficiently and outperforms state-of-the-art extraction systems on the extraction task from posts.

2019 ◽  
Vol 11 ◽  
pp. 184797901989077 ◽  
Author(s):  
Kiran Adnan ◽  
Rehan Akbar

During the recent era of big data, a huge volume of unstructured data are being produced in various forms of audio, video, images, text, and animation. Effective use of these unstructured big data is a laborious and tedious task. Information extraction (IE) systems help to extract useful information from this large variety of unstructured data. Several techniques and methods have been presented for IE from unstructured data. However, numerous studies conducted on IE from a variety of unstructured data are limited to single data types such as text, image, audio, or video. This article reviews the existing IE techniques along with its subtasks, limitations, and challenges for the variety of unstructured data highlighting the impact of unstructured big data on IE techniques. To the best of our knowledge, there is no comprehensive study conducted to investigate the limitations of existing IE techniques for the variety of unstructured big data. The objective of the structured review presented in this article is twofold. First, it presents the overview of IE techniques from a variety of unstructured data such as text, image, audio, and video at one platform. Second, it investigates the limitations of these existing IE techniques due to the heterogeneity, dimensionality, and volume of unstructured big data. The review finds that advanced techniques for IE, particularly for multifaceted unstructured big data sets, are the utmost requirement of the organizations to manage big data and derive strategic information. Further, potential solutions are also presented to improve the unstructured big data IE systems for future research. These solutions will help to increase the efficiency and effectiveness of the data analytics process in terms of context-aware analytics systems, data-driven decision-making, and knowledge management.


Transformation presents the second step in the ETL process that is responsible for extracting, transforming and loading data into a data warehouse. The role of transformation is to set up several operations to clean, to format and to unify types and data coming from multiple and different data sources. The goal is to get data to conform to the schema of the data warehouse to avoid any ambiguity problems during the data storage and analytical operations. Transforming data coming from structured, semi-structured and unstructured data sources need two levels of treatments: the first one is transformation schema to schema to get a unified schema for all selected data sources and the second treatment is transformation data to data to unify all types and data gathered. To ensure the setting up of these steps we propose in this paper a process switch from one database schema to another as a part of transformation schema to schema, and a meta-model based on MDA approach to describe the main operations of transformation data to data. The results of our transformations propose a data loading in one of the four schemas of NoSQL to best meet the constraints and requirements of Big Data.


Author(s):  
Xuan Wu ◽  
Qing-Guo Chen ◽  
Yao Hu ◽  
Dengbao Wang ◽  
Xiaodong Chang ◽  
...  

Multi-view multi-label learning serves an important framework to learn from objects with diverse representations and rich semantics. Existing multi-view multi-label learning techniques focus on exploiting shared subspace for fusing multi-view representations, where helpful view-specific information for discriminative modeling is usually ignored. In this paper, a novel multi-view multi-label learning approach named SIMM is proposed which leverages shared subspace exploitation and view-specific information extraction. For shared subspace exploitation, SIMM jointly minimizes confusion adversarial loss and multi-label loss to utilize shared information from all views. For view-specific information extraction, SIMM enforces an orthogonal constraint w.r.t. the shared subspace to utilize view-specific discriminative information. Extensive experiments on real-world data sets clearly show the favorable performance of SIMM against other state-of-the-art multi-view multi-label learning approaches.


2021 ◽  
pp. 1-12
Author(s):  
Luyi Bai ◽  
Nan Li ◽  
Lishuang Liu ◽  
Xuesong Hao

With the rapid development of the environmental, meteorological and marine data management, fuzzy spatiotemporal data has received considerable attention. Even though some achievements in querying aspect have been made, there are still some unsolved problems. Semantic and structural heterogeneity may exist among different data sources, which will lead to incomplete results. In addition, there are ambiguous query intentions and conditions when the user queries the data. This paper proposes a fuzzy spatiotemporal data semantic model. Based on this model, the RDF local semantic models are converted into a RDF global semantic model after mapping relational data and XML data to RDF local semantic models. The existing methods mainly convert relational data to RDF Schema directly. But our approach converts relational data to XML Schema and then converts it to RDF, which utilizes the semi-structured feature of XML schema to solve the structural heterogeneity between different data sources. The integration process enables us to perform global queries against different data sources. In the proposed query algorithms, the query conditions inputted are converted into exact queries before the results are returned. Finally, this paper has carried out extensive experiments, calculated the recall, precision and F-Score of the experimental results, and compared with other state-of-the-art query methods. It shows the importance of the data integration method and the effectiveness of the query method proposed in this paper.


Author(s):  
K Sobha Rani

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.


2021 ◽  
Vol 29 ◽  
pp. 115-124
Author(s):  
Xinlu Wang ◽  
Ahmed A.F. Saif ◽  
Dayou Liu ◽  
Yungang Zhu ◽  
Jon Atli Benediktsson

BACKGROUND: DNA sequence alignment is one of the most fundamental and important operation to identify which gene family may contain this sequence, pattern matching for DNA sequence has been a fundamental issue in biomedical engineering, biotechnology and health informatics. OBJECTIVE: To solve this problem, this study proposes an optimal multi pattern matching with wildcards for DNA sequence. METHODS: This proposed method packs the patterns and a sliding window of texts, and the window slides along the given packed text, matching against stored packed patterns. RESULTS: Three data sets are used to test the performance of the proposed algorithm, and the algorithm was seen to be more efficient than the competitors because its operation is close to machine language. CONCLUSIONS: Theoretical analysis and experimental results both demonstrate that the proposed method outperforms the state-of-the-art methods and is especially effective for the DNA sequence.


2021 ◽  
Vol 11 (6) ◽  
pp. 2511
Author(s):  
Julian Hatwell ◽  
Mohamed Medhat Gaber ◽  
R. Muhammad Atif Azad

This research presents Gradient Boosted Tree High Importance Path Snippets (gbt-HIPS), a novel, heuristic method for explaining gradient boosted tree (GBT) classification models by extracting a single classification rule (CR) from the ensemble of decision trees that make up the GBT model. This CR contains the most statistically important boundary values of the input space as antecedent terms. The CR represents a hyper-rectangle of the input space inside which the GBT model is, very reliably, classifying all instances with the same class label as the explanandum instance. In a benchmark test using nine data sets and five competing state-of-the-art methods, gbt-HIPS offered the best trade-off between coverage (0.16–0.75) and precision (0.85–0.98). Unlike competing methods, gbt-HIPS is also demonstrably guarded against under- and over-fitting. A further distinguishing feature of our method is that, unlike much prior work, our explanations also provide counterfactual detail in accordance with widely accepted recommendations for what makes a good explanation.


2021 ◽  
pp. 1-13
Author(s):  
Qingtian Zeng ◽  
Xishi Zhao ◽  
Xiaohui Hu ◽  
Hua Duan ◽  
Zhongying Zhao ◽  
...  

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.


Sign in / Sign up

Export Citation Format

Share Document