Knowledge Extraction from Unstructured Data on the Web

Author(s):  
Wei Emma Zhang ◽  
Quan Z. Sheng
2020 ◽  
Vol 5 (4) ◽  
pp. 43-55
Author(s):  
Gianpiero Bianchi ◽  
Renato Bruni ◽  
Cinzia Daraio ◽  
Antonio Laureti Palma ◽  
Giulio Perani ◽  
...  

AbstractPurposeThe main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities.Design/methodology/approachWebometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.FindingsThe main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitationsThe results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad.Practical implicationsThe approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems.Originality/valueThis work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).


Author(s):  
Emrah Inan ◽  
Burak Yonyul ◽  
Fatih Tekbacak

Most of the data on the web is non-structural, and it is required that the data should be transformed into a machine operable structure. Therefore, it is appropriate to convert the unstructured data into a structured form according to the requirements and to store those data in different data models by considering use cases. As requirements and their types increase, it fails using one approach to perform on all. Thus, it is not suitable to use a single storage technology to carry out all storage requirements. Managing stores with various type of schemas in a joint and an integrated manner is named as 'multistore' and 'polystore' in the database literature. In this paper, Entity Linking task is leveraged to transform texts into wellformed data and this data is managed by an integrated environment of different data models. Finally, this integrated big data environment will be queried and be examined by presenting the method.


Author(s):  
Wafaa A. Al-Rabayah ◽  
Ahmad Al-Zyoud

Sentiment analysis is a process of determining the polarity (i.e. positive, negative or neutral) of a given text. The extremely increased amount of information available on the web, especially social media, create a challenge to be retrieved and analyzed on time, timely analyzed of unstructured data provide businesses a competitive advantage by better understanding their customers' needs and preferences. This literature review will cover a number of studies about sentiment analysis and finds the connection between sentiment analysis of social network content and customers retention; we will focus on sentiment analysis and discuss concepts related to this field, most important relevant studies and its results, its methods of applications, where it can be applied and its business applications, finally, we will discuss how can sentiment analysis improve the customer retention based on retrieved data.


Author(s):  
Wafaa A. Al-Rabayah ◽  
Ahmad Al-Zyoud

Sentiment analysis is a process of determining the polarity (i.e. positive, negative or neutral) of a given text. The extremely increased amount of information available on the web, especially social media, create a challenge to be retrieved and analyzed on time, timely analyzed of unstructured data provide businesses a competitive advantage by better understanding their customers' needs and preferences. This literature review will cover a number of studies about sentiment analysis and finds the connection between sentiment analysis of social network content and customers retention; we will focus on sentiment analysis and discuss concepts related to this field, most important relevant studies and its results, its methods of applications, where it can be applied and its business applications, finally, we will discuss how can sentiment analysis improve the customer retention based on retrieved data.


Author(s):  
Caterina Paola Venditti ◽  
Paolo Mele

In the era of digital archaeology, the communication of archaeological data/contexts/work can be enhanced by Cloud computing, AI, and other emergent technologies. The authors explore the most recent and efficient examples, ranging from some intrinsic properties of AI, i.e. capabilities of sense, comprehend and act, and looking at their application in communication both among specialists of the archaeological sector and from them to other recipients. The chapter will also provide a high-level overview of knowledge extraction solutions from tons of structured and unstructured data, to make it available through software applications that perform automated tasks. Archaeologists must be ready to go down in trenches and communicate their studies with a deep consciousness of chances given by these technologies, and with adequate skills to master them.


2018 ◽  
Vol 7 (4.19) ◽  
pp. 1041
Author(s):  
Santosh V. Chobe ◽  
Dr. Shirish S. Sane

There is an explosive growth of information on Internet that makes extraction of relevant data from various sources, a difficult task for its users. Therefore, to transform the Web pages into databases, Information Extraction (IE) systems are needed. Relevant information in Web documents can be extracted using information extraction and presented in a structured format.By applying information extraction techniques, information can be extracted from structured, semi-structured, and unstructured data. This paper presents some of the major information extraction tools. Here, advantages and limitations of the tools are discussed from a user’s perspective.  


2018 ◽  
Vol 5 (4) ◽  
pp. 61-73
Author(s):  
Tanushri Banerjee ◽  
Arindam Banerjee

This article evaluates online grocery shopping web sites catering to customers primarily in India. The process of evaluation has been carried out in 3 parts using Rapidminer. In part A, the authors have studied the similarity in content that resides on the grocery shopping web sites. Using unstructured data from homepage of grocery shopping websites and the keywords specified for the web sites, the authors have made an effort to establish a cosine similarity index amongst them. In part B, the authors have analysed the customer reviews from the web sites. Studying the resulting association rules, authors have attempted to identify the attributes that drive customer happiness. In part C, the authors have documented the web traffic metric parameters (attributes) measured by search engine optimization (SEO) tool web sites. Hence, the created a correlation matrix to determine the parameters that are significantly impacting per day revenue for the web sites.


Author(s):  
JESÚS CARDEÑOSA ◽  
EDMUNDO TOVAR

Many websites are in general poorly defined and its users are not able to find the information they need. That is the reason why many papers are addressed to propose techniques able to find the right information for a user. Most of these techniques focus on finding the required information in the whole Internet. Many times the owner of the website gives incomplete/imprecise information with low level of usefulness for the user. The re-structuring of the information is many times enough for detecting lacks of information, inconsistencies and imprecisions. However this work is normally very difficult without losing performances of the website. The authors have developed a novel application to exploit existing information in a website in a more profitable way restructuring the information without the intervention of the content provider. This paper describes the authors' experience during their participation in the European Commission ESPRIT 29158 FLEX Project.


Author(s):  
Caio Saraiva Coneglian ◽  
Elvis Fusco

The data available on the Web is growing exponentially, providing information of high added value to organizations. Such information can be arranged in diverse bases and in varied formats, like videos and photos in social media. However, unstructured data present great difficulty for the information retrieval, not efficiently meeting the informational needs of the users, because there are problems in understanding the meaning of documents stored on the Web. In the context of an Information Retrieval architecture, this research aims to The implementation of a semantic extraction agent in the context of the Web that allows the location, treatment and retrieval of information in the context of Big Data in the most varied informational sources that serves as the basis for the implementation of informational environments that aid the Information Retrieval process , Using ontology to add semantics to the process of retrieval and presentation of results obtained to users, thus being able to meet their needs.


Sign in / Sign up

Export Citation Format

Share Document