Knowledge Extraction from Unstructured Data on the Web

AbstractPurposeThe main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities.Design/methodology/approachWebometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.FindingsThe main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitationsThe results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad.Practical implicationsThe approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems.Originality/valueThis work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).

Download Full-text

A Domain Specific Entity Linking Approach Consuming Multistore Environment

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.201805016 ◽

2018 ◽

pp. 46-52

Author(s):

Emrah Inan ◽

Burak Yonyul ◽

Fatih Tekbacak

Keyword(s):

Big Data ◽

Data Models ◽

Use Cases ◽

Unstructured Data ◽

Entity Linking ◽

Domain Specific ◽

Storage Technology ◽

Integrated Environment ◽

Data Environment ◽

The Web

Most of the data on the web is non-structural, and it is required that the data should be transformed into a machine operable structure. Therefore, it is appropriate to convert the unstructured data into a structured form according to the requirements and to store those data in different data models by considering use cases. As requirements and their types increase, it fails using one approach to perform on all. Thus, it is not suitable to use a single storage technology to carry out all storage requirements. Managing stores with various type of schemas in a joint and an integrated manner is named as 'multistore' and 'polystore' in the database literature. In this paper, Entity Linking task is leveraged to transform texts into wellformed data and this data is managed by an integrated environment of different data models. Finally, this integrated big data environment will be queried and be examined by presenting the method.

Download Full-text

Sentiment Analysis of Social Media as Tool to Improve Customer Retention

Digital Marketing and Consumer Engagement ◽

10.4018/978-1-5225-5187-4.ch032 ◽

2017 ◽

pp. 635-648

Author(s):

Wafaa A. Al-Rabayah ◽

Ahmad Al-Zyoud

Keyword(s):

Social Media ◽

Social Network ◽

Literature Review ◽

Competitive Advantage ◽

Sentiment Analysis ◽

Customer Retention ◽

Unstructured Data ◽

Amount Of Information ◽

Business Applications ◽

The Web

Sentiment analysis is a process of determining the polarity (i.e. positive, negative or neutral) of a given text. The extremely increased amount of information available on the web, especially social media, create a challenge to be retrieved and analyzed on time, timely analyzed of unstructured data provide businesses a competitive advantage by better understanding their customers' needs and preferences. This literature review will cover a number of studies about sentiment analysis and finds the connection between sentiment analysis of social network content and customers retention; we will focus on sentiment analysis and discuss concepts related to this field, most important relevant studies and its results, its methods of applications, where it can be applied and its business applications, finally, we will discuss how can sentiment analysis improve the customer retention based on retrieved data.

Download Full-text

Sentiment Analysis of Social Media as Tool to Improve Customer Retention

Advances in Marketing, Customer Relationship Management, and E-Services - Strategic Uses of Social Media for Improved Customer Retention ◽

10.4018/978-1-5225-1686-6.ch011 ◽

2017 ◽

pp. 207-223

Author(s):

Wafaa A. Al-Rabayah ◽

Ahmad Al-Zyoud

Keyword(s):

Social Media ◽

Social Network ◽

Literature Review ◽

Competitive Advantage ◽

Sentiment Analysis ◽

Customer Retention ◽

Unstructured Data ◽

Amount Of Information ◽

Business Applications ◽

The Web

Sentiment analysis is a process of determining the polarity (i.e. positive, negative or neutral) of a given text. The extremely increased amount of information available on the web, especially social media, create a challenge to be retrieved and analyzed on time, timely analyzed of unstructured data provide businesses a competitive advantage by better understanding their customers' needs and preferences. This literature review will cover a number of studies about sentiment analysis and finds the connection between sentiment analysis of social network content and customers retention; we will focus on sentiment analysis and discuss concepts related to this field, most important relevant studies and its results, its methods of applications, where it can be applied and its business applications, finally, we will discuss how can sentiment analysis improve the customer retention based on retrieved data.

Download Full-text

Digital Transformation and Archaeology

Developing Effective Communication Skills in Archaeology - Advances in Religious and Cultural Studies ◽

10.4018/978-1-7998-1059-9.ch011 ◽

2020 ◽

pp. 224-244

Author(s):

Caterina Paola Venditti ◽

Paolo Mele

Keyword(s):

Cloud Computing ◽

Knowledge Extraction ◽

Digital Transformation ◽

Unstructured Data ◽

Intrinsic Properties ◽

Emergent Technologies ◽

Software Applications ◽

Archaeological Data ◽

Digital Archaeology ◽

High Level

In the era of digital archaeology, the communication of archaeological data/contexts/work can be enhanced by Cloud computing, AI, and other emergent technologies. The authors explore the most recent and efficient examples, ranging from some intrinsic properties of AI, i.e. capabilities of sense, comprehend and act, and looking at their application in communication both among specialists of the archaeological sector and from them to other recipients. The chapter will also provide a high-level overview of knowledge extraction solutions from tons of structured and unstructured data, to make it available through software applications that perform automated tasks. Archaeologists must be ready to go down in trenches and communicate their studies with a deep consciousness of chances given by these technologies, and with adequate skills to master them.

Download Full-text

Extraction of Meaningful Information from the Web: a Brief Survey

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.19.28283 ◽

2018 ◽

Vol 7 (4.19) ◽

pp. 1041

Author(s):

Santosh V. Chobe ◽

Dr. Shirish S. Sane

Keyword(s):

Information Extraction ◽

Relevant Information ◽

Unstructured Data ◽

Web Pages ◽

Extraction Techniques ◽

Web Documents ◽

Meaningful Information ◽

The Web

There is an explosive growth of information on Internet that makes extraction of relevant data from various sources, a difficult task for its users. Therefore, to transform the Web pages into databases, Information Extraction (IE) systems are needed. Relevant information in Web documents can be extracted using information extraction and presented in a structured format.By applying information extraction techniques, information can be extracted from structured, semi-structured, and unstructured data. This paper presents some of the major information extraction tools. Here, advantages and limitations of the tools are discussed from a user’s perspective.

Download Full-text

Web Content Analysis of Online Grocery Shopping Web Sites in India

International Journal of Business Analytics ◽

10.4018/ijban.2018100104 ◽

2018 ◽

Vol 5 (4) ◽

pp. 61-73

Author(s):

Tanushri Banerjee ◽

Arindam Banerjee

Keyword(s):

Web Sites ◽

Correlation Matrix ◽

Similarity Index ◽

Unstructured Data ◽

Grocery Shopping ◽

Web Content ◽

Web Traffic ◽

Customer Reviews ◽

Part C ◽

The Web

This article evaluates online grocery shopping web sites catering to customers primarily in India. The process of evaluation has been carried out in 3 parts using Rapidminer. In part A, the authors have studied the similarity in content that resides on the grocery shopping web sites. Using unstructured data from homepage of grocery shopping websites and the keywords specified for the web sites, the authors have made an effort to establish a cosine similarity index amongst them. In part B, the authors have analysed the customer reviews from the web sites. Studying the resulting association rules, authors have attempted to identify the attributes that drive customer happiness. In part C, the authors have documented the web traffic metric parameters (attributes) measured by search engine optimization (SEO) tool web sites. Hence, the created a correlation matrix to determine the parameters that are significantly impacting per day revenue for the web sites.

Download Full-text

A new automatic knowledge extraction method for course documents applied in the web-based teaching system

2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) ◽

10.1109/fskd.2015.7382187 ◽

2015 ◽

Author(s):

Mingya Wang ◽

Jun Zheng ◽

Su Wang

Keyword(s):

Extraction Method ◽

Knowledge Extraction ◽

Web Based ◽

Teaching System ◽

The Web

Download Full-text

INTELLIGENT KNOWLEDGE EXTRACTION FROM THE WEB

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488503002302 ◽

2003 ◽

Vol 11 (supp01) ◽

pp. 117-134 ◽

Cited By ~ 2

Author(s):

JESÚS CARDEÑOSA ◽

EDMUNDO TOVAR

Keyword(s):

European Commission ◽

Knowledge Extraction ◽

Low Level ◽

Content Provider ◽

Imprecise Information ◽

The Right ◽

The Web

Many websites are in general poorly defined and its users are not able to find the information they need. That is the reason why many papers are addressed to propose techniques able to find the right information for a user. Most of these techniques focus on finding the required information in the whole Internet. Many times the owner of the website gives incomplete/imprecise information with low level of usefulness for the user. The re-structuring of the information is many times enough for detecting lacks of information, inconsistencies and imprecisions. However this work is normally very difficult without losing performances of the website. The authors have developed a novel application to exploit existing information in a website in a more profitable way restructuring the information without the intervention of the content provider. This paper describes the authors' experience during their participation in the European Commission ESPRIT 29158 FLEX Project.

Download Full-text

Recuperação da Informação em Ambientes Semânticos: uma ferramenta aplicada à publicações científicas

Journal on Advances in Theoretical and Applied Informatics ◽

10.26729/jadi.v1i1.1042 ◽

2015 ◽

Vol 1 (1) ◽

pp. 30 ◽

Cited By ~ 1

Author(s):

Caio Saraiva Coneglian ◽

Elvis Fusco

Keyword(s):

Social Media ◽

Big Data ◽

Information Retrieval ◽

Great Difficulty ◽

Unstructured Data ◽

Added Value ◽

Informational Needs ◽

Retrieval Process ◽

Extraction Agent ◽

The Web

The data available on the Web is growing exponentially, providing information of high added value to organizations. Such information can be arranged in diverse bases and in varied formats, like videos and photos in social media. However, unstructured data present great difficulty for the information retrieval, not efficiently meeting the informational needs of the users, because there are problems in understanding the meaning of documents stored on the Web. In the context of an Information Retrieval architecture, this research aims to The implementation of a semantic extraction agent in the context of the Web that allows the location, treatment and retrieval of information in the context of Big Data in the most varied informational sources that serves as the basis for the implementation of informational environments that aid the Information Retrieval process , Using ontology to add semantics to the process of retrieval and presentation of results obtained to users, thus being able to meet their needs.

Download Full-text

Knowledge Extraction from Unstructured Data on the Web

Exploring the Potentialities of Automatic Extraction of University Webometric Information

A Domain Specific Entity Linking Approach Consuming Multistore Environment

Sentiment Analysis of Social Media as Tool to Improve Customer Retention

Sentiment Analysis of Social Media as Tool to Improve Customer Retention

Digital Transformation and Archaeology

Extraction of Meaningful Information from the Web: a Brief Survey

Web Content Analysis of Online Grocery Shopping Web Sites in India

A new automatic knowledge extraction method for course documents applied in the web-based teaching system

INTELLIGENT KNOWLEDGE EXTRACTION FROM THE WEB

Recuperação da Informação em Ambientes Semânticos: uma ferramenta aplicada à publicações científicas

Export Citation Format