Inducing Schema.org markup from Natural Language Context

Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v9.i3.pp752-763 ◽

2018 ◽

Vol 9 (3) ◽

pp. 752 ◽

Cited By ~ 2

Author(s):

Ily Amalina Ahmad Sabri ◽

Mustafa Man

Keyword(s):

Data Extraction ◽

Extraction Process ◽

Structured Data ◽

Web Pages ◽

Web Page ◽

Web Data ◽

Web Documents ◽

Web Extraction ◽

Comparison Time ◽

The Web

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>

Download Full-text

Improving the Quality of Linked Data Using Statistical Distributions

Information Retrieval and Management ◽

10.4018/978-1-5225-5191-1.ch074 ◽

2018 ◽

pp. 1638-1664 ◽

Cited By ~ 1

Author(s):

Heiko Paulheim ◽

Christian Bizer

Keyword(s):

Knowledge Base ◽

Linked Data ◽

Relational Databases ◽

Knowledge Bases ◽

Structured Data ◽

Data Sources ◽

Data Sets ◽

Statistical Distributions ◽

The Web

Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets: SDType adds missing type statements, and SDValidate identifies faulty statements. Neither of the algorithms uses external knowledge, i.e., they operate only on the data itself. We evaluate the algorithms on the DBpedia and NELL knowledge bases, showing that they are both accurate as well as scalable. Both algorithms have been used for building the DBpedia 3.9 release: With SDType, 3.4 million missing type statements have been added, while using SDValidate, 13,000 erroneous RDF statements have been removed from the knowledge base.

Download Full-text

Effectiveness of Web Usage Mining Techniques in Business Application

Advances in Data Mining and Database Management - Web Usage Mining Techniques and Applications Across Industries ◽

10.4018/978-1-5225-0613-3.ch013 ◽

2017 ◽

pp. 324-350 ◽

Cited By ~ 2

Author(s):

Ahmed El Azab ◽

Mahmood A. Mahmood ◽

Abd El-Aziz

Keyword(s):

Data Mining ◽

Academic Research ◽

Web Usage Mining ◽

Web Pages ◽

Web Data ◽

Web Data Mining ◽

Web Usage ◽

Business Application ◽

Common Interests ◽

The Web

Web usage mining techniques and applications across industries is still exploratory and, despite an increase in academic research, there are challenge of analyze web which quantitatively capture web users' common interests and characterize their underlying tasks. This chapter addresses the problem of how to support web usage mining techniques and applications across industries by combining language of web pages and algorithms that used in web data mining. Existing research in web usage mining techniques tend to focus on finding out how each techniques can apply in different industries fields. However, there is little evidence that researchers have approached the issue of web usage mining across industries. Consequently, the aim of this chapter is to provide an overview of how the web usage mining techniques and applications across industries can be supported.

Download Full-text

Research and Application of XML-Based Web Data Mining

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.1079-1080.601 ◽

2014 ◽

Vol 1079-1080 ◽

pp. 601-603

Author(s):

Dan Yang

Keyword(s):

Data Mining ◽

Job Search ◽

Information Society ◽

Web Pages ◽

Web Data ◽

Electronic Information ◽

Web Data Mining ◽

Web Information ◽

Transmission Of Information ◽

The Web

Popularity of the network is based on the transmission of information, with the development of electronic information technology, the degree of data in the information society continues to deepen, if we want to get wanted or useful information from the mass of information, it must be on the Web information mining. Before, information data used HTML language, its structure is poor, Web data mining is difficult to meet the needs of the job search. In this context, XML language emerges, and it has a good level and structure, and can organize web pages information better, plays a good role in data mining, largely changing the various deficiencies in HTML language. This paper first introduces XML and Web data mining, and analyzes XML-based Web data mining applications on this basis.

Download Full-text

BUILDING A KNOWLEDGE BASE FOR IMPLEMENTING A WEB-BASED COMPUTERIZED RECOMMENDATION SYSTEM

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213007003552 ◽

2007 ◽

Vol 16 (05) ◽

pp. 793-828 ◽

Cited By ~ 10

Author(s):

JUAN D. VELÁSQUEZ ◽

VASILE PALADE

Keyword(s):

Knowledge Base ◽

Web Site ◽

Web Mining ◽

Recommendation System ◽

The Internet ◽

Web Pages ◽

Web Based ◽

Web Logs ◽

Mining Tools ◽

The Web

Understanding the web user browsing behaviour in order to adapt a web site to the needs of a particular user represents a key issue for many commercial companies that do their business over the Internet. This paper presents the implementation of a Knowledge Base (KB) for building web-based computerized recommender systems. The Knowledge Base consists of a Pattern Repository that contains patterns extracted from web logs and web pages, by applying various web mining tools, and a Rule Repository containing rules that describe the use of discovered patterns for building navigation or web site modification recommendations. The paper also focuses on testing the effectiveness of the proposed online and offline recommendations. An ample real-world experiment is carried out on a web site of a bank.

Download Full-text

The Canon of Dutch Literature According to Google

10.31235/osf.io/ewy27 ◽

2019 ◽

Author(s):

Lucas van der Deijl ◽

Antal van den Bosch ◽

Roel Smeets

Keyword(s):

Knowledge Base ◽

Search Engines ◽

Literary History ◽

Information Sources ◽

Search Algorithms ◽

Web Pages ◽

Printed Media ◽

Literary Histories ◽

Dutch Literature ◽

The Web

Literary history is no longer written in books alone. As literary reception thrives in blogs, Wikipedia entries, Amazon reviews, and Goodreads pro les, the Web has become a key platform for the exchange of information on literature. Al- though conventional printed media in the eld—academic monographs, literary supplements, and magazines—may still claim the highest authority, online me- dia presumably provide the rst (and possibly the only) source for many readers casually interested in literary history. Wikipedia o ers quick and free answers to readers’ questions and the range of topics described in its entries dramatically exceeds the volume any printed encyclopedia could possibly cover. While an important share of this expanding knowledge base about literature is produced bottom-up (user based and crowd-sourced), search engines such as Google have become brokers in this online economy of knowledge, organizing information on the Web for its users. Similar to the printed literary histories, search engines prioritize certain information sources over others when ranking and sorting Web pages; as such, their search algorithms create hierarchies of books, authors, and periods.

Download Full-text

Microsoft Concept Graph: Mining Semantic Concepts for Short Text Understanding

Data Intelligence ◽

10.1162/dint_a_00013 ◽

2019 ◽

Vol 1 (3) ◽

pp. 238-270 ◽

Cited By ~ 3

Author(s):

Lei Ji ◽

Yujing Wang ◽

Botian Shi ◽

Dawei Zhang ◽

Zhongyuan Wang ◽

...

Keyword(s):

Graph Mining ◽

Knowledge Graph ◽

Web Pages ◽

Search Query ◽

Short Text ◽

Text Understanding ◽

Concept Space ◽

Semantic Concepts ◽

Table Understanding ◽

The Web

Knowlege is important for text-related applications. In this paper, we introduce Microsoft Concept Graph, a knowledge graph engine that provides concept tagging APIs to facilitate the understanding of human languages. Microsoft Concept Graph is built upon Probase, a universal probabilistic taxonomy consisting of instances and concepts mined from the Web. We start by introducing the construction of the knowledge graph through iterative semantic extraction and taxonomy construction procedures, which extract 2.7 million concepts from 1.68 billion Web pages. We then use conceptualization models to represent text in the concept space to empower text-related applications, such as topic search, query recommendation, Web table understanding and Ads relevance. Since the release in 2016, Microsoft Concept Graph has received more than 100,000 pageviews, 2 million API calls and 3,000 registered downloads from 50,000 visitors over 64 countries.

Download Full-text

A Cognitive-Based Approach to Identify Topics in Text Using the Web as a Knowledge Source

Ontology Learning and Knowledge Discovery Using the Web ◽

10.4018/978-1-60960-625-1.ch004 ◽

2011 ◽

pp. 61-78 ◽

Cited By ~ 4

Author(s):

Louis Massey ◽

Wilson Wong

Keyword(s):

Natural Language ◽

Language Processing ◽

Knowledge Bases ◽

Human Cognition ◽

Web Pages ◽

Topic Identification ◽

Unstructured Text ◽

Text Information ◽

Processing Techniques ◽

The Web

This chapter explores the problem of topic identification from text. It is first argued that the conventional representation of text as bag-of-words vectors will always have limited success in arriving at the underlying meaning of text until the more fundamental issues of feature independence in vector-space and ambiguity of natural language are addressed. Next, a groundbreaking approach to text representation and topic identification that deviates radically from current techniques used for document classification, text clustering, and concept discovery is proposed. This approach is inspired by human cognition, which allows ‘meaning’ to emerge naturally from the activation and decay of unstructured text information retrieved from the Web. This paradigm shift allows for the exploitation rather than avoidance of dependence between terms to derive meaning without the complexity introduced by conventional natural language processing techniques. Using the unstructured texts in Web pages as a source of knowledge alleviates the laborious handcrafting of formal knowledge bases and ontologies that are required by many existing techniques. Some initial experiments have been conducted, and the results are presented in this chapter to illustrate the power of this new approach.

Download Full-text

FINDING STRUCTURED AND UNSTRUCTURED FEATURES TO IMPROVE THE SEARCH RESULT OF COMPLEX QUESTION

Jurnal Sistem Informasi ◽

10.21609/jsi.v5i2.264 ◽

2012 ◽

Vol 5 (2) ◽

pp. 63

Author(s):

Dewi Wisnu Wardani

Keyword(s):

Natural Language ◽

Relational Database ◽

Resource Discovery ◽

Structured Data ◽

Unstructured Data ◽

Web Pages ◽

Question Analysis ◽

Unstructured Text ◽

Improve Accuracy ◽

Complex Question

The current researches on question answer usually achieve the answer only from unstructured text resources such as collection of news or pages. According to our observation from Yahoo!Answer, users sometimes ask in complex natural language questions which contain structured and unstructured features. Generally, answering the complex questions needs to consider not only unstructured but also structured resource. In this work, researcher propose a new idea to improve accuracy of the answers of complex questions by recognizing the structured and unstructured features of questions and them in the web. Our framework consists of three parts: Question Analysis, Resource Discovery, and Analysis of The Relevant Answer. In Question Analysis researcher used a few assumptions and tried to find structured and unstructured features of the questions. In the resource discovery researcher integrated structured data (relational database) and unstructured data (web page) to take the advantage of two kinds of data to improve and to get the correct answers. We can find the best top fragments from context of the relevant web pages in the Relevant Answer part and then researcher made a score matching between the result from structured data and unstructured data, then finally researcher used QA template to reformulate the questions. Penelitian yang ada pada saat ini mengenai Question Answer (QA) biasanya mendapatkan jawaban dari sumber teks yang tidak terstruktur seperti kumpulan berita atau halaman. Sesuai dengan observasi peneliti dari pengguna Yahoo!Answer, biasanya mereka bertanya dalam natural language yang sangat kompleks di mana mengandung bentuk yang terstruktur dan tidak terstruktur. Secara umum, menjawab pertanyaan yang kompleks membutuhkan pertimbangan yang tidak hanya sumber tidak terstruktur tetapi juga sumber yang terstruktur. Pada penelitian ini, peneliti mengajukan suatu ide baru untuk meningkatkan keakuratan dari jawaban pertanyaan yang kompleks dengan mengenali bentuk terstruktur dan tidak terstruktur dan mengintegrasikan keduanya di web. Framework yang digunakan terdiri dari tiga bagian: Question Analysis, Resource Discovery, dan Analysis of The Relevant Answer. Pada Question Analysis peneliti menggunakan beberapa asumsi dan mencoba mencari bentuk data yang terstruktur dan tidak terstruktur. Dalam penemuan sumber daya, peneliti mengintegrasikan data terstruktur (relational database) dan data tidak terstruktur (halaman web) untuk mengambil keuntungan dari dua jenis data untuk meningkatkan dan untuk mencapai jawaban yang benar. Peneliti dapat menemukan fragmen atas terbaik dari konteks halaman web pada bagian Relevant Answer dan kemudian peneliti membuat pencocoka skor antara hasil dari data terstruktur dan data tidak terstruktur. Terakhir peneliti menggunakan template QA untuk merumuskan pertanyaan.

Download Full-text

Standards opportunities around data-bearing Web pages

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2012.0381 ◽

2013 ◽

Vol 371 (1987) ◽

pp. 20120381 ◽

Cited By ~ 2

Author(s):

David Karger

Keyword(s):

Information Management ◽

End Users ◽

Structured Data ◽

Web Pages ◽

Management Tools ◽

Future Evolution ◽

Left Behind ◽

Web Tools ◽

Web Developers ◽

The Web

The evolving Web has seen ever-growing use of structured data, thanks to the way it enhances information authoring, querying, visualization and sharing. To date, however, most structured data authoring and management tools have been oriented towards programmers and Web developers. End users have been left behind, unable to leverage structured data for information management and communication as well as professionals. In this paper, I will argue that many of the benefits of structured data management can be provided to end users as well. I will describe an approach and tools that allow end users to define their own schemas (without knowing what a schema is), manage data and author (not program) interactive Web visualizations of that data using the Web tools with which they are already familiar, such as plain Web pages, blogs, wikis and WYSIWYG document editors. I will describe our experience deploying these tools and some lessons relevant to their future evolution.

Download Full-text