Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Extraction of Meaningful Information from the Web: a Brief Survey

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.19.28283 ◽

2018 ◽

Vol 7 (4.19) ◽

pp. 1041

Author(s):

Santosh V. Chobe ◽

Dr. Shirish S. Sane

Keyword(s):

Information Extraction ◽

Relevant Information ◽

Unstructured Data ◽

Web Pages ◽

Extraction Techniques ◽

Web Documents ◽

Meaningful Information ◽

The Web

There is an explosive growth of information on Internet that makes extraction of relevant data from various sources, a difficult task for its users. Therefore, to transform the Web pages into databases, Information Extraction (IE) systems are needed. Relevant information in Web documents can be extracted using information extraction and presented in a structured format.By applying information extraction techniques, information can be extracted from structured, semi-structured, and unstructured data. This paper presents some of the major information extraction tools. Here, advantages and limitations of the tools are discussed from a user’s perspective.

Download Full-text

Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v9.i3.pp752-763 ◽

2018 ◽

Vol 9 (3) ◽

pp. 752 ◽

Cited By ~ 2

Author(s):

Ily Amalina Ahmad Sabri ◽

Mustafa Man

Keyword(s):

Data Extraction ◽

Extraction Process ◽

Structured Data ◽

Web Pages ◽

Web Page ◽

Web Data ◽

Web Documents ◽

Web Extraction ◽

Comparison Time ◽

The Web

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>

Download Full-text

A Cognitive-Based Approach to Identify Topics in Text Using the Web as a Knowledge Source

Ontology Learning and Knowledge Discovery Using the Web ◽

10.4018/978-1-60960-625-1.ch004 ◽

2011 ◽

pp. 61-78 ◽

Cited By ~ 4

Author(s):

Louis Massey ◽

Wilson Wong

Keyword(s):

Natural Language ◽

Language Processing ◽

Knowledge Bases ◽

Human Cognition ◽

Web Pages ◽

Topic Identification ◽

Unstructured Text ◽

Text Information ◽

Processing Techniques ◽

The Web

This chapter explores the problem of topic identification from text. It is first argued that the conventional representation of text as bag-of-words vectors will always have limited success in arriving at the underlying meaning of text until the more fundamental issues of feature independence in vector-space and ambiguity of natural language are addressed. Next, a groundbreaking approach to text representation and topic identification that deviates radically from current techniques used for document classification, text clustering, and concept discovery is proposed. This approach is inspired by human cognition, which allows ‘meaning’ to emerge naturally from the activation and decay of unstructured text information retrieved from the Web. This paradigm shift allows for the exploitation rather than avoidance of dependence between terms to derive meaning without the complexity introduced by conventional natural language processing techniques. Using the unstructured texts in Web pages as a source of knowledge alleviates the laborious handcrafting of formal knowledge bases and ontologies that are required by many existing techniques. Some initial experiments have been conducted, and the results are presented in this chapter to illustrate the power of this new approach.

Download Full-text

IMPROVED WEB PAGE IDENTIFICATION METHOD USING NEURAL NETWORKS

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026811003008 ◽

2011 ◽

Vol 10 (01) ◽

pp. 87-114 ◽

Cited By ~ 2

Author(s):

ALI SELAMAT ◽

ZHI SAM LEE ◽

MOHD AIZAINI MAAROF ◽

SITI MARIYAM SHAMSUDDIN

Keyword(s):

Neural Networks ◽

Feature Selection ◽

Weighting Scheme ◽

Web Pages ◽

Web Page ◽

Term Weighting ◽

Web Documents ◽

Web Contents ◽

Feature Selection Approach ◽

The Web

In this paper, an improved web page classification method (IWPCM) using neural networks to identify the illicit contents of web pages is proposed. The proposed IWPCM approach is based on the improvement of feature selection of the web pages using class based feature vectors (CPBF). The CPBF feature selection approach has been calculated by considering the important term's weight for illicit web documents and reduce the dependency of the less important term's weight for normal web documents. The IWPCM approach has been examined using the modified term-weighting scheme by comparing it with several traditional term-weighting schemes for non-illicit and illicit web contents available from the web. The precision, recall, and F1 measures have been used to evaluate the effectiveness of the proposed IWPCM approach. The experimental results have shown that the proposed improved term-weighting scheme has been able to identify the non-illicit and illicit web contents available from the experimental datasets.

Download Full-text

Exploiting Semantics to Improve Classification of Text Corpus

Advances in Data Mining and Database Management - Managing and Processing Big Data in Cloud Computing ◽

10.4018/978-1-4666-9767-6.ch002 ◽

2016 ◽

pp. 23-36

Author(s):

Hammad Majeed ◽

Firoza Erum

Keyword(s):

Semantic Similarity ◽

Similarity Measure ◽

High Accuracy ◽

Web Pages ◽

Relevant Feature ◽

Semantic Similarity Measure ◽

Text Corpus ◽

Processing Techniques ◽

The Web

Internet is growing fast with millions of web pages containing information on every topic. The data placed on Internet is not organized which makes the search process difficult. Classification of the web pages in some predefined classes can improve the organization of this data. In this chapter a semantic based technique is presented to classify text corpus with high accuracy. This technique uses some well-known pre-processing techniques like word stemming, term frequency, and degree of uniqueness. In addition to this a new semantic similarity measure is computed between different terms. The authors believe that semantic similarity based comparison in addition to syntactic matching makes the classification process significantly accurate. The proposed technique is tested on a benchmark dataset and results are compared with already published results. The obtained results are significantly better and that too by using quite small sized highly relevant feature set.

Download Full-text

Using Semantically-Extended LDA Topic Model for Semantic Tagging

International Journal of Semantic Computing ◽

10.1142/s1793351x16400183 ◽

2016 ◽

Vol 10 (04) ◽

pp. 503-525

Author(s):

Mehdi Allahyari ◽

Krys Kochut

Keyword(s):

Topic Model ◽

Web Pages ◽

Challenging Problem ◽

Huge Amount ◽

Web Documents ◽

Aggregate Information ◽

Probabilistic Topic Model ◽

Statistical Topic Models ◽

Great Step ◽

The Web

The volume of documents and online resources has been increasing significantly on the Web for many years. Effectively, organizing this huge amount of information has become a challenging problem. Tagging is a mechanism to aggregate information and a great step towards the Semantic Web vision. Tagging aims to organize, summarize, share and search the Web resources in an effective way. One important problem facing tagging systems is to automatically determine the most appropriate tags for Web documents. In this paper, we propose a probabilistic topic model that incorporates DBpedia knowledge into the topic model for tagging Web pages and online documents with topics discovered in them. Our method is based on integration of the DBpedia hierarchical category network with statistical topic models, where DBpedia categories are considered as topics. We have conducted extensive experiments on two different datasets to demonstrate the effectiveness of our method.

Download Full-text

A Machine Learning-Based Model to Evaluate Readability and Assess Grade Level for the Web Pages

The Computer Journal ◽

10.1093/comjnl/bxaa113 ◽

2020 ◽

Author(s):

Muralidhar Pantula ◽

K S Kuppusamy

Keyword(s):

Machine Learning ◽

Grade Level ◽

Web Pages ◽

Word Count ◽

Web Page ◽

Web Documents ◽

Browser Extension ◽

Statistical Measures ◽

Proposed Model ◽

The Web

Abstract Evaluating readability of web documents has gained attention due to several factors such as improving the effectiveness of writing and to reach a wider spectrum of audience. Current practices in this direction follow several statistical measures in evaluating readability of the document. In this paper, we have proposed a machine learning-based model to compute readability of web pages. The minimum educational standards required (grade level) to understand the contents of a web page are also computed. The proposed model classifies the web pages into highly readable, readable or less readable using specified feature set. To classify a web page with the aforementioned categories, we have incorporated the features such as sentence count, word count, syllable count, type-token ratio and lexical ambiguity. To increase the usability of the proposed model, we have developed an accessible browser extension to perform the assessments of every web page loaded into the browser.

Download Full-text

Techniques for Refreshing Images in Web Documents

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.403-408.1008 ◽

2011 ◽

Vol 403-408 ◽

pp. 1008-1013 ◽

Cited By ~ 1

Author(s):

Divya Ragatha Venkata ◽

Deepika Kulshreshtha

Keyword(s):

Search Engine ◽

Major Part ◽

Web Pages ◽

End User ◽

Web Page ◽

Web Documents ◽

Client Server ◽

Web Document ◽

Server Architecture ◽

The Web

In this paper, we put forward a technique for keeping web pages up-to-date, later used by search engine to serve the end user queries. A major part of the Web is dynamic and hence, a need arises to constantly update the changed web documents in search engine’s repository. In this paper we used the client-server architecture for crawling the web and propose a technique for detecting changes in web page based on the content of the images present if any in web documents. Once it is being identified that the image embedded in the web document is changed then the previous copy of the web document present in the search engine’s database/repository is replaced with the changed one.

Download Full-text

Ontology Driven Document Identification in Semantic Web

Advances in Semantic Web and Information Systems - Progressive Concepts for Semantic Web Evolution ◽

10.4018/978-1-60566-992-2.ch009 ◽

2010 ◽

pp. 186-220

Author(s):

Marek Reformat ◽

Ronald R. Yager ◽

Zhan Li

Keyword(s):

Semantic Web ◽

Knowledge Representation ◽

Knowledge Bases ◽

Prototype System ◽

Specific Domain ◽

Web Documents ◽

Web Environment ◽

New Concepts ◽

Simple Query ◽

The Web

The concept of Semantic Web (Berners, 2001) introduces a new form of knowledge representation – an ontology. An ontology is a partially ordered set of words and concepts of a specific domain, and allows for defining different kinds of relationships existing among concepts. Such approach promises formation of an environment where information is easily accessible and understandable for any system, application and/or human. Hierarchy of concepts (Yager, 2000) is a different and very interesting form of knowledge representation. A graph-like structure of the hierarchy provides a user with a suitable tool for identifying variety of different associations among concepts. These associations express user’s perceptions of relations among concepts, and lead to representing definitions of concepts in a human-like way. The Internet becomes an overwhelming repository of documents. This enormous storage of information will be effectively used when users will be equipped with systems capable of finding related documents quickly and correctly. The proposed work addresses that issue. It offers an approach that combines a hierarchy of concepts and ontology for the task of identifying web documents in the environment of the Semantic Web. A user provides a simple query in the form a hierarchy that only partially “describes” documents (s)he wants to retrieve from the web. The hierarchy is treated as a “seed” representing user’s initial knowledge about concepts covered by required documents. Ontologies are treated as supplementary knowledge bases. They are used to instantiate the hierarchy with concrete information, as well as to enhance it with new concepts initially unknown to the user. The proposed approach is used to design a prototype system for document identification in the web environment. The description of the system and the results of preliminary experiments are presented.

Download Full-text