Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

2006 ◽  
Vol 32 (3) ◽  
pp. 295-340 ◽  
Author(s):  
Christoph Ringlstetter ◽  
Klaus U. Schulz ◽  
Stoyan Mihov

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.

2013 ◽  
Vol 7 (2) ◽  
pp. 574-579 ◽  
Author(s):  
Dr Sunitha Abburu ◽  
G. Suresh Babu

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.  But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies  data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.   It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.  The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.


2018 ◽  
Vol 7 (4.19) ◽  
pp. 1041
Author(s):  
Santosh V. Chobe ◽  
Dr. Shirish S. Sane

There is an explosive growth of information on Internet that makes extraction of relevant data from various sources, a difficult task for its users. Therefore, to transform the Web pages into databases, Information Extraction (IE) systems are needed. Relevant information in Web documents can be extracted using information extraction and presented in a structured format.By applying information extraction techniques, information can be extracted from structured, semi-structured, and unstructured data. This paper presents some of the major information extraction tools. Here, advantages and limitations of the tools are discussed from a user’s perspective.  


Author(s):  
Ily Amalina Ahmad Sabri ◽  
Mustafa Man

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>


Author(s):  
Louis Massey ◽  
Wilson Wong

This chapter explores the problem of topic identification from text. It is first argued that the conventional representation of text as bag-of-words vectors will always have limited success in arriving at the underlying meaning of text until the more fundamental issues of feature independence in vector-space and ambiguity of natural language are addressed. Next, a groundbreaking approach to text representation and topic identification that deviates radically from current techniques used for document classification, text clustering, and concept discovery is proposed. This approach is inspired by human cognition, which allows ‘meaning’ to emerge naturally from the activation and decay of unstructured text information retrieved from the Web. This paradigm shift allows for the exploitation rather than avoidance of dependence between terms to derive meaning without the complexity introduced by conventional natural language processing techniques. Using the unstructured texts in Web pages as a source of knowledge alleviates the laborious handcrafting of formal knowledge bases and ontologies that are required by many existing techniques. Some initial experiments have been conducted, and the results are presented in this chapter to illustrate the power of this new approach.


Author(s):  
ALI SELAMAT ◽  
ZHI SAM LEE ◽  
MOHD AIZAINI MAAROF ◽  
SITI MARIYAM SHAMSUDDIN

In this paper, an improved web page classification method (IWPCM) using neural networks to identify the illicit contents of web pages is proposed. The proposed IWPCM approach is based on the improvement of feature selection of the web pages using class based feature vectors (CPBF). The CPBF feature selection approach has been calculated by considering the important term's weight for illicit web documents and reduce the dependency of the less important term's weight for normal web documents. The IWPCM approach has been examined using the modified term-weighting scheme by comparing it with several traditional term-weighting schemes for non-illicit and illicit web contents available from the web. The precision, recall, and F1 measures have been used to evaluate the effectiveness of the proposed IWPCM approach. The experimental results have shown that the proposed improved term-weighting scheme has been able to identify the non-illicit and illicit web contents available from the experimental datasets.


Author(s):  
Hammad Majeed ◽  
Firoza Erum

Internet is growing fast with millions of web pages containing information on every topic. The data placed on Internet is not organized which makes the search process difficult. Classification of the web pages in some predefined classes can improve the organization of this data. In this chapter a semantic based technique is presented to classify text corpus with high accuracy. This technique uses some well-known pre-processing techniques like word stemming, term frequency, and degree of uniqueness. In addition to this a new semantic similarity measure is computed between different terms. The authors believe that semantic similarity based comparison in addition to syntactic matching makes the classification process significantly accurate. The proposed technique is tested on a benchmark dataset and results are compared with already published results. The obtained results are significantly better and that too by using quite small sized highly relevant feature set.


2016 ◽  
Vol 10 (04) ◽  
pp. 503-525
Author(s):  
Mehdi Allahyari ◽  
Krys Kochut

The volume of documents and online resources has been increasing significantly on the Web for many years. Effectively, organizing this huge amount of information has become a challenging problem. Tagging is a mechanism to aggregate information and a great step towards the Semantic Web vision. Tagging aims to organize, summarize, share and search the Web resources in an effective way. One important problem facing tagging systems is to automatically determine the most appropriate tags for Web documents. In this paper, we propose a probabilistic topic model that incorporates DBpedia knowledge into the topic model for tagging Web pages and online documents with topics discovered in them. Our method is based on integration of the DBpedia hierarchical category network with statistical topic models, where DBpedia categories are considered as topics. We have conducted extensive experiments on two different datasets to demonstrate the effectiveness of our method.


2020 ◽  
Author(s):  
Muralidhar Pantula ◽  
K S Kuppusamy

Abstract Evaluating readability of web documents has gained attention due to several factors such as improving the effectiveness of writing and to reach a wider spectrum of audience. Current practices in this direction follow several statistical measures in evaluating readability of the document. In this paper, we have proposed a machine learning-based model to compute readability of web pages. The minimum educational standards required (grade level) to understand the contents of a web page are also computed. The proposed model classifies the web pages into highly readable, readable or less readable using specified feature set. To classify a web page with the aforementioned categories, we have incorporated the features such as sentence count, word count, syllable count, type-token ratio and lexical ambiguity. To increase the usability of the proposed model, we have developed an accessible browser extension to perform the assessments of every web page loaded into the browser.


2011 ◽  
Vol 403-408 ◽  
pp. 1008-1013 ◽  
Author(s):  
Divya Ragatha Venkata ◽  
Deepika Kulshreshtha

In this paper, we put forward a technique for keeping web pages up-to-date, later used by search engine to serve the end user queries. A major part of the Web is dynamic and hence, a need arises to constantly update the changed web documents in search engine’s repository. In this paper we used the client-server architecture for crawling the web and propose a technique for detecting changes in web page based on the content of the images present if any in web documents. Once it is being identified that the image embedded in the web document is changed then the previous copy of the web document present in the search engine’s database/repository is replaced with the changed one.


Author(s):  
Marek Reformat ◽  
Ronald R. Yager ◽  
Zhan Li

The concept of Semantic Web (Berners, 2001) introduces a new form of knowledge representation – an ontology. An ontology is a partially ordered set of words and concepts of a specific domain, and allows for defining different kinds of relationships existing among concepts. Such approach promises formation of an environment where information is easily accessible and understandable for any system, application and/or human. Hierarchy of concepts (Yager, 2000) is a different and very interesting form of knowledge representation. A graph-like structure of the hierarchy provides a user with a suitable tool for identifying variety of different associations among concepts. These associations express user’s perceptions of relations among concepts, and lead to representing definitions of concepts in a human-like way. The Internet becomes an overwhelming repository of documents. This enormous storage of information will be effectively used when users will be equipped with systems capable of finding related documents quickly and correctly. The proposed work addresses that issue. It offers an approach that combines a hierarchy of concepts and ontology for the task of identifying web documents in the environment of the Semantic Web. A user provides a simple query in the form a hierarchy that only partially “describes” documents (s)he wants to retrieve from the web. The hierarchy is treated as a “seed” representing user’s initial knowledge about concepts covered by required documents. Ontologies are treated as supplementary knowledge bases. They are used to instantiate the hierarchy with concrete information, as well as to enhance it with new concepts initially unknown to the user. The proposed approach is used to design a prototype system for document identification in the web environment. The description of the system and the results of preliminary experiments are presented.


Sign in / Sign up

Export Citation Format

Share Document