Phrase-based text representation for managing the Web documents

Author(s):  
R. Sharma ◽  
S. Raman
2013 ◽  
Vol 7 (2) ◽  
pp. 574-579 ◽  
Author(s):  
Dr Sunitha Abburu ◽  
G. Suresh Babu

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.  But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies  data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.   It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.  The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.


2009 ◽  
Vol 54 (1) ◽  
pp. 181-188 ◽  
Author(s):  
Tayebeh Mosavi Miangah

Abstract In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them. The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises. One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.


Author(s):  
Barbara Catania ◽  
Elena Ferrari

Web is characterized by a huge amount of very heterogeneous data sources, that differ both in media support and format representation. In this scenario, there is the need of an integrating approach for querying heterogeneous Web documents. To this purpose, XML can play an important role since it is becoming a standard for data representation and exchange over the Web. Due to its flexibility, XML is currently being used as an interface language over the Web, by which (part of) document sources are represented and exported. Under this assumption, the problem of querying heterogeneous sources can be reduced to the problem of querying XML data sources. In this chapter, we first survey the most relevant query languages for XML data proposed both by the scientific community and by standardization committees, e.g., W3C, mainly focusing on their expressive power. Then, we investigate how typical Information Retrieval concepts, such as ranking, similarity-based search, and profile-based search, can be applied to XML query languages. Commercial products based on the considered approaches are then briefly surveyed. Finally, we conclude the chapter by providing an overview of the most promising research trends in the fields.


2018 ◽  
Vol 7 (4.19) ◽  
pp. 1041
Author(s):  
Santosh V. Chobe ◽  
Dr. Shirish S. Sane

There is an explosive growth of information on Internet that makes extraction of relevant data from various sources, a difficult task for its users. Therefore, to transform the Web pages into databases, Information Extraction (IE) systems are needed. Relevant information in Web documents can be extracted using information extraction and presented in a structured format.By applying information extraction techniques, information can be extracted from structured, semi-structured, and unstructured data. This paper presents some of the major information extraction tools. Here, advantages and limitations of the tools are discussed from a user’s perspective.  


2018 ◽  
Vol 52 (2) ◽  
pp. 266-277 ◽  
Author(s):  
Hyo-Jung Oh ◽  
Dong-Hyun Won ◽  
Chonghyuck Kim ◽  
Sung-Hee Park ◽  
Yong Kim

Purpose The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web. Design/methodology/approach This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages. Findings Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case. Research limitations/implications To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors. Practical implications The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs. Originality/value This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.


2012 ◽  
Vol 001 (001) ◽  
pp. 5-7
Author(s):  
L. Rajesh ◽  
◽  
V. Shanthi ◽  
E. Manigandan ◽  
◽  
...  

The fast and wide-ranging pervasion of data and information over the web possess a high dispersion of an enormous capacity of normal language textual possessions. Excessive attention has been evolved in the existing scenario for determining, distribution and retrieving of an enormous source of knowledge. For this purpose, processing enormous data capacities in a sensible time frame is an important challenge and a vital necessity in numerous commercial and exploration fields. Computer clusters, distributed systems and parallel computing paradigms are being progressively applied in the current years; subsequently they presented important developments for computing presentation in data-intensive contexts, like Big Data mining and analysis. NLP is one of the significant features which can be utilized for text explanation and first feature extraction from request area with high computational supplies; therefore, these responsibilities can have advantage over similar architectures. This study shows a discrete framework for running NLP tasks in a parallel fashion and crawling web documents. The system was found on Apache Hadoop environment, and on its equivalent programming paradigm, called MapReduce. Authentication is done using the explanation for extracting keywords and critical phrase from the web documents in a multinode Hadoop cluster. The results of the proposed work shows increased storage capacity, increased speed in data processing, reduced user searching time and receives the accurate content from the large dataset stored in HBase.


Author(s):  
Ily Amalina Ahmad Sabri ◽  
Mustafa Man

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>


2014 ◽  
Vol 13 (04) ◽  
pp. A01 ◽  
Author(s):  
Christian Oltra ◽  
Ana Delicado ◽  
Ana Prades ◽  
Sergio Pereira ◽  
Luisa Schmidt

The Internet is increasingly considered as a legitimate source of information on scientific and technological topics. Lay individuals are increasingly using Internet sources to find information about new technological developments, but scientific communities might have a limited understanding of the nature of this content. In this paper we examine the nature of the content of information about fusion energy on the Internet. By means of a content and thematic analysis of a sample of English-, Spanish- and Portuguese-language web documents, we analyze the structural characteristics of the webs, characterize the presentation of nuclear fusion, and study the associations to nuclear fission and the main benefits and risks associated to fusion technologies in the Web. Our findings indicate that the information about fusion on the Internet is produced by a variety of actors (including private users via blogs), that almost half of the sample provided relevant technical information about nuclear fusion, that the majority of the web documents provided a positive portrayal of fusion energy (as a clean, safe and powerful energy technology), and that nuclear fusion was generally presented as a potential solution to world energy problems, as a key scientific challenge and as a superior alternative to nuclear fission. We discuss the results in terms of the role of Internet in science communication.


2021 ◽  
Vol 5 (1) ◽  
pp. 45-56
Author(s):  
Poonam Chahal ◽  
Manjeet Singh

In today's era, with the availability of a huge amount of dynamic information available in world wide web (WWW), it is complex for the user to retrieve or search the relevant information. One of the techniques used in information retrieval is clustering, and then the ranking of the web documents is done to provide user the information as per their query. In this paper, semantic similarity score of Semantic Web documents is computed by using the semantic-based similarity feature combining the latent semantic analysis (LSA) and latent relational analysis (LRA). The LSA and LRA help to determine the relevant concepts and relationships between the concepts which further correspond to the words and relationships between these words. The extracted interrelated concepts are represented by the graph further representing the semantic content of the web document. From this graph representation for each document, the HCS algorithm of clustering is used to extract the most connected subgraph for constructing the different number of clusters which is according to the information-theoretic approach. The web documents present in clusters in graphical form are ranked by using the text-rank method in combination with the proposed method. The experimental analysis is done by using the benchmark datasets OpinRank. The performance of the approach on ranking of web documents using semantic-based clustering has shown promising results.


Sign in / Sign up

Export Citation Format

Share Document