An Algorithm of Scene Information Collection in General Football Matches Based on Web Documents

In order to obtain the scene information of the ordinary football game more comprehensively, an algorithm of collecting the scene information of the ordinary football game based on web documents is proposed. The commonly used T-graph web crawler model is used to collect the sample nodes of a specific topic in the football game scene information and then collect the edge document information of the football game scene information topic after the crawling stage of the web crawler. Using the feature item extraction algorithm of semantic analysis, according to the similarity of the feature items, the feature items of the football game scene information are extracted to form a web document. By constructing a complex network and introducing the local contribution and overlap coefficient of the community discovery feature selection algorithm, the features of the web document are selected to realize the collection of football game scene information. Experimental results show that the algorithm has high topic collection capabilities and low computational cost, the average accuracy of equilibrium is always around 98%, and it has strong quantification capabilities for web crawlers and communities.

Download Full-text

Design and implementation of crawling algorithm to collect deep web information for web archiving

Data Technologies and Applications ◽

10.1108/dta-07-2017-0053 ◽

2018 ◽

Vol 52 (2) ◽

pp. 266-277 ◽

Cited By ~ 2

Author(s):

Hyo-Jung Oh ◽

Dong-Hyun Won ◽

Chonghyuck Kim ◽

Sung-Hee Park ◽

Yong Kim

Keyword(s):

Deep Web ◽

Web Crawler ◽

Web Archiving ◽

Web Browser ◽

Web Documents ◽

Content Type ◽

Web Document ◽

Web Information ◽

Web Crawlers ◽

The Web

Purpose The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web. Design/methodology/approach This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages. Findings Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case. Research limitations/implications To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors. Practical implications The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs. Originality/value This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.

Download Full-text

A Novel Approach for Web Crawler to Classify the Web Documents

International Journal of Web Technology ◽

10.20894/ijwt.104.001.001.002 ◽

2012 ◽

Vol 001 (001) ◽

pp. 5-7

Author(s):

L. Rajesh ◽

◽

V. Shanthi ◽

E. Manigandan ◽

◽

...

Keyword(s):

Web Crawler ◽

Web Documents ◽

Novel Approach ◽

The Web

Download Full-text

An Efficient Approach for Ranking of Semantic Web Documents by Computing Semantic Similarity and Using HCS Clustering

International Journal of Semiotics and Visual Rhetoric ◽

10.4018/ijsvr.2021010104 ◽

2021 ◽

Vol 5 (1) ◽

pp. 45-56

Author(s):

Poonam Chahal ◽

Manjeet Singh

Keyword(s):

Semantic Web ◽

Semantic Similarity ◽

Semantic Analysis ◽

Semantic Content ◽

Graph Representation ◽

Graphical Form ◽

Theoretic Approach ◽

Connected Subgraph ◽

Web Documents ◽

The Web

In today's era, with the availability of a huge amount of dynamic information available in world wide web (WWW), it is complex for the user to retrieve or search the relevant information. One of the techniques used in information retrieval is clustering, and then the ranking of the web documents is done to provide user the information as per their query. In this paper, semantic similarity score of Semantic Web documents is computed by using the semantic-based similarity feature combining the latent semantic analysis (LSA) and latent relational analysis (LRA). The LSA and LRA help to determine the relevant concepts and relationships between the concepts which further correspond to the words and relationships between these words. The extracted interrelated concepts are represented by the graph further representing the semantic content of the web document. From this graph representation for each document, the HCS algorithm of clustering is used to extract the most connected subgraph for constructing the different number of clusters which is according to the information-theoretic approach. The web documents present in clusters in graphical form are ranked by using the text-rank method in combination with the proposed method. The experimental analysis is done by using the benchmark datasets OpinRank. The performance of the approach on ranking of web documents using semantic-based clustering has shown promising results.

Download Full-text

An Enhanced Web Document Search Engine using a Semantic Network

REV Journal on Electronics and Communications ◽

10.21553/rev-jec.134 ◽

2016 ◽

Author(s):

Sang Thanh Thi Nguyen ◽

Tuan Thanh Nguyen

Keyword(s):

Web Search ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Semantic Network ◽

Daily Basis ◽

Web Pages ◽

Web Crawler ◽

Web Document ◽

Modelling Techniques ◽

The Web

With the rapid advancement of ICT technology, the World Wide Web (referred to as the Web) has become the biggest information repository whose volume keeps growing on a daily basis. The challenge is how to find the most wanted information from the Web with a minimum effort. This paper presents a novel ontology-based framework for searching the related web pages to a given term within a few given specific websites. With this framework, a web crawler first learns the content of web pages within the given websites, then the topic modeller finds the relations between web pages and topics via key words found on the web pages using the Latent Dirichlet Allocation (LDA) technique. After that, the ontology builder establishes an ontology which is a semantic network of web pages based on the topic model. Finally, a reasoner can find the related web pages to a given term by making use of the ontology. The framework and related modelling techniques have been verified using a few test websites and the results convince its superiority over the existing web search tools.

Download Full-text

Location based Web Object Search using Probabilistic Classification Model.

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1082.1292s19 ◽

2019 ◽

Vol 9 (2S) ◽

pp. 145-151

Keyword(s):

Search Engines ◽

Web Search ◽

Classification Model ◽

Web Documents ◽

Web Document ◽

Web Objects ◽

User Query ◽

Object Search ◽

Search Facility ◽

The Web

The classical Web search engines focus on satisfying the information need of the users by retrieving relevant Web documents corresponding to the user query. The Web document contains the information on different Web objects such as authors, automobiles, political parties e.t.c. The user might be accessing the Web document to procure information about a specific Web object, the remaining information in the Web object [2-6] becomes redundant specific to the user. If the size of Web documents is significantly large and the user information requirement is small fraction of the document, the user has to invest effort in locating the required information inside the document. It would be much more convenient if the user is provided with only the required Web object information located inside the Web documents. Web object search engines provide Web search facility through vertical search on Web objects. In this paper the main goal we considered is the objective information present in different documents is extracted and integrated into an object repository over which the Web object search facility is built.

Download Full-text

Efficient Retrieval Of Html Documents Using Hybrid Meta-Heuristic Approaches In Web Document Clustering

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e1192.0585c19 ◽

2019 ◽

Vol 8 (5C) ◽

pp. 1350-1354

Keyword(s):

Web Search ◽

Heuristic Algorithms ◽

Document Clustering ◽

Web Documents ◽

Clustering Problem ◽

Web Document ◽

Efficient Retrieval ◽

Web Document Clustering ◽

The Given ◽

The Web

With the rapid growth of web documents on WWW, it is becoming difficult to organize, analyze and present these documents efficiently. Web search engines return many documents to the web user, out of which some are relevant and some irrelevant documents to the topic, for the given query. Web search is usually performed using only features extracted from the web page text. HTML tags with particular meanings have been found to improve the efficiency of the information retrieval System. However, organizing documents in a way that will improve search without additional cost or complexity is still a great challenge. Clustering can play an important role to organize such a large number of documents into several groups. However due to limitations in existing techniques of clustering, scientists have begun using Meta-heuristic algorithms for the clustering problem of documents. In this paper, we presented a document clustering method that uses HTML tags and Metaheuristic approaches. The hybrid PSO+ACO+K-means algorithm is used for clustering the documents. In the proposed approach, results are analyzed on WEBKB dataset

Download Full-text

AN EFFECTIVE FUZZY CLUSTERING ALGORITHM FOR WEB DOCUMENT CLASSIFICATION: A CASE STUDY IN CULTURAL CONTENT MINING

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s021819401350023x ◽

2013 ◽

Vol 23 (06) ◽

pp. 869-886 ◽

Cited By ~ 3

Author(s):

GEORGE E. TSEKOURAS ◽

DAMIANOS GAVALAS

Keyword(s):

Clustering Algorithm ◽

Web Pages ◽

Proof Of Concept ◽

Web Crawler ◽

Web Documents ◽

Web Document ◽

Content Mining ◽

Concept Application ◽

Cultural Content

This article presents a novel crawling and clustering method for extracting and processing cultural data from the web in a fully automated fashion. Our architecture relies upon a focused web crawler to download web documents relevant to culture. The focused crawler is a web crawler that searches and processes only those web pages that are relevant to a particular topic. After downloading the pages, we extract from each document a number of words for each thematic cultural area, filtering the documents with non-cultural content; we then create multidimensional document vectors comprising the most frequent cultural term occurrences. We calculate the dissimilarity between the cultural-related document vectors and for each cultural theme, we use cluster analysis to partition the documents into a number of clusters. Our approach is validated via a proof-of-concept application which analyzes hundreds of web pages spanning different cultural thematic areas.

Download Full-text

Text-Style Conversion of Speech Transcript into Web Document for Lecture Archive

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2009.p0499 ◽

2009 ◽

Vol 13 (4) ◽

pp. 499-505 ◽

Cited By ~ 1

Author(s):

Masashi Ito ◽

◽

Tomohiro Ohno ◽

Shigeki Matsubara ◽

◽

...

Keyword(s):

Knowledge Society ◽

Spontaneous Speech ◽

Prototype System ◽

Web Documents ◽

Web Document ◽

Internet Browser ◽

Spoken Documents ◽

Readable Text ◽

The Web

It is very significant to the knowledge society to accumulate spoken documents on the web. However, because of the high redundancy of spontaneous speech, the faithfully transcribed text is not readable on an Internet browser, and therefore not suitable as a web document. This paper proposes a technique for converting spoken documents into web documents for the purpose of building a speech archiving system. The technique edits automatically transcribed texts and improves their readability on the browser. The readable text can be generated by applying technology such as paraphrasing, segmentation, and structuring transcribed texts. Editing experiments using lecture data demonstrated the feasibility of the technique. A prototype system of spoken document archiving was implemented to confirm its effectiveness.

Download Full-text

Techniques for Refreshing Images in Web Documents

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.403-408.1008 ◽

2011 ◽

Vol 403-408 ◽

pp. 1008-1013 ◽

Cited By ~ 1

Author(s):

Divya Ragatha Venkata ◽

Deepika Kulshreshtha

Keyword(s):

Search Engine ◽

Major Part ◽

Web Pages ◽

End User ◽

Web Page ◽

Web Documents ◽

Client Server ◽

Web Document ◽

Server Architecture ◽

The Web

In this paper, we put forward a technique for keeping web pages up-to-date, later used by search engine to serve the end user queries. A major part of the Web is dynamic and hence, a need arises to constantly update the changed web documents in search engine’s repository. In this paper we used the client-server architecture for crawling the web and propose a technique for detecting changes in web page based on the content of the images present if any in web documents. Once it is being identified that the image embedded in the web document is changed then the previous copy of the web document present in the search engine’s database/repository is replaced with the changed one.

Download Full-text

Exploring the composition of the searchable web: a corpus-based taxonomy of web registers

Corpora ◽

10.3366/cor.2015.0065 ◽

2015 ◽

Vol 10 (1) ◽

pp. 11-45 ◽

Cited By ~ 20

Author(s):

Douglas Biber ◽

Jesse Egbert ◽

Mark Davies

Keyword(s):

Web Search ◽

Methodological Approach ◽

Full Range ◽

End Users ◽

Web Documents ◽

Web Document ◽

Situational Characteristics ◽

Hierarchical Decision ◽

High Degree ◽

The Web

One major challenge for Web-As-Corpus research is that a typical Web search provides little information about the register of the documents that are searched. Previous research has attempted to address this problem (e.g., through the Automatic Genre Identification initiative), but with only limited success. As a result, we currently know surprisingly little about the distribution of registers on the web. In this study, we tackle this problem through a bottom-up user-based investigation of a large, representative corpus of web documents. We base our investigation on a much larger corpus than those used in previous research (48,571 web documents), and obtained through random sampling from across the full range of documents that are publically available on the searchable web. Instead of relying on individual expert coders, we recruit typical end-users of the Web for register coding, with each document in the corpus coded by four different raters. End-users identify basic situational characteristics of each web document, coded in a hierarchical manner. Those situational characteristics lead to general register categories, which eventually lead to lists of specific sub-registers. By working through a hierarchical decision tree, users are able to identify the register category of most Internet texts with a high degree of reliability. After summarising our methodological approach, this paper documents the register composition of the searchable web. Narrative registers are found to be the most prevalent, while Opinion and Informational Description/Explanation registers are also found to be extremely common. One of the major innovations of the approach adopted here is that it permits an empirical identification of ‘hybrid’ documents, which integrate characteristics from multiple general register categories (e.g., opinionated-narrative). These patterns are described and illustrated through sample Internet documents.

Download Full-text