WEBCAP

Advances in Distance Education Technologies - Technologies Shaping Instruction and Distance Education ◽

10.4018/978-1-60566-934-2.ch011 ◽

2010 ◽

pp. 166-180

Author(s):

Habib Sami ◽

Safar Maytham

Keyword(s):

Distance Learning ◽

Capacity Planning ◽

Web Applications ◽

Planning Techniques ◽

Web Documents ◽

Web Document ◽

Precedence Graph ◽

Heavy Burden ◽

Operation Timing

In many web applications, such as the distance learning, the frequency of refreshing multimedia web documents places a heavy burden on the WWW resources. Moreover, the updated web documents may encounter inordinate delays, which make it difficult to retrieve web documents in time. Here, we present an Internet tool called WEBCAP that can schedule the retrieval of multimedia web documents in time while considering the workloads on the WWW resources by applying capacity planning techniques. We have modeled a multimedia web document as a 4-level hierarchy (object, operation, timing, and precedence.) The transformations between levels are performed automatically, followed by the application of Bellman-Ford’s algorithm on the precedence graph to schedule all operations (fetch, transmit, process, and render) while satisfying the in time retrieval and all workload resources constraints. Our results demonstrate how effective WEBCAP is in scheduling the refreshing of multimedia web documents.

Download Full-text

SWI-Prolog and the web

Theory and Practice of Logic Programming ◽

10.1017/s1471068407003237 ◽

2008 ◽

Vol 8 (3) ◽

pp. 363-392 ◽

Cited By ~ 31

Author(s):

JAN WIELEMAKER ◽

ZHISHENG HUANG ◽

LOURENS VAN DER MEIJ

Keyword(s):

Web Application ◽

Memory Management ◽

Web Applications ◽

Design Decisions ◽

Web Documents ◽

Web Document ◽

Wide Range ◽

Automatic Memory ◽

Proprietary Protocol ◽

Excellent Tool

AbstractProlog is an excellent tool for representing and manipulating data written in formal languages as well as natural language. Its safe semantics and automatic memory management make it a prime candidate for programming robust Web services. Although Prolog is commonly seen as a component in a Web application that is either embedded or communicates using a proprietary protocol, we propose an architecture where Prolog communicates to other components in a Web application using the standard HTTP protocol. By avoiding embedding in external Web servers, development and deployment become much easier. To support this architecture, in addition to the transfer protocol, we must also support parsing, representing and generating the key Web document types such as HTML, XML and RDF. This article motivates the design decisions in the libraries and extensions to Prolog for handling Web documents and protocols. The design has been guided by the requirement to handle large documents efficiently. The described libraries support a wide range of Web applications ranging from HTML and XML documents to Semantic Web RDF processing. The benefits of using Prolog for Web-related tasks are illustrated using three case studies.

Download Full-text

Design and implementation of crawling algorithm to collect deep web information for web archiving

Data Technologies and Applications ◽

10.1108/dta-07-2017-0053 ◽

2018 ◽

Vol 52 (2) ◽

pp. 266-277 ◽

Cited By ~ 2

Author(s):

Hyo-Jung Oh ◽

Dong-Hyun Won ◽

Chonghyuck Kim ◽

Sung-Hee Park ◽

Yong Kim

Keyword(s):

Deep Web ◽

Web Crawler ◽

Web Archiving ◽

Web Browser ◽

Web Documents ◽

Content Type ◽

Web Document ◽

Web Information ◽

Web Crawlers ◽

The Web

Purpose The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web. Design/methodology/approach This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages. Findings Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case. Research limitations/implications To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors. Practical implications The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs. Originality/value This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.

Download Full-text

Note—A Comparison of Capacity Planning Techniques in a Job Shop Control System

Management Science ◽

10.1287/mnsc.23.9.1011 ◽

1977 ◽

Vol 23 (9) ◽

pp. 1011-1015 ◽

Cited By ~ 20

Author(s):

Nabil Adam ◽

Julius Surkis

Keyword(s):

Control System ◽

Capacity Planning ◽

Job Shop ◽

Planning Techniques

Download Full-text

Towards Comparative Mining of Web Document Objects with NFA

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2012100101 ◽

2012 ◽

Vol 8 (4) ◽

pp. 1-21 ◽

Cited By ~ 1

Author(s):

C. I. Ezeife ◽

Titas Mutsuddy

Keyword(s):

Web Sites ◽

Pattern Mining ◽

Data Extraction ◽

Web Pages ◽

Web Content ◽

Specific Domain ◽

Web Documents ◽

Web Document ◽

Web Contents ◽

Finite State

The process of extracting comparative heterogeneous web content data which are derived and historical from related web pages is still at its infancy and not developed. Discovering potentially useful and previously unknown information or knowledge from web contents such as “list all articles on ’Sequential Pattern Mining’ written between 2007 and 2011 including title, authors, volume, abstract, paper, citation, year of publication,” would require finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse before web content extraction and mining from the database. This paper proposes a technique for automatic web content data extraction, the WebOMiner system, which models web sites of a specific domain like Business to Customer (B2C) web sites, as object oriented database schemas. Then, non-deterministic finite state automata (NFA) based wrappers for recognizing content types from this domain are built and used for extraction of related contents from data blocks into an integrated database for future second level mining for deep knowledge discovery.

Download Full-text

CATEGORIZATION OF WEB APPLICATIONS FOR DISTANCE LEARNING WITH RESPECT TO THE SAMR MODEL

EDULEARN21 Proceedings ◽

10.21125/edulearn.2021.0850 ◽

2021 ◽

Author(s):

Libor Klubal ◽

Vojtech Gybas ◽

Katerina Kostolanyova

Keyword(s):

Distance Learning ◽

Web Applications ◽

Samr Model

Download Full-text

Capacity planning techniques for manufacturing control systems: Information requirements and operational features

Journal of Operations Management ◽

10.1016/0272-6963(82)90018-3 ◽

1982 ◽

Vol 3 (1) ◽

pp. 13-25 ◽

Cited By ~ 23

Author(s):

William L. Berry ◽

Thomas G. Schmitt CPIM ◽

Thomas E. Vollmann

Keyword(s):

Control Systems ◽

Capacity Planning ◽

Planning Techniques ◽

Information Requirements ◽

Manufacturing Control

Download Full-text

A NAÏVE BAYES CLASSIFIER FOR WEB DOCUMENT SUMMARIES CREATED BY USING WORD SIMILARITY AND SIGNIFICANT FACTORS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213010000285 ◽

2010 ◽

Vol 19 (04) ◽

pp. 465-486 ◽

Cited By ~ 8

Author(s):

MARIA SOLEDAD PERA ◽

YIU-KAI NG

Keyword(s):

Naive Bayes ◽

Naïve Bayes ◽

Naive Bayes Classifier ◽

Bayes Classifier ◽

Naïve Bayes Classifier ◽

Document Summarization ◽

Web Documents ◽

Word Similarity ◽

Web Document ◽

Time Accuracy

Text classification categorizes web documents in large collections into predefined classes based on their contents. Unfortunately, the classification process can be time-consuming and users are still required to spend considerable amount of time scanning through the classified web documents to identify the ones with contents that satisfy their information needs. In solving this problem, we first introduce CorSum, an extractive single-document summarization approach, which is simple and effective in performing the summarization task, since it only relies on word similarity to generate high-quality summaries. We further enhance CorSum by considering the significance factor of sentences in documents, in addition to using word-correlation factors, for document summarization. We denote the enhanced approach CorSum-SF and use the summaries generated by CorSum-SF to train a Multinomial Naïve Bayes classifier for categorizing web document summaries into predefined classes. Experimental results on the DUC-2002 and 20 Newsgroups datasets show that CorSum-SF outperforms other extractive summarization methods, and classification time (accuracy, respectively) is significantly reduced (compatible, respectively) using CorSum-SF generated summaries compared with using the entire documents. More importantly, browsing summaries, instead of entire documents, which are assigned to predefined categories, facilitates the information search process on the Web.

Download Full-text

Correlation Based Method to Detect and Remove Redundant Web Document

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.171-172.543 ◽

2010 ◽

Vol 171-172 ◽

pp. 543-546 ◽

Cited By ~ 1

Author(s):

G. Poonkuzhali ◽

R. Kishore Kumar ◽

R. Kripa Keshav ◽

P. Sudhakar ◽

K. Sarukesi

Keyword(s):

Linear Correlation ◽

Research Work ◽

Vital Role ◽

Web Pages ◽

Web Content ◽

Web Documents ◽

Web Content Mining ◽

Web Document ◽

Content Mining ◽

Search Speed

The enrichment of internet has resulted in the flooding of abundant information on WWW with more replicas. As the duplicated web pages increase the indexing space and time complexity, finding and removing these pages becomes significant for search engines and other likely system which will improve on accuracy of search results as well as search speed. Web content mining plays a vital role in resolving these aspects. Existing algorithm for web content mining focus attention on applying weightage to structured documents whereas in this research work, a mathematical approach based on linear correlation is developed to detect and remove the duplicates present in both structured and unstructured web document. In the proposed work, linear correlation between two web documents is found out. If the correlated value is 1 then the documents are said to be exactly redundant and it should be eliminated otherwise not redundant.

Download Full-text

Web document classification using topic modeling based document ranking

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i3.pp2386-2392 ◽

2021 ◽

Vol 11 (3) ◽

pp. 2386

Author(s):

Youngseok Lee ◽

Jungwon Cho

Keyword(s):

Topic Modeling ◽

High Speed ◽

Data Retrieval ◽

Web Pages ◽

Web Documents ◽

Document Ranking ◽

Web Document ◽

New Information ◽

Data Update ◽

Ranking Technique

In this paper, we propose a web document ranking method using topic modeling for effective information collection and classification. The proposed method is applied to the document ranking technique to avoid duplicated crawling when crawling at high speed. Through the proposed document ranking technique, it is feasible to remove redundant documents, classify the documents efficiently, and confirm that the crawler service is running. The proposed method enables rapid collection of many web documents; the user can search the web pages with constant data update efficiently. In addition, the efficiency of data retrieval can be improved because new information can be automatically classified and transmitted. By expanding the scope of the method to big data based web pages and improving it for application to various websites, it is expected that more effective information retrieval will be possible.

Download Full-text

An Algorithm of Scene Information Collection in General Football Matches Based on Web Documents

Security and Communication Networks ◽

10.1155/2021/5801631 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Bin Li ◽

Ting Zhang

Keyword(s):

Semantic Analysis ◽

Computational Cost ◽

Web Crawler ◽

Web Documents ◽

Web Document ◽

Football Game ◽

Average Accuracy ◽

Local Contribution ◽

Scene Information ◽

The Web

In order to obtain the scene information of the ordinary football game more comprehensively, an algorithm of collecting the scene information of the ordinary football game based on web documents is proposed. The commonly used T-graph web crawler model is used to collect the sample nodes of a specific topic in the football game scene information and then collect the edge document information of the football game scene information topic after the crawling stage of the web crawler. Using the feature item extraction algorithm of semantic analysis, according to the similarity of the feature items, the feature items of the football game scene information are extracted to form a web document. By constructing a complex network and introducing the local contribution and overlap coefficient of the community discovery feature selection algorithm, the features of the web document are selected to realize the collection of football game scene information. Experimental results show that the algorithm has high topic collection capabilities and low computational cost, the average accuracy of equilibrium is always around 98%, and it has strong quantification capabilities for web crawlers and communities.

Download Full-text