scholarly journals Web document classification using topic modeling based document ranking

Author(s):  
Youngseok Lee ◽  
Jungwon Cho

In this paper, we propose a web document ranking method using topic modeling for effective information collection and classification. The proposed method is applied to the document ranking technique to avoid duplicated crawling when crawling at high speed. Through the proposed document ranking technique, it is feasible to remove redundant documents, classify the documents efficiently, and confirm that the crawler service is running. The proposed method enables rapid collection of many web documents; the user can search the web pages with constant data update efficiently. In addition, the efficiency of data retrieval can be improved because new information can be automatically classified and transmitted. By expanding the scope of the method to big data based web pages and improving it for application to various websites, it is expected that more effective information retrieval will be possible.

2012 ◽  
Vol 8 (4) ◽  
pp. 1-21 ◽  
Author(s):  
C. I. Ezeife ◽  
Titas Mutsuddy

The process of extracting comparative heterogeneous web content data which are derived and historical from related web pages is still at its infancy and not developed. Discovering potentially useful and previously unknown information or knowledge from web contents such as “list all articles on ’Sequential Pattern Mining’ written between 2007 and 2011 including title, authors, volume, abstract, paper, citation, year of publication,” would require finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse before web content extraction and mining from the database. This paper proposes a technique for automatic web content data extraction, the WebOMiner system, which models web sites of a specific domain like Business to Customer (B2C) web sites, as object oriented database schemas. Then, non-deterministic finite state automata (NFA) based wrappers for recognizing content types from this domain are built and used for extraction of related contents from data blocks into an integrated database for future second level mining for deep knowledge discovery.


2010 ◽  
Vol 171-172 ◽  
pp. 543-546 ◽  
Author(s):  
G. Poonkuzhali ◽  
R. Kishore Kumar ◽  
R. Kripa Keshav ◽  
P. Sudhakar ◽  
K. Sarukesi

The enrichment of internet has resulted in the flooding of abundant information on WWW with more replicas. As the duplicated web pages increase the indexing space and time complexity, finding and removing these pages becomes significant for search engines and other likely system which will improve on accuracy of search results as well as search speed. Web content mining plays a vital role in resolving these aspects. Existing algorithm for web content mining focus attention on applying weightage to structured documents whereas in this research work, a mathematical approach based on linear correlation is developed to detect and remove the duplicates present in both structured and unstructured web document. In the proposed work, linear correlation between two web documents is found out. If the correlated value is 1 then the documents are said to be exactly redundant and it should be eliminated otherwise not redundant.


2021 ◽  
Vol 1 (2) ◽  
pp. 68-75
Author(s):  
Elena А. Zhidkova ◽  
Elena V. Okunkova ◽  
Konstantin V. Zorin ◽  
Konstantin G. Gurevich

Introduction. The development of modern society implies that more and more information is placed in the digital space. Therefore, the attitude of the management of transport companies to the promotion of healthy lifestyle and the prevention of COVID-19 can be indirectly assessed by the content of their sites and web pages. Purpose. The purpose of this article is a content analysis of rail companies’ sites and web pages on the promotion of healthy lifestyles and the prevention of COVID-19. Materials and methods. The study was carried out once, in October 2020. We analyzed the sites and web pages of rail companies using the formation of queries, as well as using the search engines Yandex and Google. Results. The frequency of mentioning healthy lifestyle issues on the websites and web pages of the analyzed rail companies vary from 32 to 54 %. COVID-19 prevention issues are covered only in a half of cases by railway companies and in more than 90 % – on the websites and web pages of subways, high-speed trams and city trams. In general, the analyzed websites and web pages of the rail companies adequately display the information related to the prevention of COVID. Discussion. Rail companies pay attention to the formation of healthy lifestyles and prevention of COVID-19 on their web pages and sites. For this reason, it can be assumed that the management of the rail companies is concerned about these problems. At the same time, we have no reason to believe that the analyzed web documents fully reflect the ongoing preventive programs, in other words, it can be assumed that the preventive activities of rail companies are broader than shown in this study. Meanwhile, rail companies do not fully cover their social policy regarding the health of their own employees. Conclusion. The sites and web pages of rail companies, as a rule, contain scattered information on promoting healthy lifestyle. At the same time, the information on the prevention of COVID-19 is usually presented in a comprehensive manner. The issues of COVID-19 prevention in general are mentioned on more sites and web pages than the issues of healthy lifestyle formation.


Author(s):  
GEORGE E. TSEKOURAS ◽  
DAMIANOS GAVALAS

This article presents a novel crawling and clustering method for extracting and processing cultural data from the web in a fully automated fashion. Our architecture relies upon a focused web crawler to download web documents relevant to culture. The focused crawler is a web crawler that searches and processes only those web pages that are relevant to a particular topic. After downloading the pages, we extract from each document a number of words for each thematic cultural area, filtering the documents with non-cultural content; we then create multidimensional document vectors comprising the most frequent cultural term occurrences. We calculate the dissimilarity between the cultural-related document vectors and for each cultural theme, we use cluster analysis to partition the documents into a number of clusters. Our approach is validated via a proof-of-concept application which analyzes hundreds of web pages spanning different cultural thematic areas.


2011 ◽  
Vol 403-408 ◽  
pp. 1008-1013 ◽  
Author(s):  
Divya Ragatha Venkata ◽  
Deepika Kulshreshtha

In this paper, we put forward a technique for keeping web pages up-to-date, later used by search engine to serve the end user queries. A major part of the Web is dynamic and hence, a need arises to constantly update the changed web documents in search engine’s repository. In this paper we used the client-server architecture for crawling the web and propose a technique for detecting changes in web page based on the content of the images present if any in web documents. Once it is being identified that the image embedded in the web document is changed then the previous copy of the web document present in the search engine’s database/repository is replaced with the changed one.


2013 ◽  
Vol 7 (2) ◽  
pp. 574-579 ◽  
Author(s):  
Dr Sunitha Abburu ◽  
G. Suresh Babu

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.  But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies  data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.   It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.  The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.


2018 ◽  
Vol 33 (4) ◽  
Author(s):  
Murari Kumar ◽  
Samir Farooqi ◽  
K. K. Chaturvedi ◽  
Chandan Kumar Deb ◽  
Pankaj Das

Bibliographic data contains necessary information about literature to help users to recognize and retrieve that resource. These data are used quantitatively by a “Bibliometrician” for analysis and dissemination purpose but with the increasing rate of literature publication in open access journals such as Nucleic Acids Research (NAR), Springer, Oxford Journals etc., it has become difficult to retrieve structured bibliographic information in desired format. A digital bibliographic database contains necessary and structured information about published literature. Bibliographic records of different articles are scattered and resides on different web pages. This thesis presents the retrieval system for bibliographic data of NAR at a single place. For this purpose, parser agents have been developed which access the web pages of NAR and parse the scattered bibliographic data and finally store it into a local bibliographic database. Based on the bibliographic database, “three-tier architecture” has been utilized to display the bibliographic information in systematized format. Using this system, it would be possible to build the network between different authors and affiliations and also other analytical reports can be generated.


2020 ◽  
Author(s):  
Tanweer Alam ◽  
Mohamed Benaida

Building the innovative blockchain-based architecture across the Internet of Things (IoT) platform for the education system could be an enticing mechanism to boost communication efficiency within the 5 G network. Wireless networking would have been the main research area allowing people to communicate without using the wires. It was established at the start of the Internet by retrieving the web pages to connect from one computer to another computer Moreover, high-speed, intelligent, powerful networks with numerous contemporary technologies, such as low power consumption, and so on, appear to be available in today's world to connect among each other. The extension of fog features on physical things under IoT is allowed in this situation. One of the complex tasks throughout the area of mobile communications would be to design a new virtualization framework based on blockchain across the Internet of Things architecture. The goal of this research is to connect a new study for an educational system that contains Blockchain to the internet of things or keeping things cryptographically secure on the internet. This research combines with its improved blockchain and IoT to create an efficient interaction system between students, teachers, employers, developers, facilitators and accreditors on the Internet. This specified framework is detailed research's great estimation.


Author(s):  
Gerardo Reyes Ruiz ◽  
Samuel Olmos Peña ◽  
Marisol Hernández Hernández

New technologies have changed the way today's own label products are being offered. Today the Internet and even more the so-called social networks have played key roles in dispersing any particular product in a more efficient and dynamic sense. Also, having a smartphone and a wireless high-speed network are no longer a luxury or a temporary fad, but rather a necessity for the new generations. These technological advances and new marketing trends have not gone unnoticed by the medium and large stores. The augmented reality applied to interactive catalogs is a new technology that supports the adding of virtual reality to a real environment which in turn makes it a tool for discovering new uses, forms, and in this case, spending habits. The challenge for companies with their private labels in achieving their business objectives, is providing customers with products and services of the highest quality, thus promoting the efficient and streamlined use of all resources that are accounted for and at the same time promoting the use of new information technologies as a strategic competitive.


Sign in / Sign up

Export Citation Format

Share Document