A Tag-Based Improved LDA and Web Page Clustering Analysis

2014 ◽  
Vol 667 ◽  
pp. 277-285 ◽  
Author(s):  
Fang Chen ◽  
Yan Hui Zhou

With the rapid development of Internet, tag technology has been widely used in various sites. The brief text labels of network resources are greatly convenient for people to access the massive data. Social tags allows the user to use any word ----to tag network objects, and to share these tags, because of its simple and flexible operation, and it has become one of the popular applications. However, there exists some problems like noise of tags, lack of using criteria, and sparse distribution etc. Especially sparsity of tags seriously limits its application in the semantic analysis of web pages. This paper, by exploiting the user-related tag expansion method to overcome this problem, at the same time by using the topic model----LDA to model the web tags, mine its potential topic from the large-scale web page, and obtain the topic distribution of the text to the text clustering analysis. The experimental results show that, compared with the traditional clustering algorithm, the method of based LDA clustering on the analysis of the web tags have a larger increase.

Author(s):  
Ben Choi

Web mining aims for searching, organizing, and extracting information on the Web and search engines focus on searching. The next stage of Web mining is the organization of Web contents, which will then facilitate the extraction of useful information from the Web. This chapter will focus on organizing Web contents. Since a majority of Web contents are stored in the form of Web pages, this chapter will focus on techniques for automatically organizing Web pages into categories. Various artificial intelligence techniques have been used; however the most successful ones are classification and clustering. This chapter will focus on clustering. Clustering is well suited for Web mining by automatically organizing Web pages into categories each of which contain Web pages having similar contents. However, one problem in clustering is the lack of general methods to automatically determine the number of categories or clusters. For the Web domain, until now there is no such a method suitable for Web page clustering. To address this problem, this chapter describes a method to discover a constant factor that characterizes the Web domain and proposes a new method for automatically determining the number of clusters in Web page datasets. This chapter also proposes a new bi-directional hierarchical clustering algorithm, which arranges individual Web pages into clusters and then arranges the clusters into larger clusters and so on until the average inter-cluster similarity approaches the constant factor. Having the constant factor together with the algorithm, this chapter provides a new clustering system suitable for mining the Web.


2010 ◽  
Vol 30 (3) ◽  
pp. 818-820
Author(s):  
Rui LI ◽  
Jun-yu ZENG ◽  
Si-wang ZHOU

2021 ◽  
pp. 1-11
Author(s):  
Zhinan Gou ◽  
Yan Li

With the development of the web 2.0 communities, information retrieval has been widely applied based on the collaborative tagging system. However, a user issues a query that is often a brief query with only one or two keywords, which leads to a series of problems like inaccurate query words, information overload and information disorientation. The query expansion addresses this issue by reformulating each search query with additional words. By analyzing the limitation of existing query expansion methods in folksonomy, this paper proposes a novel query expansion method, based on user profile and topic model, for search in folksonomy. In detail, topic model is constructed by variational antoencoder with Word2Vec firstly. Then, query expansion is conducted by user profile and topic model. Finally, the proposed method is evaluated by a real dataset. Evaluation results show that the proposed method outperforms the baseline methods.


Information ◽  
2018 ◽  
Vol 9 (9) ◽  
pp. 228 ◽  
Author(s):  
Zuping Zhang ◽  
Jing Zhao ◽  
Xiping Yan

Web page clustering is an important technology for sorting network resources. By extraction and clustering based on the similarity of the Web page, a large amount of information on a Web page can be organized effectively. In this paper, after describing the extraction of Web feature words, calculation methods for the weighting of feature words are studied deeply. Taking Web pages as objects and Web feature words as attributes, a formal context is constructed for using formal concept analysis. An algorithm for constructing a concept lattice based on cross data links was proposed and was successfully applied. This method can be used to cluster the Web pages using the concept lattice hierarchy. Experimental results indicate that the proposed algorithm is better than previous competitors with regard to time consumption and the clustering effect.


Author(s):  
Jie Zhao ◽  
Jianfei Wang ◽  
Jia Yang ◽  
Peiquan Jin

Company acquisition relation reflects a company's development intent and competitive strategies, which is an important type of enterprise competitive intelligence. In the traditional environment, the acquisition of competitive intelligence mainly relies on newspapers, internal reports, and so on, but the rapid development of the Web introduces a new way to extract company acquisition relation. In this paper, the authors study the problem of extracting company acquisition relation from huge amounts of Web pages, and propose a novel algorithm for company acquisition relation extraction. The authors' algorithm considers the tense feature of Web content and classification technology of semantic strength when extracting company acquisition relation from Web pages. It first determines the tense of each sentence in a Web page, which is then applied in sentences classification so as to evaluate the semantic strength of the candidate sentences in describing company acquisition relation. After that, the authors rank the candidate acquisition relations and return the top-k company acquisition relation. They run experiments on 6144 pages crawled through Google, and measure the performance of their algorithm under different metrics. The experimental results show that the algorithm is effective in determining the tense of sentences as well as the company acquisition relation.


2020 ◽  
Vol 39 (6) ◽  
pp. 8981-8988
Author(s):  
Xiaohua Liu

In the face of the current epidemic situation, news reports are facing the problem of higher accuracy. The speed and accuracy of public emergency news depends on the accuracy of web page links and tags clustering. An improved web page clustering method based on the combination of topic clustering and structure clustering is proposed in this paper. The algorithm takes the result of web page structure clustering as the weight factor. Combined with the web content clustering by K-means algorithm, the basic content that meets the conditions is selected. Through the improved translator of clustering algorithm, it is translated into Chinese and compared with the target content to analyze the similarity. It realized the translation aim of new crown virus epidemic related news report of Japanese Linguistics based on page link mining.


2013 ◽  
Vol 846-847 ◽  
pp. 1868-1872
Author(s):  
Shuai Gang

In recent years, the number and size of Web services on the Internet have a rapid development. Industry and academia start to study the web service. In Internet resources, if the web cannot be found, the web service will become meaningless. So for web services, large-scale managements and problems are the keys of the study of Internet service resources. This paper studies large-scale distributed web services in network resources based on SOA architecture ideas. It also designs the unified management and organization system of ideological and political education which treat the ideological and political education as the content. It proposes SN network resource service model of ideological and political education. With the development and popularization of the Internet today, the study on Internet resources of ideological and political education in this paper provides a theoretical reference for the innovation of the ideological and political education.


2019 ◽  
Vol 12 (1) ◽  
pp. 105-116
Author(s):  
Qiuyu Zhu ◽  
Dongmei Li ◽  
Cong Dai ◽  
Qichen Han ◽  
Yi Lin

With the rapid development of the Internet, the information retrieval model based on the keywords matching algorithm has not met the requirements of users, because people with various query history always have different retrieval intentions. User query history often implies their interests. Therefore, it is of great importance to enhance the recall ratio and the precision ratio by applying query history into the judgment of retrieval intentions. For this sake, this article does research on user query history and proposes a method to construct user interest model utilizing query history. Coordinately, the authors design a model called PLSA-based Personalized Information Retrieval with Network Regularization. Finally, the model is applied into academic information retrieval and the authors compare it with Baidu Scholar and the personalized information retrieval model based on the probabilistic latent semantic analysis topic model. The experiment results prove that this model can effectively extract topics and retrieves back results more satisfied for users' requirements. Also, this model improves the effect of retrieval results apparently. In addition, the retrieval model can be utilized not only in the academic information retrieval, but also in the personalized information retrieval on microblog search, associate recommendation, etc.


Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Jiangfan Feng ◽  
Xuejun Fu ◽  
Yao Zhou ◽  
Yuling Zhu ◽  
Xiaobo Luo

The rapid developments in sensor technology and mobile devices bring a flourish of social images, and large-scale social images have attracted increasing attention to researchers. Existing approaches generally rely on recognizing object instances individually with geo-tags, visual patterns, etc. However, the social image represents a web of interconnected relations; these relations between entities carry semantic meaning and help a viewer differentiate between instances of a substance. This article forms the perspective of the spatial relationship to exploring the joint learning of social images. Precisely, the model consists of three parts: (a) a module for deep semantic understanding of images based on residual network (ResNet); (b) a deep semantic analysis module of text beyond traditional word bag methods; (c) a joint reasoning module from which the text weights obtained using image features on self-attention and a novel tree-based clustering algorithm. The experimental results demonstrate the effectiveness of using Flickr30k and Microsoft COCO datasets. Meanwhile, our method considers spatial relations while matching.


2006 ◽  
Author(s):  
Ron Bekkerman ◽  
Shlomo Zilberstein ◽  
James Allan

Sign in / Sign up

Export Citation Format

Share Document