A Filtering Algorithm of Main Word Frequency for Online Commodity Subject Classification in E-Commerce

—Based on the traditional classification of plain text in E-Commerce, this article has put forward a processing method in accordance with semi-structured data and main information in web pages, which enhances the accuracy of the product distribution. On the basis of the traditional textmining, combined with the structure and links of web page, this article has proposed an improved web page text representation model in E-Commerce based on supporting vector machines and web text classification algorithm, but there are still a lot of shortcomings waiting for further improvement. According to the data contrast in precision ratio, recall ratio and F-measure, the effect of the improved experiment with LDF-IDF is comprehensively better than that of tf-idf. The precision rate in certain classification can reach 100%, but there is low precision rate caused by items with fewer samples or samples fuzziness. Therefore, the classification of the correct category will directly affect the effect of classification.

Download Full-text

Web Page Structure Enhanced Feature Selection for Classification of Web Pages

International Journal of Computer Applications ◽

10.5120/11818-7494 ◽

2013 ◽

Vol 69 (2) ◽

pp. 41-47

Author(s):

B. LeelaDevi ◽

A. Sankar

Keyword(s):

Feature Selection ◽

Web Pages ◽

Web Page ◽

Selection For

Download Full-text

Is Domain Highlighting Actually Helpful in Identifying Phishing Web Pages?

Human Factors The Journal of the Human Factors and Ergonomics Society ◽

10.1177/0018720816684064 ◽

2017 ◽

Vol 59 (4) ◽

pp. 640-660 ◽

Cited By ~ 7

Author(s):

Aiping Xiong ◽

Robert W. Proctor ◽

Weining Yang ◽

Ninghui Li

Keyword(s):

Eye Gaze ◽

Phase 1 ◽

Web Pages ◽

Web Page ◽

Domain Name ◽

Heat Map ◽

Potential Applications ◽

Two Phases ◽

Phishing Detection ◽

Better Than

Objective: To evaluate the effectiveness of domain highlighting in helping users identify whether Web pages are legitimate or spurious. Background: As a component of the URL, a domain name can be overlooked. Consequently, browsers highlight the domain name to help users identify which Web site they are visiting. Nevertheless, few studies have assessed the effectiveness of domain highlighting, and the only formal study confounded highlighting with instructions to look at the address bar. Method: We conducted two phishing detection experiments. Experiment 1 was run online: Participants judged the legitimacy of Web pages in two phases. In Phase 1, participants were to judge the legitimacy based on any information on the Web page, whereas in Phase 2, they were to focus on the address bar. Whether the domain was highlighted was also varied. Experiment 2 was conducted similarly but with participants in a laboratory setting, which allowed tracking of fixations. Results: Participants differentiated the legitimate and fraudulent Web pages better than chance. There was some benefit of attending to the address bar, but domain highlighting did not provide effective protection against phishing attacks. Analysis of eye-gaze fixation measures was in agreement with the task performance, but heat-map results revealed that participants’ visual attention was attracted by the highlighted domains. Conclusion: Failure to detect many fraudulent Web pages even when the domain was highlighted implies that users lacked knowledge of Web page security cues or how to use those cues. Application: Potential applications include development of phishing prevention training incorporating domain highlighting with other methods to help users identify phishing Web pages.

Download Full-text

Automatic Web Page Classification System with Improved Accuracy

Webology ◽

10.14704/web/v18i2/web18318 ◽

2021 ◽

Vol 18 (2) ◽

pp. 225-242

Author(s):

Chait hra ◽

Dr.G.M. Lingaraju ◽

Dr.S. Jagannatha

Keyword(s):

Research Work ◽

Web Pages ◽

Automated Classification ◽

Classification Methods ◽

Web Page ◽

Web Page Classification ◽

Chi Squared ◽

The Web ◽

Page Classification

Nowadays, the Internet contain s a wide variety of online documents, making finding useful information about a given subject impossible, as well as retrieving irrelevant pages. Web document and page recognition software is useful in a variety of fields, including news, medicine, and fitness, research, and information technology. To enhance search capability, a large number of web page classification methods have been proposed, especially for news web pages. Furthermore existing classification approaches seek to distinguish news web pages while still reducing the high dimensionality of features derived from these pages. Due to the lack of automated classification methods, this paper focuses on the classification of news web pages based on their scarcity and importance. This work will establish different models for the identification and classification of the web pages. The data sets used in this paper were collected from popular news websites. In the research work we have used BBC dataset that has five predefined categories. Initially the input source can be preprocessed and the errors can be eliminated. Then the features can be extracted depend upon the web page reviews using Term frequency-inverse document frequency vectorization. In the work 2225 documents are represented with the 15286 features, which represents the tf-idf score for different unigrams and bigrams. This type of the representation is not only used for classification task also helpful to analyze the dataset. Feature selection is done by using the chi-squared test which will be in the task of finding the terms that are most correlated with each of the categories. Then the pointed features can be selected using chi-squared test. Finally depend upon the classifier the web page can be classified. The results showed that list has obtained the highest percentage, which reflect its effectiveness on the classification of web pages.

Download Full-text

Effective Genre Classification - Understanding Url And Webpage Attributes For Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1191.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 2011-2016

Keyword(s):

The Internet ◽

Web Pages ◽

Web Page ◽

Exchange Method ◽

Genre Classification ◽

Internet Application ◽

Reliable Classification ◽

Time Eating ◽

The Web

With the boom in the number of internet pages, it is very hard to discover desired records effortlessly and fast out of heaps of web pages retrieved with the aid of a search engine. there may be a increasing requirement for automatic type strategies with more class accuracy. There are a few conditions these days in which it's far vital to have an green and reliable classification of a web-web page from the information contained within the URL (Uniform aid Locator) handiest, with out the want to go to the web page itself. We want to understand if the URL can be used by us while not having to look and visit the page due to numerous motives. Getting the web page content material and sorting them to discover the genre of the net web page is very time ingesting and calls for the consumer to recognize the shape of the web page which needs to be categorised. To avoid this time-eating technique we proposed an exchange method so one can help us get the genre of the entered URL based of the entered URL and the metadata i.e., description, keywords used in the website along side the title of the web site. This approach does not most effective rely upon URL however also content from the internet application. The proposed gadget can be evaluated using numerous available datasets.

Download Full-text

Multi-Label Genre Classification of Web Pages Using an Adaptive Centroid-Based Classifier

Journal of Information & Knowledge Management ◽

10.1142/s0219649216500088 ◽

2016 ◽

Vol 15 (01) ◽

pp. 1650008 ◽

Cited By ~ 2

Author(s):

Chaker Jebari

Keyword(s):

Computational Complexity ◽

Rapid Evolution ◽

Classification Method ◽

Training Dataset ◽

Complex Object ◽

Web Pages ◽

Web Page ◽

Adaptive Classification ◽

Genre Classification

This paper proposes an adaptive centroid-based classifier (ACC) for multi-label classification of web pages. Using a set of multi-genre training dataset, ACC constructs a centroid for each genre. To deal with the rapid evolution of web genres, ACC implements an adaptive classification method where web pages are classified one by one. For each web page, ACC calculated its similarity with all genre centroids. Based on this similarity, ACC either adjusts the genre centroid by including the new web page or discards it. A web page is a complex object that contains different sections belonging to different genres. To handle this complexity, ACC implements a multi-label classification where a web page can be assigned to multiple genres at the same time. To improve the performance of genre classification, we propose to aggregate the classifications produced using character n-grams extracted from URL, title, headings and anchors. Experiments conducted using a known multi-label dataset show that ACC outperforms many other multi-label classifiers and has the lowest computational complexity.

Download Full-text

Towards a phylogenetic classification of the Myxomycetes

Phytotaxa ◽

10.11646/phytotaxa.399.3.5 ◽

2019 ◽

Vol 399 (3) ◽

pp. 209 ◽

Cited By ~ 8

Author(s):

DMITRY V. LEONTYEV ◽

MARTIN SCHNITTLER ◽

STEVEN L. STEPHENSON ◽

YURI K. NOVOZHILOV ◽

OLEG N. SHCHEPIN

Keyword(s):

Hierarchical Classification ◽

Morphological Characters ◽

Molecular Data ◽

Evolutionary Relationships ◽

Entire Group ◽

Code Of Nomenclature ◽

Basal Group ◽

Traditional Classification ◽

Better Than

The traditional classification of the Myxomycetes (Myxogastrea) into five orders (Echinosteliales, Liceales, Trichiales, Stemonitidales and Physarales), used in all monographs published since 1945, does not properly reflect evolutionary relationships within the group. Reviewing all published phylogenies for myxomycete subgroups together with a 18S rDNA phylogeny of the entire group serving as an illustration, we suggest a revised hierarchical classification, in which taxa of higher ranks are formally named according to the International Code of Nomenclature for algae, fungi and plants. In addition, informal zoological names are provided. The exosporous genus Ceratiomyxa, together with some protosteloid amoebae, constitute the class Ceratiomyxomycetes. The class Myxomycetes is divided into a bright- and a dark-spored clade, now formally named as subclasses Lucisporomycetidae and Columellomycetidae, respectively. For bright-spored myxomycetes, four orders are proposed: Cribrariales (considered as a basal group), Reticulariales, a narrowly circumscribed Liceales and Trichiales. The dark-spored myxomycetes include five orders: Echinosteliales (considered as a basal group), Clastodermatales, Meridermatales, a more narrowly circumscribed Stemonitidales and Physarales (including as well most of the traditional Stemonitidales with durable peridia). Molecular data provide evidence that conspicuous morphological characters such as solitary versus compound fructifications or presence versus absence of a stalk are overestimated. Details of the capillitium and peridium, and especially how these structures are connected to each other, seem to reflect evolutionary relationships much better than many characters which have been used in the past.

Download Full-text

Exploring a knowledge-based approach to predicting NACE codes of enterprises based on web page texts

Statistical Journal of the IAOS ◽

10.3233/sji-200675 ◽

2020 ◽

Vol 36 (3) ◽

pp. 807-821

Author(s):

Heidi Kühnemann ◽

Arnout van Delden ◽

Dick Windmeijer

Keyword(s):

Economic Activity ◽

Model Performance ◽

Support Vector ◽

Web Pages ◽

Data Set ◽

Filter Model ◽

Domain Specific ◽

Knowledge Based ◽

Vector Machines

Classification of enterprises by main economic activity according to NACE codes is a challenging but important task for national statistical institutes. Since manual editing is time-consuming, we investigated the automatic prediction from dedicated website texts using a knowledge-based approach. To that end, concept features were derived from a set of domain-specific keywords. Furthermore, we compared flat classification to a specific two-level hierarchy which was based on an approach used by manual editors. We limited ourselves to Naïve Bayes and Support Vector Machines models and only used texts from the main web pages. As a first step, we trained a filter model that classifies whether websites contain information about economic activity. The resulting filtered data set was subsequently used to predict 111 NACE classes. We found that using concept features did not improve the model performance compared to a model with character n-grams, i.e. non-informative features. Neither did the two-level hierarchy improve the performance relative to a flat classification. Nonetheless, prediction of the best three NACE classes clearly improved the overall prediction performance compared to a top-one prediction. We conclude that more effort is needed in order to achieve good results with a knowledge-based approach and discuss ideas for improvement.

Download Full-text

Automatically Appraising the Credibility of Vaccine-Related Web Pages Shared on Social Media: A Twitter Surveillance Study

Journal of Medical Internet Research ◽

10.2196/14007 ◽

2019 ◽

Vol 21 (11) ◽

pp. e14007 ◽

Cited By ~ 5

Author(s):

Zubair Shah ◽

Didi Surian ◽

Amalie Dyda ◽

Enrico Coiera ◽

Kenneth D Mandl ◽

...

Keyword(s):

Support Vector ◽

Web Pages ◽

Web Page ◽

Vector Machines ◽

Large Audience ◽

Risk Of Exposure ◽

Twitter Users ◽

Context Specific ◽

Appraisal Tool ◽

High Credibility

Background Tools used to appraise the credibility of health information are time-consuming to apply and require context-specific expertise, limiting their use for quickly identifying and mitigating the spread of misinformation as it emerges. Objective The aim of this study was to estimate the proportion of vaccine-related Twitter posts linked to Web pages of low credibility and measure the potential reach of those posts. Methods Sampling from 143,003 unique vaccine-related Web pages shared on Twitter between January 2017 and March 2018, we used a 7-point checklist adapted from validated tools and guidelines to manually appraise the credibility of 474 Web pages. These were used to train several classifiers (random forests, support vector machines, and recurrent neural networks) using the text from a Web page to predict whether the information satisfies each of the 7 criteria. Estimating the credibility of all other Web pages, we used the follower network to estimate potential exposures relative to a credibility score defined by the 7-point checklist. Results The best-performing classifiers were able to distinguish between low, medium, and high credibility with an accuracy of 78% and labeled low-credibility Web pages with a precision of over 96%. Across the set of unique Web pages, 11.86% (16,961 of 143,003) were estimated as low credibility and they generated 9.34% (1.64 billion of 17.6 billion) of potential exposures. The 100 most popular links to low credibility Web pages were each potentially seen by an estimated 2 million to 80 million Twitter users globally. Conclusions The results indicate that although a small minority of low-credibility Web pages reach a large audience, low-credibility Web pages tend to reach fewer users than other Web pages overall and are more commonly shared within certain subpopulations. An automatic credibility appraisal tool may be useful for finding communities of users at higher risk of exposure to low-credibility vaccine communications.

Download Full-text

Design of Theme Crawler for Web Forum

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.548-549.1330 ◽

2014 ◽

Vol 548-549 ◽

pp. 1330-1333

Author(s):

Zhao Qiu ◽

Ceng Jun Dai ◽

Tao Liu

Keyword(s):

Implementation Strategy ◽

Operating Efficiency ◽

Web Pages ◽

Web Page ◽

Subsequent Work ◽

Web Information Extraction ◽

Performance Accuracy ◽

Web Information ◽

Web Forum ◽

Better Than

Network crawler as web information extraction tools, it can download web pages from internet for the engine. The implementation strategy and operating efficiency of crawling program have a direct influence on results of subsequent work. The paper aimed at the shortcomings of ordinary crawler, puts forward a practical and efficient precise crawler theme method for the BBS, the method for the BBS characteristics, attempts in the web page parsing, theme correlation analysis and the crawling strategy, using the template configuration, analyze and crawl on the article. The method is better than the general crawler in the performance, accuracy and comprehensive rate.

Download Full-text

Automatically Appraising the Credibility of Vaccine-Related Web Pages Shared on Social Media: A Twitter Surveillance Study (Preprint)

10.2196/preprints.14007 ◽

2019 ◽

Author(s):

Zubair Shah ◽

Didi Surian ◽

Amalie Dyda ◽

Enrico Coiera ◽

Kenneth D Mandl ◽

...

Keyword(s):

Support Vector ◽

Web Pages ◽

Web Page ◽

Vector Machines ◽

Large Audience ◽

Risk Of Exposure ◽

Twitter Users ◽

Context Specific ◽

Appraisal Tool ◽

High Credibility

BACKGROUND Tools used to appraise the credibility of health information are time-consuming to apply and require context-specific expertise, limiting their use for quickly identifying and mitigating the spread of misinformation as it emerges. OBJECTIVE The aim of this study was to estimate the proportion of vaccine-related Twitter posts linked to Web pages of low credibility and measure the potential reach of those posts. METHODS Sampling from 143,003 unique vaccine-related Web pages shared on Twitter between January 2017 and March 2018, we used a 7-point checklist adapted from validated tools and guidelines to manually appraise the credibility of 474 Web pages. These were used to train several classifiers (random forests, support vector machines, and recurrent neural networks) using the text from a Web page to predict whether the information satisfies each of the 7 criteria. Estimating the credibility of all other Web pages, we used the follower network to estimate potential exposures relative to a credibility score defined by the 7-point checklist. RESULTS The best-performing classifiers were able to distinguish between low, medium, and high credibility with an accuracy of 78% and labeled low-credibility Web pages with a precision of over 96%. Across the set of unique Web pages, 11.86% (16,961 of 143,003) were estimated as low credibility and they generated 9.34% (1.64 billion of 17.6 billion) of potential exposures. The 100 most popular links to low credibility Web pages were each potentially seen by an estimated 2 million to 80 million Twitter users globally. CONCLUSIONS The results indicate that although a small minority of low-credibility Web pages reach a large audience, low-credibility Web pages tend to reach fewer users than other Web pages overall and are more commonly shared within certain subpopulations. An automatic credibility appraisal tool may be useful for finding communities of users at higher risk of exposure to low-credibility vaccine communications.

Download Full-text