An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm

Abstract Topic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in the web page. The semantic-based similarity measure provides the precise relevance score, even if the synonyms of the given topic occur in the web page. The unavailability of the topic in the ontology produces inaccurate relevance score by the semantic focused crawlers. This paper overcomes these glitches with a hybrid string-matching algorithm by combining the semantic similarity-based measure with the probabilistic similarity-based measure. The experimental results revealed that this algorithm increased the efficiency of the focused web crawlers and achieved better Harvest Rate (HR), Precision (P) and Irrelevance Ratio (IR) than the existing web focused crawlers achieve.

Download Full-text

Web Application Vulnerability Detection Using Hybrid String Matching Algorithm

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.6.14950 ◽

2018 ◽

Vol 7 (3.6) ◽

pp. 106

Author(s):

B J. Santhosh Kumar ◽

Kankanala Pujitha

Keyword(s):

Web Application ◽

False Negative ◽

String Matching ◽

False Negative Rate ◽

Web Server ◽

Web Pages ◽

Matching Algorithm ◽

White List ◽

Client Side ◽

The Web

Application uses URL as contribution for Web Application Vulnerabilities recognition. if the length of URL is too long then it will consume more time to scan the URL (Ain Zubaidah et.al 2014).Existing system can notice the web pages but not overall web application. This application will test for URL of any length using String matching algorithm. To avoid XSS and CSRF and detect attacks that try to sidestep program upheld arrangements by white list and DOM sandboxing techniques (Elias Athanasopoulos et.al.2012). The web application incorporates a rundown of cryptographic hashes of legitimate (trusted) client side contents. In the event that there is a cryptographic hash for the content in the white list. On the off chance that the hash is discovered the content is viewed as trusted or not trusted. This application makes utilization of SHA-1 for making a message process. The web server stores reliable scripts inside div or span HTML components that are attribute as reliable. DOM sandboxing helps in identifying the script or code. Partitioning Program Symbols into Code and Non-code. This helps to identify any hidden code in trusted tag, which bypass web server. Scanning the website for detecting the injection locations and injecting the mischievous XSS assault vectors in such infusion focuses and check for these assaults in the helpless web application( Shashank Gupta et.al 2015).The proposed application improve the false negative rate.

Download Full-text

Webpage Recommendation System Based on the Social Media Semantic Details of the Website

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i6.1358 ◽

2021 ◽

Vol 12 (6) ◽

pp. 237-243

Author(s):

Dr. R.Rooba Et.al

Keyword(s):

Recommendation System ◽

Semantic Information ◽

Markov Chain Model ◽

Similarity Score ◽

Web Pages ◽

Chain Model ◽

Web Page ◽

The Social ◽

Recommendation Accuracy ◽

The Web

The web page recommendation is generated by using the navigational history from web server log files. Semantic Variable Length Markov Chain Model (SVLMC) is a web page recommendation system used to generate recommendation by combining a higher order Markov model with rich semantic data. The problem of state space complexity and time complexity in SVLMC was resolved by Semantic Variable Length confidence pruned Markov Chain Model (SVLCPMC) and Support vector machine based SVLCPMC (SSVLCPMC) meth-ods respectively. The recommendation accuracy was further improved by quickest change detection using Kullback-Leibler Divergence method. In this paper, socio semantic information is included with the similarity score which improves the recommendation accuracy. The social information from the social websites such as twitter is considered for web page recommendation. Initially number of web pages is collected and the similari-ty between web pages is computed by comparing their semantic information. The term frequency and inverse document frequency (tf-idf) is used to produce a composite weight, the most important terms in the web pages are extracted. Then the Pointwise Mutual Information (PMI) between the most important terms and the terms in the twitter dataset are calculated. The PMI metric measures the closeness between the twitter terms and the most important terms in the web pages. Then this measure is added with the similarity score matrix to provide the socio semantic search information for recommendation generation. The experimental results show that the pro-posed method has better performance in terms of prediction accuracy, precision, F1 measure, R measure and coverage.

Download Full-text

Toward semantic similarity measure between concepts in an ontology

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v14.i3.pp1356-1372 ◽

2019 ◽

Vol 14 (3) ◽

pp. 1356

Author(s):

Suwan Tongphu

Keyword(s):

Semantic Similarity ◽

Similarity Measure ◽

Description Logic ◽

Classical Problem ◽

Similarity Score ◽

New Method ◽

Snomed Ct ◽

Major Drawback ◽

Semantic Similarity Measure ◽

Concept Definition

<p>A similarity measure is one classical problem in Description Logic which aims at identifying the similarity between concepts in an ontology. Finding a hierarchy distance among concepts in an ontology is one popular technique. However, one major drawback of such a technique is that it usually ignores a concept definition analysis. This work introduces a new method for similarity measure. The proposed system semantically analyzes structures of two concept descriptions and then computes the similarity score based on the number of shared features. The efficiency of the proposed algorithm is measured by means of the satisfaction of desirable properties and intensive experiments on the Snomed ct ontology.</p>

Download Full-text

Sentimental Analysis and LSI Similarity Measure for Efficient Page Ranking

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1190.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 1143-1146

Keyword(s):

Similarity Measure ◽

Distributed File System ◽

Web Page ◽

Page Rank ◽

Search Results ◽

Similarity Factor ◽

Ranking Procedure ◽

Page Ranking ◽

Similar Strategy ◽

The Web

Supplementary factor to the general pagerank calculation which is utilized by Google chrome to rank sites in their web index results is tended to in this paper. These extra factors incorporate couple of ideas which expressly results to build the precision of evaluating the PageRank value. By making a decision about the likeness between the web page content with the text extracted from different site pagesresulted in topmost search using few keywords of the considered page for which the rank is to be determined by utilizing a comparability measure. It results with a worth or rate which speaks to the significance or similarity factor. Further, in a similar strategy if sentimental analysis is applied the search results of the keywords could be analysed with keywords of the page considered, it results with a Sentimental Analysed factor.In this way, one can improve and execute the Page ranking procedure which results with a superior accuracy.Hadoop Distributed File System is used to compute the page rank of input nodes. Python is chosen for parallel page rank algorithm that is executed on Hadoop

Download Full-text

A New Semantic Similarity Measure Based On Ontology for Movie Rate Prediction

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c4442.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 6756-6762

Keyword(s):

Semantic Similarity ◽

Similarity Measure ◽

Experimental Evaluation ◽

Pearson Correlation ◽

Similarity Measures ◽

Similarity Score ◽

Cosine Similarity ◽

Semantic Similarity Measure ◽

Rate Prediction ◽

Target User

A recommendation algorithm comprises of two important steps: 1) Predicting rates, and 2) Recommendation. Rate prediction is a cumulative function of the similarity score between two movies and rate history of those movies by other users. There are various methods for rate prediction such as weighted sum method, regression, deviation based etc. All these methods rely on finding similar items to the items previously viewed/rated by target user, with assumption that user tends to have similar rating for similar items. Computing the similarities can be done using various similarity measures such as Euclidian Distance, Cosine Similarity, Adjusted Cosine Similarity, Pearson Correlation, Jaccard Similarity etc. All of these well-known approaches calculate similarity score between two movies using simple rating based data. Hence, such similarity measures could not accurately model rating behavior of user. In this paper, we will show that the accuracy in rate prediction can be enhanced by incorporating ontological domain knowledge in similarity computation. This paper introduces a new ontological semantic similarity measure between two movies. For experimental evaluation, the performance of proposed approach is compared with two existing approaches: 1) Adjusted Cosine Similarity (ACS), and 2) Weighted Slope One (WSO) algorithm, in terms of two performance measures: 1) Execution time and 2) Mean Absolute Error (MAE). The open-source Movielens (ml-1m) dataset is used for experimental evaluation. As our results show, the ontological semantic similarity measure enhances the performance of rate prediction as compared to the existing-well known approaches.

Download Full-text

WEB GRAPH BASED SEARCH BY USING DENSITY OF KEYWORD AND AGE FACTOR

International Journal of Computer Science and Informatics ◽

10.47893/ijcsi.2013.1124 ◽

2013 ◽

pp. 89-93

Author(s):

GAURAV AGARWAL ◽

SACHI GUPTA ◽

SAURABH MUKHERJEE

Keyword(s):

Search Engine ◽

Web Search ◽

Web Pages ◽

Main Role ◽

Ranking Algorithm ◽

Web Page ◽

Web Crawler ◽

User Requirement ◽

Priority Assignment ◽

The Web

Today, web servers, are the key repositories of the information & internet is the source of getting this information. There is a mammoth data on the Internet. It becomes a difficult job to search out the accordant data. Search Engine plays a vital role in searching the accordant data. A search engine follows these steps: Web crawling by crawler, Indexing by Indexer and Searching by Searcher. Web crawler retrieves information of the web pages by following every link on the site. Which is stored by web search engine then the content of the web page is indexed by the indexer. The main role of indexer is how data can be catch soon as per user requirements. As the client gives a query, Search Engine searches the results corresponding to this query to provide excellent output. Here ambition is to enroot an algorithm for search engine which may response most desirable result as per user requirement. In this a ranking method is used by the search engine to rank the web pages. Various ranking approaches are discussed in literature but in this paper, ranking algorithm is proposed which is based on parent-child relationship. Proposed ranking algorithm is based on priority assignment phase of Heterogeneous Earliest Finish Time (HEFT) Algorithm which is designed for multiprocessor task scheduling. Proposed algorithm works on three on range variable its means the density of keywords, number of successors to the nodes and the age of the web page. Density shows the occurrence of the keyword on the particular web page. Numbers of successors represent the outgoing link to a single web page. Age is the freshness value of the web page. The page which is modified recently is the freshest page and having the smallest age or largest freshness value. Proposed Technique requires that the priorities of each page to be set with the downward rank values & pages are arranged in ascending/ Descending order of their rank values. Experiments show that our algorithm is valuable. After the comparison with Google we find that our Algorithm is performing better. For 70% problems our algorithm is working better than Google.

Download Full-text

Personalized Recommendation Algorithm for Web Pages Based on Associ ation Rule Mining

MATEC Web of Conferences ◽

10.1051/matecconf/201817303020 ◽

2018 ◽

Vol 173 ◽

pp. 03020

Author(s):

Lu Xing-Hua ◽

Ye Wen-Quan ◽

Liu Ming-Yuan

Keyword(s):

User Satisfaction ◽

Personalized Recommendation ◽

Web Pages ◽

Semantic Features ◽

Web Page ◽

Web Crawler ◽

Rule Mining ◽

Recommendation Algorithm ◽

Browsing Behavior ◽

The Web

In order to improve the user ' s ability to access websites and web pages, according to the interest preference of the user, the personalized recommendation design is carried out, and the personalized recommendation model for web page visit is established to meet the personalized interest demand of the user to browse the web page. A webpage personalized recommendation algorithm based on association rule mining is proposed. Based on the semantic features of web pages, user browsing behavior is calculated by similarity computation, and web crawler algorithm is constructed to extract the semantic features of web pages. The autocorrelation matching method is used to match the features of web page and user browsing behavior, and the association rules feature quantity of user browsing website behavior is mined. According to the semantic relevance and semantic information of web users to search words, fuzzy registration is taken, Web personalized recommendation is obtained to meet the needs of the users browse the web. The simulation results show that the method is accurate and user satisfaction is higher.

Download Full-text

Exploiting Semantics to Improve Classification of Text Corpus

Advances in Data Mining and Database Management - Managing and Processing Big Data in Cloud Computing ◽

10.4018/978-1-4666-9767-6.ch002 ◽

2016 ◽

pp. 23-36

Author(s):

Hammad Majeed ◽

Firoza Erum

Keyword(s):

Semantic Similarity ◽

Similarity Measure ◽

High Accuracy ◽

Web Pages ◽

Relevant Feature ◽

Semantic Similarity Measure ◽

Text Corpus ◽

Processing Techniques ◽

The Web

Internet is growing fast with millions of web pages containing information on every topic. The data placed on Internet is not organized which makes the search process difficult. Classification of the web pages in some predefined classes can improve the organization of this data. In this chapter a semantic based technique is presented to classify text corpus with high accuracy. This technique uses some well-known pre-processing techniques like word stemming, term frequency, and degree of uniqueness. In addition to this a new semantic similarity measure is computed between different terms. The authors believe that semantic similarity based comparison in addition to syntactic matching makes the classification process significantly accurate. The proposed technique is tested on a benchmark dataset and results are compared with already published results. The obtained results are significantly better and that too by using quite small sized highly relevant feature set.

Download Full-text

Algorithm of web page similarity comparison based on visual block

Computer Science and Information Systems ◽

10.2298/csis180915028l ◽

2019 ◽

Vol 16 (3) ◽

pp. 815-830

Author(s):

Xingchen Li ◽

Weizhe Zhang ◽

Desheng Wang ◽

Bin Zhang ◽

Hui He

Keyword(s):

False Alarm ◽

Web Pages ◽

Web Page ◽

Similarity Matching ◽

Similarity Comparison ◽

Matching Algorithm ◽

Similarity Detection ◽

Structure Similarity ◽

Relative Similarity ◽

The Web

Phishing often deceives users due to the relative similarity to the true pages on a layout and leads to considerable losses for the society. Consequently, detecting phishing sites has been an urgent activity. By researching phishing web pages using web page screenshots, we discover that this kind of web pages use numerous web page screenshots to achieve the close similarity to the true page and avoid the text and structure similarity detection. This study introduces a new similarity matching algorithm based on visual blocks. First, the RenderLayer tree of the web page is obtained to extract the visual block. Second, an algorithm that will settle the jumbled visual blocks, including the deletion of the small visual blocks and the emergence of the overlapping visual blocks, is designed. Finally, the similarity between the two web pages is assessed. The proposed algorithm sets different thresholds to achieve the optimal missing and false alarm rates.

Download Full-text

Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

Cybernetics and Information Technologies ◽

10.2478/cait-2019-0021 ◽

2019 ◽

Vol 19 (2) ◽

pp. 146-158 ◽

Cited By ~ 1

Author(s):

S. R. Mani Sekhar ◽

G. M. Siddesh ◽

Sunilkumar S. Manvi ◽

K. G. Srinivasa

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Web Page ◽

Breadth First Search ◽

Web Crawler ◽

Web Crawlers ◽

Relevance Measure ◽

Focused Crawlers

Abstract In the fast growing of digital technologies, crawlers and search engines face unpredictable challenges. Focused web-crawlers are essential for mining the boundless data available on the internet. Web-Crawlers face indeterminate latency problem due to differences in their response time. The proposed work attempts to optimize the designing and implementation of Focused Web-Crawlers using Master-Slave architecture for Bioinformatics web sources. Focused Crawlers ideally should crawl only relevant pages, but the relevance of the page can only be estimated after crawling the genomics pages. A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper. The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources. The proposed solution uses a TextRank algorithm to rank the sentences, as well as ensuring the correct classification of Bioinformatics web page. Finally, the model is validated by being compared with a breadth first search web-crawler. The comparison shows significant reduction in run time for the same harvest rate.

Download Full-text