scholarly journals An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm

2021 ◽  
Vol 21 (2) ◽  
pp. 105-120
Author(s):  
K. S. Sakunthala Prabha ◽  
C. Mahesh ◽  
S. P. Raja

Abstract Topic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in the web page. The semantic-based similarity measure provides the precise relevance score, even if the synonyms of the given topic occur in the web page. The unavailability of the topic in the ontology produces inaccurate relevance score by the semantic focused crawlers. This paper overcomes these glitches with a hybrid string-matching algorithm by combining the semantic similarity-based measure with the probabilistic similarity-based measure. The experimental results revealed that this algorithm increased the efficiency of the focused web crawlers and achieved better Harvest Rate (HR), Precision (P) and Irrelevance Ratio (IR) than the existing web focused crawlers achieve.

2018 ◽  
Vol 7 (3.6) ◽  
pp. 106
Author(s):  
B J. Santhosh Kumar ◽  
Kankanala Pujitha

Application uses URL as contribution for Web Application Vulnerabilities recognition. if the length of URL is too long then it will consume more time to scan the URL (Ain Zubaidah et.al 2014).Existing system can notice the web pages but not overall web application. This application will test for URL of any length using String matching algorithm. To avoid XSS and CSRF and detect attacks that try to sidestep program upheld arrangements by white list and DOM sandboxing techniques (Elias Athanasopoulos et.al.2012). The web application incorporates a rundown of cryptographic hashes of legitimate (trusted) client side contents. In the event that there is a cryptographic hash for the content in the white list. On the off chance that the hash is discovered the content is viewed as trusted or not trusted. This application makes utilization of SHA-1 for making a message process. The web server stores reliable scripts inside div or span HTML components that are attribute as reliable. DOM sandboxing helps in identifying the script or code. Partitioning Program Symbols into Code and Non-code. This helps to identify any hidden code in trusted tag, which bypass web server. Scanning the website for detecting the injection locations and injecting the mischievous XSS assault vectors in such infusion focuses and check for these assaults in the helpless web application( Shashank Gupta et.al 2015).The proposed application improve the false negative rate.  


Author(s):  
Dr. R.Rooba Et.al

The web page recommendation is generated by using the navigational history from web server log files. Semantic Variable Length Markov Chain Model (SVLMC) is a web page recommendation system used to generate recommendation by combining a higher order Markov model with rich semantic data. The problem of state space complexity and time complexity in SVLMC was resolved by Semantic Variable Length confidence pruned Markov Chain Model (SVLCPMC) and Support vector machine based SVLCPMC (SSVLCPMC) meth-ods respectively. The recommendation accuracy was further improved by quickest change detection using Kullback-Leibler Divergence method. In this paper, socio semantic information is included with the similarity score which improves the recommendation accuracy. The social information from the social websites such as twitter is considered for web page recommendation. Initially number of web pages is collected and the similari-ty between web pages is computed by comparing their semantic information. The term frequency and inverse document frequency (tf-idf) is used to produce a composite weight, the most important terms in the web pages are extracted. Then the Pointwise Mutual Information (PMI) between the most important terms and the terms in the twitter dataset are calculated. The PMI metric measures the closeness between the twitter terms and the most important terms in the web pages. Then this measure is added with the similarity score matrix to provide the socio semantic search information for recommendation generation. The experimental results show that the pro-posed method has better performance in terms of prediction accuracy, precision, F1 measure, R measure and coverage.


Author(s):  
Suwan Tongphu

<p>A similarity measure is one classical problem in Description Logic which aims at identifying the similarity between concepts in an ontology. Finding a hierarchy distance among concepts in an ontology is one popular technique. However, one major drawback of such a technique is that it usually ignores a concept definition analysis. This work introduces a new method for similarity measure. The proposed system semantically analyzes structures of two concept descriptions and then computes the similarity score based on the number of shared features. The efficiency of the proposed algorithm is measured by means of the satisfaction of desirable properties and intensive experiments on the Snomed ct ontology.</p>


Supplementary factor to the general pagerank calculation which is utilized by Google chrome to rank sites in their web index results is tended to in this paper. These extra factors incorporate couple of ideas which expressly results to build the precision of evaluating the PageRank value. By making a decision about the likeness between the web page content with the text extracted from different site pagesresulted in topmost search using few keywords of the considered page for which the rank is to be determined by utilizing a comparability measure. It results with a worth or rate which speaks to the significance or similarity factor. Further, in a similar strategy if sentimental analysis is applied the search results of the keywords could be analysed with keywords of the page considered, it results with a Sentimental Analysed factor.In this way, one can improve and execute the Page ranking procedure which results with a superior accuracy.Hadoop Distributed File System is used to compute the page rank of input nodes. Python is chosen for parallel page rank algorithm that is executed on Hadoop


2019 ◽  
Vol 8 (3) ◽  
pp. 6756-6762

A recommendation algorithm comprises of two important steps: 1) Predicting rates, and 2) Recommendation. Rate prediction is a cumulative function of the similarity score between two movies and rate history of those movies by other users. There are various methods for rate prediction such as weighted sum method, regression, deviation based etc. All these methods rely on finding similar items to the items previously viewed/rated by target user, with assumption that user tends to have similar rating for similar items. Computing the similarities can be done using various similarity measures such as Euclidian Distance, Cosine Similarity, Adjusted Cosine Similarity, Pearson Correlation, Jaccard Similarity etc. All of these well-known approaches calculate similarity score between two movies using simple rating based data. Hence, such similarity measures could not accurately model rating behavior of user. In this paper, we will show that the accuracy in rate prediction can be enhanced by incorporating ontological domain knowledge in similarity computation. This paper introduces a new ontological semantic similarity measure between two movies. For experimental evaluation, the performance of proposed approach is compared with two existing approaches: 1) Adjusted Cosine Similarity (ACS), and 2) Weighted Slope One (WSO) algorithm, in terms of two performance measures: 1) Execution time and 2) Mean Absolute Error (MAE). The open-source Movielens (ml-1m) dataset is used for experimental evaluation. As our results show, the ontological semantic similarity measure enhances the performance of rate prediction as compared to the existing-well known approaches.


Author(s):  
GAURAV AGARWAL ◽  
SACHI GUPTA ◽  
SAURABH MUKHERJEE

Today, web servers, are the key repositories of the information & internet is the source of getting this information. There is a mammoth data on the Internet. It becomes a difficult job to search out the accordant data. Search Engine plays a vital role in searching the accordant data. A search engine follows these steps: Web crawling by crawler, Indexing by Indexer and Searching by Searcher. Web crawler retrieves information of the web pages by following every link on the site. Which is stored by web search engine then the content of the web page is indexed by the indexer. The main role of indexer is how data can be catch soon as per user requirements. As the client gives a query, Search Engine searches the results corresponding to this query to provide excellent output. Here ambition is to enroot an algorithm for search engine which may response most desirable result as per user requirement. In this a ranking method is used by the search engine to rank the web pages. Various ranking approaches are discussed in literature but in this paper, ranking algorithm is proposed which is based on parent-child relationship. Proposed ranking algorithm is based on priority assignment phase of Heterogeneous Earliest Finish Time (HEFT) Algorithm which is designed for multiprocessor task scheduling. Proposed algorithm works on three on range variable its means the density of keywords, number of successors to the nodes and the age of the web page. Density shows the occurrence of the keyword on the particular web page. Numbers of successors represent the outgoing link to a single web page. Age is the freshness value of the web page. The page which is modified recently is the freshest page and having the smallest age or largest freshness value. Proposed Technique requires that the priorities of each page to be set with the downward rank values & pages are arranged in ascending/ Descending order of their rank values. Experiments show that our algorithm is valuable. After the comparison with Google we find that our Algorithm is performing better. For 70% problems our algorithm is working better than Google.


2018 ◽  
Vol 173 ◽  
pp. 03020
Author(s):  
Lu Xing-Hua ◽  
Ye Wen-Quan ◽  
Liu Ming-Yuan

In order to improve the user ' s ability to access websites and web pages, according to the interest preference of the user, the personalized recommendation design is carried out, and the personalized recommendation model for web page visit is established to meet the personalized interest demand of the user to browse the web page. A webpage personalized recommendation algorithm based on association rule mining is proposed. Based on the semantic features of web pages, user browsing behavior is calculated by similarity computation, and web crawler algorithm is constructed to extract the semantic features of web pages. The autocorrelation matching method is used to match the features of web page and user browsing behavior, and the association rules feature quantity of user browsing website behavior is mined. According to the semantic relevance and semantic information of web users to search words, fuzzy registration is taken, Web personalized recommendation is obtained to meet the needs of the users browse the web. The simulation results show that the method is accurate and user satisfaction is higher.


Author(s):  
Hammad Majeed ◽  
Firoza Erum

Internet is growing fast with millions of web pages containing information on every topic. The data placed on Internet is not organized which makes the search process difficult. Classification of the web pages in some predefined classes can improve the organization of this data. In this chapter a semantic based technique is presented to classify text corpus with high accuracy. This technique uses some well-known pre-processing techniques like word stemming, term frequency, and degree of uniqueness. In addition to this a new semantic similarity measure is computed between different terms. The authors believe that semantic similarity based comparison in addition to syntactic matching makes the classification process significantly accurate. The proposed technique is tested on a benchmark dataset and results are compared with already published results. The obtained results are significantly better and that too by using quite small sized highly relevant feature set.


2019 ◽  
Vol 16 (3) ◽  
pp. 815-830
Author(s):  
Xingchen Li ◽  
Weizhe Zhang ◽  
Desheng Wang ◽  
Bin Zhang ◽  
Hui He

Phishing often deceives users due to the relative similarity to the true pages on a layout and leads to considerable losses for the society. Consequently, detecting phishing sites has been an urgent activity. By researching phishing web pages using web page screenshots, we discover that this kind of web pages use numerous web page screenshots to achieve the close similarity to the true page and avoid the text and structure similarity detection. This study introduces a new similarity matching algorithm based on visual blocks. First, the RenderLayer tree of the web page is obtained to extract the visual block. Second, an algorithm that will settle the jumbled visual blocks, including the deletion of the small visual blocks and the emergence of the overlapping visual blocks, is designed. Finally, the similarity between the two web pages is assessed. The proposed algorithm sets different thresholds to achieve the optimal missing and false alarm rates.


2019 ◽  
Vol 19 (2) ◽  
pp. 146-158 ◽  
Author(s):  
S. R. Mani Sekhar ◽  
G. M. Siddesh ◽  
Sunilkumar S. Manvi ◽  
K. G. Srinivasa

Abstract In the fast growing of digital technologies, crawlers and search engines face unpredictable challenges. Focused web-crawlers are essential for mining the boundless data available on the internet. Web-Crawlers face indeterminate latency problem due to differences in their response time. The proposed work attempts to optimize the designing and implementation of Focused Web-Crawlers using Master-Slave architecture for Bioinformatics web sources. Focused Crawlers ideally should crawl only relevant pages, but the relevance of the page can only be estimated after crawling the genomics pages. A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper. The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources. The proposed solution uses a TextRank algorithm to rank the sentences, as well as ensuring the correct classification of Bioinformatics web page. Finally, the model is validated by being compared with a breadth first search web-crawler. The comparison shows significant reduction in run time for the same harvest rate.


Sign in / Sign up

Export Citation Format

Share Document