Detecting Web Spam Based on Novel Features from Web Page Source Code

Search engine is critical in people’s daily life because it determines the information quality people obtain through searching. Fierce competition for the ranking in search engines is not conducive to both users and search engines. Existing research mainly studies the content and links of websites. However, none of these techniques focused on semantic analysis of link and anchor text for detection. In this paper, we propose a web spam detection method by extracting novel feature sets from the homepage source code and choosing the random forest (RF) as the classifier. The novel feature sets are extracted from the homepage’s links, hypertext markup language (HTML) structure, and semantic similarity of content. We conduct experiments on the WEBSPAM-UK2007 and UK-2011 dataset using a five-fold cross-validation method. Besides, we design three sets of experiments to evaluate the performance of the proposed method. The proposed method with novel feature sets is compared with different indicators and has better performance than other methods with a precision of 0.929 and a recall of 0.930. Experiment results show that the proposed model could effectively detect web spam.

Download Full-text

Latent Semantic Analysis using a Dennis Coefficient for English Sentiment Classification in a Parallel System

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2018.3.3044 ◽

2018 ◽

Vol 13 (3) ◽

pp. 408-428 ◽

Cited By ~ 4

Author(s):

Phu Vo Ngoc

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Sentiment Classification ◽

Training Data ◽

The Novel ◽

Data Set ◽

Proposed Model ◽

Testing Data ◽

Novel Model ◽

Better Than

We have already survey many significant approaches for many years because there are many crucial contributions of the sentiment classification which can be applied in everyday life, such as in political activities, commodity production, and commercial activities. We have proposed a novel model using a Latent Semantic Analysis (LSA) and a Dennis Coefficient (DNC) for big data sentiment classification in English. Many LSA vectors (LSAV) have successfully been reformed by using the DNC. We use the DNC and the LSAVs to classify 11,000,000 documents of our testing data set to 5,000,000 documents of our training data set in English. This novel model uses many sentiment lexicons of our basis English sentiment dictionary (bESD). We have tested the proposed model in both a sequential environment and a distributed network system. The results of the sequential system are not as good as that of the parallel environment. We have achieved 88.76% accuracy of the testing data set, and this is better than the accuracies of many previous models of the semantic analysis. Besides, we have also compared the novel model with the previous models, and the experiments and the results of our proposed model are better than that of the previous model. Many different fields can widely use the results of the novel model in many commercial applications and surveys of the sentiment classification.

Download Full-text

Accounting Information Quality, Interbank Competition, and Bank Risk-Taking

The Accounting Review ◽

10.2308/accr-50956 ◽

2014 ◽

Vol 90 (3) ◽

pp. 967-985 ◽

Cited By ~ 13

Author(s):

Carlos Corona ◽

Lin Nan ◽

Gaoqing Zhang

Keyword(s):

False Alarm ◽

Risk Taking ◽

Information Quality ◽

Accounting Information ◽

Capital Requirements ◽

Quality Of Accounting Information ◽

Accounting Information Quality ◽

Fierce Competition ◽

Excessive Risk Taking

ABSTRACT We study the interaction between interbank competition and accounting information quality and their effects on banks' risk-taking behavior. We identify an endogenous false-alarm cost that banks incur when forced to sell assets to meet capital requirements. We find that when the interbank competition is less intense, an improvement in the quality of accounting information encourages banks to take more risk. Keeping the banks' investments in loans constant, the provision of high-quality accounting information reduces the false-alarm cost of assets sales and improves the discriminating efficiency of the capital requirement policy. When considering the banks' endogenous investment decisions, however, this improvement in discriminating efficiency causes excessive risk-taking, because banks respond by competing more aggressively in the deposit market, and the increase in deposit costs motivates banks to take more risk. Our paper shows that improving information quality increases risk-taking with mild competition, but has no effect under fierce competition.

Download Full-text

Cross-Language Source Code Plagiarism Detection using Explicit Semantic Analysis and Scored Greedy String Tilling

Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 ◽

10.1145/3383583.3398594 ◽

2020 ◽

Cited By ~ 1

Author(s):

Tomáš Foltýnek ◽

Richard Všianský ◽

Norman Meuschke ◽

Dita Dlabolová ◽

Bela Gipp

Keyword(s):

Semantic Analysis ◽

Source Code ◽

Plagiarism Detection ◽

Cross Language ◽

Explicit Semantic Analysis

Download Full-text

Enhanced Content-Based Image Retrieval Using Information Oriented Angle-Based Local Tri-Directional Weber Patterns

International Journal of Image and Graphics ◽

10.1142/s0219467821500467 ◽

2021 ◽

pp. 2150046

Author(s):

Gangavarapu Venkata Satya Kumar ◽

Pillutla Gopala Krishna Mohan

Keyword(s):

Image Retrieval ◽

Computer Applications ◽

Content Based Image Retrieval ◽

Visual Features ◽

The Novel ◽

Image Content ◽

Query Image ◽

Proposed Model ◽

Maximum Mutual Information ◽

Basic Features

In diverse computer applications, the analysis of image content plays a key role. This image content might be either textual (like text appearing in the images) or visual (like shape, color, texture). These two image contents consist of image’s basic features and therefore turn out to be as the major advantage for any of the implementation. Many of the art models are based on the visual search or annotated text for Content-Based Image Retrieval (CBIR) models. There is more demand toward multitasking, a new method needs to be introduced with the combination of both textual and visual features. This paper plans to develop the intelligent CBIR system for the collection of different benchmark texture datasets. Here, a new descriptor named Information Oriented Angle-based Local Tri-directional Weber Patterns (IOA-LTriWPs) is adopted. The pattern is operated not only based on tri-direction and eight neighborhood pixels but also based on four angles [Formula: see text], [Formula: see text], [Formula: see text], and [Formula: see text]. Once the patterns concerning tri-direction, eight neighborhood pixels, and four angles are taken, the best patterns are selected based on maximum mutual information. Moreover, the histogram computation of the patterns provides the final feature vector, from which the new weighted feature extraction is performed. As a new contribution, the novel weight function is optimized by the Improved MVO on random basis (IMVO-RB), in such a way that the precision and recall of the retrieved image is high. Further, the proposed model has used the logarithmic similarity called Mean Square Logarithmic Error (MSLE) between the features of the query image and trained images for retrieving the concerned images. The analyses on diverse texture image datasets have validated the accuracy and efficiency of the developed pattern over existing.

Download Full-text

A novel product recommendation model consolidating price, trust and online reviews

Kybernetes ◽

10.1108/k-03-2018-0143 ◽

2019 ◽

Vol 48 (6) ◽

pp. 1355-1372 ◽

Cited By ~ 1

Author(s):

Ying Huang ◽

Nu-nu Wang ◽

Hongyu Zhang ◽

Jianqiang Wang

Keyword(s):

Search Engines ◽

Distance Measure ◽

Fuzzy Theory ◽

Online Reviews ◽

Purchasing Decisions ◽

Neutrosophic Sets ◽

Content Type ◽

Product Recommendation ◽

Proposed Model

Purpose The purpose of this paper is to propose a model for product recommendation to improve the accuracy of recommendation based on the current search engines used in e-commerce platforms like Tmall.com. Design/methodology/approach First, the proposed model comprehensively considers price, trust and online reviews, which all represent critical factors in consumers’ purchasing decisions. Second, the model introduces the quantization methods for these criteria incorporating fuzzy theory. Third, the model uses a distance measure between two single valued neutrosophic sets based on the prioritized average operator to consolidate the influences of positive, neutral and negative comments. Finally, the model uses multi-criteria decision-making methods to integrate the influences of price, trust and online reviews on purchasing decisions to generate recommendations. Findings To demonstrate the feasibility and efficiency of the proposed model, a case study is conducted based on Tmall.com. The results of case study indicate that the recommendations of our model perform better than those of current search engines of Tmall.com. The proposed model can significantly improve the accuracy of product recommendations based on search engines. Originality/value The product recommendation method can meet the critical challenge from the search engines on e-commerce platforms. In addition, the proposed method could be used in practice to develop a new application for e-commerce platforms.

Download Full-text

Modified Decision Tree Technique for Ransomware Detection at Runtime through API Calls

Scientific Programming ◽

10.1155/2020/8845833 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Faizan Ullah ◽

Qaisar Javaid ◽

Abdu Salam ◽

Masood Ahmad ◽

Nadeem Sarwar ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Decision Tree ◽

Feature Vector ◽

Machine Learning Algorithms ◽

The Novel ◽

Proposed Model ◽

Testing Accuracy ◽

Financial Losses

Ransomware (RW) is a distinctive variety of malware that encrypts the files or locks the user’s system by keeping and taking their files hostage, which leads to huge financial losses to users. In this article, we propose a new model that extracts the novel features from the RW dataset and performs classification of the RW and benign files. The proposed model can detect a large number of RW from various families at runtime and scan the network, registry activities, and file system throughout the execution. API-call series was reutilized to represent the behavior-based features of RW. The technique extracts fourteen-feature vector at runtime and analyzes it by applying online machine learning algorithms to predict the RW. To validate the effectiveness and scalability, we test 78550 recent malign and benign RW and compare with the random forest and AdaBoost, and the testing accuracy is extended at 99.56%.

Download Full-text

Online information on face masks in Italian and English websites: deficiencies and responsibilities of search engines

10.1101/2020.10.23.20218271 ◽

2020 ◽

Author(s):

Shaily Meta ◽

Daria Ghezzi ◽

Alessia Catalani ◽

Tania Vanzolini ◽

Pietro Ghezzi

Keyword(s):

Public Health ◽

Mental Health ◽

Search Engines ◽

Information Quality ◽

Quality Criteria ◽

Face Mask ◽

Online Information ◽

Public Health Agencies ◽

Health Agencies ◽

The Usa

AbstractCountries have major differences in the acceptance of face mask use for the prevention of COVID-19. We analyzed 450 webpages returned by searching the string “are face masks dangerous” in Italy, the UK and the USA using three search engines (Bing, Duckduckgo and Google). The majority (64-79%) were pages from news outlets, with few (2-6%) pages from government and public health agencies. Webpages with a positive stance on masks were more frequent in English (50%) than in Italian (36%), and those with a negative stance were more frequent in Italian (28% vs. 19% in English). Google returned the highest number of mask-positive pages and Duckduckgo the lowest. Google also returned the lowest number of pages mentioning conspiracy theories and Duckduckgo the highest. Webpages in Italian scored lower than those in English in transparency (reporting authors, their credentials and backing the information with references). When issues about the use of face masks were analyzed, mask effectiveness was the most discussed followed by hypercapnia (accumulation of carbon dioxide), contraindication in respiratory disease, and hypoxia, with issues related to their contraindications in mental health conditions and disability mentioned by very few pages. This study suggests that: 1) public health agencies should increase their web presence in providing correct information on face masks; 2) search engines should improve the information quality criteria in their ranking; 3) the public should be more informed on issues related to the use of masks and disabilities, mental health and stigma arising for those people who cannot wear masks.

Download Full-text

Mathematical modeling approach to predict COVID-19 infected people in Sri Lanka

AIMS Mathematics ◽

10.3934/math.2022260 ◽

2021 ◽

Vol 7 (3) ◽

pp. 4672-4699

Author(s):

I. H. K. Premarathna ◽

◽

H. M. Srivastava ◽

Z. A. M. S. Juman ◽

Ali AlArjani ◽

...

Keyword(s):

Sri Lanka ◽

Secondary Data ◽

Logistic Growth ◽

Solution Technique ◽

The Novel ◽

Infected People ◽

Proposed Model ◽

New Finding ◽

Stochastic Forecasting ◽

The Government

<abstract> <p>The novel corona virus (COVID-19) has badly affected many countries (more than 180 countries including China) in the world. More than 90% of the global COVID-19 cases are currently outside China. The large, unanticipated number of COVID-19 cases has interrupted the healthcare system in many countries and created shortages for bed space in hospitals. Consequently, better estimation of COVID-19 infected people in Sri Lanka is vital for government to take suitable action. This paper investigates predictions on both the number of the first and the second waves of COVID-19 cases in Sri Lanka. First, to estimate the number of first wave of future COVID-19 cases, we develop a stochastic forecasting model and present a solution technique for the model. Then, another solution method is proposed to the two existing models (SIR model and Logistic growth model) for the prediction on the second wave of COVID-19 cases. Finally, the proposed model and solution approaches are validated by secondary data obtained from the Epidemiology Unit, Ministry of Health, Sri Lanka. A comparative assessment on actual values of COVID-19 cases shows promising performance of our developed stochastic model and proposed solution techniques. So, our new finding would definitely be benefited to practitioners, academics and decision makers, especially the government of Sri Lanka that deals with such type of decision making.</p> </abstract>

Download Full-text

Predictive System of COVID -19 Using Response Based Analytical Model

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit21725 ◽

2021 ◽

pp. 05-10

Author(s):

Pallavi Mirajkar ◽

Rupali Dahake

Keyword(s):

Analytical Model ◽

Medical Services ◽

The Novel ◽

Response Data ◽

Proposed Model ◽

The World

The novel COVID sickness 2019 (COVID-19) pandemic caused by the SARS-CoV-2 keeps on representing a serious and vital threat to worldwide health. This pandemic keeps on testing clinical frameworks around the world in numerous viewpoints, remembering sharp increments in requests for clinic beds and basic deficiencies in clinical equipments, while numerous medical services laborers have themselves been infected. We have proposed analytical model that predicts a positive SARS-CoV-2 infection by considering both common and severe symptoms in patients. The proposed model will work on response data of all individuals if they are suffering from various symptoms of the COVID-19. Consequently, proposed model can be utilized for successful screening and prioritization of testing for the infection in everyone.

Download Full-text

What makes users willing or hesitant to use Fintech?: the moderating effect of user type

Industrial Management & Data Systems ◽

10.1108/imds-07-2017-0325 ◽

2018 ◽

Vol 118 (3) ◽

pp. 541-569 ◽

Cited By ~ 31

Author(s):

Hyun-Sun Ryu

Keyword(s):

Least Squares Method ◽

Original Data ◽

The Novel ◽

Continuance Intention ◽

Content Type ◽

Factors Affecting ◽

Proposed Model ◽

Negative Effect ◽

Partial Least Squares Method ◽

Benefit And Risk

Purpose The purpose of this paper is to better understand why people are willing or hesitant to use Financial technology (Fintech) as well as to determine whether the effect of perceived benefits and risks of continuance intention differs depending on user types. Design/methodology/approach Original data were collected via a survey of 243 participants with Fintech usage experience. The partial least squares method was used to test the proposed model. Findings The results reveal that legal risk had the most negative effect on the Fintech continuance intention, while convenience had the strongest positive effect. Differences in specific benefit and risk impacts are found between early and late adopters. Originality/value This empirical study contributes to the novel understanding of the benefit and risk factors affecting the Fintech continuance intention.

Download Full-text