Ranked Deep Web Page Detection Using Reinforcement Learning and Query Optimization

This paper proposes a novel algorithm based on reinforcement learning-entitled asynchronous advantage actor-critic (A3C). Overflow queries are optimized to crawl the ranked deep web. A3C assigns the reward and penalty to the various queries. Queries are derived from the domain-based taxonomy that helps to fill the search forms. Overflow queries are the collection of queries that match with more than k number of results and only top k matched results are retrieved. Low ranked documents beyond k results are not accessible and lead to low coverage. Overflow queries are optimized to convert into non-overflow queries based on the proposed technique and lead to more coverage. As of yet, no research work has been explored by using A3C with taxonomy in the domain of ranked deep web. The experimental results show that the proposed technique outperforms the three other techniques (i.e., document frequency, random query, and high frequency) in terms of average improvement metric by 26%, 69%, and 92%, respectively.

Download Full-text

Conceptual Sentiment Analysis Model

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i4.pp2358-2366 ◽

2018 ◽

Vol 8 (4) ◽

pp. 2358 ◽

Cited By ~ 2

Author(s):

Kranti Vithal Ghag ◽

Ketan Shah

Keyword(s):

Sentiment Analysis ◽

Relative Frequency ◽

Syntactic Structure ◽

Research Work ◽

Average Frequency ◽

Semantic Structure ◽

Analysis Model ◽

Inverse Document Frequency ◽

Document Frequency ◽

Improve Accuracy

<span>Bag-of-words approach is popularly used for Sentiment analysis. It maps the terms in the reviews to term-document vectors and thus disrupts the syntactic structure of sentences in the reviews. Association among the terms or the semantic structure of sentences is also not preserved. This research work focuses on classifying the sentiments by considering the syntactic and semantic structure of the sentences in the review. To improve accuracy, sentiment classifiers based on relative frequency, average frequency and term frequency inverse document frequency were proposed. To handle terms with apostrophe, preprocessing techniques were extended. To focus on opinionated contents, subjectivity extraction was performed at phrase level. Experiments were performed on Pang & Lees, Kaggle’s and UCI’s dataset. Classifiers were also evaluated on the UCI’s Product and Restaurant dataset. Sentiment Classification accuracy improved from 67.9% for a comparable term weighing technique, DeltaTFIDF, up to 77.2% for proposed classifiers. Inception of the proposed concept based approach, subjectivity extraction and extensions to preprocessing techniques, improved the accuracy to 93.9%.</span>

Download Full-text

An Improved Cross-Domain Sentiment Analysis Based on a Semi-Supervised Convolutional Neural Network

10.4018/978-1-7998-8413-2.ch007 ◽

2022 ◽

pp. 155-170

Author(s):

Lap-Kei Lee ◽

Kwok Tai Chui ◽

Jingjing Wang ◽

Yin-Chun Fung ◽

Zhanhui Tan

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Sentiment Analysis ◽

Language Processing ◽

Research Work ◽

Latent Semantic Indexing ◽

Future Research ◽

Training Time ◽

Cross Domain ◽

Document Frequency

The dependence on Internet in our daily life is ever-growing, which provides opportunity to discover valuable and subjective information using advanced techniques such as natural language processing and artificial intelligence. In this chapter, the research focus is a convolutional neural network for three-class (positive, neutral, and negative) cross-domain sentiment analysis. The model is enhanced in two-fold. First, a similarity label method facilitates the management between the source and target domains to generate more labelled data. Second, term frequency-inverse document frequency (TF-IDF) and latent semantic indexing (LSI) are employed to compute the similarity between source and target domains. Performance evaluation is conducted using three datasets, beauty reviews, toys reviews, and phone reviews. The proposed method enhances the accuracy by 4.3-7.6% and reduces the training time by 50%. The limitations of the research work have been discussed, which serve as the rationales of future research directions.

Download Full-text

Study on the Query Optimization for the Deep Web Data Integration System

International Conference on Computer Technology and Development, 3rd (ICCTD 2011) ◽

10.1115/1.859919.paper193 ◽

2011 ◽

pp. 1181-1185

Keyword(s):

Data Integration ◽

Query Optimization ◽

Deep Web ◽

Web Data ◽

Integration System ◽

Web Data Integration ◽

Data Integration System

Download Full-text

Extracting Top-k Company Acquisition Relations From the Web

International Journal on Semantic Web and Information Systems ◽

10.4018/ijswis.2017100102 ◽

2017 ◽

Vol 13 (4) ◽

pp. 27-41 ◽

Cited By ~ 1

Author(s):

Jie Zhao ◽

Jianfei Wang ◽

Jia Yang ◽

Peiquan Jin

Keyword(s):

Rapid Development ◽

Relation Extraction ◽

Experimental Results ◽

Competitive Intelligence ◽

Web Pages ◽

Web Content ◽

Web Page ◽

Competitive Strategies ◽

The Web ◽

Novel Algorithm

Company acquisition relation reflects a company's development intent and competitive strategies, which is an important type of enterprise competitive intelligence. In the traditional environment, the acquisition of competitive intelligence mainly relies on newspapers, internal reports, and so on, but the rapid development of the Web introduces a new way to extract company acquisition relation. In this paper, the authors study the problem of extracting company acquisition relation from huge amounts of Web pages, and propose a novel algorithm for company acquisition relation extraction. The authors' algorithm considers the tense feature of Web content and classification technology of semantic strength when extracting company acquisition relation from Web pages. It first determines the tense of each sentence in a Web page, which is then applied in sentences classification so as to evaluate the semantic strength of the candidate sentences in describing company acquisition relation. After that, the authors rank the candidate acquisition relations and return the top-k company acquisition relation. They run experiments on 6144 pages crawled through Google, and measure the performance of their algorithm under different metrics. The experimental results show that the algorithm is effective in determining the tense of sentences as well as the company acquisition relation.

Download Full-text

A Novel Algorithm of Web Page Change Detection

International Journal of Computer Applications ◽

10.5120/ijca2016910679 ◽

2016 ◽

Vol 146 (2) ◽

pp. 21-24

Author(s):

Surbhi Chhabra ◽

Rajender Nath

Keyword(s):

Change Detection ◽

Web Page ◽

Novel Algorithm

Download Full-text

High Frequency Mutant Genes in Urothelial Carcinoma Based On Genomic Landscape

10.21203/rs.3.rs-1006840/v1 ◽

2021 ◽

Author(s):

Haifeng Hao ◽

Yunmeng Zhang ◽

Yue Zheng ◽

Xiaoting Yan ◽

Liang wei ◽

...

Keyword(s):

Urothelial Carcinoma ◽

Exome Sequencing ◽

High Frequency ◽

Meta Analysis ◽

Research Work ◽

Systematic Evaluation ◽

Genetic Characteristics ◽

Genomic Landscape ◽

Based Medicine ◽

Mutant Genes

Abstract ObjectiveThe new generation of sequencing technology has been applied to the study of genomic genetic characteristics of urothelial carcinoma for 20 years. Researchers at home and abroad have done a lot of research work. Analyzing and summarizing the research results, we can clarify the genes with high-frequency mutations, which is of great significance for the screening of biomarkers and molecular targets of urothelial carcinoma.Method We will adopt the PICOS analysis method of evidence-based medicine; follow the principles of systematic evaluation and meta-analysis; formulate literature retrieval keywords and retrieval strategies; determine the inclusion criteria; and statistically analyze the name, mutation frequency, quantity, and the total number of times in repeated reports of significant mutant genes in the genomic landscape.Results A total of 6254 cases of urothelial carcinoma were sequenced in the 27 theses selected. Sequencing methods include whole genome sequencing, whole exome sequencing, and target exome sequencing. 27 genomic landscapes of urothelial carcinoma showed that the number of significant mutant genes was 5-58, with an average of 26 reported in each paper. There were 273 genes with significant mutations in urothelial carcinoma, 65.57% (179 / 273) of which were reported only once and 34.43% (94 / 273) were reported more than twice. The top 7 genes most frequently reported were TP53, PIK3CA, FGFR3, KDM6A, ARID1A , RB1 and STAG2.Conclusion There were 273 genes with significant mutations in the genome of urothelial carcinoma, and biomarkers may be selected from 94 genes with high-frequency of mutations.

Download Full-text

A Reinforcement Learning Based Dirt-Exploration for Cleaning-Auditing Robot

Sensors ◽

10.3390/s21248331 ◽

2021 ◽

Vol 21 (24) ◽

pp. 8331

Author(s):

Thejus Pathmakumar ◽

Mohan Rajesh Elara ◽

Braulio Félix Gómez ◽

Balakrishnan Ramalingam

Keyword(s):

Reinforcement Learning ◽

Research Work ◽

Leading Edge ◽

Research Problem ◽

Fundamental Question ◽

Policy Learning ◽

Policy Network ◽

Important Research ◽

Important Research Problem

Cleaning is one of the fundamental tasks with prime importance given in our day-to-day life. Moreover, the importance of cleaning drives the research efforts towards bringing leading edge technologies, including robotics, into the cleaning domain. However, an effective method to assess the quality of cleaning is an equally important research problem to be addressed. The primary footstep towards addressing the fundamental question of “How clean is clean” is addressed using an autonomous cleaning-auditing robot that audits the cleanliness of a given area. This research work focuses on a novel reinforcement learning-based experience-driven dirt exploration strategy for a cleaning-auditing robot. The proposed approach uses proximal policy approximation (PPO) based on-policy learning method to generate waypoints and sampling decisions to explore the probable dirt accumulation regions in a given area. The policy network is trained in multiple environments with simulated dirt patterns. Experiment trials have been conducted to validate the trained policy in both simulated and real-world environments using an in-house developed cleaning audit robot called BELUGA.

Download Full-text

Context Awareness and Intelligence in Cognitive Radio Networks: Design and Applications

10.26686/wgtn.16973704.v1 ◽

2021 ◽

Author(s):

◽

Kok-Lim Yau

Keyword(s):

Reinforcement Learning ◽

Access Control ◽

Medium Access Control ◽

Context Awareness ◽

Research Work ◽

Radio Spectrum ◽

Operating Environment ◽

Medium Access ◽

Efficient Manner ◽

White Spaces

<p>CR technology, which is the next-generation wireless communication system, improves the utilization of the overall radio spectrum through dynamic adaptation to local spectrum availability. In CR networks, unlicensed or Secondary Users (SUs) may operate in underutilized spectrum (called white spaces) owned by the licensed or Primary Users (PUs) conditional upon PUs encountering acceptably low interference levels. Ideally, the PUs are oblivious to the presence of the SUs. Context awareness enables an SU to sense and observe its operating environment, which is complex and dynamic in nature; while intelligence enables the SU to learn knowledge, which can be acquired through observing the consequences of its prior action, about its operating environment so that it carries out the appropriate action to achieve optimum network performance in an efficient manner without following a strict and static predefined set of policies. Traditionally, without the application of intelligence, each wireless host adheres to a strict and static predefined set of policies, which may not be optimum in many kinds of operating environment. With the application of intelligence, the knowledge changes in line with the dynamic operating environment. This thesis investigates the application of an artificial intelligence approach called reinforcement learning to achieve context awareness and intelligence in order to enable the SUs to sense and utilize the high quality white spaces. To date, the research focus of the CR research community has been primarily on the physical layer of the open system interconnection model. The research into the data link layer is still in its infancy, and our research work focusing on this layer has been pioneering in this field and has attacted considerable international interest. There are four major outcomes in this thesis. Firstly, various types of multi-channel medium access control protocols are reviewed, followed by discussion of their merits and demerits. The purpose is to show the additional functionalities and challenges that each multi-channel medium access control protocol has to offer and address in order to operate in CR networks. Secondly, a novel cross-layer based quality of service architecture called C2net for CR networks is proposed to provide service prioritization and tackle the issues associated with CR networks. Thirdly, reinforcement learning is applied to pursue context awareness and intelligence in both centralized and distributed CR networks. Analysis and simulation results show that reinforcement learning is a promising mechanism to achieve context awareness and intelligence. Fourthly, the versatile reinforcement learning approach is applied in various schemes for performance enhancement in CR networks.</p>

Download Full-text

Using Psychophysiological Sensors to Assess Mental Workload in Web Browsing

10.20944/preprints201712.0021.v1 ◽

2017 ◽

Author(s):

Angel Jimenez-Molina ◽

Cristian Retamal ◽

Hernan Lira

Keyword(s):

High Frequency ◽

Mental Workload ◽

Electrodermal Activity ◽

Eye Gaze ◽

Physiological Responses ◽

Web Browsing ◽

Web Page ◽

Non Invasive ◽

Four Levels ◽

Electroencephalogram Eeg

The mental workload induced by a Web page is essential for improving the user’s browsing experience. However, continuously assessing the mental workload during a browsing task is challenging. In order to face this issue, this paper leverages the correlation between stimuli and physiological responses, which are measured with high-frequency, non-invasive psychophysiological sensors during very short span windows. An experiment was conducted to identify levels of mental workload through the analysis of pupil dilation measured by an eye-tracking sensor. In addition, a method was developed to classify real-time mental workload by appropriately combining different signals (electrodermal activity (EDA), electrocardiogram, photoplethysmography (PPG), electroencephalogram (EEG), temperature and eye gaze) obtained with non-invasive psychophysiological sensors. The results show that the Web browsing task involves on average four levels of mental workload. Also, by combining EEG with the PPG and EDA, the accuracy of the classification reaches 95.73 %.

Download Full-text

Influence of GUJarati STEmmeR in Supervised Learning of Web Page Categorization

International Journal of Intelligent Systems and Applications ◽

10.5815/ijisa.2021.03.03 ◽

2021 ◽

Vol 13 (3) ◽

pp. 23-34

Author(s):

Chandrakant D. Patel ◽

◽

Jayesh M. Patel

Keyword(s):

Machine Learning ◽

Information Retrieval ◽

Research Work ◽

Research Problem ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Web Page ◽

User Query ◽

On Line ◽

Gujarati Language

With the large quantity of information offered on-line, it's equally essential to retrieve correct information for a user query. A large amount of data is available in digital form in multiple languages. The various approaches want to increase the effectiveness of on-line information retrieval but the standard approach tries to retrieve information for a user query is to go looking at the documents within the corpus as a word by word for the given query. This approach is incredibly time intensive and it's going to miss several connected documents that are equally important. So, to avoid these issues, stemming has been extensively utilized in numerous Information Retrieval Systems (IRS) to extend the retrieval accuracy of all languages. These papers go through the problem of stemming with Web Page Categorization on Gujarati language which basically derived the stem words using GUJSTER algorithms [1]. The GUJSTER algorithm is based on morphological rules which is used to derived root or stem word from inflected words of the same class. In particular, we consider the influence of extracted a stem or root word, to check the integrity of the web page classification using supervised machine learning algorithms. This research work is intended to focus on the analysis of Web Page Categorization (WPC) of Gujarati language and concentrate on a research problem to do verify the influence of a stemming algorithm in a WPC application for the Gujarati language with improved accuracy between from 63% to 98% through Machine Learning supervised models with standard ratio 80% as training and 20% as testing.

Download Full-text