Ranked Deep Web Page Detection Using Reinforcement Learning and Query Optimization

2021 ◽  
Vol 17 (4) ◽  
pp. 99-121
Author(s):  
Kapil Madan ◽  
Rajesh K. Bhatia

This paper proposes a novel algorithm based on reinforcement learning-entitled asynchronous advantage actor-critic (A3C). Overflow queries are optimized to crawl the ranked deep web. A3C assigns the reward and penalty to the various queries. Queries are derived from the domain-based taxonomy that helps to fill the search forms. Overflow queries are the collection of queries that match with more than k number of results and only top k matched results are retrieved. Low ranked documents beyond k results are not accessible and lead to low coverage. Overflow queries are optimized to convert into non-overflow queries based on the proposed technique and lead to more coverage. As of yet, no research work has been explored by using A3C with taxonomy in the domain of ranked deep web. The experimental results show that the proposed technique outperforms the three other techniques (i.e., document frequency, random query, and high frequency) in terms of average improvement metric by 26%, 69%, and 92%, respectively.

Author(s):  
Kranti Vithal Ghag ◽  
Ketan Shah

<span>Bag-of-words approach is popularly used for Sentiment analysis. It maps the terms in the reviews to term-document vectors and thus disrupts the syntactic structure of sentences in the reviews. Association among the terms or the semantic structure of sentences is also not preserved. This research work focuses on classifying the sentiments by considering the syntactic and semantic structure of the sentences in the review. To improve accuracy, sentiment classifiers based on relative frequency, average frequency and term frequency inverse document frequency were proposed. To handle terms with apostrophe, preprocessing techniques were extended. To focus on opinionated contents, subjectivity extraction was performed at phrase level. Experiments were performed on Pang &amp; Lees, Kaggle’s and UCI’s dataset. Classifiers were also evaluated on the UCI’s Product and Restaurant dataset. Sentiment Classification accuracy improved from 67.9% for a comparable term weighing technique, DeltaTFIDF, up to 77.2% for proposed classifiers. Inception of the proposed concept based approach, subjectivity extraction and extensions to preprocessing techniques, improved the accuracy to 93.9%.</span>


2022 ◽  
pp. 155-170
Author(s):  
Lap-Kei Lee ◽  
Kwok Tai Chui ◽  
Jingjing Wang ◽  
Yin-Chun Fung ◽  
Zhanhui Tan

The dependence on Internet in our daily life is ever-growing, which provides opportunity to discover valuable and subjective information using advanced techniques such as natural language processing and artificial intelligence. In this chapter, the research focus is a convolutional neural network for three-class (positive, neutral, and negative) cross-domain sentiment analysis. The model is enhanced in two-fold. First, a similarity label method facilitates the management between the source and target domains to generate more labelled data. Second, term frequency-inverse document frequency (TF-IDF) and latent semantic indexing (LSI) are employed to compute the similarity between source and target domains. Performance evaluation is conducted using three datasets, beauty reviews, toys reviews, and phone reviews. The proposed method enhances the accuracy by 4.3-7.6% and reduces the training time by 50%. The limitations of the research work have been discussed, which serve as the rationales of future research directions.


Author(s):  
Jie Zhao ◽  
Jianfei Wang ◽  
Jia Yang ◽  
Peiquan Jin

Company acquisition relation reflects a company's development intent and competitive strategies, which is an important type of enterprise competitive intelligence. In the traditional environment, the acquisition of competitive intelligence mainly relies on newspapers, internal reports, and so on, but the rapid development of the Web introduces a new way to extract company acquisition relation. In this paper, the authors study the problem of extracting company acquisition relation from huge amounts of Web pages, and propose a novel algorithm for company acquisition relation extraction. The authors' algorithm considers the tense feature of Web content and classification technology of semantic strength when extracting company acquisition relation from Web pages. It first determines the tense of each sentence in a Web page, which is then applied in sentences classification so as to evaluate the semantic strength of the candidate sentences in describing company acquisition relation. After that, the authors rank the candidate acquisition relations and return the top-k company acquisition relation. They run experiments on 6144 pages crawled through Google, and measure the performance of their algorithm under different metrics. The experimental results show that the algorithm is effective in determining the tense of sentences as well as the company acquisition relation.


2016 ◽  
Vol 146 (2) ◽  
pp. 21-24
Author(s):  
Surbhi Chhabra ◽  
Rajender Nath

2021 ◽  
Author(s):  
Haifeng Hao ◽  
Yunmeng Zhang ◽  
Yue Zheng ◽  
Xiaoting Yan ◽  
Liang wei ◽  
...  

Abstract ObjectiveThe new generation of sequencing technology has been applied to the study of genomic genetic characteristics of urothelial carcinoma for 20 years. Researchers at home and abroad have done a lot of research work. Analyzing and summarizing the research results, we can clarify the genes with high-frequency mutations, which is of great significance for the screening of biomarkers and molecular targets of urothelial carcinoma.Method We will adopt the PICOS analysis method of evidence-based medicine; follow the principles of systematic evaluation and meta-analysis; formulate literature retrieval keywords and retrieval strategies; determine the inclusion criteria; and statistically analyze the name, mutation frequency, quantity, and the total number of times in repeated reports of significant mutant genes in the genomic landscape.Results A total of 6254 cases of urothelial carcinoma were sequenced in the 27 theses selected. Sequencing methods include whole genome sequencing, whole exome sequencing, and target exome sequencing. 27 genomic landscapes of urothelial carcinoma showed that the number of significant mutant genes was 5-58, with an average of 26 reported in each paper. There were 273 genes with significant mutations in urothelial carcinoma, 65.57% (179 / 273) of which were reported only once and 34.43% (94 / 273) were reported more than twice. The top 7 genes most frequently reported were TP53, PIK3CA, FGFR3, KDM6A, ARID1A , RB1 and STAG2.Conclusion There were 273 genes with significant mutations in the genome of urothelial carcinoma, and biomarkers may be selected from 94 genes with high-frequency of mutations.


Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8331
Author(s):  
Thejus Pathmakumar ◽  
Mohan Rajesh Elara ◽  
Braulio Félix Gómez ◽  
Balakrishnan Ramalingam

Cleaning is one of the fundamental tasks with prime importance given in our day-to-day life. Moreover, the importance of cleaning drives the research efforts towards bringing leading edge technologies, including robotics, into the cleaning domain. However, an effective method to assess the quality of cleaning is an equally important research problem to be addressed. The primary footstep towards addressing the fundamental question of “How clean is clean” is addressed using an autonomous cleaning-auditing robot that audits the cleanliness of a given area. This research work focuses on a novel reinforcement learning-based experience-driven dirt exploration strategy for a cleaning-auditing robot. The proposed approach uses proximal policy approximation (PPO) based on-policy learning method to generate waypoints and sampling decisions to explore the probable dirt accumulation regions in a given area. The policy network is trained in multiple environments with simulated dirt patterns. Experiment trials have been conducted to validate the trained policy in both simulated and real-world environments using an in-house developed cleaning audit robot called BELUGA.


2021 ◽  
Author(s):  
◽  
Kok-Lim Yau

<p>CR technology, which is the next-generation wireless communication system, improves the utilization of the overall radio spectrum through dynamic adaptation to local spectrum availability. In CR networks, unlicensed or Secondary Users (SUs) may operate in underutilized spectrum (called white spaces) owned by the licensed or Primary Users (PUs) conditional upon PUs encountering acceptably low interference levels. Ideally, the PUs are oblivious to the presence of the SUs. Context awareness enables an SU to sense and observe its operating environment, which is complex and dynamic in nature; while intelligence enables the SU to learn knowledge, which can be acquired through observing the consequences of its prior action, about its operating environment so that it carries out the appropriate action to achieve optimum network performance in an efficient manner without following a strict and static predefined set of policies. Traditionally, without the application of intelligence, each wireless host adheres to a strict and static predefined set of policies, which may not be optimum in many kinds of operating environment. With the application of intelligence, the knowledge changes in line with the dynamic operating environment. This thesis investigates the application of an artificial intelligence approach called reinforcement learning to achieve context awareness and intelligence in order to enable the SUs to sense and utilize the high quality white spaces. To date, the research focus of the CR research community has been primarily on the physical layer of the open system interconnection model. The research into the data link layer is still in its infancy, and our research work focusing on this layer has been pioneering in this field and has attacted considerable international interest. There are four major outcomes in this thesis. Firstly, various types of multi-channel medium access control protocols are reviewed, followed by discussion of their merits and demerits. The purpose is to show the additional functionalities and challenges that each multi-channel medium access control protocol has to offer and address in order to operate in CR networks. Secondly, a novel cross-layer based quality of service architecture called C2net for CR networks is proposed to provide service prioritization and tackle the issues associated with CR networks. Thirdly, reinforcement learning is applied to pursue context awareness and intelligence in both centralized and distributed CR networks. Analysis and simulation results show that reinforcement learning is a promising mechanism to achieve context awareness and intelligence. Fourthly, the versatile reinforcement learning approach is applied in various schemes for performance enhancement in CR networks.</p>


Author(s):  
Angel Jimenez-Molina ◽  
Cristian Retamal ◽  
Hernan Lira

The mental workload induced by a Web page is essential for improving the user&rsquo;s browsing experience. However, continuously assessing the mental workload during a browsing task is challenging. In order to face this issue, this paper leverages the correlation between stimuli and physiological responses, which are measured with high-frequency, non-invasive psychophysiological sensors during very short span windows. An experiment was conducted to identify levels of mental workload through the analysis of pupil dilation measured by an eye-tracking sensor. In addition, a method was developed to classify real-time mental workload by appropriately combining different signals (electrodermal activity (EDA), electrocardiogram, photoplethysmography (PPG), electroencephalogram (EEG), temperature and eye gaze) obtained with non-invasive psychophysiological sensors. The results show that the Web browsing task involves on average four levels of mental workload. Also, by combining EEG with the PPG and EDA, the accuracy of the classification reaches 95.73 %.


2021 ◽  
Vol 13 (3) ◽  
pp. 23-34
Author(s):  
Chandrakant D. Patel ◽  
◽  
Jayesh M. Patel

With the large quantity of information offered on-line, it's equally essential to retrieve correct information for a user query. A large amount of data is available in digital form in multiple languages. The various approaches want to increase the effectiveness of on-line information retrieval but the standard approach tries to retrieve information for a user query is to go looking at the documents within the corpus as a word by word for the given query. This approach is incredibly time intensive and it's going to miss several connected documents that are equally important. So, to avoid these issues, stemming has been extensively utilized in numerous Information Retrieval Systems (IRS) to extend the retrieval accuracy of all languages. These papers go through the problem of stemming with Web Page Categorization on Gujarati language which basically derived the stem words using GUJSTER algorithms [1]. The GUJSTER algorithm is based on morphological rules which is used to derived root or stem word from inflected words of the same class. In particular, we consider the influence of extracted a stem or root word, to check the integrity of the web page classification using supervised machine learning algorithms. This research work is intended to focus on the analysis of Web Page Categorization (WPC) of Gujarati language and concentrate on a research problem to do verify the influence of a stemming algorithm in a WPC application for the Gujarati language with improved accuracy between from 63% to 98% through Machine Learning supervised models with standard ratio 80% as training and 20% as testing.


Sign in / Sign up

Export Citation Format

Share Document