Learning to rank with click-through features in a reinforcement learning framework

Purpose Learning to rank algorithms inherently faces many challenges. The most important challenges could be listed as high-dimensionality of the training data, the dynamic nature of Web information resources and lack of click-through data. High dimensionality of the training data affects effectiveness and efficiency of learning algorithms. Besides, most of learning to rank benchmark datasets do not include click-through data as a very rich source of information about the search behavior of users while dealing with the ranked lists of search results. To deal with these limitations, this paper aims to introduce a novel learning to rank algorithm by using a set of complex click-through features in a reinforcement learning (RL) model. These features are calculated from the existing click-through information in the data set or even from data sets without any explicit click-through information. Design/methodology/approach The proposed ranking algorithm (QRC-Rank) applies RL techniques on a set of calculated click-through features. QRC-Rank is as a two-steps process. In the first step, Transformation phase, a compact benchmark data set is created which contains a set of click-through features. These feature are calculated from the original click-through information available in the data set and constitute a compact representation of click-through information. To find most effective click-through feature, a number of scenarios are investigated. The second phase is Model-Generation, in which a RL model is built to rank the documents. This model is created by applying temporal difference learning methods such as Q-Learning and SARSA. Findings The proposed learning to rank method, QRC-rank, is evaluated on WCL2R and LETOR4.0 data sets. Experimental results demonstrate that QRC-Rank outperforms the state-of-the-art learning to rank methods such as SVMRank, RankBoost, ListNet and AdaRank based on the precision and normalized discount cumulative gain evaluation criteria. The use of the click-through features calculated from the training data set is a major contributor to the performance of the system. Originality/value In this paper, we have demonstrated the viability of the proposed features that provide a compact representation for the click through data in a learning to rank application. These compact click-through features are calculated from the original features of the learning to rank benchmark data set. In addition, a Markov Decision Process model is proposed for the learning to rank problem using RL, including the sets of states, actions, rewarding strategy and the transition function.

Download Full-text

Artificial bee colony algorithm for feature selection and improved support vector machine for text classification

Information Discovery and Delivery ◽

10.1108/idd-09-2018-0045 ◽

2019 ◽

Vol 47 (3) ◽

pp. 154-170

Author(s):

Janani Balakumar ◽

S. Vijayarani Mohan

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Text Classification ◽

Support Vector ◽

Data Sets ◽

Selection Algorithm ◽

Data Set ◽

Content Type ◽

Benchmark Data ◽

Bee Colony

Purpose Owing to the huge volume of documents available on the internet, text classification becomes a necessary task to handle these documents. To achieve optimal text classification results, feature selection, an important stage, is used to curtail the dimensionality of text documents by choosing suitable features. The main purpose of this research work is to classify the personal computer documents based on their content. Design/methodology/approach This paper proposes a new algorithm for feature selection based on artificial bee colony (ABCFS) to enhance the text classification accuracy. The proposed algorithm (ABCFS) is scrutinized with the real and benchmark data sets, which is contrary to the other existing feature selection approaches such as information gain and χ2 statistic. To justify the efficiency of the proposed algorithm, the support vector machine (SVM) and improved SVM classifier are used in this paper. Findings The experiment was conducted on real and benchmark data sets. The real data set was collected in the form of documents that were stored in the personal computer, and the benchmark data set was collected from Reuters and 20 Newsgroups corpus. The results prove the performance of the proposed feature selection algorithm by enhancing the text document classification accuracy. Originality/value This paper proposes a new ABCFS algorithm for feature selection, evaluates the efficiency of the ABCFS algorithm and improves the support vector machine. In this paper, the ABCFS algorithm is used to select the features from text (unstructured) documents. Although, there is no text feature selection algorithm in the existing work, the ABCFS algorithm is used to select the data (structured) features. The proposed algorithm will classify the documents automatically based on their content.

Download Full-text

Queries related to COVID-19: a more effective retrieval through finetuned ALBERT with BM25L question answering system

World Journal of Engineering ◽

10.1108/wje-01-2021-0059 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Deepthi Godavarthi ◽

Mary Sowjanya A.

Keyword(s):

Question Answering ◽

Relevant Information ◽

Medical Community ◽

Data Sets ◽

Data Set ◽

Content Type ◽

Benchmark Data ◽

Open Research ◽

First Time ◽

Scientific Questions

Purpose The purpose of this paper is to build a better question answering (QA) system that can furnish more improved retrieval of answers related to COVID-19 queries from the COVID-19 open research data set (CORD-19). As CORD-19 has an up-to-date collection of coronavirus literature, text mining approaches can be successfully used to retrieve answers pertaining to all coronavirus-related questions. The existing a lite BERT for self-supervised learning of language representations (ALBERT) model is finetuned for retrieving all COVID relevant information to scientific questions posed by the medical community and to highlight the context related to the COVID-19 query. Design/methodology/approach This study presents a finetuned ALBERT-based QA system in association with Best Match25 (Okapi BM25) ranking function and its variant BM25L for context retrieval and provided high scores in benchmark data sets such as SQuAD for answers related to COVID-19 questions. In this context, this paper has built a QA system, pre-trained on SQuAD and finetuned it on CORD-19 data to retrieve answers related to COVID-19 questions by extracting semantically relevant information related to the question. Findings BM25L is found to be more effective in retrieval compared to Okapi BM25. Hence, finetuned ALBERT when extended to the CORD-19 data set provided accurate results. Originality/value The finetuned ALBERT QA system was developed and tested for the first time on the CORD-19 data set to extract context and highlight the span of the answer for more clarity to the user.

Download Full-text

A new method for Arabic/Farsi numeral data set size reduction via modified frequency diagram matching

Kybernetes ◽

10.1108/k-10-2013-0226 ◽

2014 ◽

Vol 43 (5) ◽

pp. 817-834

Author(s):

Mohammad Amin Shayegan ◽

Saeed Aghabozorgi

Keyword(s):

Optical Character Recognition ◽

Size Reduction ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

Training Algorithms ◽

Data Set ◽

Content Type ◽

Set Size ◽

Frequency Diagram

Purpose – Pattern recognition systems often have to handle problem of large volume of training data sets including duplicate and similar training samples. This problem leads to large memory requirement for saving and processing data, and the time complexity for training algorithms. The purpose of the paper is to reduce the volume of training part of a data set – in order to increase the system speed, without any significant decrease in system accuracy. Design/methodology/approach – A new technique for data set size reduction – using a version of modified frequency diagram approach – is presented. In order to reduce processing time, the proposed method compares the samples of a class to other samples in the same class, instead of comparing samples from different classes. It only removes patterns that are similar to the generated class template in each class. To achieve this aim, no feature extraction operation was carried out, in order to produce more precise assessment on the proposed data size reduction technique. Findings – The results from the experiments, and according to one of the biggest handwritten numeral standard optical character recognition (OCR) data sets, Hoda, show a 14.88 percent decrease in data set volume without significant decrease in performance. Practical implications – The proposed technique is effective for size reduction for all pictorial databases such as OCR data sets. Originality/value – State-of-the-art algorithms currently used for data set size reduction usually remove samples near to class's centers, or support vector (SV) samples between different classes. However, the samples near to a class center have valuable information about class characteristics, and they are necessary to build a system model. Also, SV s are important samples to evaluate the system efficiency. The proposed technique, unlike the other available methods, keeps both outlier samples, as well as the samples close to the class centers.

Download Full-text

The COBRA fixed-wing georeferenced imagery dataset

International Journal of Intelligent Unmanned Systems ◽

10.1108/ijius-10-2014-0008 ◽

2015 ◽

Vol 3 (2/3) ◽

pp. 62-71 ◽

Cited By ~ 1

Author(s):

Sajad Saeedi ◽

Carl Thibault ◽

Michael Trentini ◽

Howard Li

Keyword(s):

Ground Truth ◽

Target Localization ◽

Data Sets ◽

Data Set ◽

Content Type ◽

Robot Operating System ◽

Benchmark Data ◽

Localization And Mapping ◽

Aerial Vehicle ◽

Sensor Configuration

Purpose – The purpose of this paper is to present a localization and mapping data set acquired by a fixed-wing unmanned aerial vehicle (UAV). The data set was collected for educational and research purposes: to save time in dealing with hardware and to compare the results with a benchmark data set. The data were collected in standard Robot Operating System (ROS) format. The environment, fixed-wing, and sensor configuration are explained in detail. GPS coordinates of the fixed-wing are also available as ground truth. The data set is available for download (www.ece.unb.ca/COBRA/open_source.htm). Design/methodology/approach – The data were collected in standard ROS format. The environment, fixed-wing, and sensor configuration are explained in detail. Findings – The data set can be used for target localization and mapping. The data were collected to assist algorithm developments and help researchers to compare their results. Robotic data sets are specifically important when they are related to unmanned systems such as fixed-wing aircraft. Originality/value – The Robotics Data Set Repository (RADISH) by A. Howard and N. Roy hosts 41 well-known data sets with different sensors; however, there is no fixed-wing data set in RADISH. This work presents two data sets collected by a fixed-wing aircraft using ROS standards. The data sets can be used for target localization and SLAM.

Download Full-text

A systematic review of machine learning-based missing value imputation techniques

Data Technologies and Applications ◽

10.1108/dta-12-2020-0298 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Tressy Thomas ◽

Enayat Rajabi

Keyword(s):

Machine Learning ◽

Selection Process ◽

Evaluation Metrics ◽

Correct Prediction ◽

Data Sets ◽

Data Set ◽

Missing Value ◽

Content Type ◽

Missing Value Imputation ◽

Literature Reviews

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

Download Full-text

Fast and accurate detection of surface defect based on improved YOLOv4

Assembly Automation ◽

10.1108/aa-04-2021-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Jiawei Lian ◽

Junhong He ◽

Yun Niu ◽

Tianze Wang

Keyword(s):

Feature Extraction ◽

Real Time ◽

Surface Defect ◽

Steel Ingot ◽

Industrial Applications ◽

Data Sets ◽

Data Set ◽

Processing Technologies ◽

Content Type ◽

Public Data

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

Time series trending for condition assessment and prognostics

Journal of Manufacturing Technology Management ◽

10.1108/jmtm-04-2013-0037 ◽

2014 ◽

Vol 25 (4) ◽

pp. 550-567 ◽

Cited By ~ 12

Author(s):

Ahmed Mosallam ◽

Kamal Medjaher ◽

Noureddine Zerhouni

Keyword(s):

Condition Monitoring ◽

Health Monitoring ◽

Health Assessment ◽

Remaining Useful Life ◽

Working Environment ◽

Compact Representation ◽

Data Set ◽

Content Type ◽

Sensory Signals ◽

Degradation Data

Purpose – The developments of complex systems have increased the demand for condition monitoring techniques so as to maximize operational availability and safety while decreasing the costs. Signal analysis is one of the methods used to develop condition monitoring in order to extract important information contained in the sensory signals, which can be used for health assessment. However, extraction of such information from collected data in a practical working environment is always a great challenge as sensory signals are usually multi-dimensional and obscured by noise. The paper aims to discuss this issue. Design/methodology/approach – This paper presents a method for trends extraction from multi-dimensional sensory data, which are then used for machinery health monitoring and maintenance needs. The proposed method is based on extracting successive features from machinery sensory signals. Then, unsupervised feature selection on the features domain is applied without making any assumptions concerning the source of the signals and the number of the extracted features. Finally, empirical mode decomposition (EMD) algorithm is applied on the projected features with the purpose of following the evolution of data in a compact representation over time. Findings – The method is demonstrated on accelerated degradation data set of bearings acquired from PRONOSTIA experimental platform and a second data set acquired form NASA repository. Originality/value – The method showed that it is able to extract interesting signal trends which can be used for health monitoring and remaining useful life prediction.

Download Full-text

An application of data envelopment analysis for Korean banks with negative data

Benchmarking An International Journal ◽

10.1108/bij-02-2016-0023 ◽

2017 ◽

Vol 24 (4) ◽

pp. 1052-1064 ◽

Cited By ~ 9

Author(s):

Yong Joo Lee ◽

Seong-Jong Joo ◽

Hong Gyun Park

Keyword(s):

Data Envelopment Analysis ◽

Data Sets ◽

Data Envelopment ◽

Negative Data ◽

Ownership Type ◽

Data Set ◽

Translation Invariant ◽

Content Type ◽

Dea Models ◽

Regional Banks

Purpose The purpose of this paper is to measure the comparative efficiency of 18 Korean commercial banks under the presence of negative observations and examine performance differences among them by grouping them according to their market conditions. Design/methodology/approach The authors employ two data envelopment analysis (DEA) models such as a Banker, Charnes, and Cooper (BCC) model and a modified slacks-based measure of efficiency (MSBM) model, which can handle negative data. The BCC model is proven to be translation invariant for inputs or outputs depending on output or input orientation. Meanwhile, the MSBM model is unit invariant in addition to translation invariant. The authors compare results from both models and choose one for interpreting results. Findings Most Korean banks recovered from the worst performance in 2011 and showed similar performance in recent years. Among three groups such as national banks, regional banks, and special banks, the most special banks demonstrated superb performance across models and years. Especially, the performance difference between the special banks and the regional banks was statistically significant. The authors concluded that the high performance of the special banks was due to their nationwide market access and ownership type. Practical implications This study demonstrates how to analyze and measure the efficiency of entities when variables contain negative observations using a data set for Korean banks. The authors have tried two major DEA models that are able to handle negative data and proposed a practical direction for future studies. Originality/value Although there are research papers for measuring the performance of banks in Korea, all of the papers in the topic have studied efficiency or productivity using positive data sets. However, variables such as net incomes and growth rates frequently include negative observations in bank data sets. This is the first paper to investigate the efficiency of bank operations in the presence of negative data in Korea.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text