From Tf-Idf to Learning-to-Rank

2016 ◽  
pp. 1245-1292 ◽  
Author(s):  
Muhammad Ibrahim ◽  
Manzur Murshed

Ranking a set of documents based on their relevances with respect to a given query is a central problem of information retrieval (IR). Traditionally people have been using unsupervised scoring methods like tf-idf, BM25, Language Model etc., but recently supervised machine learning framework is being used successfully to learn a ranking function, which is called learning-to-rank (LtR) problem. There are a few surveys on LtR in the literature; but these reviews provide very little assistance to someone who, before delving into technical details of different algorithms, wants to have a broad understanding of LtR systems and its evolution from and relation to the traditional IR methods. This chapter tries to address this gap in the literature. Mainly the following aspects are discussed: the fundamental concepts of IR, the motivation behind LtR, the evolution of LtR from and its relation to the traditional methods, the relationship between LtR and other supervised machine learning tasks, the general issues pertaining to an LtR algorithm, and the theory of LtR.

Author(s):  
Muhammad Ibrahim ◽  
Manzur Murshed

Ranking a set of documents based on their relevances with respect to a given query is a central problem of information retrieval (IR). Traditionally people have been using unsupervised scoring methods like tf-idf, BM25, Language Model etc., but recently supervised machine learning framework is being used successfully to learn a ranking function, which is called learning-to-rank (LtR) problem. There are a few surveys on LtR in the literature; but these reviews provide very little assistance to someone who, before delving into technical details of different algorithms, wants to have a broad understanding of LtR systems and its evolution from and relation to the traditional IR methods. This chapter tries to address this gap in the literature. Mainly the following aspects are discussed: the fundamental concepts of IR, the motivation behind LtR, the evolution of LtR from and its relation to the traditional methods, the relationship between LtR and other supervised machine learning tasks, the general issues pertaining to an LtR algorithm, and the theory of LtR.


2020 ◽  
Author(s):  
JINGYANG CAO ◽  
Shirong Yin ◽  
Guoxu Zhang

Abstract This paper presents a novel approach to analyze the sentiment of the product comments from sentence to document level and apply to the customers sentiment analysis on UAV-aided product comments for hotel management. In order to realize the effiffifficient sentiment analysis, a cascaded sentence-to-document sentiment classifification method is investigated. Initially, a supervised machine learning method is applied to explore the sentiment polarity of the sentence (SPS). Afterward, the contribution of the sentence to document (CSD) is calculated by using various statistical algorithms. Lastly, the sentiment polarity of the document (SPD) is determined by the SPS as well as its contribution. Comparative experiments have been established on the basis of hotel online comments, and the outcomes indicate that the proposed method not only raises the effiffifficiency in attaining a more accurate result but also assists immensely in regards to the B5G wireless communication supported by the UAV. The fifindings provide a new perspective that sentence position and its sentiment similarity with document (sentiment condition) dramatically disclose the relationship between sentence and document.


IRBM ◽  
2019 ◽  
Vol 40 (3) ◽  
pp. 157-166 ◽  
Author(s):  
A.S.A. Huque ◽  
K.I. Ahmed ◽  
M.A. Mukit ◽  
R. Mostafa

2017 ◽  
Author(s):  
Daniel R. Schrider ◽  
Julien Ayroles ◽  
Daniel R. Matute ◽  
Andrew D. Kern

ABSTRACTHybridization and gene flow between species appears to be common. Even though it is clear that hybridization is widespread across all surveyed taxonomic groups, the magnitude and consequences of introgression are still largely unknown. Thus it is crucial to develop the statistical machinery required to uncover which genomic regions have recently acquired haplotypes via introgression from a sister population. We developed a novel machine learning framework, called FILET (Finding Introgressed Loci via Extra-Trees) capable of revealing genomic introgression with far greater power than competing methods. FILET works by combining information from a number of population genetic summary statistics, including several new statistics that we introduce, that capture patterns of variation across two populations. We show that FILET is able to identify loci that have experienced gene flow between related species with high accuracy, and in most situations can correctly infer which population was the donor and which was the recipient. Here we describe a data set of outbred diploid Drosophila sechellia genomes, and combine them with data from D. simulans to examine recent introgression between these species using FILET. Although we find that these populations may have split more recently than previously appreciated, FILET confirms that there has indeed been appreciable recent introgression (some of which might have been adaptive) between these species, and reveals that this gene flow is primarily in the direction of D. simulans to D. sechellia.AUTHOR SUMMARYUnderstanding the extent to which species or diverged populations hybridize in nature is crucially important if we are to understand the speciation process. Accordingly numerous research groups have developed methodology for finding the genetic evidence of such introgression. In this report we develop a supervised machine learning approach for uncovering loci which have introgressed across species boundaries. We show that our method, FILET, has greater accuracy and power than competing methods in discovering introgression, and in addition can detect the directionality associated with the gene flow between species. Using whole genome sequences from Drosophila simulans and Drosophila sechellia we show that FILET discovers quite extensive introgression between these species that has occurred mostly from D. simulans to D. sechellia. Our work highlights the complex process of speciation even within a well-studied system and points to the growing importance of supervised machine learning in population genetics.


2021 ◽  
Vol 13 (3) ◽  
pp. 23-34
Author(s):  
Chandrakant D. Patel ◽  
◽  
Jayesh M. Patel

With the large quantity of information offered on-line, it's equally essential to retrieve correct information for a user query. A large amount of data is available in digital form in multiple languages. The various approaches want to increase the effectiveness of on-line information retrieval but the standard approach tries to retrieve information for a user query is to go looking at the documents within the corpus as a word by word for the given query. This approach is incredibly time intensive and it's going to miss several connected documents that are equally important. So, to avoid these issues, stemming has been extensively utilized in numerous Information Retrieval Systems (IRS) to extend the retrieval accuracy of all languages. These papers go through the problem of stemming with Web Page Categorization on Gujarati language which basically derived the stem words using GUJSTER algorithms [1]. The GUJSTER algorithm is based on morphological rules which is used to derived root or stem word from inflected words of the same class. In particular, we consider the influence of extracted a stem or root word, to check the integrity of the web page classification using supervised machine learning algorithms. This research work is intended to focus on the analysis of Web Page Categorization (WPC) of Gujarati language and concentrate on a research problem to do verify the influence of a stemming algorithm in a WPC application for the Gujarati language with improved accuracy between from 63% to 98% through Machine Learning supervised models with standard ratio 80% as training and 20% as testing.


Author(s):  
Hardeo Kumar Thakur ◽  
Anand Gupta ◽  
Ayushi Bhardwaj ◽  
Devanshi Verma

This article describes how a rumor can be defined as a circulating unverified story or a doubtful truth. Rumor initiators seek social networks vulnerable to illimitable spread, therefore, online social media becomes their stage. Hence, this misinformation imposes colossal damage to individuals, organizations, and the government, etc. Existing work, analyzing temporal and linguistic characteristics of rumors seems to give ample time for rumor propagation. Meanwhile, with the huge outburst of data on social media, studying these characteristics for each tweet becomes spatially complex. Therefore, in this article, a two-fold supervised machine-learning framework is proposed that detects rumors by filtering and then analyzing their linguistic properties. This method attempts to automate filtering by training multiple classification algorithms with accuracy higher than 81.079%. Finally, using textual characteristics on the filtered data, rumors are detected. The effectiveness of the proposed framework is shown through extensive experiments on over 10,000 tweets.


Sign in / Sign up

Export Citation Format

Share Document