evaluation metrics
Recently Published Documents


TOTAL DOCUMENTS

578
(FIVE YEARS 256)

H-INDEX

28
(FIVE YEARS 9)

Knowledge ◽  
2022 ◽  
Vol 2 (1) ◽  
pp. 55-87
Author(s):  
Sargam Yadav ◽  
Abhishek Kaushik

Conversational systems are now applicable to almost every business domain. Evaluation is an important step in the creation of dialog systems so that they may be readily tested and prototyped. There is no universally agreed upon metric for evaluating all dialog systems. Human evaluation, which is not computerized, is now the most effective and complete evaluation approach. Data gathering and analysis are evaluation activities that need human intervention. In this work, we address the many types of dialog systems and the assessment methods that may be used with them. The benefits and drawbacks of each sort of evaluation approach are also explored, which could better help us understand the expectations associated with developing an automated evaluation system. The objective of this study is to investigate conversational agents, their design approaches and evaluation metrics. This approach can help us to better understand the overall process of dialog system development, and future possibilities to enhance user experience. Because human assessment is costly and time consuming, we emphasize the need of having a generally recognized and automated evaluation model for conversational systems, which may significantly minimize the amount of time required for analysis.


2021 ◽  
Vol 15 ◽  
Author(s):  
Zarina Rakhimberdina ◽  
Quentin Jodelet ◽  
Xin Liu ◽  
Tsuyoshi Murata

With the advent of brain imaging techniques and machine learning tools, much effort has been devoted to building computational models to capture the encoding of visual information in the human brain. One of the most challenging brain decoding tasks is the accurate reconstruction of the perceived natural images from brain activities measured by functional magnetic resonance imaging (fMRI). In this work, we survey the most recent deep learning methods for natural image reconstruction from fMRI. We examine these methods in terms of architectural design, benchmark datasets, and evaluation metrics and present a fair performance evaluation across standardized evaluation metrics. Finally, we discuss the strengths and limitations of existing studies and present potential future directions.


2021 ◽  
Author(s):  
◽  
Ankit Patel

<p>This doctoral thesis examines the multivariate nature of sporting performances, expressed as performance on context specific tasks, to develop a novel framework for constructing sport-based rating systems, also referred to as scoring models. The intent of this framework is to produce reliable, robust, intuitive, and transparent ratings, regarded as meaningful, for performance prevalent in the sport player and team evaluation environment. In this thesis, Bracewell’s (2003) definition of a rating as an elegant form of dimension reduction is extended. Specifically, ratings are an elegant and excessive form of dimension reduction whereby a single numerical value provides an objective interpretation of performance.  The data, provided by numerous vendors, is a summary of actions and performances completed by an individual during the evaluation period. A literature review of rating systems to measure performance, revealed a set of common methodologies, which were applied to produce a set of rating systems that were used as pilot studies to garner a set of learnings and limitations surrounding the current literature.  By reviewing rating methodologies and developing rating systems a set of limitations and communalities surrounding the current literature were identified and used to develop a novel framework for constructing sport-based rating systems which output measures of both team and player-level performance. The proposed framework adopts a multi-objective ensembling strategy and implements five key communalities present within many rating methodologies. These communalities are the application of 1) dimension reduction and feature selection techniques, 2) feature engineering tasks, 3) a multi-objective framework, 4) time-based variables and 5) an ensembling procedure to produce an overall rating.  An ensemble approach is adopted because it assumed that sporting performances are a function of the significant traits affecting performance. Therefore, performance is defined as performance=f(〖trait〗_1,…,〖trait〗_n). Moreover, the framework is a form of model stacking where information from multiple models is combined to generate a more informative model. Rating systems built using this approach provide a meaningful quantitative interpretation performance during an evaluation period. These ratings measure the quality of performance during a specific time-interval, known as the evaluation period.  The framework introduces a methodical approach for constructing rating systems within the sporting domain, which produce meaningful ratings. Meaningful ratings must 1) yield good performance when data is drawn from a wide range of probability distributions that are largely unaffected by outliers, small departures from model assumptions and small sample sizes (robust), 2) be accurate and produce highly informative predictions which are well-calibrated and sharp (reliable), 3) be interpretable and easy to communicate and (transparent), and 4) relate to real-world observable outcomes (intuitive).  The framework is developed to construct meaningful rating systems within the sporting industry to evaluate team and player performances. The approach was tested and validated by constructing both team and individual player-based rating systems within the cricketing context. The results of these systems were found to be meaningful, in that, they produced reliable, robust, transparent, and intuitive ratings. This ratings framework is not restricted within the sport of cricket to evaluate players and teams’ performances and is applicable in any sporting code where a summary of multivariate data is necessary to understand performance.  Common model evaluation metrics were found to be limited and lacked applicability when evaluating the effectiveness of meaningful ratings, therefore a novel evaluation metric was developed. The constructed metric applies a distance and magnitude-based metrics derived from the spherical scoring rule methodology. The distance and magnitude-based spherical (DMS) metric applies an analytical hierarchy process to assess the effectiveness of meaningful sport-based ratings and accounts for forecasting difficulty on a time basis. The DMS performance metric quantifies elements of the decision-making process by 1) evaluating the distance between ratings reported by the modeller and the actual outcome or the modellers ‘true’ beliefs, 2) providing an indication of “good” ratings, 3) accounting for the context and the forecasting difficulty to which the ratings are being applied, and 4) capturing the introduction of any subjective human bias within sport-based rating systems. The DMS metric is shown to outperform conventional model evaluation metrics such as the log-loss, in specific sporting scenarios of varying difficulty.</p>


2021 ◽  
Author(s):  
◽  
Ankit Patel

<p>This doctoral thesis examines the multivariate nature of sporting performances, expressed as performance on context specific tasks, to develop a novel framework for constructing sport-based rating systems, also referred to as scoring models. The intent of this framework is to produce reliable, robust, intuitive, and transparent ratings, regarded as meaningful, for performance prevalent in the sport player and team evaluation environment. In this thesis, Bracewell’s (2003) definition of a rating as an elegant form of dimension reduction is extended. Specifically, ratings are an elegant and excessive form of dimension reduction whereby a single numerical value provides an objective interpretation of performance.  The data, provided by numerous vendors, is a summary of actions and performances completed by an individual during the evaluation period. A literature review of rating systems to measure performance, revealed a set of common methodologies, which were applied to produce a set of rating systems that were used as pilot studies to garner a set of learnings and limitations surrounding the current literature.  By reviewing rating methodologies and developing rating systems a set of limitations and communalities surrounding the current literature were identified and used to develop a novel framework for constructing sport-based rating systems which output measures of both team and player-level performance. The proposed framework adopts a multi-objective ensembling strategy and implements five key communalities present within many rating methodologies. These communalities are the application of 1) dimension reduction and feature selection techniques, 2) feature engineering tasks, 3) a multi-objective framework, 4) time-based variables and 5) an ensembling procedure to produce an overall rating.  An ensemble approach is adopted because it assumed that sporting performances are a function of the significant traits affecting performance. Therefore, performance is defined as performance=f(〖trait〗_1,…,〖trait〗_n). Moreover, the framework is a form of model stacking where information from multiple models is combined to generate a more informative model. Rating systems built using this approach provide a meaningful quantitative interpretation performance during an evaluation period. These ratings measure the quality of performance during a specific time-interval, known as the evaluation period.  The framework introduces a methodical approach for constructing rating systems within the sporting domain, which produce meaningful ratings. Meaningful ratings must 1) yield good performance when data is drawn from a wide range of probability distributions that are largely unaffected by outliers, small departures from model assumptions and small sample sizes (robust), 2) be accurate and produce highly informative predictions which are well-calibrated and sharp (reliable), 3) be interpretable and easy to communicate and (transparent), and 4) relate to real-world observable outcomes (intuitive).  The framework is developed to construct meaningful rating systems within the sporting industry to evaluate team and player performances. The approach was tested and validated by constructing both team and individual player-based rating systems within the cricketing context. The results of these systems were found to be meaningful, in that, they produced reliable, robust, transparent, and intuitive ratings. This ratings framework is not restricted within the sport of cricket to evaluate players and teams’ performances and is applicable in any sporting code where a summary of multivariate data is necessary to understand performance.  Common model evaluation metrics were found to be limited and lacked applicability when evaluating the effectiveness of meaningful ratings, therefore a novel evaluation metric was developed. The constructed metric applies a distance and magnitude-based metrics derived from the spherical scoring rule methodology. The distance and magnitude-based spherical (DMS) metric applies an analytical hierarchy process to assess the effectiveness of meaningful sport-based ratings and accounts for forecasting difficulty on a time basis. The DMS performance metric quantifies elements of the decision-making process by 1) evaluating the distance between ratings reported by the modeller and the actual outcome or the modellers ‘true’ beliefs, 2) providing an indication of “good” ratings, 3) accounting for the context and the forecasting difficulty to which the ratings are being applied, and 4) capturing the introduction of any subjective human bias within sport-based rating systems. The DMS metric is shown to outperform conventional model evaluation metrics such as the log-loss, in specific sporting scenarios of varying difficulty.</p>


Automatic text summarization is a technique of generating short and accurate summary of a longer text document. Text summarization can be classified based on the number of input documents (single document and multi-document summarization) and based on the characteristics of the summary generated (extractive and abstractive summarization). Multi-document summarization is an automatic process of creating relevant, informative and concise summary from a cluster of related documents. This paper does a detailed survey on the existing literature on the various approaches for text summarization. Few of the most popular approaches such as graph based, cluster based and deep learning-based summarization techniques are discussed here along with the evaluation metrics, which can provide an insight to the future researchers.


2021 ◽  
Author(s):  
Ram Ayyala ◽  
Junghyun Jung ◽  
Sergey Knyazev ◽  
SERGHEI MANGUL

Although precise identification of the human leukocyte antigen (HLA) allele is crucial for various clinical and research applications, HLA typing remains challenging due to high polymorphism of the HLA loci. However, with Next-Generation Sequencing (NGS) data becoming widely accessible, many computational tools have been developed to predict HLA types from RNA sequencing (RNA-seq) data. However, there is a lack of comprehensive and systematic benchmarking of RNA-seq HLA callers using large-scale and realist gold standards. In order to address this limitation, we rigorously compared the performance of 12 HLA callers over 50,000 HLA tasks including searching 30 pairwise combinations of HLA callers and reference in over 1,500 samples. In each case, we produced evaluation metrics of accuracy that is the percentage of correctly predicted alleles (two and four-digit resolution) based on six gold standard datasets spanning 650 RNA-seq samples. To determine the influence of the relationship of the read length over the HLA region on prediction quality using each tool, we explored the read length effect by considering read length in the range 37-126 bp, which was available in our gold standard datasets. Moreover, using the Genotype-Tissue Expression (GTEx) v8 data, we carried out evaluation metrics by calculating the concordance of the same HLA type across different tissues from the same individual to evaluate how well the HLA callers can maintain consistent results across various tissues of the same individual. This study offers crucial information for researchers regarding appropriate choices of methods for an HLA analysis.


2021 ◽  
Author(s):  
Gaifang Luo ◽  
Lijun Cheng ◽  
Chao Jing ◽  
Can Zhao ◽  
Guozhu Song

2021 ◽  
Vol 11 (21) ◽  
pp. 10337
Author(s):  
Junkai Ren ◽  
Yujun Zeng ◽  
Sihang Zhou ◽  
Yichuan Zhang

Scaling end-to-end learning to control robots with vision inputs is a challenging problem in the field of deep reinforcement learning (DRL). While achieving remarkable success in complex sequential tasks, vision-based DRL remains extremely data-inefficient, especially when dealing with high-dimensional pixels inputs. Many recent studies have tried to leverage state representation learning (SRL) to break through such a barrier. Some of them could even help the agent learn from pixels as efficiently as from states. Reproducing existing work, accurately judging the improvements offered by novel methods, and applying these approaches to new tasks are vital for sustaining this progress. However, the demands of these three aspects are seldom straightforward. Without significant criteria and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the previous methods are meaningful. For this reason, we conducted ablation studies on hyperparameters, embedding network architecture, embedded dimension, regularization methods, sample quality and SRL methods to compare and analyze their effects on representation learning and reinforcement learning systematically. Three evaluation metrics are summarized, including five baseline algorithms (including both value-based and policy-based methods) and eight tasks are adopted to avoid the particularity of each experiment setting. We highlight the variability in reported methods and suggest guidelines to make future results in SRL more reproducible and stable based on a wide number of experimental analyses. We aim to spur discussion about how to assure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.


2021 ◽  
Author(s):  
Takashi Itahashi ◽  
Yuta Y. Aoki ◽  
Ayumu Yamashita ◽  
Takafumi Soda ◽  
Junya Fujino ◽  
...  

A downside of upgrading MRI acquisition sequences is the discontinuity of technological homogeneity of the MRI data. It hampers combining new and old datasets, especially in a longitudinal design. Characterizing upgrading effects on multiple brain parameters and examining the efficacy of harmonization methods are essential. This study investigated the upgrading effects on three structural parameters, including cortical thickness (CT), surface area (SA), cortical volume (CV), and resting-state functional connectivity (rs-FC) collected from 64 healthy volunteers. We used two evaluation metrics, Cohen's d and classification accuracy, to quantify the effects. In classification analyses, we built classifiers for differentiating the protocols from brain parameters. We investigated the efficacy of three harmonization methods, including traveling subject (TS), TS-ComBat, and ComBat methods, and the sufficient number of participants for eliminating the effects on the evaluation metrics. Finally, we performed age prediction as an example to confirm that harmonization methods retained biological information. The results without harmonization methods revealed small to large mean Cohen's d values on brain parameters (CT:0.85, SA:0.66, CV:0.68, and rs-FC:0.24) with better classification accuracy (>92% accuracy). With harmonization methods, Cohen's d values approached zero. Classification performance reached the chance level with TS-based techniques when data from less than 26 participants were used for estimating the effects, while the Combat method required more participants. Furthermore, harmonization methods improved age prediction performance, except for the ComBat method. These results suggest that acquiring TS data is essential to preserve the continuity of MRI data.


Sign in / Sign up

Export Citation Format

Share Document