Insight into Multiple References in an MT Evaluation Metric

Author(s):  
Ying Qin ◽  
Lucia Specia
Sensors ◽  
2020 ◽  
Vol 20 (2) ◽  
pp. 557 ◽  
Author(s):  
Rui Zhang ◽  
Oliver Amft

We present an eating detection algorithm for wearable sensors based on first detecting chewing cycles and subsequently estimating eating phases. We term the corresponding algorithm class as a bottom-up approach. We evaluated the algorithm using electromyographic (EMG) recordings from diet-monitoring eyeglasses in free-living and compared the bottom-up approach against two top-down algorithms. We show that the F1 score was no longer the primary relevant evaluation metric when retrieval rates exceeded approx. 90%. Instead, detection timing errors provided more important insight into detection performance. In 122 hours of free-living EMG data from 10 participants, a total of 44 eating occasions were detected, with a maximum F1 score of 99.2%. Average detection timing errors of the bottom-up algorithm were 2.4 ± 0.4 s and 4.3 ± 0.4 s for the start and end of eating occasions, respectively. Our bottom-up algorithm has the potential to work with different wearable sensors that provide chewing cycle data. We suggest that the research community report timing errors (e.g., using the metrics described in this work).


Author(s):  
Samiksha Tripathi ◽  
Vineet Kansal

Machine Translation (MT) evaluation metrics like BiLingual Evaluation Understudy (BLEU) and Metric for Evaluation of Translation with Explicit Ordering (METEOR) are known to have poor performance for word-order and morphologically rich languages. Application of linguistic knowledge to evaluate MTs for morphologically rich language like Hindi as a target language, is shown to be more effective and accurate [S. Tripathi and V. Kansal, Using linguistic knowledge for machine translation evaluation with Hindi as a target language, Comput. Sist.21(4) (2017) 717–724]. Leveraging the recent progress made in the domain of word vector and sentence vector embedding [T. Mikolov and J. Dean, Distributed representations of words and phrases and their compositionality, Adv. Neural Inf. Process. Syst. 2 (2013) 3111–3119], authors have trained a large corpus of pre-processed Hindi text ([Formula: see text] million tokens) for obtaining the word vectors and sentence vector embedding for Hindi. The training has been performed on high end system configuration utilizing Google Cloud platform resources. This sentence vector embedding is further used to corroborate the findings through linguistic knowledge in evaluation metric. For morphologically rich language as target, evaluation metric of MT systems is considered as an optimal solution. In this paper, authors have demonstrated that MT evaluation using sentence embedding-based approach closely mirrors linguistic evaluation technique. The relevant codes used to generate the vector embedding for Hindi have been uploaded on code sharing platform Github. a


Author(s):  
Petr Homola ◽  
Vladislav Kuboň ◽  
Pavel Pecina

2014 ◽  
Vol 2014 ◽  
pp. 1-12 ◽  
Author(s):  
Aaron L.-F. Han ◽  
Derek F. Wong ◽  
Lidia S. Chao ◽  
Liangye He ◽  
Yi Lu

With the rapid development of machine translation (MT), the MT evaluation becomes very important to timely tell us whether the MT system makes any progress. The conventional MT evaluation methods tend to calculate the similarity between hypothesis translations offered by automatic translation systems and reference translations offered by professional translators. There are several weaknesses in existing evaluation metrics. Firstly, the designed incomprehensive factors result in language-bias problem, which means they perform well on some special language pairs but weak on other language pairs. Secondly, they tend to use no linguistic features or too many linguistic features, of which no usage of linguistic feature draws a lot of criticism from the linguists and too many linguistic features make the model weak in repeatability. Thirdly, the employed reference translations are very expensive and sometimes not available in the practice. In this paper, the authors propose an unsupervised MT evaluation metric using universal part-of-speech tagset without relying on reference translations. The authors also explore the performances of the designed metric on traditional supervised evaluation tasks. Both the supervised and unsupervised experiments show that the designed methods yield higher correlation scores with human judgments.


2021 ◽  
Author(s):  
Seyed Shayan Sajjadinia ◽  
Bruno Carpentieri ◽  
Gerhard A. Holzapfel

Numerical simulation is widely used to study physical systems, although it can be computationally too expensive. To counter this limitation, a surrogate may be used, which is a high-performance model that replaces the main numerical model by using, e.g., a machine learning (ML) regressor that is trained on a previously generated subset of possible inputs and outputs of the numerical model. In this context, inspired by the definition of the mean squared error (MSE) metric, we introduce the pointwise MSE (PMSE) metric, which can give a better insight into the performance of such ML models over the test set, by focusing on every point that forms the physical system. To show the merits of the metric, we will create a dataset of a physics problem that will be used to train an ML surrogate, which will then be evaluated by the metrics. In our experiment, the PMSE contour demonstrates how the model learns the physics in different model regions and, in particular, the correlation between the characteristics of the numerical model and the learning progress can be observed. We therefore conclude that this simple and efficient metric can provide complementary and potentially interpretable information regarding the performance and functionality of the surrogate.


Author(s):  
Ahrii Kim ◽  
Jinhyun Kim

SacreBLEU, by incorporating a text normalizing step in the pipeline, has been well-received as an automatic evaluation metric in recent years. With agglutinative languages such as Korean, however, the metric cannot provide a conceivable result without the help of customized pre-tokenization. In this regard, this paper endeavors to examine the influence of diversified pre-tokenization schemes –word, morpheme, character, and subword– on the aforementioned metric by performing a meta-evaluation with manually-constructed into-Korean human evaluation data. Our empirical study demonstrates that the correlation of SacreBLEU (to human judgment) fluctuates consistently by the token type. The reliability of the metric even deteriorates due to some tokenization, and MeCab is not an exception. Guiding through the proper usage of tokenizer for each metric, we stress the significance of a character level and the insignificance of a Jamo level in MT evaluation.


1966 ◽  
Vol 24 ◽  
pp. 322-330
Author(s):  
A. Beer

The investigations which I should like to summarize in this paper concern recent photo-electric luminosity determinations of O and B stars. Their final aim has been the derivation of new stellar distances, and some insight into certain patterns of galactic structure.


Sign in / Sign up

Export Citation Format

Share Document