scholarly journals The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification

2021 ◽  
pp. 1-29
Author(s):  
Fernando Alva-Manchego ◽  
Carolina Scarton ◽  
Lucia Specia

Abstract In order to simplify sentences, several rewriting operations can be performed such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgements on the simplicity achieved by executing specific operations (e.g. simplicity gain based on lexical replacements). In this article, we investigate how well existing metrics can assess sentence-level simplifications where multiple operations may have been applied and which, therefore, require more general simplicity judgements. For that, we first collect a new and more reliable dataset for evaluating the correlation of metrics and human judgements of overall simplicity. Second, we conduct the first meta-evaluation of automatic metrics in Text Simplification, using our new dataset (and other existing data) to analyse the variation of the correlation between metrics’ scores and human judgements across three dimensions: the perceived simplicity level, the system type and the set of references used for computation. We show that these three aspects affect the correlations and, in particular, highlight the limitations of commonly-used operation-specific metrics. Finally, based on our findings, we propose a set of recommendations for automatic evaluation of multi-operation simplifications, suggesting which metrics to compute and how to interpret their scores.

2017 ◽  
Vol 108 (1) ◽  
pp. 85-96 ◽  
Author(s):  
Eva Martínez Garcia ◽  
Carles Creus ◽  
Cristina España-Bonet ◽  
Lluís Màrquez

Abstract We integrate new mechanisms in a document-level machine translation decoder to improve the lexical consistency of document translations. First, we develop a document-level feature designed to score the lexical consistency of a translation. This feature, which applies to words that have been translated into different forms within the document, uses word embeddings to measure the adequacy of each word translation given its context. Second, we extend the decoder with a new stochastic mechanism that, at translation time, allows to introduce changes in the translation oriented to improve its lexical consistency. We evaluate our system on English–Spanish document translation, and we conduct automatic and manual assessments of its quality. The automatic evaluation metrics, applied mainly at sentence level, do not reflect significant variations. On the contrary, the manual evaluation shows that the system dealing with lexical consistency is preferred over both a standard sentence-level and a standard document-level phrase-based MT systems.


2017 ◽  
Vol 43 (4) ◽  
pp. 683-722 ◽  
Author(s):  
Shafiq Joty ◽  
Francisco Guzmán ◽  
Lluís Màrquez ◽  
Preslav Nakov

In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment level and at the system level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTK party. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular, we show that (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference RST tree is positively correlated with translation quality.


2014 ◽  
Vol 165 (2) ◽  
pp. 194-222 ◽  
Author(s):  
Sowmya Vajjala ◽  
Detmar Meurers

Readability assessment can play a role in the evaluation of a simplification algorithm as well as in the identification of what to simplify. While some previous research used traditional readability formulas to evaluate text simplification, there is little research into the utility of readability assessment for identifying and analyzing sentence level targets for text simplification. We explore this aspect in our paper by first constructing a readability model that is generalizable across corpora and across genres and later adapting this model to make sentence-level readability judgments. First, we report on experiments establishing that the readability model integrating a broad range of linguistic features works well at a document level, performing on par with the best systems on a standard test corpus. Next, the model is confirmed to be transferable to different text genres. Moving from documents to sentences, we investigate the model’s ability to correctly identify the difference in reading level between a sentence and its human simplified version. We conclude that readability models can be useful for identifying simplification targets for human writers and for evaluating machine generated simplifications.


Information ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 78 ◽  
Author(s):  
Tulu Tilahun Hailu ◽  
Junqing Yu ◽  
Tessfu Geteye Fantaye

Text summarization is a process of producing a concise version of text (summary) from one or more information sources. If the generated summary preserves meaning of the original text, it will help the users to make fast and effective decision. However, how much meaning of the source text can be preserved is becoming harder to evaluate. The most commonly used automatic evaluation metrics like Recall-Oriented Understudy for Gisting Evaluation (ROUGE) strictly rely on the overlapping n-gram units between reference and candidate summaries, which are not suitable to measure the quality of abstractive summaries. Another major challenge to evaluate text summarization systems is lack of consistent ideal reference summaries. Studies show that human summarizers can produce variable reference summaries of the same source that can significantly affect automatic evaluation metrics scores of summarization systems. Humans are biased to certain situation while producing summary, even the same person perhaps produces substantially different summaries of the same source at different time. This paper proposes a word embedding based automatic text summarization and evaluation framework, which can successfully determine salient top-n sentences of a source text as a reference summary, and evaluate the quality of systems summaries against it. Extensive experimental results demonstrate that the proposed framework is effective and able to outperform several baseline methods with regard to both text summarization systems and automatic evaluation metrics when tested on a publicly available dataset.


2012 ◽  
Vol 98 (1) ◽  
pp. 99-108 ◽  
Author(s):  
Maja Popović

rgbF: An Open Source Tool for n-gram Based Automatic Evaluation of Machine Translation Output We describe RGBF, a tool for automatic evaluation of machine translation output based on n-gram precision and recall. The tool calculates the F-score averaged on all n-grams of an arbitrary set of distinct units such as words, morphemes, POS tags, etc. The arithmetic mean is used for n-gram averaging. As input, the tool requires reference translation(s) and hypothesis, both containing the same combination of units. The default output is the document level 4-gram F-score of the desired unit combination. The scores at the sentence level can be obtained on demand, as well as precision and/or recall scores, separate unit scores and separate n-gram scores. In addition, weights can be introduced both for n-grams and for units, as well as the desired n-gram order n.


2021 ◽  
Author(s):  
Devjeet Roy ◽  
Sarah Fakhoury ◽  
Venera Arnaoudova

2019 ◽  
Vol 2 (2) ◽  
pp. 83-87
Author(s):  
Sri Devi Wulandari

The purpose of this study is to describe the profile of mathematical representation of students who are capable of early mathematics while resolving mathematical problems in three-dimensional material with O Matic screencast media. The research approach that will be used by researchers is qualitative. The type of research that used in this study is descriptive qualitative research, namely by analyzing existing data to obtain information about the profiles of mathematical representations to solve problems in the material of the Three dimensions with the O Matic screencast media. The research phase is the introduction, planning, research, and completion stage. This study uses test questions, interviews, and observation sheets. The results of this study subject able to meet the three indicators of mathematical representation ability that have been determined, but the final results of the answers are still less precise because they misunderstand the problem of the problem.   Tujuan penelitian ini adalah untuk mendeskripsikan profil representasi matematis siswa yang berkemampuan awal matematika sedang dalam menyelesaikan masalah matematika pada materi dimensi tiga dengan media screencast o matic.Pendekatan penelitian yang digunakan oleh peneliti adalah pendekatan penelitian kualitatif. Jenis penelitian yang digunakan dalam penelitian ini adalah penelitian kualitatif deskriptif, yaitu dengan cara menganalisis data yang ada untuk memperoleh informasi mengenai profil representasi matematis siwa untuk memecahkan soal pada materi dimensi Tiga dengan media screencast o matic. Tahap penelitian yaitu pendahuluan, perencanaan, pelaksanaan penelitian dan tahap penyelesaian. Penelitian ini menggunakan instrumen soal tes, wawancara dan lembar observasi. Hasil penelitian ini adalah subjek mampu memenuhi ketiga indikator kemampuan representasi matematis yang telah ditentukan, namun hasil akhir jawabannya masih kurang tepat karena salah memahami  permasalahan soal.


Author(s):  
Troy L. Holcolmbe ◽  
Carla J. Moore

In the previous chapters, the various techniques for delimiting the continental shelf have been outlined. However many continental shelf claims will be developed largely on the basis of existing information. Therefore, a coastal state should begin its article 76 implementation by assembling and reviewing all available information that is relevant for determining the outer limit of the continental shelf, and for assessing the resource potential beyond 200 nautical miles (M). Data compilation activities tend to be labor-intensive, and the amount of time needed for their successful execution depends to a large extent upon the quantity and condition of the data sets, the skill and experience of the compilation staff, and the data-handling facilities at their disposal. However, it is reasonably safe to assume that almost any compilation of existing data will be less expensive than mobilizing and executing a field program for collecting new data, so it is usually more cost-effective to begin with a compilation. Even if the data compilation operation serves primarily to demonstrate the inadequacy of existing data, it will serve a useful purpose by identifying specifically where and what kind of new information is needed. To satisfy the requirements of article 76, and to provide a foundation for an understanding of the resources within the continental shelf, we are concerned primarily with data in the fields of hydrography, geodesy, geology, geophysics, and geochemistry and their subdisciplines. Such data are usually characterized by their spatial variations, in two or three dimensions, which are of a far greater magnitude than any temporal changes, as for example in the case of gravity anomaly data. However, the temporal variation of some geoscience parameters is becoming increasingly important as an indicator of environmental change. Because of the importance of their spatial changes with respect to the delineation of the continental shelf, the traditional form of presentation of geoscience data has been as maps. Whereas maps provide an excellent visualization of the data field, they may not be sufficient to carry out the analysis needed to satisfy article 76, and increasingly, digital data, profiles, and other data forms are becoming necessary.


Sign in / Sign up

Export Citation Format

Share Document