The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification

Computational Linguistics ◽

10.1162/coli_a_00418 ◽

2021 ◽

pp. 1-29

Author(s):

Fernando Alva-Manchego ◽

Carolina Scarton ◽

Lucia Specia

Keyword(s):

Three Dimensions ◽

Evaluation Metrics ◽

Automatic Evaluation ◽

System Type ◽

Text Simplification ◽

Sentence Level ◽

Complex Words ◽

Existing Data

Abstract In order to simplify sentences, several rewriting operations can be performed such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgements on the simplicity achieved by executing specific operations (e.g. simplicity gain based on lexical replacements). In this article, we investigate how well existing metrics can assess sentence-level simplifications where multiple operations may have been applied and which, therefore, require more general simplicity judgements. For that, we first collect a new and more reliable dataset for evaluating the correlation of metrics and human judgements of overall simplicity. Second, we conduct the first meta-evaluation of automatic metrics in Text Simplification, using our new dataset (and other existing data) to analyse the variation of the correlation between metrics’ scores and human judgements across three dimensions: the perceived simplicity level, the system type and the set of references used for computation. We show that these three aspects affect the correlations and, in particular, highlight the limitations of commonly-used operation-specific metrics. Finally, based on our findings, we propose a set of recommendations for automatic evaluation of multi-operation simplifications, suggesting which metrics to compute and how to interpret their scores.

Download Full-text

Using Word Embeddings to Enforce Document-Level Lexical Consistency in Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0011 ◽

2017 ◽

Vol 108 (1) ◽

pp. 85-96 ◽

Cited By ~ 2

Author(s):

Eva Martínez Garcia ◽

Carles Creus ◽

Cristina España-Bonet ◽

Lluís Màrquez

Keyword(s):

Machine Translation ◽

Evaluation Metrics ◽

Automatic Evaluation ◽

Word Embeddings ◽

Standard Document ◽

Sentence Level ◽

Word Translation ◽

Stochastic Mechanism ◽

Document Level

Abstract We integrate new mechanisms in a document-level machine translation decoder to improve the lexical consistency of document translations. First, we develop a document-level feature designed to score the lexical consistency of a translation. This feature, which applies to words that have been translated into different forms within the document, uses word embeddings to measure the adequacy of each word translation given its context. Second, we extend the decoder with a new stochastic mechanism that, at translation time, allows to introduce changes in the translation oriented to improve its lexical consistency. We evaluate our system on English–Spanish document translation, and we conduct automatic and manual assessments of its quality. The automatic evaluation metrics, applied mainly at sentence level, do not reflect significant variations. On the contrary, the manual evaluation shows that the system dealing with lexical consistency is preferred over both a standard sentence-level and a standard document-level phrase-based MT systems.

Download Full-text

Discourse Structure in Machine Translation Evaluation

Computational Linguistics ◽

10.1162/coli_a_00298 ◽

2017 ◽

Vol 43 (4) ◽

pp. 683-722 ◽

Cited By ~ 1

Author(s):

Shafiq Joty ◽

Francisco Guzmán ◽

Lluís Màrquez ◽

Preslav Nakov

Keyword(s):

Machine Translation ◽

Similarity Measures ◽

Discourse Structure ◽

System Level ◽

Structure Theory ◽

Evaluation Metrics ◽

Machine Translation Evaluation ◽

Sentence Level ◽

Relation Type ◽

Parse Trees

In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment level and at the system level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTK party. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular, we show that (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference RST tree is positively correlated with translation quality.

Download Full-text

One Step Closer to Automatic Evaluation of Text Simplification Systems

10.3115/v1/w14-1201 ◽

2014 ◽

Cited By ~ 5

Author(s):

Sanja Štajner ◽

Ruslan Mitkov ◽

Horacio Saggion

Keyword(s):

Automatic Evaluation ◽

Text Simplification ◽

One Step

Download Full-text

Readability assessment for text simplification

ITL - International Journal of Applied Linguistics ◽

10.1075/itl.165.2.04vaj ◽

2014 ◽

Vol 165 (2) ◽

pp. 194-222 ◽

Cited By ~ 10

Author(s):

Sowmya Vajjala ◽

Detmar Meurers

Keyword(s):

Standard Test ◽

Reading Level ◽

Linguistic Features ◽

Text Simplification ◽

Sentence Level ◽

Test Corpus ◽

The Difference ◽

Readability Assessment ◽

Document Level ◽

Readability Formulas

Readability assessment can play a role in the evaluation of a simplification algorithm as well as in the identification of what to simplify. While some previous research used traditional readability formulas to evaluate text simplification, there is little research into the utility of readability assessment for identifying and analyzing sentence level targets for text simplification. We explore this aspect in our paper by first constructing a readability model that is generalizable across corpora and across genres and later adapting this model to make sentence-level readability judgments. First, we report on experiments establishing that the readability model integrating a broad range of linguistic features works well at a document level, performing on par with the best systems on a standard test corpus. Next, the model is confirmed to be transferable to different text genres. Moving from documents to sentences, we investigate the model’s ability to correctly identify the difference in reading level between a sentence and its human simplified version. We conclude that readability models can be useful for identifying simplification targets for human writers and for evaluating machine generated simplifications.

Download Full-text

A Framework for Word Embedding Based Automatic Text Summarization and Evaluation

Information ◽

10.3390/info11020078 ◽

2020 ◽

Vol 11 (2) ◽

pp. 78 ◽

Cited By ~ 2

Author(s):

Tulu Tilahun Hailu ◽

Junqing Yu ◽

Tessfu Geteye Fantaye

Keyword(s):

Text Summarization ◽

Evaluation Framework ◽

Word Embedding ◽

Evaluation Metrics ◽

Original Text ◽

Automatic Evaluation ◽

Source Text ◽

Automatic Text Summarization ◽

Automatic Text

Text summarization is a process of producing a concise version of text (summary) from one or more information sources. If the generated summary preserves meaning of the original text, it will help the users to make fast and effective decision. However, how much meaning of the source text can be preserved is becoming harder to evaluate. The most commonly used automatic evaluation metrics like Recall-Oriented Understudy for Gisting Evaluation (ROUGE) strictly rely on the overlapping n-gram units between reference and candidate summaries, which are not suitable to measure the quality of abstractive summaries. Another major challenge to evaluate text summarization systems is lack of consistent ideal reference summaries. Studies show that human summarizers can produce variable reference summaries of the same source that can significantly affect automatic evaluation metrics scores of summarization systems. Humans are biased to certain situation while producing summary, even the same person perhaps produces substantially different summaries of the same source at different time. This paper proposes a word embedding based automatic text summarization and evaluation framework, which can successfully determine salient top-n sentences of a source text as a reference summary, and evaluate the quality of systems summaries against it. Extensive experimental results demonstrate that the proposed framework is effective and able to outperform several baseline methods with regard to both text summarization systems and automatic evaluation metrics when tested on a publicly available dataset.

Download Full-text

rgbF: An Open Source Tool for n-gram Based Automatic Evaluation of Machine Translation Output

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-012-0012-y ◽

2012 ◽

Vol 98 (1) ◽

pp. 99-108 ◽

Cited By ~ 2

Author(s):

Maja Popović

Keyword(s):

Open Source ◽

Machine Translation ◽

Arithmetic Mean ◽

Automatic Evaluation ◽

Open Source Tool ◽

On Demand ◽

Sentence Level ◽

N Gram ◽

Document Level

rgbF: An Open Source Tool for n-gram Based Automatic Evaluation of Machine Translation Output We describe RGBF, a tool for automatic evaluation of machine translation output based on n-gram precision and recall. The tool calculates the F-score averaged on all n-grams of an arbitrary set of distinct units such as words, morphemes, POS tags, etc. The arithmetic mean is used for n-gram averaging. As input, the tool requires reference translation(s) and hypothesis, both containing the same combination of units. The default output is the document level 4-gram F-score of the desired unit combination. The scores at the sentence level can be obtained on demand, as well as precision and/or recall scores, separate unit scores and separate n-gram scores. In addition, weights can be introduced both for n-grams and for units, as well as the desired n-gram order n.

Download Full-text

Reassessing automatic evaluation metrics for code summarization tasks

10.1145/3468264.3468588 ◽

2021 ◽

Author(s):

Devjeet Roy ◽

Sarah Fakhoury ◽

Venera Arnaoudova

Keyword(s):

Evaluation Metrics ◽

Automatic Evaluation

Download Full-text

PROFIL REPRESENTASI MATEMATIS SISWA DALAM MENYELESAIKAN MASALAH MATEMATIKA DENGAN MEDIA SCREENCAST O MATIC

Journal of Mathematics Education and Science ◽

10.32665/james.v2i2.98 ◽

2019 ◽

Vol 2 (2) ◽

pp. 83-87

Author(s):

Sri Devi Wulandari

Keyword(s):

Qualitative Research ◽

Three Dimensional ◽

Three Dimensions ◽

Mathematical Representation ◽

Research Approach ◽

Early Mathematics ◽

Mathematical Representations ◽

Mathematical Problems ◽

Existing Data ◽

Descriptive Qualitative

The purpose of this study is to describe the profile of mathematical representation of students who are capable of early mathematics while resolving mathematical problems in three-dimensional material with O Matic screencast media. The research approach that will be used by researchers is qualitative. The type of research that used in this study is descriptive qualitative research, namely by analyzing existing data to obtain information about the profiles of mathematical representations to solve problems in the material of the Three dimensions with the O Matic screencast media. The research phase is the introduction, planning, research, and completion stage. This study uses test questions, interviews, and observation sheets. The results of this study subject able to meet the three indicators of mathematical representation ability that have been determined, but the final results of the answers are still less precise because they misunderstand the problem of the problem. Tujuan penelitian ini adalah untuk mendeskripsikan profil representasi matematis siswa yang berkemampuan awal matematika sedang dalam menyelesaikan masalah matematika pada materi dimensi tiga dengan media screencast o matic.Pendekatan penelitian yang digunakan oleh peneliti adalah pendekatan penelitian kualitatif. Jenis penelitian yang digunakan dalam penelitian ini adalah penelitian kualitatif deskriptif, yaitu dengan cara menganalisis data yang ada untuk memperoleh informasi mengenai profil representasi matematis siwa untuk memecahkan soal pada materi dimensi Tiga dengan media screencast o matic. Tahap penelitian yaitu pendahuluan, perencanaan, pelaksanaan penelitian dan tahap penyelesaian. Penelitian ini menggunakan instrumen soal tes, wawancara dan lembar observasi. Hasil penelitian ini adalah subjek mampu memenuhi ketiga indikator kemampuan representasi matematis yang telah ditentukan, namun hasil akhir jawabannya masih kurang tepat karena salah memahami permasalahan soal.

Download Full-text

Predicting Human Assessment of Machine Translation Quality by Combining Automatic Evaluation Metrics using Binary Classifiers

International Journal of Computer Applications ◽

10.5120/9581-4062 ◽

2012 ◽

Vol 59 (10) ◽

pp. 1-7

Author(s):

Michael Paul ◽

Andrew Finch ◽

Eiichiro Sumita

Keyword(s):

Machine Translation ◽

Evaluation Metrics ◽

Automatic Evaluation ◽

Translation Quality ◽

Binary Classifiers ◽

Human Assessment

Download Full-text

Data Sources, Management, and Presentation

Continental Shelf Limits ◽

10.1093/oso/9780195117820.003.0020 ◽

2000 ◽

Author(s):

Troy L. Holcolmbe ◽

Carla J. Moore

Keyword(s):

Continental Shelf ◽

Cost Effective ◽

Three Dimensions ◽

Digital Data ◽

Data Sets ◽

New Information ◽

Data Compilation ◽

Successful Execution ◽

Available Information ◽

Existing Data

In the previous chapters, the various techniques for delimiting the continental shelf have been outlined. However many continental shelf claims will be developed largely on the basis of existing information. Therefore, a coastal state should begin its article 76 implementation by assembling and reviewing all available information that is relevant for determining the outer limit of the continental shelf, and for assessing the resource potential beyond 200 nautical miles (M). Data compilation activities tend to be labor-intensive, and the amount of time needed for their successful execution depends to a large extent upon the quantity and condition of the data sets, the skill and experience of the compilation staff, and the data-handling facilities at their disposal. However, it is reasonably safe to assume that almost any compilation of existing data will be less expensive than mobilizing and executing a field program for collecting new data, so it is usually more cost-effective to begin with a compilation. Even if the data compilation operation serves primarily to demonstrate the inadequacy of existing data, it will serve a useful purpose by identifying specifically where and what kind of new information is needed. To satisfy the requirements of article 76, and to provide a foundation for an understanding of the resources within the continental shelf, we are concerned primarily with data in the fields of hydrography, geodesy, geology, geophysics, and geochemistry and their subdisciplines. Such data are usually characterized by their spatial variations, in two or three dimensions, which are of a far greater magnitude than any temporal changes, as for example in the case of gravity anomaly data. However, the temporal variation of some geoscience parameters is becoming increasingly important as an indicator of environmental change. Because of the importance of their spatial changes with respect to the delineation of the continental shelf, the traditional form of presentation of geoscience data has been as maps. Whereas maps provide an excellent visualization of the data field, they may not be sufficient to carry out the analysis needed to satisfy article 76, and increasingly, digital data, profiles, and other data forms are becoming necessary.

Download Full-text