Background:
Automatic Machine Translation (AMT) Evaluation Metrics have become popular
in the Machine Translation Community in recent times. This is because of the popularity of Machine Translation
engines and Machine Translation as a field itself. Translator is a very important tool to break barriers
between communities especially in countries like India, where people speak 22 different languages and their
many variations. With the onset of Machine Translation engines, there is a need for a system that evaluates
how well these are performing. This is where machine translation evaluation enters.
Objective:
This paper discusses the importance of Automatic Machine Translation Evaluation and compares
various Machine Translation Evaluation metrics by performing Statistical Analysis on various metrics and
human evaluations to find out which metric has the highest correlation with human scores.
Methods:
The correlation between the Automatic and Human Evaluation Scores and the correlation between
the five Automatic evaluation scores are examined at the sentence level. Moreover, a hypothesis is set up
and p-values are calculated to find out how significant these correlations are.
Results:
The results of the statistical analysis of the scores of various metrics and human scores are shown
in the form of graphs to see the trend of the correlation between the scores of Automatic Machine Translation
Evaluation metrics and human scores.
Conclusion:
Out of the five metrics considered for the study, METEOR shows the highest correlation with
human scores as compared to the other metrics.