
Markus Freitag
Research Areas
Authored Publications
Sort By
Mitigating metric bias in minimum bayes risk decoding
Geza Kovacs
Proceedings of the Ninth Conference on Machine Translation (2024), pp. 1063-1094
Preview abstract
Minimum bayes risk decoding has been shown to improve translation quality both on automated metrics and human evaluations. In this paper we show that MBR decoding tends to show larger improvements in the utility metric and similar metrics, compared to other unrelated metrics. To mitigate this metric bias issue, we explore using MBR decoding using ensembles of multiple metrics as the utility function, as well as QE filtering followed by MBR decoding. Human evaluations show that using an ensemble of metrics improves quality over MBR or QE decoding with a single metric.
View details
Preview abstract
Collecting high-quality translations is crucial for the development and evaluation of machine translation systems. However, traditional human-only approaches are costly and slow. This study presents a comprehensive investigation of 11 approaches for acquiring translation data, including human-only, machine-only, and hybrid approaches. Our findings demonstrate that human-machine collaboration can match or even exceed the quality of human-only translations, while being more cost-efficient. Error analysis reveals the complementary strengths between human and machine contributions, highlighting the effectiveness of collaborative methods. Cost analysis further demonstrates the economic benefits of human-machine collaboration methods, with some approaches achieving top-tier quality at around 60% of the cost of traditional methods. We release a publicly available dataset containing nearly 18,000 segments of varying translation quality with corresponding human ratings to facilitate future research.
View details
Preview abstract
Reliable human evaluation is critical to the development of successful natural language generation models, but achieving it is notoriously difficult. Stability is a crucial requirement when ranking systems by quality: consistent ranking of systems across repeated evaluations is not just desirable, but essential. Without it, there is no reliable foundation for hill-climbing or product launch decisions. In this paper, we use machine translation and its state-of-the-art human evaluation framework, MQM, as a case study to understand how to set up reliable human evaluations that yield stable conclusions. We investigate the optimal configurations for item allocation to raters, number of ratings per item, and score normalization. Our study on two language pairs provides concrete recommendations for designing replicable human evaluation studies. We also collect and release the largest publicly available dataset of multi-segment translations rated by multiple professional translators, consisting of nearly 140,000 segment annotations across two language pairs.
View details
Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task
Nitika Mathur
Chi-kiu Lo
Eleftherios Avramidis
Ricardo Rei
Brian Thompson
Frédéric Blain
Tom Kocmi
Jiayi Wang
David Adelani
Marianna Buchicchio
Chrysoula Zerva
Alon Lavie
2024
Findings of the WMT24 General Machine Translation Shared Task: The LLM Era is Here but MT is Not Solved Yet
Tom Kocmi
Eleftherios Avramidis
Rachel Bawden
Ondrej Bojar
Anton Dvorkovich
Christian Federman
Mark Fishel
Thamme Gowda
Roman Grundkiewicz
Barry Haddow
Marzena Karpinska
Philipp Koehn
Benjamin Marie
Christof Monz
Kenton Murray
Masaaki Nagata
Martin Popel
Maja Popovic
Mariya Shmatova
Steinþór Steingrímsson
Vilém Zouhar
2024
Preview abstract
This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).
View details
Preview abstract
Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT’23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT’23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM’s strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models.
View details
Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration
George Foster
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, pp. 12914-12929
Preview abstract
Kendall's tau is frequently used to meta-evaluate how well machine translation (MT) evaluation metrics score individual translations. Its focus on pairwise score comparisons is intuitive but raises the question of how ties should be handled, a gray area that has motivated different variants in the literature. We demonstrate that, in settings like modern MT meta-evaluation, existing variants have weaknesses arising from their handling of ties, and in some situations can even be gamed. We propose instead to meta-evaluate metrics with a version of pairwise accuracy that gives metrics credit for correctly predicting ties, in combination with a tie calibration procedure that automatically introduces ties into metric scores, enabling fair comparison between metrics that do and do not predict ties. We argue and provide experimental evidence that these modifications lead to fairer ranking-based assessments of metric performance.
View details
Prompting PaLM for Translation: Assessing Strategies and Performance
Jiaming Luo
Viresh Ratnakar
George Foster
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada (2023), 15406–15427
Preview abstract
Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM’s MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM’s MT output which reveals some interesting properties and prospects for future work.
View details
INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback
Wenda Xu
Danqing Wang
Liangming Pan
Zhenqiao Song
William Wang
Lei Li
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, pp. 5967-5994
Preview abstract
Automatically evaluating the quality of language generation is critical. Although recent learned metrics show high correlation with human judgement, these metrics do not provide explicit explanation of their verdict, nor associate the scores with defects in the generated text. To address this limitation, we present INSTRUCTSCORE, a fine-grained explainable evaluation metric for text generation. By harnessing both explicit human instruction and the implicit knowledge of GPT-4, we fine-tune a text evaluation metric based on LLaMA, producing both a score for generated text and a human readable diagnostic report. We evaluate INSTRUCTSCORE on a variety of generation tasks, including translation, captioning, data-to-text, and commonsense generation. Experiments show that our 7B model surpasses all other unsupervised metrics, including those based on 175B GPT-3 and GPT-4. Surprisingly, our INSTRUCTSCORE, even without direct supervision from human-rated data, achieves performance levels on par with state-of-the-art metrics like COMET22, which were fine-tuned on human ratings.
View details
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation
Patrick Fernandes
Mara Finkelstein
André Martins
Graham Neubig
Ankush Garg
Conference on Machine Translation (2023)
Preview abstract
Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on direct estimation of quality scores, the resulting metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we fill this gap by proposing \textbf{\textsc{AutoMQM}}, a prompting technique which leverages the \textit{reasoning} and \textit{in-context learning} capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple \textit{score prediction} prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate \textsc{AutoMQM} with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.
View details