A Critical Evaluation of Evaluations for Long-Form Question Answering

    May 2023 in “ arXiv (Cornell University)
    Fangyuan Xu, Yixiao Song, Mohit Iyyer, Eunsol Choi
    TLDR Current automatic metrics for long-form question answering don't align with human preferences; a multi-faceted evaluation approach is needed.
    The study critically evaluates the methods used for assessing long-form question answering (LFQA), involving domain experts from seven areas to provide preference judgments and justifications for answer pairs. The analysis highlights the importance of comprehensiveness in evaluations and reveals that current automatic metrics do not predict human preferences, though some correlate with specific answer aspects like coherence. The authors advocate for a multi-faceted evaluation approach focusing on aspects such as factuality and completeness, rather than a single overall score. They have made their annotations and code publicly available to encourage further research in LFQA evaluation.
    Discuss this study in the Community →