A Critical Evaluation of Evaluations for Long-Form Question Answering
January 2023
TLDR Current methods can't accurately predict which long-form answers people prefer; evaluations should consider different answer qualities separately.
The study "A Critical Evaluation of Evaluations for Long-form Question Answering" conducted a detailed analysis of the evaluation practices for long-form question answering (LFQA). The researchers hired domain experts in seven areas to provide preference judgments over pairs of answers and to justify their choices. The study found that no existing automatic text generation metrics accurately predict human preference judgments, although some metrics do correlate with specific aspects of answers, such as coherence. The researchers suggest future work should move away from a single overall score and instead adopt a multi-faceted evaluation approach, focusing on aspects like factuality and completeness. The study's annotations and code have been made publicly available to encourage further research into LFQA evaluation.