HANNA Benchmark
HANNA is a large annotated dataset of Human-ANnotated NArratives for ASG evaluation.
It was introduced in our paper “Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation”, accepted in COLING 2022, and further extended in our paper “Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation”, published in TACL.
HANNA contains annotations for 1,056 stories generated from 96 prompts from the WritingPrompts
dataset. Each story was annotated by 3 raters on 6 criteria (Relevance, Coherence, Empathy, Surprise, Engagement and Complexity), for a grand total of 19,008 annotations, as well as the scores of those 1,056 stories evaluated by 72 automatic metrics and annotated by 4 different Large Language Models (Beluga-13B, Llama-13B, Mistral-7B, ChatGPT).
The GitHub repository is accessible here.