Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Posts

OECD AI Principles

less than 1 minute read

Published: August 14, 2025

In a world where artificial intelligence is being increasingly used for a wide variety of tasks, it becomes ever more necessary to guide AI development towards a responsible stance. In that regard, the OECD AI Principles, adopted in 2019 and updated in 2024, aim at promoting innovative and trustworthy AI through a unified set of guidelines.

portfolio

HANNA Benchmark

HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation.

publications

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Published in COLING, 2022

Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10 different ASG systems. HANNA allows us to quantitatively evaluate the correlations of 72 automatic metrics with human criteria. Our analysis highlights the weaknesses of current metrics for ASG and allows us to formulate practical recommendations for ASG evaluation.

Recommended citation: Cyril Chhun, Pierre Colombo, Fabian M. Suchanek, and Chloé Clavel. 2022. Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5794–5836, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Download Paper

Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation

Published in TACL, 2024

Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning, and deep understanding. Meanwhile, Large Language Models (LLMs) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers.

Recommended citation: Cyril Chhun, Fabian M. Suchanek, Chloé Clavel; Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation. Transactions of the Association for Computational Linguistics 2024; 12 1122–1142. doi: https://doi.org/10.1162/tacl_a_00689
Download Paper

talks

teaching

IA717

Graduate course, Télécom Paris, AI track, 2022

2021-2022, 2022-2023, and 2023-2024.

Cyril Chhun

Sitemap

Pages

Page Not Found

About Me

Posts by Collection

CV

Page Archive

Portfolio

Publications

Sitemap

Posts by Tags

Teaching

Terms and Privacy Policy

Blog posts

Posts

OECD AI Principles

portfolio

HANNA Benchmark

publications

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation

talks

teaching

IA717