IndoNLI: a Natural Language Inference Dataset for Indonesian

Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, Clara Vania

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect ~18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning. Experiment results show that XLM-R outperforms other pre-trained models in our data. The best performance on the expert-annotated data is still far below human performance (13.4% accuracy gap), suggesting that this test set is especially challenging. Furthermore, our analysis shows that our expert-annotated data is more diverse and contains fewer annotation artifacts than the crowd-annotated data. We hope this dataset can help accelerate progress in Indonesian NLP research.
Original languageEnglish
Title of host publicationIndoNLI
Subtitle of host publicationA Natural Language Inference Dataset for Indonesian
PublisherAssociation for Computational Linguistics
Pages10511–10527
Number of pages17
ISBN (Print)9781955917094
DOIs
Publication statusPublished - 7 Nov 2021

Fingerprint

Dive into the research topics of 'IndoNLI: a Natural Language Inference Dataset for Indonesian'. Together they form a unique fingerprint.

Cite this