The WebNLG Challenge

Results and Data

The challenge is over. Results and data to download are published online.

Generating Text from RDF Data

The WebNLG challenge consists in mapping data to text. The training data consists of Data/Text pairs where the data is a set of triples extracted from DBPedia and the text is a verbalisation of these triples. For instance, given the 3 DBPedia triples shown in (a), the aim is to generate a text such as (b).

a. (John_E_Blaha birthDate 1942_08_26) (John_E_Blaha birthPlace San_Antonio) (John_E_Blaha occupation Fighter_pilot)
b. John E Blaha, born in San Antonio on 1942-08-26, worked as a fighter pilot

As the example illustrates, the task involves specific NLG subtasks such as sentence segmentation (how to chunk the input data into sentences), lexicalisation (of the DBPedia properties), aggregation (how to avoid repetitions) and surface realisation (how to build a syntactically correct and natural sounding text).

Motivations

The WebNLG data was created to promote the development (i) of RDF verbalisers and (ii) of microplanners able to handle a wide range of linguistic constructions.

KB Verbalizers. The RDF language in which DBPedia is encoded is widely used within the Linked Data framework. Many large scale datasets are encoded in this language (e.g., MusicBrainz, FOAF, LinkedGeoData and official institutions increasingly publish their data in this format. Being able to generate good quality text from RDF data would permit e.g., making this data more accessible to lay users, enriching existing text with information drawn from knowledge bases such as DBPedia or describing, comparing and relating entities present in these knowledge bases.

Microplanning. While many recent datasets for generation takes as input dialogue act meaning representations which can be viewed as trees of depth one, the WebNLG data was carefully constructed to allow for input trees of various shapes and depth and thereby allow for greater syntactic diversity in the corresponding text (cf. Gardent el al.). We hope that the WebNLG challenge will drive the deep learning community to take up this new challenge and work on the development of neural generators that can handle the generation of linguistically rich texts.

Data

The WebNLG dataset consists of 21,855 data/text pairs with a total of 8,372 distinct data input. The input describes entities belonging to 9 distinct DBpedia categories namely, Astronaut, University, Monument, Building, ComicsCharacter, Food, Airport, SportsTeam and WrittenWork. The WebNLG data is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International. For a more detailed description of the dataset, click here.

To download the data, please register here go to the results and data page.

Evaluation

Evaluation of the generated texts will be done both with automatic evaluation metrics (BLEU, TER or/and METEOR) and using human judgements obtained through crowdsourcing. The human evaluation will seek to assess such criteria as fluency, grammaticality and appropriateness (does the text correctly verbalise the input data?).

References

Creating Training Corpora for Micro-Planners. Claire Gardent, Anastasia Shimorina, Shashi Narayan and Laura Perez-Beltrachini. Proceedings of ACL 2017. PDF
Building RDF Content for Data-to-Text Generation. Laura Perez-Beltrachini, Rania Sayed and Claire Gardent. Proceedings COLING 2016. Osaka (Japan). PDF
The WebNLG Challenge: Generating Text from DBPedia Data. Emilie Colin, Claire Gardent, Yassine Mrabet, Shashi Narayan and Laura Perez-Beltrachini. Proceedings of INLG 2016 PDF

To cite the dataset and/or challenge, use:

@InProceedings{gardent2017creating,
author={Gardent, Claire and Shimorina, Anastasia and Narayan, Shashi and Perez-Beltrachini, Laura},
title={Creating Training Corpora for Micro-Planners},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {August},
year = {2017},
address = {Vancouver, Canada},
publisher = {Association for Computational Linguistics}
}

Registration

Please register using the following form.

Entry Submission

More details are provided here.

Important Dates

  • 14 April 2017: Release of Training and Development Data
  • 30 April 2017: Release of Baseline System
  • 18 August 2017: Release of Test Data 1 July - 22 August 2017: Test data submission period
  • 25 22 August 2017: Entry submission deadline
  • 5 September 2017: Results of automatic evaluation and system presentations (at INLG 2017)
  • 30 September October 2017: Results of human evaluation

Organising Committee

  • Claire Gardent, CNRS/LORIA, Nancy, France
  • Anastasia Shimorina, CNRS/LORIA, Nancy, France
  • Shashi Narayan, School of Informatics, University of Edinburgh, UK
  • Laura Perez-Beltrachini, School of Informatics, University of Edinburgh, UK

Contacts

webnlg2017@inria.fr

Acknowledgments

The WebNLG challenge is funded by the WebNLG ANR Project.