WebNLG Deliverables

You can try the online demo here.

Quelo Query Corpus

The Quelo Query Corpus consists of 206 pairs of Description Logics (DL) queries and query verbalisations associated with linguistic annotations (grammar trees and syntactic classes). This dataset was created semi-automatically as follows.

First, we created conjunctive queries for 11 ontologies for different domains taking care to capture query patterns varying in terms of length in terms of number of KB concepts and relations (min: 2, max: 19, avg: 7.44), of query tree shape (max depth: 4 and max fanout: 6) and in terms of relation types (i.e. lexical shape of the relation identifier). Next, we generated query verbalizations from these queries using semi-automatically defined microplans and the symbolic surface realizer described in [1]. This yielded a total of 6841 output which we disambiguated manually, choosing for each input query, the output which best verbalizes this query. The training corpus consists then of 206 pairs with 351 sentences in total.

The dataset can be dowloaded from here and an online demo of the QCorpus annotation tool could be found here.

Contact: Claire Gardent, Laura Perez

KBGen dataset :

KBGen was a challenge in Surface Realisation from Knowledge Base data, organised in 2012. The organisers provided training and test data sets extracted from an existing ontology, the BioKB101. For both data sets, the input consists of entity variables, event variables (their properties) and relationships between them (event-to-entity, event-to-event, entity-to-event, entity-to-entity and entity-to-property value) specified in the form of RDF triples. Also separate event and entity lexicons are provided for each dataset, where the entity lexicon specifies the noun and its plural form for the entity variables present in the input and the event lexicon maps the event variables in the input to their corresponding verbs with their inflections and nominalisations. The KBGen dataset can be dowloaded here.

Contact: Claire Gardent, Bikash Gyawali

KBGen+ dataset :

In this dataset, each input describes a single event in relation to one or more entities it participates with. To obtain such input, we manually edited the KBGen dataset (both training and test inputs) such that the newly formed input contains only the event-to-entity type relations. For the KBGen input containing more than one event variable, we create multiple KBGen+ inputs, each one describing a single event and its relation to the participating entities only. We also carefully edited the reference sentence of the source KBGen input to obtain the reference sentence for the corresponding KBGen+ inputs. The lexicons for this dataset are the same as for the KBGen dataset. The KBGEN+ dataset can be dowloaded here.

Note : The filenames in the KBGen+ dataset have been chosen to reflect the original KBGen input from which they are derived and in case of multiple inputs from the same KBGen input, the files are prefixed with I, II, III etc.

Contact: Claire Gardent, Bikash Gyawali