This documentation describes the WebNLG corpus which maps RDF-triples to text. RDF-triples were extracted from DBpedia, and texts were collected using crowdsourcing.
Everything is wrapped in the root tag
The main unit of the benchmark is
<entry>. All the entries are wrapped in the tag
<entries>. Each entry has three attributes: a DBpedia category, an entry ID, and a triple set size. For example,
<entry category="Food" eid="Id65" size="2">.
Each entry consists of three sections:
Original tripleset represents a set of triples as they were extracted from DBpedia. Each original triple is wrapped with the tag
Modified tripleset represents a set of triples as they were presented to crowdworkers (for more details on modifications, see below). The order of triples in the benchmark is the same as the order in which triples were presented to the crowd. Each modified triple is wrapped with the tag
Lex (shortened for lexicalisation) represents a natural language text corresponding to triples. Each lexicalisation has two attributes: a comment, and a lexicalisation ID. By default, comments have the value good, except rare cases when they were manually marked as toFix. That was done during the corpus creation, when it was seen that a lexicalisation did not exactly match a triple set.
Subject-predicate-object structure of triples is linearised with vertical bars as separators. For instance,
Arròs_negre | country | Spain
where Arròs_negre is a subject, country is a predicate, Spain is an object of the RDF-triple.
<entry category="Food" eid="Id65" size="2"> <originaltripleset> <otriple>Arròs_negre | country | Spain</otriple> <otriple>Arròs_negre | ingredient | White_rice</otriple> </originaltripleset> <modifiedtripleset> <mtriple>Arròs_negre | country | Spain</mtriple> <mtriple>Arròs_negre | ingredient | White_rice</mtriple> </modifiedtripleset> <lex comment="good" lid="1">White rice is an ingredient of Arros negre which is a traditional dish from Spain.</lex> <lex comment="good" lid="2">White rice is used in Arros negre which is from Spain.</lex> <lex comment="good" lid="3">Arros negre contains white rice as an ingredient and it comes from Spain.</lex> </entry>
Initial triples extracted from DBpedia were modified in several ways. In this section, we describe the most frequent changes that have been made.
Unclear properties were renamed.
<otriple>Karnataka | west | Arabian_Sea</otriple> <mtriple>Karnataka | has to its west | Arabian_Sea</mtriple>
Properties whose semantics does not differ were merged to the same property to avoid redundancy in data.
<otriple>Stuart_Parker_(footballer) | club | Chesterfield_F.C.</otriple> <otriple>Stuart_Parker_(footballer) | team | Chesterfield_F.C.</otriple> <mtriple>Stuart_Parker_(footballer) | club | Chesterfield_F.C.</mtriple>
Inexact subjects and objects were clarified.
<otriple>1_Decembrie_1918_University,_Alba_Iulia | nickname | Uab</otriple> <mtriple>1_Decembrie_1918_University | nickname | Uab</mtriple>
This example demonstrates the motivation to have only the name of the university (1_Decembrie_1918_University), rather than its name together with its location (Alba_Iulia).
Objects were replaced due to the following reasons:
incorrect DBpedia data (quite often stemming from the bad parsing of infoboxes);
<otriple>Ab_Klink | almaMater | Law</otriple> <mtriple>Ab_Klink | almaMater | Leiden_University</mtriple>
This incorrect original triple resulted from having Ab Klink who studied Law at the Leiden University.
same data, but in different measurement units (e.g., feet/metres, Celsius/Fahrenheit, etc);
<otriple>320_South_Boston_Building | height | 400.0 (feet)</otriple> <otriple>320_South_Boston_Building | height | 121.92 (metres)</otriple> <mtriple>320_South_Boston_Building | height | 121.92 (metres)</mtriple>
same data, but in different formats (e.g., using double quotes, datatypes);
<otriple>Elliot_See | deathDate | "1966-02-28"^^xsd:date</otriple> <otriple>Elliot_See | deathDate | 1966-02-28</otriple> <mtriple>Elliot_See | deathDate | 1966-02-28</mtriple>
The changes that have been made were sometimes quite drastic especially in the case of incorrect DBpedia data, so do not be surprised to see how original triples were converted to modified ones.
An original tripleset and a modified tripleset usually represent a one-to-one mapping. However, there are cases with many-to-one mappings when several original triplesets are mapped to one modified tripleset.
<originaltripleset> <otriple>Jens_Härtel | team | 1._FC_Magdeburg</otriple> </originaltripleset> <originaltripleset> <otriple>Jens_Härtel | managerClub | 1._FC_Magdeburg</otriple> </originaltripleset> <modifiedtripleset> <mtriple>Jens_Härtel | club | 1._FC_Magdeburg</mtriple> </modifiedtripleset>
We model the difference between original and modified triples as follows. They serve different purposes: the original triples — to link data to a knowledge base (DBpedia), whereas the modified triples — to ensure consistency and homogeneity throughout the data. To train models, the modified triples should be used.
Note on entries from 1_triple files
We built the corpus in such a way that, in the 1_triple files, we included all the triples that you could find in the corpus, given a particular category. Hence the name allSolutions in file names.
For example, given the Food category, 1triple, one can find the triple United_States | leader | Barack_Obama. That means that somewhere in the 2, 3, 4 or 5 triples files in the category Food there is such a tripleset (talking about Food) that includes a triple about the leader of the United States.
Theoretically, that hierarchical corpus construction enables to produce texts expressing 5 triples by using only a 1_triple entries.
Note on typos
WebNLG was crowdsourced, so there are spelling mistakes in some lexicalisations. Typos were preserved to model real-world data. This script may help you to detect typos in WebNLG.
However, if you find mistakes in realising semantic content (e.g., a triple realisation is missing), do not hesitate to drop us a line at email@example.com; we will fix it.