Datasets

LRMI Dataset

This section refers to the LRMI markup extracted from the Web Data Commons, respectively the 2013, 2014, 2015 releases of WDC. Each entity description corresponds to a set of quadruples q of the form {s, p, o, u}, where s, p, o represent a triple consisting of subject, predicate, object and u represents the URL of the document d from which the triple has been extracted respectively. For a particular real-world entity e, usually there exist n ≥ 0 subjects s which represent distinct descriptions of e. Precisely, this dataset contains all embedded markup statements extracted from documents (in the respective Web Data Commons dataset) which contain at least one triple {s, p, o} where either p refers to any of the LRMI predicates or s or o represent instances of LRMI-specific types AlignmentObject or EducationalAudience.

2013 lrmi2013.nq.gz 214M
2014 lrmi2014.nq.gz 531M
2015 lrmi2015.nq.gz 762M

LRMI’ Dataset (including common errors)

This section refers to LRMI’, a variant of the LRMI corpus denoted above, where additionally quads were included which contained erroneous LRMI statements, considering the frequent errors described by Meusel et al, for instance, quads involving misspellings of LRMI terms.

2013 lrmi_with_common_error_2013.nq.gz 214M
2014 lrmi_with_common_error_2014.nq.gz 531M
2015 lrmi_with_common_error_2015.nq.gz 1.6G

LRMI” Dataset (fixed common errors)

This section refers to LRMI”, a variant of the LRMI’ corpus denoted above, where the common errors described by Meusel et al are fixed using a set of heuristics. Note that fixing of misused object properties has increased the size of the respective dataset, since for each new object property, corresponding objects and related triples are added.

2013 lrmi_fixed_common_error_2013.nq.gz 235M
2014 lrmi_fixed_common_error_2014.nq.gz 618M
2015 lrmi_fixed_common_error_2015.nq.gz 1.9G

LRMI Markup RDF/SPARQL Endpoint

In addition, the LRMI datasets (as opposed to the LRMI’ corpora) were made available as RDF dataset via a SPARQL endpoint at: http://asev.l3s.uni-hannover.de/sparql. The original quads <s, p, o, d> were transformed into triples by storing each triple <s, p, o> and adding a particular statement for each subject/resource URI of the form <s, http://data-observatory.org/appearsOn, d> where d denotes the URI from which a particular resource was extracted.

Please use the following graph URIs:

  • Graph URI of LRMI(CC13): http://data-observatory.org/lrmi-wdc-2013
  • Graph URI of LRMI(CC14): http://data-observatory.org/lrmi-wdc-2014
  • Graph URI of LRMI(CC15): http://data-observatory.org/lrmi-wdc-2015

Note: should the SPARQL endpoint be unresponsive, please refer to the data dumps mentioned above.