LRMI Dataset
This section refers to the LRMI markup extracted from the Web Data Commons, respectively the 2013, 2014, 2015 releases of WDC. Each entity description corresponds to a set of quadruples q of the form {s, p, o, u}, where s, p, o represent a triple consisting of subject, predicate, object and u represents the URL of the document d from which the triple has been extracted respectively. For a particular real-world entity e, usually there exist n ≥ 0 subjects s which represent distinct descriptions of e. Precisely, this dataset contains all embedded markup statements extracted from documents (in the respective Web Data Commons dataset) which contain at least one triple {s, p, o} where either p refers to any of the LRMI predicates or s or o represent instances of LRMI-specific types AlignmentObject or EducationalAudience.
2013 | lrmi2013.nq.gz | 214M | |
2014 | lrmi2014.nq.gz | 531M | |
2015 | lrmi2015.nq.gz | 762M |
LRMI’ Dataset (including common errors)
This section refers to LRMI’, a variant of the LRMI corpus denoted above, where additionally quads were included which contained erroneous LRMI statements, considering the frequent errors described by Meusel et al, for instance, quads involving misspellings of LRMI terms.
2013 | lrmi_with_common_error_2013.nq.gz | 214M | |
2014 | lrmi_with_common_error_2014.nq.gz | 531M | |
2015 | lrmi_with_common_error_2015.nq.gz | 1.6G |
LRMI” Dataset (fixed common errors)
This section refers to LRMI”, a variant of the LRMI’ corpus denoted above, where the common errors described by Meusel et al are fixed using a set of heuristics. Note that fixing of misused object properties has increased the size of the respective dataset, since for each new object property, corresponding objects and related triples are added.
2013 | lrmi_fixed_common_error_2013.nq.gz | 235M | |
2014 | lrmi_fixed_common_error_2014.nq.gz | 618M | |
2015 | lrmi_fixed_common_error_2015.nq.gz | 1.9G |
LRMI Markup RDF/SPARQL Endpoint
In addition, the LRMI datasets (as opposed to the LRMI’ corpora) were made available as RDF dataset via a SPARQL endpoint at: http://asev.l3s.uni-hannover.de/sparql. The original quads <s, p, o, d> were transformed into triples by storing each triple <s, p, o> and adding a particular statement for each subject/resource URI of the form <s, http://data-observatory.org/appearsOn, d> where d denotes the URI from which a particular resource was extracted.
Please use the following graph URIs:
- Graph URI of LRMI(CC13): http://data-observatory.org/lrmi-wdc-2013
- Graph URI of LRMI(CC14): http://data-observatory.org/lrmi-wdc-2014
- Graph URI of LRMI(CC15): http://data-observatory.org/lrmi-wdc-2015
Note: should the SPARQL endpoint be unresponsive, please refer to the data dumps mentioned above.