May 02, 2015

RDFLib Stores

The basic task: creating a non-native RDF store

The basic task is to achieve an efficient and proper translation of an RDF graph into one or other of the wide range of currently-available data store models: relational, key-data, document, etc. Triplestore counts head off into the millions very quickly, so considered choices amongst the speed/space/structure tradeoffs in both storage and retrieval will be crucial to the success of any non-trivial attempt. Because data storage and retrieval is a highly technical field, those considerations can be complex, (a typical paper in the field: An Efficient SQL-based RDF Querying Scheme) and wide-ranging, as indicated in the W3C deliverable Mapping Semantic Web Data with RDBMSes report (well worth a quick dekko and a leisurely revisit later).

answers.semanticweb.com, the semantic web “correlate” of stackoverflow has some highly informative answers to questions about RDF storage and contemporary non-native RDF stores:

The answers are an excellent tour d’horizon of the principles in play and provide accessible and highly-relevant background support to the RDFLib-specific topics that are covered in this document.

Other preliminary reading that would most likely make this document more useful:

Types of RDF Store

The domain being modelled is that of RDF graphs and (minimally) statements of the form {subject, predicate, object} (aka triples), desirably augmented with the facility to handle statements about statements (quoted statements) and references to groups of statements (contexts), hence the following broad divisions of RDF store, all of which have an impact on the modelling:

Context-aware: An RDF store capable of storing statements within contexts is considered context-aware. Essentially, such a store is able to partition the RDF model it represents into individual, named, and addressable sub-graphs.

Formula-aware: An RDF store capable of distinguishing between statements that are asserted and statements that are quoted is considered formula-aware.

Conjunctive Graph: This refers to the ‘top-level’ Graph. It is the aggregation of all the contexts within it and is also the appropriate, absolute boundary for closed world assumptions / models.

For the sake of persistence, Conjunctive Graphs must be distinguished by identifiers (which may not necessarily be RDF identifiers or may be an RDF identifier normalized - SHA1/MD5 perhaps - for database naming purposes).

The Notation3 reference has relevant information regarding formulae, quoted statements and such.

“An RDF document parses to a set of statements, or graph. However RDF itself has no datatype allowing a graph as a literal value. N3 extends RDF allows a graph itself to be referred to within the language, where it is known as a formula.”

For a more detailed discussion, see Chimezie’s blog post “Patterns and Optimizations for RDF Queries over Named Graph Aggregates”

Resources

Scamped Notes

Rob Vesse

The one I settled on is essentially to have a single simple document which represents the existence of the Graph:

{
  name: "some-name" ,
  uri: "http://example.org/graph"
}

And then to have a document for each individual triple:

{
  subject : "<http://example.org/subject>" ,
  predicate : "<http://example.org/predicate>" ,
  object : "<http://example.org/object>" ,
  graphuri : "http://example.org/graph"
}

I took advantage of MongoDBs indexing capabilities to generate indexes on Subject, Predicate, Object and Graph URI and then used these to apply SPARQL queries over MongoDB and it worked reasonably well. Though as I note in my blog post it isn’t going to replace dedicated triple stores but does work well for small scale stores - actual performance would vary depending on your data and how you use it in your application.

Vasiliy Faronov

Suppose I am building a Linked Data client app based on Python and RDFLib, and I want to do some reasoning. Most likely I have a few vocabularies that are dear to my heart, and want to do RDFS reasoning with them, i.e. materialize superclass membership, superproperty values etc. I also want to handle owl:sameAs in instance data. Support for the rest of OWL is welcome but not essential.

The graphs I will be working with are rather small, let’s say on the order of 10,000 triples (all stored in memory), but I need to reason in real-time (e.g. my client is an end-user app that works with Linked Data) and so delays should be small.

But most importantly, the solution has to be as easy to use as possible. Ideally:

import reasoner
reasoner.infer_all(my_rdflib_graph)

What are my best options?

author:Graham Higgins
contact:Graham Higgins, gjh@bel-epa.com
version:0.1