May 02, 2015

SPARQL in RDFLib (Version 2.1)

author:Ivan Herman ivan@ivan-herman.net
date:2005/10/10 15:40:35

Introduction

This is a short overview of the query facilities added to RDFLib. These are based on the July 2005 version of the SPARQL draft worked on at the W3C. For a lack of a better word, I refer to this implementation as sparql-p.

Thanks to the work of Daniel Krech and mainly Michel Pelletier, sparql-p is now fully integrated with the newer versions of RDFLib (version 2.2.2 or later), whereas earlier versions were distributed as separate packages. This integration has led to some minor adjustments in class naming and structure compared to earlier versions. If you are looking for the documentation of the separate package, please refer to an earlier version of this document. Be warned, though, that the earlier versions are now deprecated in favour of RDFLib 2.2.2 or later.

The SPARQL draft describes its facilities in terms of a query language. A full SPARQL implementation should include a parser of that language mapping on the underlying implementation. sparql-p does not include such parser yet, only the underlying SPARQL engine and its API. The description below shows how the mapping works. This also means that the API is not the full implementation of SPARQL: some of the features should be left to the parser that could use this API. This is the case, for example, of named Graphs facilities that could be mapped using RDFLib Graph instances: all query is performed on such an instance in the first place! In any case, the implementation of sparql-p covers (I believe) the most frequently used cases of SPARQL.

The SPARQL facilities ar based on a wrapper class called SPARQLGraph around the basic Graph object defined by RDFLib. Ie, all programs using sparql-p should be of the form:

from rdfextras.sparql import sparqlGraph
sparqlGr = sparqlGraph.SPARQLGraph()

the sparqlGr object thus created has the same methods as a Graph type object would have, extended with the sparql-p facilities. An alternative way of creating the sparql-p graph is to use an existing Graph instance:

sparqlGr = sparqlGraph.SPARQLGraph(graph=myExistingGraph)

Basic SPARQL

The basic SPARQL construct is as follows (using the query syntax of the SPARQL document):

SELECT ?a ?b ?c
WHERE { ?a P ?x .
        Q ?b ?a .
        ?x R ?c
     }

The meaning of this construction is simple: the ‘?a’, ‘?b’, etc, symbols (referred to as ‘unbound’ symbols) are queried with the constraint that the tuples listed in the WHERE clause are ‘true’, i.e., part of the triple store. This functionality is translated into a Python method as:

from rdfextras.sparql import GraphPattern
select = ("?a","?b","?c")
where  = GraphPattern([("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c")])
result = sparqlGr.query(select,where)

where result is a list of tuples, each giving possible binding combinations for ”?a”, ”?b”, and ”?c”, respectively. P, Q, R, etc, must be the rdflib incarnations of RDF resources, i.e., URIRef, Literal, etc. The object of each pattern can also be one of the following Python types:

  • integer
  • long
  • float
  • string
  • unicode
  • datetime.date,
  • datetime.time,
  • datetime.datetime

these are transformed into a Literal with the corresponding XML Schema datatype on the fly. This allows coding in the form:

select = ("?a","?b","?c")
where  = GraphPattern([("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),
                      ("?x",S,"Some Literal Here"),("?x",R,43)])
result = sparqlGr.query(select,where)

Note that the SPARQL draft mandates datetime only, not separate date and time, but it was obvious to add this into the Python implementation (and useful in practice). See also the note above on literals, as well as the additional section on datatypes.

As a further convenience to the user, if select consists of a single entry, it is not necessary to use a tuple and just giving the string value will do. Similarly, if the where consists of one single tuple, the array construction may be skipped, and the single tuple is accepted as an argument. Finally, if select consists of one entry, result is a list of the values rather than tuples of (single) values.

The GraphPattern class instance can be built up gradually via the addPattern() and addPatterns() methods (the former takes one tuple, the latter a list of tuples).

The draft describes nested patterns, too, but it also draws the attention to the fact that nested patterns can be turned into regular patterns by possibly repeating some patterns. In other words, nested patterns can be handled by a parser and is therefore not implemented on this API level.

The SPARQLError exception is raised (beyond the possible exceptions raised by rdflib) if there are inconsistencies in the select or where clauses (e.g., the tuples do not have the correct length or there are incorrect data in the tuples themselves).

Constraining Values

SPARQL makes it possible to constrain values through operators, like:

SELECT ?a,?b,?c
WHERE { ?a P ?x .
        Q ?b ?a .
        ?x R ?c .
        FILTER ?x < 10
       }
 ...

The draft also refers to the fact that application specific functions can also be used in the ‘FILTER’ part. There are two ways to translate this feature into sparql-p (see below for a further discussion).

Global Constraint

This version is based on constraints that refer to the whole binding of the pattern and is therefore executed against the full binding once available. Here is how it looks in sparql-p:

select = ("?a","?b","?c")
where  = GraphPattern([("?a",P,"?x"),(Q,"?b","?a"),
                       ("?x",R,"?c"),("?x",S,"Some Literal Here")])
where.addConstraint(func1)
where.addConstraints([func2,func3,...])
result = sparqlGr.query(select,where)

Each function in the constraints is of the form:

def func1(bindingDir) :
    # ....
    return True # or False

where bindingDir is a dictionary of the possible binding, ie, of the form

{"?a" : Resource1, "?b" : Resource2, ...}

Adding several constraints (in a list or via a series of addConstraint() methods) is equivalent to a logical conjunction of the individual constraints.

As an extra help to operator writers, the bindingDir also includes a special entry referring to the SPARQLGraph instance in use via a special key:

from rdfextras.sparql import graphKey
graph = bindingDir[graphKey]

This construction, ie, the global constraint, is the faithful representation of the SPARQL spec. Note that a number of operator methods are available to make the construction of the global constraints easier, see the separate section on that.

Per Pattern constraint

This version is based on a constraint that can be imposed on one specific (bound) pattern only. This is achieved by adding a fourth element to the tuple representing the pattern, e.g.:

select = ("?a","?b","?c")
where  = GraphPattern([("?a",P,"?x",func),(Q,"?b","?a"),
                       ("?x",R,"?c"),("?x",S,"Some Literal Here")])
result = sparqlGr.query(select,where)

where func is a function with three arguments (the bound version of the ?a, P, ?x in the example).

Why Two Constraints?

Functionally, the global constraint is a ‘superset’ of the per pattern constraint; in other words, anything that can be expressed by per pattern constraints can be achieved by global constraints. E.g., a method above can be expressed in two different ways:

select = ("?a","?b","?c")
where  = GraphPattern([("?a",P,"?x"),(Q,"?b","?a"),
                       ("?x",R,"?c"),("?x",S,"Some Literal Here")])
where.addConstraint(lambda binding: int(binding["?x"]) < 10)
result = sparqlGr.query(select,where)

or:

select = ("?a","?b","?c")
where  = GraphPattern([("?a",P,"?x",lambda a,P,x: int(x) < 10),
        (Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")])
result = sparqlGr.query(select,where)

However, the second version may be much more efficient. The search is ‘cut’ in the by the constraint, ie the binding tree is not (unnecessarily) expanded further, whereas a full binding tree must be generated for a global constraint (see the notes on the implementation below).

For large triple stores and/or large patterns this may be a significant difference. A parser may optimize by generating per-pattern constraints in some cases to make use of this optimization, hence this alternative.

‘Or’-d Patterns

A slight variation of the basic scheme could be described as:

SELECT ?a,?b,?c
WHERE { { ?a P ?x . Q ?b ?a } UNION { S ?b ?a. ?x R ?c } }
 ...

(I hope my understanding is correct that...) the meaning a logical ‘or’ on one of the clauses. This is expressed in sparql-p by allowing the query method to accept a list of graph patterns, too, instead of single patterns only:

select = ("?a","?b","?c")
where1 = GraphPattern([("?a",P,"?x"),(Q,"?b","?a")])
where1 = GraphPattern([(S,"?b","?a"),("?x",R,"?c")])
result = sparqlGr.query(select,[where1,where2])

The two queries are evaluated separately, and the concatenation of the results is returned.

Optional Matching

Another variation on the basic query is the usage of ‘optional’ clauses:

SELECT ?a,?b,?c,?d
WHERE { ?a P ?x .
        Q ?b ?a .
        ?x R ?c .
        OPTIONAL { ?x S ?d. ... }
      }

What this means is that if the fourth tuple (with ?x already bound) is not in the triple store, that should not invalidate the possible bindings of ?a, ?b, and ?c; instead, the ?d unbound variable should be set to a null value, but the remaining bindings should be returned. In other words first the following query is performed:

SELECT ?a,?b,?c
WHERE { ?a P ?x .
        Q ?b ?a .
        ?x R ?c
      }

then, for each possible bindings, a second query is done:

SELECT ?d
WHERE { X S ?d }

where X stands for a possible binding of ?x.

The sparql-p expression of this facility is based on the creation of a separate graph pattern for the optional clause:

select = ("?a","?b","?c","?d")
where  = GraphPattern([("?a",P,"?x"),[(Q,"?b","?a"),("?x",R,"?c")])
opt    = GraphPattern([("?x",S,"?d")])
result = sparqlGr.query(select,where,opt)

and the (possible) unbound ?d is set to None in the return value. Just as for the ‘main’ pattern, the third argument of the call can be a list of graph patterns (for several OPTIONAL clauses) evaluated separately. Each of the OPTIONAL clauses can have their global constraints.

Query Forms

The SPARQL draft includes several Query forms, which is a term to control how the query results are returned to the caller. In the case of sparql-p this is implemented via a separate Python class, called Query. All query results yield, in fact, an instance of that class, and various methods on that class are defined corresponding to the SPARQL Query Forms. The queryObject() method can be invoked instead of query() to return an instance of such object. (In fact, the SPARQLGraph.query method, used in all previous examples, is simply a convenient shorthand, see below.)

SELECT Forms

The SELECT SPARQL query forms are used to retrieve the query results. Corresponding to the draft, the Query class has a select() method, with two (keyword) arguments: distinct (with possible values True and False) and limit (which is either a positive integer or None). For example:

select       = ("?a","?b","?c","?d")
where        = GraphPattern([("?a",P,"?x"),[(Q,"?b","?a"),("?x",R,"?c")])
opt          = GraphPattern([("?x",S,"?d")])
resultObject = sparqlGr.queryObject(where,opt)
result       = resultObject.select(select,distinct=True,limit=5)

returns the first 5 query results, all distinct from one another. The default for distinct is set True and the limit is None. Ie, the query() is, in fact, a shorthand for queryObject(where,...).select(select) (it is probably the most widespread use of select hence this shorthand method).

Note that it is possible to use the same class instance returned by queryObject() to run different selections (though the SPARQL draft does not make this distinction); in other words, running the select() method does not change any internal variable of the class.

CONSTRUCT Forms

The construct method can be invoked either with an explicit Graph Pattern or without (the latter corresponds to the CONSTRUCT * of the draft, the former to the case when a separate CONSTRUCT pattern is defined). In both cases, a separate SPARQLGraph instance is returned containing the constructed triples. For example, the construction in the draft:

CONSTRUCT { <http://example.org/person#Alice> FN ?name }
WHERE     { ?x nm ?name }

corresponds to the sparql-p construction:

where            = GraphPattern([("?x",nm,"?name"])
constructPattern = GraphPattern([(URIRef(
                        "http://example.org/person#Alice"),FN,"?name")])
resultObject     = sparqlGr.queryObject(where)
result           = resultObject.construct(constructPattern)

whereas the example:

CONSTRUCT * WHERE (?x N ?name)

corresponds to:

where        = GraphPattern([("?x",N,"?name"])
resultObject = sparqlGr.queryObject(where)
result       = resultObject.construct() # or resultObject.construct(None)

DESCRIPTION Forms

The current draft is pretty vague as to what this facility is (and leaves is to the implementor). What SPARQLGraph implements is a form of clustering. The describe() method has a seed argument (to serve as a seed for clustering) and two keyword arguments, forward and backward, each a boolean. What it means:

  • forward=True and backward=False generates a triple store

    with a transitive closure for each result of the query and the seed: take, recursively, all the properties and objects that start by a specific resource.

  • forward=False and backward=True the same as forward but in the ‘other direction’.

  • forward=True and backward=True combines the two into one triple store.

  • forward=False and backward=False returns and empty triple store.

ASK Forms

The SPARQL draft refers to an ASK query form., which simply says whether the set of patterns represent a non-empty subgraph. This is done by:

resultObject = sparqlGr.queryObject(where)
result       = resultObject.ask()

The ask() method returns False or True (whether the resulting subgraph is empty or not, respectively).

Datatype lexical representations

The current implementation does not (yet) do a full implementation of all the datatypes with the precise lexical representation as defined in the XML Schema Datatype document (and referred to in the SPARQL document). In theory, these should be taken care of by the underlying RDFLib layer when parsing strings into datatypes, but it does not happen yet. sparql-p does a partial conversion to have the vast majority of queries running properly, but there are some restrictions:

  • string: Implemented and coded in UTF-8
  • integer, float, long: Implemented as required
  • double: As Python does not know doubles, it is mapped to floats
  • decimal: As Python does not know general decimals, mapped to integers
  • date: The format is YYYY-MM-DD. The optional timezone character (allowed by the XML Schema document) is not implemented when interpreting Literal-s as date.
  • time: The format is HH:MM:SS. The optional microsecond and timezone characters (allowed by the XML Schema document) are not implemented when interpreting Literal-s as time.
  • dateTime: The format is YYYY-MM-DDTHH:MM:SS (ie, the combination of date and time with a separator ‘T’). No microseconds or timezone characters are implemented when interpreting a Literal as a dateTime.

These mappings are used when a typed literal value is specified in a Graph pattern, and a Literal instance is generated on-the-fly: the Literal instance uses these lexical representations and the corresponding XML Schema datatype are stored. When comparing values coming from an RDF data and parsed by RDFLib, these lexical representations are pre-supposed when comparing Literal instances.

Operators

SPARQL defines a number of possible operators for the AND clause. It is not obvious at this point which of those should be left to a parser and which of those should be implemented by the engine. sparql-p provides a number of methods that can be used to create an elementary operator and that can also be used in the AND clause. More complex constructions can be done using Python’s lambda function, for example.

The available binary operator functions are: lt() (for less than), le() (for less or equal), gt() (for greater than), ge() (for greater or equal), and eq() (for equal). Each of these operator methods take two parameters, which are both either a query string or a typed value, and each of these operators return a function that can be plugged into a global constraint. (All these methods should be imported from the sparqlOperators module.) For example, to add the constraint:

FILTER ?m < 42

one can use:

constraints = [ lt("?m",42) ]

For the more complex case of the form:

FILTER ?m < 42 || ?n > 56

the lambda construction can be used:

constraints = [ lambda binding: lt("?m",42)(binding) or gt("?n",56)(binding) ]

The complicated case of how values of different types compare is left completely to Python for the time being. If a comparison does not make sense, the return value is False. When the Working Group gets to an equilibrium point on this issue, this should be compared to what Python does but this is currently a matter of debate in the group, too.

The module also offers a special operator called isOnCollection() that can be used as a global constraint to check whether a resource is on a collection or not.

The SPARQL document also defines a number of special operators. The following of those operators are implemented: bound(), isURI(), isBlank(), isLiteral(), str(), lang(), datatype(). For example:

pattern.addConstraint(isURI("?mbox"))

adds a constraint that the value bound to ?mbox must be a real URI (as opposed to a literal), or

pattern.addConstraint( lambda binding: datatype("?d")(binding) == \\
    "http://www.myexampledatatype.org" )

checks whether the datatype of a bound resource is of a specific URI.

Whether this set of elementary operators is enough or not for the complete implementation of SPARQL is not yet clear. I presume the final answer will come when somebody writes a parser to the query language...

The sparqlOperators module in the package includes some methods that might be useful in creating more complex constraint methods, such as getLiteralValue() (to return the value of a Literal, possibly making on-the-fly conversion for the known datatypes), or getValue() (to create a ‘retrieval’ method to return either the original Resource or a bound resource in case of a query string parameter). Look at the detailed method description for details.

Implementation

The implementation of SPARQL is based on an expansion tree. Each layer in the tree takes care of a statement in the WHERE clause, starting by the first for the first layer, then the second statement for the second layer, etc. Once the full expansion is done, the results for SELECT are collected by visiting the leaves. In more details:

The root of the tree is created by handing over the full list of statements, and a dictionary with the variable bindings. Initially, this dictionary looks like

{"?x": None, "?a" : None, ...}

The node picks the first tuple in the where, replaces all unbound variables by None and makes a RDFLib query to the triple store. The result are all tuples in the triple store that conform to the pattern expressed by the first where pattern tuple. For each of those a child node is created, by handing over the rest of the triples in the where clause, and a binding where some of the None values are replaced by ‘real’ RDF resources. The children follow the same process recursively. There are several ways for the recursion to stop:

  • though there is still a where pattern to handle, no tuples are found in the triple store in the process. This means that the corresponding branch does not produce valid results. (In the implementation, such a node is marked as ‘clashed’). The same happens if, though a tuple is found, that tuple is rejected by the constraint function assigned to this tuple (the “per-tuple” constraint).
  • though there are no statements to process any more in the where clause, there are still unbound variables
  • all variables are bound and there are no more patterns to process. Unless one of the global constraints reject this binding this yields ‘successful’ leaves.

The third type of leaf contains a valid, possible query result for the unbound variables. Once the expansion is complete, the collection of the results becomes obvious: successful leaves are visited to return their results as the binding for the variables appearing in the select clause; the non-leaf nodes simply collect and combine the results of their children.

The implementation of the ‘optional’ feature follows the semantic description. A pre-processing step separates the ‘regular’ and ‘optional’ select and where clauses. First a regular expansion is done; then, separate optional expansions (for each optional clauses) are attached to each successful leaf node (obviously, by binding all variables that can be bound on that level). The collection of the result follows the same mechanism except that if the optional expansion tree yields no results, the real result tuples are padded by the necessary number of None-s.

author:Ivan Herman ivan@ivan-herman.net
date:2005/10/10 15:40:35

This software is available for use under the W3C Software License.