Why am I Mr. SPARQL?

October 9, 2020

EK Team

Over the past few years, I have gained a lot of experience working with graph databases, RDF, and SPARQL.^¹ SPARQL can be tricky for both new and experienced users as it is not always obvious why a query is returning unexpected data. After a brief intro to SPARQL, I will note some reminders and tips to consider when writing SPARQL queries to reduce the number of head-scratching moments. This blog is intended for users who have a basic knowledge of SPARQL.

What is SPARQL?

RDF is a W3C standard model for describing and relating content on the web through triples. It’s also a standard storage model for graph databases. The W3C recommended RDF query language is SPARQL. Similar to other query languages like SQL, SPARQL allows technical business analysts to transform and retrieve data from a graph database. Some graph databases provide support for other query languages but most provide support for both RDF and SPARQL. You can find more detailed information in section 2 of our best practices for knowledge graphs and in our blog titled “Why a Taxonomist Should Know SPARQL.” Now that we have a basic understanding of SPARQL, let’s jump into some SPARQL recommendations.

SPARQL is Based on Patterns

SPARQL queries match patterns in the RDF data. In the WHERE clause of a query, you specify what triples to look for, i.e. what subjects, predicates, and objects you need to answer a question. When retrieving the identifier of all people in a database, a new SPARQL user might write the query as follows:

SELECT ?id WHERE {
    ?s a :Person .
}

This is a common mistake for new SPARQL-ers, especially those coming from a SQL background. A SPARQL query only knows the patterns that you give it–it does not know the schema of your graph (at least in this instance). The above query has no knowledge of an ?id variable or where to retrieve it from, so the query will fail to retrieve data. Extend the query with an additional triple to explicitly define where the ?id variable can be found:

SELECT ?id WHERE {
    ?s a :Person .
    ?s :identifier ?id .
}

The WHERE clause provides the pattern you wish to match, while the SELECT clause explicitly lists which variables from your WHERE clause you’d like to return.

SPARQL Matches Patterns Exactly

I often find myself unexpectedly restricting or duplicating the results of a query. This is best explained with an example query: “Find the name and telephone number for all people in the database.”

 SELECT ?name ?cellNumber WHERE {
    ?s a :Person .
    ?s :name ?name .
    ?s :cellNumber ?cellNumber .
}

The above SPARQL query only returns a result for people that have a cell number. This might be what you want, but what if you were looking for a complete list of people regardless of if they have a cell number? In SPARQL, you would have to wrap the cell number in an OPTIONAL clause.

SELECT ?s ?name ?cellNumber WHERE {
    ?s a :Person .
    ?s :name ?name .

    OPTIONAL {
        ?s :cellNumber ?cellNumber .
    }
}

A person will also appear twice in the results if they have two numbers. If this isn’t the behavior you want, you will need to group the results on the person (?s) and combine the numbers.

SELECT ?s ?name (GROUP_CONCAT(?cellNumber) as ?numbers) WHERE {
    ?s a :Person .
    ?s :name ?name .

    OPTIONAL {
        ?s :cellNumber ?cellNumber .
    }
} GROUP BY ?s ?name

For simplicity, I also assumed that each person only has one name in the database, but you can expand this to meet your data needs.

When writing SPARQL queries, you have to be aware of your data model and know which predicates are required, optional, or multi-valued. If a predicate is required for every subject, you can match it in a pattern with no issues. If a predicate is optional, make sure you are not removing any results that you want. And, if a predicate is multi-valued, you might need to group results to avoid data duplication. It never hurts to run a query to check that your data model matches what you expect. This could lead you to find problems in your data transforming or loading process.

Subqueries and Unions Can Save Complexity

Occasionally a query I am writing needs to cover a number of different conditions. An example query is, “Find all topics and countries that our content is tagged with that have a tagging score of greater than 50.” This question is not too complex on its own but it helps emphasize the point.

You could write this query and go down the rabbit hole of IF and BIND as I initially did. A SPARQL IF statement allows you to select between two values based on a boolean (true or false) statement. BIND statements let you set the value of a variable. IF and BIND statements are very useful in certain situations for dynamically setting variables. The above query could be written as follows.

SELECT 
    ?content 
    (GROUP_CONCAT(?topic) as ?topics)
    (GROUP_CONCAT(?country) as ?countries)
WHERE {
    ?content :tagged ?tag .

    # Verify the tag is for a topic
    ?tag :about ?term .
    ?term a ?type .
    BIND(IF(?type = :Topic, ?term, ?null) as ?topic)
    BIND(IF(?type = :Country, ?term, ?null) as ?country)

    # Check the score
    ?tag :score ?score .
    FILTER(?score > 50)
} GROUP BY ?content

The query matches the type of each term associated with ?content and then sets the value of ?topic and ?country based on the type. We use a FILTER to restrict the tags to only those with a score greater than 50. In this case, the query solves the question by leveraging a nifty use of BIND and IF, but there are less complex solutions.

As your queries and data get more complex, the RDF patterns that you need to match may not line up as nicely. In our case, the relationship between content and topics or countries is the same, so we only needed to include two lines of logic. A much simpler approach is to UNION together two subqueries or subpatterns. This allows the query to retrieve topics and countries separately, matching two different sets of RDF patterns.

SELECT 
    ?content 
    (GROUP_CONCAT(?topic) as ?topics)
    (GROUP_CONCAT(?country) as ?countries)
WHERE {
    {
        ?content :tagged ?tag .

        # Verify the tag is for a topic
        ?tag :about ?topic .
        ?topic a :Topic .

        # Check the score
        ?tag :score ?score .
        FILTER(?score > 50)
    } UNION {
        ?content :tagged ?tag .

        # Verify the tag is for a country
        ?tag :about ?country .
        ?country a :Country .

        # Check the score
        ?tag :score ?score .
        FILTER(?score > 50)
    }
} GROUP BY ?content

This breaks up the SPARQL query into two smaller queries that are much easier to approach without needing to worry about how to combine multiple sets of patterns in the same query. Additionally, this query could be optimized by using a subquery that retrieves the content and tags with a score above 50 before checking for the valid types.

SELECT
    ?content 
    (GROUP_CONCAT(?topic) as ?topics)
    (GROUP_CONCAT(?country) as ?countries)
WHERE {
    {
        SELECT ?content ?tag WHERE {
            ?content :tagged ?tag .

            # Check the score
            ?tag :score ?score .
            FILTER(?score > 50)
        }
    }
    {
        # Verify the tag is for a topic
        ?tag :about ?topic .
        ?topic a :Topic .
    } UNION {
        # Verify the tag is for a country
        ?tag :about ?country .
        ?country a :Country .
    }
} GROUP BY ?content

In this query, the results of the subquery are merged with the results of the UNION enabling us to still apply custom patterns to topics and countries. We use a subquery in order to avoid matching the ?content and ?tag values more than once and the merge enforces that every tag has to be about a topic or a country.

Final SPARQL Thoughts

SPARQL is a robust query language for working with RDF data. Try not to overlook uncommon SPARQL functions (such as VALUES, STRDT, and SAMPLE) and check if your graph database has any proprietary functions that you can leverage for even more flexibility. As a more general recommendation, always take the time to step back and see if there’s a cleaner, more efficient way to retrieve the data you need.

Enterprise Knowledge writes more performant queries and designs data models to enable advanced graph solutions. If you can’t find your own Mr. SPARQL unicorn, and whether you are new to the graph space or looking to optimize your existing data, contact us to discuss how EK can help take your solution to the next level.

^¹If the title of this blog is familiar, that’s because it is a reference to an episode of The Simpsons. In one episode, a Japanese cleaning agency used Homer’s face (or one closely resembling it) as the logo of a brand called “Mr. Sparkle.” Homer calls up the brand and asks, “Why am I Mr. Sparkle?” One of my colleagues mentioned that he was reminded of this episode anytime he heard me discussing SPARQL with the rest of the EK Team.

I have been using SPARQL actively for the past 3 years and have come to recognize that it requires a unique mindset. There are some common gotcha moments and optimization techniques for improving queries but writing the initial query requires an understanding of the RDF format and piecing it together is just as critical as making it more efficient. The most effective SPARQL developers in your organization will be the unicorns, the individuals with a knowledge of code who are able to adjust course quickly, hold complex logic in their head, and enjoy the time it takes to solve puzzles.

Blog