Linked Data Repo Fringe

From ECSWiki
Jump to: navigation, search

Quick Links:

Contents

 [hide

[edit] Picture the scene...

Your line manager walks into your bay looking flustered. She looks at you and says, "The Vice Chancellor has been reading the Times Higher and he' got himself hooked on 'linked data'. He's told me he wants a demonstrator by the end of the week." You look at your calendar and reply,

"Sorry but I've already got loads to do this week, I might have some time this afternoon to start looking at it but thats all really."

"But we really need this demonstrator! If we make it convincing it could be our chance to lead an open data project in the university. It will make us both look good and you can have a break from Sharepoint for 6 months. Just see what you can come up with OK." Your mind drifts into a dream world free from the trappings of large infrastructure.

You start browsing around around the web getting to grips with the basics of linked data. Linked data looks quite complex and intimidating but on your search you encounter a tool called Graphite. It looks very straight forward and lets you do simple things easily. This might be exactly the tool you need to get something up and running in the limited time available.


...Sadly scenes like this are all too common when starting out with linked data. Its a big field, there are a lot of subtleties, the formats seem unnecessarily complex and the tutorial material is often completely impenetrable. Graphite is a library which is designed so you can ignore a lot of the mess in linked data and just get on and produce something useful quickly. In this workshop we will give you a basic grounding in linked data principles and get you to join up some data from around the web so that you can see the benefits of linked data.

[edit] Grasping the basics

If you haven't used linked data before there are a few really key principles that will guide your understanding

  • All linked data is described as triples. All triples have a subject a predicate and an object.
Patrick Buys Ball
Subject Predicate Object
  • Triples can be made of just IDs, or 2 IDs and 1 literal.
person-11337 relation-223 item-12344
Subject (ID) Predicate (ID) Object (ID)


person-11337 relation-223 "A Ball"
Subject (ID) Predicate (ID) Object (Literal)


  • When you give something an ID you should make that ID globally unique.
    • person-11337 is unique in the institution, it might be globally unique but it is hard to be certain.
  • The easiest way to make a globally unique ID is by using a URI (like a URL but it doesn't point to a page).
    • Patrick's institution identifies him as http://id.ecs.soton.ac.uk/person/11337. We know that only the institution owns that domain and they have given Patrick a unique identifier within the institution.
  • Concepts and intangible things can and should still be identified using a globally unique identifier.
    • In the example concept of buying is given the ID relation-id-233. This means other people can describe things as being bought and we are using the same vocabulary.
  • When you refer to something try to use the same identifier which other people are using. That way everyone will know you are talking about the same things.
  • When someone tries to resolve your URI you can use content negotiation to serve them a page which is readable by a human being explaining the URI. This means when you look at the URI in a web browser you see an explanation but when you look at it with data-aware software, you get the data.

[edit] Let's look at some data

This is two triples about the University of Southamton:

<http://id.southampton.ac.uk/> <http://www.w3.org/2000/01/rdf-schema#label> "University of Southampton"@EN .
<http://id.southampton.ac.uk/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/org#FormalOrganization> .

This format is called n-triples it is the most simple way to describe triples and it is commonly used but it is quite verbose.

The first says http://id.southampton.ac.uk/ (the University of Southampton's globally unique identifier) has a label "University of Southmapton" which is in English. Note the university may have other labels and it may have that label in different languages.

The second says http://id.southampton.ac.uk/ is of type formal organisation. The globally unique identifiers for the label, type and formal organisation are maintained by other organisations (in this case W3C). Anyone can use these terms to describe there there own data.

We can represent the same data in a different format. The format below is called turtle (ttl):

<http://id.southampton.ac.uk/> <http://www.w3.org/2000/01/rdf-schema#label> "University of Southampton"@EN ;
                               <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/org#FormalOrganization> .

We can then add prefixes for common namespaces:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ns0: <http://www.w3.org/ns/org#> .

<http://id.southampton.ac.uk/> rdfs:label "University of Southampton"@EN ;
                               rdf:type ns0:FormalOrganization .

This is actually more verbose than the previous example but for large ammount of data prefixes really reduce files sizes and aid readability.

By default EPrints (3.2.1 or later) exports linked data

You can see a large source of data for one of Patrick's publication as n-triples here: http://eprints.soton.ac.uk/cgi/export/eprint/272063/RDFNT/eps-eprint-272063.nt (20 kB)

You can see the same data as N3 (a format very like turtle) with prefixes here: http://eprints.soton.ac.uk/cgi/export/eprint/272063/RDFN3/eps-eprint-272063.n3 (10 kB)

[edit] The Graphite Browser

Whilest turtle with prefixes makes the data slightly easier to read it is still quite hard going and data won't always be available in that format. The graphite website provides a linked data browser so that you can more easily see what you are working with. You can use it by going to http://graphite.ecs.soton.ac.uk/browser/.

Copy and paste a URL into the box and it will format the data in a more readable way. Try using http://eprints.soton.ac.uk/cgi/export/eprint/272063/RDFN3/eps-eprint-272063.n3 , http://eprints.soton.ac.uk/cgi/export/eprint/272063/RDFNT/eps-eprint-272063.nt or another URL from the example above. You can click links the data to browse to other data documents.

[edit] Installing the tool

Graphite's install process is quick and easy. In a php directory on a web server simply download the tarball and and untar it using:

tar xvzf <tarballfilename>

Now all you have to do is include the "Graphite.php" and "arc/ARC2.php" file in your php scripts.

For more information and alternative ways to install Graphite see the documentation at http://graphite.ecs.soton.ac.uk/#installation

[edit] The Excercise

[edit] Hello world

Now we have some data lets build you a little page using the data. In your web directory create a new php file and include the Graphite library and create a new graph.

<?php

include_once("arc/ARC2.php");
include_once("Graphite.php");

$graph = new Graphite();

?>

You can then load your data file and Graphite will parse it into an internal object structure using the $graph->load() function.

$graph->load("http://eprints.soton.ac.uk/cgi/export/eprint/272063/RDFNT/eps-eprint-272063.nt");

Now you want to check that parsed correctly and have a little look at your data. Graphite's graph->dump() renders your rdf to a string of html to make it a bit easier to read. You can use it to see what you've got.

print $graph->dump();

I you are using PHP on the commandline rather than on a website you can use:

print $graph->dumpText();

[edit] get()

Ok cool now lets get on and actually do something. Get hold of the EPrint using $graph->resource(); and use $resource->get() to get the eprint's title.

 $resource = $graph->resource("http://eprints.soton.ac.uk/id/eprint/272063");

 print $resource->get("http://purl.org/dc/terms/title")."<br />\n";

[edit] Namespace mapping

That get() was a little bit more long winded than we would like. We can use a namespace mapping so we can refer to a name using short url rdfs:label. This is particularly beneficial if you are going to be working a lot in a namespace.

$graph->ns( "dct", "http://purl.org/dc/terms/" );

You can now do:

 print $resource->get("dct:title")."<br />\n";

Graphite actually does a bunch of common namespaces for you so you don't need to add any of the following in order to use them but you might have to add less common ones.

Predefined namespaces in the currect version of Graphite:

foaf = http://xmlns.com/foaf/0.1/
dc = http://purl.org/dc/elements/1.1/
dcterms = http://purl.org/dc/terms/
dct = http://purl.org/dc/terms/
rdf = http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs = http://www.w3.org/2000/01/rdf-schema#
owl = http://www.w3.org/2002/07/owl#
xsd = http://www.w3.org/2001/XMLSchema#
cc = http://creativecommons.org/ns#
bibo = http://purl.org/ontology/bibo/
skos = http://www.w3.org/2004/02/skos/core#
geo = http://www.w3.org/2003/01/geo/wgs84_pos#
sioc = http://rdfs.org/sioc/ns#
oo = http://purl.org/openorg/

[edit] has() data?

To find out if get() would return a value, use has().

if( $resource->has( "bibo:presentedAt" ) )
{
    print "This paper was presented at: ".$resource->get( "bibo:presentedAt" )."
\n"; }

You wouldn't want to print "This paper was presented at: " if it had not be presented anywhere, so it's good to test. You can use getLiteral() as an alternative to get(). When no value is available get() will return "[NULL]" where as getLiteral returns an empty string. The "[NULL]" response can be useful to show something has gone wrong. You will often see it if you mistakenly assumed something would always have a value.

[edit] Loops over multiple values

To loop over all the values, use all(). This returns a ResourceList object which is itterable, but has some functions you can call on it.

foreach( $resource->all( "dcterms:creator" ) as $creator )
{
      print $creator->link()."<br />\n";
}

You should now have a list of the creators.

[edit] prettyLink()

This is a handy short cut that creates a nice looking link to a resource, using it's label if it has one.

foreach( $resource->all( "dcterms:creator" ) as $creator )
{
      print $creator->prettyLink()."<br />\n";
}

If the resource is a tel: or mailto: URI it will link it nicely with an appropriate icon.

[edit] Another use for load()

So far we have only been working in a single document. One of the great things about Linked Data is that because all of our globally unique identifiers are resolvable we can follow the links to get more data.

Before:

print $graph->dump();

After:

$resource->all( "bibo:presentedAt" )->load();

print $graph->dump();

You will notice that after using the load command you have more data than you had before. Graphite has followed the links to get more information about the events at which this item was was presented. Now you have more information you can do something more useful to show what is inside a building:

$resource->all( "bibo:presentedAt" )->load();
$events = $resource->all( "bibo:presentedAt" );
$events->all("-bibo:presentedAt")->load();


print "Presented along side:<br />\n";

foreach( $events->all("-bibo:presentedAt") as $item )
{
      print $item->prettyLink()."<br />\n";
}

You should now have a list of the items from the repository presented at Open Repositories 2011. Their are two lines of loading. The first one loads this information about the event this paper was presented at. The second uses the "-" convention to reverse the predicate. It takes the event and loads all items presented at it. If we do not do the second load then we will not have the titles of the other items to use with prettyLink().

[edit] Summing up

You have looked at the Graphite linked data library. Graphite is simple powerful tool which lets you get on and code against linked data. There are a number of features which we have not discussed but you can see them all at http://graphite.ecs.soton.ac.uk/. Graphite is by no means the only linked data library out there. There are a range of others and we recommend you try them out. Bare in mind graphite similifies a lot of the subtlites of linked data so you may need a more indepth understanding before you can fully move to other tools. However Graphite's similicity comes at a price, for large datasets you will find it can be quite slow and memory intensive. It is simply not designed for very large volumes of data. That said there is an awful lot you can do with Graphite so don't be afraid to try. Good luck in the linked data world and happy hacking :-)

[edit] If you liked this you might also like...

More and more data around the web is being published as linked data. The government has a large linked data initive to publish all government data openly. There are loads of data sources out there but we've short listed a few.