quinta-feira, 3 de junho de 2010

Graph Database Tutorial

Operations on a Graph Database (Part 1 - Nodes)

Graph databases are still quite unfamiliar to many developers. This is the first post in a series discussing the operations a graph database makes available to the developer. Just like there are only so many different things you can do on a relational database (like CREATE TABLE or INSERT), there are only so many things you can do on a graph database. It is worth looking at them one at a time, and that’s the goal of this series. This first post is on creating and deleting nodes.

To recap, a graph database contains nodes and edges, or MeshObjects and Relationships (as we call them in InfoGrid), or Instances and Links (as the UML would call them), or Resources and Triples (as the semantic web folks would call them), or boxes and arrows (as we draw them on a white board).

Nodes are those objects in a graph database that can stand on their own, they don’t depend on anything else. Edges are those objects that depend on the existence of (typically two) other objects, their source and their destination; we think of edges as connecting nodes.

To create a node in a graph database is one of its basic operations. For example, in InfoGrid, you can simply say:

MeshObject createMeshObject()


and voila, you have one. Similarly, you can delete a node by saying:



deleteMeshObject( MeshObject toDelete )


There are few conditions around those operations, such as that you have to have a transaction open, and that you have to have access rights to actually perform this operation, but that goes without saying.



When deleting a node, the graph database may require you to first delete all edges connected to the node before you get to delete it. Or, it may “ripple delete” all connected edges as part of the delete operation. There are some differences in the various graph database products on this; neither will make much of a difference to the developer.



If the graph database enforces a model (aka schema), as some graph databases do, you may need to make sure you don’t attempt to delete a node in a way that the schema would be violated. For example, if the schema says “an Order must be placed by exactly one Customer”, and you are attempting to delete the node representing the Customer, the graph database may prevent you from doing that as long as there still are nodes representing Order related to the Customer node. We’ll discuss schemas and graph databases in more detail in a later post.



For now, we learned two basic operations on a graph database:




  • create node


  • delete node.



 



Operations on a Graph Database (Part 2 - Edges)



In the first post of this series, we looked at creating and deleting Nodes. Today we are looking at Edges.



Unlike simpler NoSQL data stores like key-value stores, graph databases not only manage nodes, but also edges. Edges are things that connect two other data elements, and graph datastores have them as a basic element managed by the store. Think of them as the line between two boxes; that’s exactly what they are.



Edges often take developers a while to get used to, because there isn’t much precedent in the world of software. Even the so-called “relational database” doesn’t actually have “relationships” as a first-class concept: we have to infer them from primary/foreign key declarations; and that only works if developers actually declare them, which is not all that common.



Edges don’t exist in normal code either. Pretty much all mainstream programming languages only have pointers, not relationships aka edges. Edges are bidirectional, managed things, while pointers are one-directional and not managed at all. Let’s take an example (using a simplified version of the InfoGrid API, see the FirstStep example for complete code of a basic URL tagging application):



MeshObject customer = createMeshObject(); // create first node, called MeshObject in InfoGrid
MeshObject order = createMeshObject(); // create second node
customer.relate( order );


What did we just do?



We created a customer object, and an order object, and then we said the two are related. (The graph database makes sure the objects get persisted automatically when the Transaction is closed; not shown here as we try to stay on topic.)



If we had to do that in straight Java, we’d do something like this:



Customer customer = new Customer();
Order order = new Order();
customer.addOrder( order );
order.setCustomer( customer );


and we’d have to write the code to manage the edge ourselves, such as:



class Customer {
...
private List<Order> ordersOfThisCustomer = new ArrayList<Order>();
}
class Order {
...
private Customer customerPlacingThisOrder;
}


The question is: why do we have to do all this work for a simple 1-N relationship between customers and orders? The graph database API is much better: for one, it lets the database worry about how and when to save and restore the involved objects. It could, for example, (as InfoGrid does), decide to restore from disk the Customer object but not the Order object for some time because the application code does not currently need to know the Customer’s orders. And referential integrity is always guaranteed. For example:



customer.traverseToNeighbors().getSingleMember(); // returns the single Order object
order.traverseToNeighbors().getSingleMember(); // returns the single Customer object

// now we delete the edge
customer.unrelate( order );

customer.traverseToNeighbors().getSingleMember(); // returns null
order.traverseToNeighbors().getSingleMember(); // returns null


If there is no graph database involved, we need to do it manually, like this:



customer.removeOrder( order );
order.setCustomer( null );


… and hope that we don’t forget one of those calls, because then referential integrity would be out the window, and the next application crash is a certainty.



Imagine if we wanted to restore the Customer and the Order object at different times from disk. Without help from sophisticated run-time infrastructure like a graph database, band-aid-technologies such as object-relational mapping is most likely going to create a separate instance for, say, the restored Order object, and code such as List.remove( … ) is not going to work because we have two Java objects in memory that represent the same order. (Shudder.)



Of course, code could be written to manage all of this manually, but it’s much better if the platform takes care of it.



[The astute reader will notice that the plain Java example has one advantage: it provides type safety. I'll have to say more about this in an upcoming post about types.]



So: after working with graph databases for a while, many people believe that edges are actually the much more interesting and useful concept than nodes. Just like many data modelers think that the value of a data model is often more in the way the entities are connected by relationships than the details of the entities. Automatic management of relationships make things simple, and that’s what any good database should do. Developers have enough to worry about, and graph databases provide real help here.



 



Operations on a Graph Database (Part 3 - Types)



We need to talk about properties, but before we can do that, we need to talk about types. In a previous post we looked at the various alternatives for typing and their pros and cons. Graph databases may take the two following approaches to typing, and anything in between:




  • No types at all: nodes, edges and properties have no types; anything goes.


  • Fully typed: nodes, edges and properties are all strongly typed; the graph database refuses to perform an operation that is not consistent with the type system.



There are some graph databases on the market that require types for some elements (edges, for example) but don’t allow types for others (nodes, or properties), so hybrids are possible.



A totally untyped graph database has it easy: it allows an arbitrary number of properties, distinguished only by name and with arbitrary values. So the operations are:



Defined on Node and/or Edge:



setProperty( String label, Object value );  // returns nothing
getProperty( String label ); // returns the value
getAllProperties(); // returns a set of all current property labels


A fully typed graph database needs to offer a much larger set of operations. They are:



Defined on Node:



bless( NodeType type );   // returns nothing
unbless( NodeType type ); // returns nothing
getTypes(); // returns a set of all current NodeTypes
setProperty( PropertyType type, Object value ); // returns nothing
getProperty( PropertyType type ); // returns the value of the Property


Defined on Edge:



bless( EdgeType type );   // returns nothing
unbless( EdgeType type ); // returns nothing
getTypes(); // returns a set of all current EdgeTypes
setProperty( PropertyType type, Object value ); // returns nothing
getProperty( PropertyType type ); // returns the value of the Property


To determine the available properties, one examines the meta-data available in the type system:



Defined on NodeType and on EdgeType



getPropertyTypes();   // returns a set of all possible properties defined by the type


The abstract API I’m listing here is the general case for a fully dynamically-typed hypothetical graph database. No product that I’m aware of implements them all, but InfoGrid comes close. If you read the JavaDoc for MeshObject, for example (MeshObject is InfoGrid’s name for Node), you will find all of the above Node operations. The operations are also there for Relationships (InfoGrid’s name for Edge) except that for efficiency reasons 1) they are defined on MeshObject, and 2) InfoGrid does not support properties on edges. (InfoGrid V1 used to, but we decided to eliminate that support; will talk about that some other time).



If a graph database was fully typed but using a static type system, we obviously could not bless either nodes or edges with new types at run-time as this API defines. (InfoGrid V1 used to be that way; InfoGrid V2 is fully dynamic).



In the next part, we’ll talk about properties then.



 



Operations on a Graph Database (Part 4 - Properties)



Today we’re looking at properties. There are a few different philosophies that a graph database might employ.



1. The purists often argue that properties aren’t needed at all: all properties can be modeled as edges to separate nodes, each of which represents a value. That’s of course true at some level: instead of a node representing a person, for example, that “contains” the person’s FirstName, LastName and DateOfBirth, one could create two String nodes and a TimeStamp node, and connect the with edges representing “first name”, “last name” and “date of birth”.



The non-purists counter that for practical purposes, it is much simpler to think of these data elements as properties instead of as independent things that are related. For example, it makes deletion of the Person much simpler (and we don’t need to implement cascading delete rules). Also, there are performance tradeoffs: if the three properties are stored with their owning node, for example, a single read is required to restore from disk the node and all of its properties. This would require at least 4 (perhaps 7, depending on how edges are stored in the graph database) reads if stored independently.



In InfoGrid, we believe in properties. We don’t prevent anybody from creating as many edges as they like, of course, but think that properties definitely have their uses.



2. Properties have to be named in some fashion, and the simplest approach — used by a number of graph database projects — is to give them a String label as a name. Correspondingly, the essence of the property API using Strings as labels would look like this:



public Object getPropertyValue( String name );


public void setPropertyValue( String name, Object value );


The advantage of this model is obviously that it is very simple. The disadvantage is that for complex schemas or models created by multiple development teams, name conflicts and spelling errors for property names occur more frequently than one would like. At least that is our experience when building InfoGrid applications, which is why we prefer the next alternative:



3. Properties are identified by true meta-data objects. We call them PropertyTypes, and they are part of what developers define when defining an InfoGrid Model. So the InfoGrid property API looks like this:



public Object getPropertyValue( PropertyType name );


public void setPropertyValue( PropertyType name, Object value );


We’ll have more to say on the subject of meta-data and Models in a future post.



Finally, we need to discuss what in a graph database can carry properties. Everybody other than the purists (see above) agree that nodes (called MeshObjects in InfoGrid) can carry properties. Some graph database projects (like the now-obsolete InfoGrid V1) also allow properties on edges (called Relationships in InfoGrid). Others (InfoGrid today) do not allow that.



It may sound peculiar that we had what looks like a more powerful approach in an earlier InfoGrid version but not any more. Here is what we observed in our practice with InfoGrid:




  • Properties on edges are fairly rare compared to Properties on nodes. We’ve been involved in several projects over the years where the Models were substantial and not a single property was found on any edge; nor did anybody ask for one.


  • If a property is needed on an edge, there is an easy workaround known as “associative entity” in data modeling circles: simply create an intermediary node that carries the property.


  • The deciding factor was performance: if properties are rarely needed on edges, it is possible to traverse from one node to a neighbor node in a single step. If properties are needed on edges, the edge needs to be represented as a separate object, and a traversal from one node to its neighbor requires two steps: from the start node to the connecting edge, and from the edge to the destination node. So not having properties on edges can improve performance by a 100%. Which is why we got rid of them for InfoGrid V2.



In the next post, we will look at data types for properties.



 



Operations on a Graph Database (Part 5 - Identifiers)






Well, “identifiers” aren’t much of an “operation”, but there are some operations related to identifiers, thus the title.



All first-class objects in a graph database typically have a unique identifier. This means nodes have unique identifiers, and for those graph databases that represent edges as distinct objects (see previous discussion on the pros and cons), they have unique identifiers, too.



This means we can ask a node for their identifier, remember the identifier, and later find the node again by looking it up in the graph database. In InfoGrid, this looks as follows:



MeshObject someNode = ...; // some MeshObject aka Node


MeshObjectIdentifier id = someNode.getIdentifier();


and later we can do this:



MeshBase mb = ...; // some MeshBase


MeshObject nodeFoundAgain = mb.findMeshObjectByIdentifier( id );


As you can see, InfoGrid uses an abstract data type called MeshObjectIdentifier, which you can think of as String for a second. (see below.) In InfoGrid, all identifiers are long-lasting. This means, your object will still have the same MeshObjectIdentifier after you rebooted your database. This has some advantages, e.g. you can define well-known objects in your graph database to which you can easily return even weeks later.



Other graph databases may use different data types as identifiers (e.g. int or long), but the use of identifiers is common with the above operations. They may or may not be the same after rebooting of the database.



Why does the type of identifier matter? Well, it depends on the application you have in mind. For InfoGrid applications, we primarily care about web applications, specifically REST-ful web applications. And so InfoGrid web applications generally use MeshObjectIdentifiers that identical to the URLs of the application. Let’s make an example:



Assume you have a URL bookmarking application which runs at http://example.com/. Let’s say a user creates tag “books”, which can be found at URL http://example.com/books/. It would be most straightforward to create a MeshObject with MeshObjectIdentifier http://example.com/books/. Which is exactly what InfoGrid does by default. No impedance mismatch between URLs that the user sees, the objects in the application layer, and the database! This leads to dramatic simplification of development and debugging.



 



Operations on a Graph Database (Part 6 - Traversals)



Traversals are the most common operations on a graph database. They are just as important for graph databases as joins are for relational databases. But of course, they are something else as graphs are not tables.



A traversal in a graph database is uniquely described by two data items:




  • a start node


  • a specification for the traversal



A traversal always leads to a set of nodes. Depending on the structure of the graph being traversed, that set may contain many, one or zero nodes. For example, if we traverse from a node representing a Person to the nodes representing their grandchildren, we may or may not find any, depending on whether they have any.



From a given node, we can traverse in many different ways (i.e. same node, different traversal specifications). Or, given the same traversal specification, we can start with different nodes.



By way of analogy, consider street directions:




  • start node: my house


  • traversal specification: first turn left, then go either straight or left.



The result of this particular traversal is a single-element set containing the neighborhood park. If you had started at the same node (my house), but gone right first, you would not have arrived at the park. If you had started at a different node (somebody else’s house), you may or may not have arrived at the park. You may not have arrived anywhere (perhaps there is no left that one can take from certain houses). Or you might have arrived in multiple places (”go either straight or left” might not take you to the same part regardless which you take, but taken you into different directions.



Graph database products seem to differ on how to deliver to the developer the set of nodes that is the result of a traversal. In InfoGrid, all traversals produce a MeshObjectSet, which, as the name says, is a set of nodes. One can then iterate over that set, for example, subset it, unify it with another or ask how many elements it has. In other products, traversals produce an iterator directly which then can be queried for one member of the result set at a time. Regardless of API details, the result of a traversal is always a set (e.g. it can’t contain duplicates.)



Just like there are many ways of giving directions, traversal specifications can be captured in many different ways. In InfoGrid, we have — you guessed it — an abstract data type called TraversalSpecification and several different classes that implement that type, such as:




  • traverse by going to all direct neighbor nodes of the start node


  • go to all neighbor nodes related with a edge of a particular type in a particular direction (e.g. “traverse from employee to their manager(s)”)


  • go N steps in sequence, where each step can be any traversal specification


  • go N steps in parallel, and unify the resulting set


  • select a subset of the found nodes based on some criteria, etc.



The FirstStep example shows some simple traversals.



And just for simplicity, InfoGrid also allows traversals starting from a set of start nodes, not just one. So we can say things like this:



MeshObject me = ...;
MeshObjectSet myParents = me.traverse( childToParents );
MeshObjectSet myGrandParents = myParents.traverse( childToParents );


In our experience, working with sets makes complex traversals very easily understandable.



 



Operations on a Graph Database (Part 7 - Sets)



Sets are a core concept of most databases. For example, any SQL SELECT statement in a relational database produces a set. Sets apply to Graph Databases just as well and are just as useful:



The most frequently encountered set of nodes in a Graph Database is the result of a traversal. For example, in InfoGrid, all traversal operations result in a set like this:



MeshObject    startNode     = ...; // some start node
MeshObjectSet neighborNodes = startNode.traverseToNeighbors();


We might as well have returned an array, or an Iterator over the members of the set, were it not for the fact that there are well-understood set operations that often make our jobs as developers much simpler: like set unification, intersection and so forth.



For example, in a social bookmarking application we might want to find out which sites both you and I have bookmarked. Code might look like this:



MeshObject me  = ...; // node representing me
MeshObject you = ...; // node representing you

TraversalSpecification ME_TO_BOOKMARKS_SPEC = ...;
// how to get from a person to their bookmarks, see post on traversals
MeshObjectSet myBookmarks = me.traverse( ME_TO_BOOKMARKS_SPEC );
MeshObjectSet yourBookmarks = you.traverse( ME_TO_BOOKMARKS_SPEC );

// Bookmarks that you and I share
MeshObjectSet sharedBookmarks = myBookmarks.intersect( yourBookmarks );


Notice how simple this code is to understand? One of the powers of sets. Or, if you know what a “minus” operation is on a set, this is immediately obvious:



// Bookmarks unique to me
MeshObjectSet myUniqueBookmarks = myBookmarks.minus( yourBookmarks );


This is clearly much simpler than writing imperative code which would have lots of loops and if/then/else’s and comparisons and perhaps indexes in it. (And seeing this might put some concerns to rest that NoSQL databases are primitive because they don’t have a SQL-like query language. I’d argue it’s less the language but the power of sets, and if you have sets you have a lot of power at your fingertips.)



To check out sets in InfoGrid, try package org.infogrid.mesh.set. Clearly much more can be done than we have so far in InfoGrid, but it’s a very useful start in our experience.



 



Operations on a Graph Database (Part 8 - Events)



The database industry is not used to databases that can generate events. The closest the relational database has to events are stored procedures, but they never “reach out” back to the application, so their usefulness is limited. But events are quite natural for graph databases. Broadly speaking, they occur in two places:




  • Events on the graph database itself (example: “tell me when a transaction has been committed, regardless on which thread”)


  • Events on individual objects stored in the graph database (example: “tell me when property X on object Y has changed to value Z”, or “tell me when Node A has a new Edge”)



Events on the GraphDB itself are more useful for administrative and management purposes. For example, an event handler listening to GraphDB events can examine the list of changes that a Transaction is performing at commit time, and collect statistics (for example).



From an application developer’s perspective, events on the data are more interesting:



An example may illustrate this. Imagine an application that helps manage an emergency room in a hospital. The application’s object graph contains things such as the doctors on staff, the patients currently in the emergency room and their status (like “arrived”, “has been triaged”, “waiting for doctor”, “waiting for lab results” etc.) Doctors carry pagers. One of the requirements for application is that the doctor be paged when the status of one of their assigned patients changes (e.g. from “waiting for lab results” to “waiting for doctor”).



With a passive database, i.e. one that cannot generate events, like a typical relational database, we usually have to write some kind of background task (e.g. a cron job) that periodically checks whether certain properties have changed, and then sends the message to the pager. That is very messy: e.g. how does your cron job know which properties *changed* from one run to the next? Or we have to add the message sending code to every single screen and web service interface in the app that could possibly change the relevant property, which is just as messy and hard to maintain.



With a GraphDB like InfoGrid, you simply subscribe to events, like this:



MeshObject patientStatus = ...; // the node in the graph representing a patient's status
patientStatus.addPropertyChangeListener( new PropertyChangeListener() {
public void propertyChanged( PropertyChangeEvent e ) {
sendPagerMessage( ... );
});
}


The graph database will trigger the event handler whenever a property changed on that particular object. It’s real simple.