Making Master Data Management Fun with Neo4j - Part 1

Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a graph database (Neo4j) and an interesting dataset (developer-oriented collaboration sites) to make MDM an enjoyable experience. This approach will allow you to quickly and sensibly merge data from different sources into a consistent picture and query across the data efficiently to answer your most pressing questions.

To start I’ll just be importing one data source: StackOverflow questions tagged with neo4j and their answers. In future blog posts I will discuss how to integrate other data sources into a single graph database to provide a richer view of the world of Neo4j developers’ online social interactions.

I’ve created a GraphGist to explore questions about the imported data, but in this post I’d like to briefly discuss the process of getting data from StackOverflow into Neo4j.

The model

Modeling the StackOverflow data was mostly straightforward. I decided to stick with questions, answers, tags, and users for now. If I wanted to get more complex I could have included comments and edits, but a lot can be shown without them.

Also, at first I had prefixed all labels with StackOverflow (so that for example question nodes had the label StackOverflowQuestion). This was an attempt to avoid conflict with later imports where I might want to use the same label (User is a prime example). After some feedback, though, I gave all nodes a StackOverflow label in addition to their specific labels. This is a simple but powerful way to model different data sources.

The import code

I decided to try out Clojure (GitHub repo) to further the demonstration of using Neo4j as a central location for collecting and analyzing different sources of data together. Using my script I was able to import all 6,571 neo4j questions along with associated tags, answers, and users.

The code may be easy to read for those used to a programming language that is like lego made out of lego or wearing your underwear on the outside, but for everybody else I’ll explain it.

The actual import revolves around a function called merge-props:

(defn merge-props [label props merge-prop neo4j-conn]
  (if-not (nil? (props merge-prop))
    (cypher/tquery
      neo4j-conn
      (str "MERGE (node:`" label "` {" merge-prop ": {props}." merge-prop
        "}) SET node:StackOverflow, node = {props} RETURN node") {"props" props})
    )
)

The idea behind merge-props is to use Neo4j’s MERGE to import an object if it doesn’t already exist. You specify a Neo4j label, a set of property/value pairs, and the property for MERGE to check. For a User, this might be called like this:

(merge-props
  "User"
  {"user_id" "12345", "display_name" "Brian Underwood", "reputation" "1791"}
  "user_id"
  neo4j-conn)

And it would generate the following cypher:

MERGE (node:`User` {user_id: {props}.user_id})
  SET node:StackOverflow, node = {props}
  RETURN node

The main import functions are import-question, import-tag, import-answer, and import-user which are responsible for defining the properties to import and for calling other import- functions to MERGE dependent relationships. All of that is then wrapped in a loop to import pages of questions until the API returns false for has_more.

To speed up processing I considered taking advantage of Clojure’s natural ability to execute in parallel as well as using Neo4j’s transactional endpoints to batch cypher statements. However, since all of the questions imported within about 20 minutes I’m happy for now.

Next time

This post was just an introduction to the idea of Master Data Management in Neo4j. In the next post I’ll bring in another data source, show how I linked the data together, and demonstrate the sort of bigger picture that one can get from this approach.

UPDATE: Part 2 now available

2023

Back to Top ↑

2021

How Far Can I Push a GenServer?

I’ve been using Elixir for a while and I’ve implemented a number of GenServers. But while I think I mostly understand the purpose of them, I’ve not gotten t...

Why I Love Lodash

I love Lodash, but I’m not here to tell you to use Lodash. It’s up to you to decide if a tool is useful for you or your project. It will come down to the n...

Back to Top ↑

2020

Structuring an Elixir+Phoenix App

I’ve mix phx.new ed many applications and when doing so I often start with wondering how to organize my code. I love how Phoenix pushes you to think about th...

Back to Top ↑

2015

Analyzing Ruby’s ObjectSpace with Neo4j

Recently the continuous builds for the neo4j Ruby gem failed for JRuby because the memory limit had been reached. I wanted to see if I could use my favorite...

Master Data Management Scoring Examples

A while ago my colleague Michael suggested to me that I draw out some examples of how my record linkage algorithm did it’s thing. In order to do that, I’ve ...

Loading SQL to Neo4j Like Magic

When using neo4j for the first time, most people want to import data from another database to start playing around. There are a lot of options including LOA...

Back to Top ↑

2014

Analyzing Twitter with Neo4j and Rails

Having recently become interested in making it easy to pull data from Twitter with neo4apis-twitter I also decided that I wanted to be able to visualize an...

neo4apis

I’ve been reading a few interesting analyses of Twitter data recently such as this #gamergate analysis by Andy Baio. I thought it would be nice to have a ...

Normalizing Religion in Ireland

When I told the people of Northern Ireland that I was an atheist, a woman in the audience stood up and said, ‘Yes, but is it the God of the Catholics or t...

Back to Top ↑