Let Your Database Update You with EctoWatch
Elixir allows application developers to create very parallel and very complex systems. Tools like Phoenix PubSub and LiveView thrive on this property of the ...
Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a graph database (Neo4j) and an interesting dataset (developer-oriented collaboration sites) to make MDM an enjoyable experience. This approach will allow you to quickly and sensibly merge data from different sources into a consistent picture and query across the data efficiently to answer your most pressing questions.
To start I’ll just be importing one data source: StackOverflow questions tagged with neo4j
and their answers. In future blog posts I will discuss how to integrate other data sources into a single graph database to provide a richer view of the world of Neo4j developers’ online social interactions.
I’ve created a GraphGist to explore questions about the imported data, but in this post I’d like to briefly discuss the process of getting data from StackOverflow into Neo4j.
Modeling the StackOverflow data was mostly straightforward. I decided to stick with questions, answers, tags, and users for now. If I wanted to get more complex I could have included comments and edits, but a lot can be shown without them.
Also, at first I had prefixed all labels with StackOverflow
(so that for example question nodes had the label StackOverflowQuestion
). This was an attempt to avoid conflict with later imports where I might want to use the same label (User
is a prime example). After some feedback, though, I gave all nodes a StackOverflow
label in addition to their specific labels. This is a simple but powerful way to model different data sources.
I decided to try out Clojure (GitHub repo) to further the demonstration of using Neo4j as a central location for collecting and analyzing different sources of data together. Using my script I was able to import all 6,571 neo4j questions along with associated tags, answers, and users.
The code may be easy to read for those used to a programming language that is like lego made out of lego or wearing your underwear on the outside, but for everybody else I’ll explain it.
The actual import revolves around a function called merge-props
:
(defn merge-props [label props merge-prop neo4j-conn]
(if-not (nil? (props merge-prop))
(cypher/tquery
neo4j-conn
(str "MERGE (node:`" label "` {" merge-prop ": {props}." merge-prop
"}) SET node:StackOverflow, node = {props} RETURN node") {"props" props})
)
)
The idea behind merge-props
is to use Neo4j’s MERGE
to import an object if it doesn’t already exist. You specify a Neo4j label, a set of property/value pairs, and the property for MERGE
to check. For a User
, this might be called like this:
(merge-props
"User"
{"user_id" "12345", "display_name" "Brian Underwood", "reputation" "1791"}
"user_id"
neo4j-conn)
And it would generate the following cypher:
MERGE (node:`User` {user_id: {props}.user_id})
SET node:StackOverflow, node = {props}
RETURN node
The main import functions are import-question
, import-tag
, import-answer
, and import-user
which are responsible for defining the properties to import and for calling other import-
functions to MERGE
dependent relationships. All of that is then wrapped in a loop to import pages of questions until the API returns false
for has_more
.
To speed up processing I considered taking advantage of Clojure’s natural ability to execute in parallel as well as using Neo4j’s transactional endpoints to batch cypher statements. However, since all of the questions imported within about 20 minutes I’m happy for now.
This post was just an introduction to the idea of Master Data Management in Neo4j. In the next post I’ll bring in another data source, show how I linked the data together, and demonstrate the sort of bigger picture that one can get from this approach.
UPDATE: Part 2 now available
Elixir allows application developers to create very parallel and very complex systems. Tools like Phoenix PubSub and LiveView thrive on this property of the ...
(This post was originally created for the Erlang Solutions blog. The original can be found here)
with
It, Can’t Live with
out It
(This post was originally created for the Erlang Solutions blog. The original can be found here)
I’ve been using Elixir for a while and I’ve implemented a number of GenServers. But while I think I mostly understand the purpose of them, I’ve not gotten t...
I love Lodash, but I’m not here to tell you to use Lodash. It’s up to you to decide if a tool is useful for you or your project. It will come down to the n...
I’ve mix phx.new ed many applications and when doing so I often start with wondering how to organize my code. I love how Phoenix pushes you to think about th...
What can a 50 year old cryptic error message teach us about the software we write today?
For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911...
Recently the continuous builds for the neo4j Ruby gem failed for JRuby because the memory limit had been reached. I wanted to see if I could use my favorite...
A while ago my colleague Michael suggested to me that I draw out some examples of how my record linkage algorithm did it’s thing. In order to do that, I’ve ...
Last night I ran a very successful workshop at the Friends of Neo4j Stockholm meetup group. The format was based on a workshop that I attended in San Franci...
In my last two posts I covered the process of importing data from StackOverflow and GitHub for the purpose of creating a combined MDM database. Now we get t...
In my last post I said I would “bring in another data source, show how I linked the data together, and demonstrate the sort of bigger picture that one can ge...
Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a g...
I have a bit of a problem.
When using neo4j for the first time, most people want to import data from another database to start playing around. There are a lot of options including LOA...
Having recently become interested in making it easy to pull data from Twitter with neo4apis-twitter I also decided that I wanted to be able to visualize an...
I’ve been reading a few interesting analyses of Twitter data recently such as this #gamergate analysis by Andy Baio. I thought it would be nice to have a ...
I am he as you are he as you are me and we are all together – The Beatles
When I told the people of Northern Ireland that I was an atheist, a woman in the audience stood up and said, ‘Yes, but is it the God of the Catholics or t...
“Wilkins! Yes! I’ve considered decorating these walls with some graffiti of my own, and writing it in the Universal Character.. but it is too depressing...