Brian Underwood

Making Master Data Management Fun with Neo4j - Part 3

In my last two posts I covered the process of importing data from StackOverflow and GitHub for the purpose of creating a combined MDM database. Now we get to the really fun stuff: today I show you how I tied the data together!

Tying the data together

The most obvious entities to link between StackOverflow and GitHub is the user profiles. Unfortuately users don’t always have two profiles, don’t fill out both profiles, or just don’t have perfectly matching information between the two. To this end I built record linkage tools which helped me make links between users.

My goal is to create a Person label which ties together nodes from GitHub and StackOverflow (and potentially other sites) as seen below:

Standardization

My first step to linking records is to standardize data as much as possible. To this end I do a number of things:

StackOverflow users

From the website_url and about_me properties:
- Pull the Twitter username to a twitter_username property if there is a URL which points to twitter
- Pull the GitHub username to a github_username property if there is a URL which points to twitter
- Extract all URIs to the uris property as an array (no URI protocol or www.)
Extract any E-Mail addresses from the about_me property to the emails property
Further reduce uris and emails to the domains property (no paths)
Make a usernames property with the GitHub and Twitter usernames as well as the StackOverflow display_name

GitHub users

Extract the Twitter username from the blog property if it points to twitter
Standardize the blog property to the uris property as an array (no URL protocol or www.)
Further reduce uris to the domains property (no paths)
Make a usernames property with the GitHub and Twitter usernames

Post-processing

All of the above generated properties get downcased, stripped of extra whitespace, and stored without duplicates.

After all domains have been looked for, I run the followed to extract any domains which have been used more than once to find and eliminate domains like gmail.com, github.com, neo4j.com, etc…



query.match(u: :User).unwind(domain: 'u.domains').
  with(:domain, count: 'count(domain)').where('count > 1').
  pluck(:domain)

Linkage

With the data standardized we can now easily do record comparison and scoring. Here I go through the millions of combinations of user pairings and give each of them a score. For each pairing I want to create a relationship in the database so that I don’t need to calculate the score again. Since I don’t want to have millions of relationships connecting every single user to every other user, I only create relationships above a score threshold.

I created a class called ObjectIdentifier which allows me to create rules for comparing and scoring records.



identifier = ObjectIdentifier.new do |config|
  config.default_threshold = 0.8
  config.default_weight = 1.0

  config.add_matcher :name, :display_name, &fuzzy_string_matcher
  config.add_matcher :login, :display_name, &fuzzy_string_matcher
  config.add_matcher :blog, :website_url, &fuzzy_string_matcher
  config.add_matcher :location, :location, &fuzzy_string_matcher

  config.add_matcher :login, :github_username, weight: 2.0, &exact_string_matcher
  config.add_matcher :twitter_username, :twitter_username, weight: 2.0, &exact_string_matcher

  config.add_matcher :domains, :domains, threshold: 0.98, &array_fuzzy_string_matcher
  config.add_matcher :usernames, :usernames, threshold: 0.98, &array_exact_string_matcher

end

You can see that there are anonymous functions for both fuzzy and exact string matching.

With the identifier object I can call identifier.classify_hash(ghu, sou) to get the following result:



{
                    [ :name, :display_name ] => 0.9769319492502884,
                   [ :login, :display_name ] => 0.9769319492502884,
                     [ :blog, :website_url ] => 0.0,
                    [ :location, :location ] => 1.0,
                [ :login, :github_username ] => 0.0,
    [ :twitter_username, :twitter_username ] => 0.0,
                      [ :domains, :domains ] => 0.0,
                  [ :usernames, :usernames ] => 0.0,
}

Using the identifier object, I then compare every combination of users between GitHub and StackOverflow. With 2,312 StackOverflow users and 6,255 GitHub users there are 14,461,560 potential pairs, which can take some time.

For each pairing my script gets the result hash and calculates a score as the sum of the Hash values. If that score is above a threshold, it creates a COMPUTER_IDENTIFIED Neo4j relationship to represent a hit. The relationship stores the score as well as a hash property with a JSON representation of the result for debugging.

pHash

While comparing strings to strings I realized there was another aspect of the profile that I was missing: The profile photo.

If I could do fuzzy comparisons against the user images I could narrow down the results even further. Fortunately the phashion gem provides an implementation of the pHash algorithm to do just that. So I wrote a quick script to download the profile photos locally and then another to compare the millions of combinations. With all of that downloading and pHashing, I found that I also needed to use the parallel gem to complete in a reasonable amount of time.

So that I didn’t need to compare photos during my ObjectIdentifier process, I simply created a SIMILAR_IMAGE_TO relationship between users to store the result of the pHash (a value between 1 and 15 where 1 is the closest perceived match).

Using Neo4j I don’t need to worry about table structure or if I should create a collection or embed in existing objects. I can create whatever links between users that I need as I play with algorithms that will work for my purpose.

Man vs. Machine

To help guide my algorithms I used the only thing that I had available to me that could do a reasonable comparison of profiles: me. I turned my project into a Rails app which could show my two profiles side-by-side:

This view would show me a randomly chosen pair of users. 80% of the time those users were chosen because I had algorithmically determined some linkage between them. 20% of the time it would just show me a randomly chosen set of two users which hadn’t already been matched to acts as a control group. I created five links along the top of the page (“Definitely not”, “Probably not”, “Don’t know”, “Probably”, and “Definitely”) and threw in a bit of CoffeeScript to let me choose them via arrow keys.

In making decisions on hundres of pairs of profiles, a number of questions occurred to me:

How should do I choose between the five levels?

I made some rules to describe how I evaluate the evidence. By evidence I mean comparing name, location, photo, etc...

two pieces of matching evidence mean "Probably"
three pieces of matching evidence mean "Definitely"
two pieces of contradictory evidence means "Probably not"
three pieces of contradictory evidence means "Definitely not"

While I placed "Not sure" in the middle, that was a special case which means that I don't have enough evidence.

Should I follow links and otherwise research the users?

No. If I want to determine how well my matching algorithms work, I should use only same data that they have.

What about pictures?

I happen to be a human and humans are pretty good at comparing photos, particularly photos of people. I was torn about this, but decided that if two photos were obviously not the same person, I should classify the profiles as different.

Analyzing the result

To figure out how well my classification algorithm worked I turned, of course, to R. Using RNeo4j I run a query to get the human score (from -2 as ‘Definitely not’ and 2 as ‘Definitely’) and the computer score (for the rules I specified this runs from 0 to about 6):



query <- "MATCH
          (ghu:User:GitHub)-[ci:COMPUTER_IDENTIFIED]-(u:User),
          MATCH ghu-[i:IDENTIFIED]-u
        RETURN ci.score, toInt(i.index)"
result <- cypher(graph, query)

plot(result[[2]], result[[1]], xlab = "Human Score", ylab = "ObjectIdentifier score")

…and plot the results…

This shows me that I should be able to relatively safely classify an ObjectIdentifier score of around 3 and up as the same user. A score of 2.5 and above looks promising, though looking in that range for users with a human classification of Not sure there are some false positives. I suspect I can improve the correlation between the two by adjusting the algorithm scoring/weights.

That’s plenty to digest for now, so next time I’ll show the sorts of queries that can be run on the linked database! You can see all the code on the GitHub repository.

If you want to talk more in person, feel free to drop by one of my upcoming speaking engagements!

2025 1
2024 1
2023 2
2021 2
2020 2
2015 9
2014 5

2025

Reduce, Reuse… Refactor: Clearer Elixir with the Enum Module

April 24, 2025

“When an operation cannot be expressed by any of the functions in the Enum module, developers will most likely resort to reduce/3.” From the docs for E...

2024

Let Your Database Update You with EctoWatch

June 27, 2024

Elixir allows application developers to create very parallel and very complex systems. Tools like Phoenix PubSub and LiveView thrive on this property of the ...

2023

Lifting Your Loads for Maintainable Elixir Applications

June 15, 2023

(This post was originally created for the Erlang Solutions blog. The original can be found here)

Can’t Live `with` It, Can’t Live `with`out It

February 23, 2023

(This post was originally created for the Erlang Solutions blog. The original can be found here)

2021

How Far Can I Push a GenServer?

July 23, 2021

I’ve been using Elixir for a while and I’ve implemented a number of GenServers. But while I think I mostly understand the purpose of them, I’ve not gotten t...

Why I Love Lodash

May 13, 2021

I love Lodash, but I’m not here to tell you to use Lodash. It’s up to you to decide if a tool is useful for you or your project. It will come down to the n...

2020

Structuring an Elixir+Phoenix App

July 11, 2020

I’ve mix phx.new ed many applications and when doing so I often start with wondering how to organize my code. I love how Phoenix pushes you to think about th...

Expecting the Unexpected in Elixir

January 31, 2020

What can a 50 year old cryptic error message teach us about the software we write today?

2015

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

August 20, 2015

For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911...

Analyzing Ruby’s ObjectSpace with Neo4j

June 3, 2015

Recently the continuous builds for the neo4j Ruby gem failed for JRuby because the memory limit had been reached. I wanted to see if I could use my favorite...

Master Data Management Scoring Examples

May 14, 2015

A while ago my colleague Michael suggested to me that I draw out some examples of how my record linkage algorithm did it’s thing. In order to do that, I’ve ...

Running a Neo4j Cypher Introduction Workshop

April 16, 2015

Last night I ran a very successful workshop at the Friends of Neo4j Stockholm meetup group. The format was based on a workshop that I attended in San Franci...

Making Master Data Management Fun with Neo4j - Part 3

March 8, 2015

In my last two posts I covered the process of importing data from StackOverflow and GitHub for the purpose of creating a combined MDM database. Now we get t...

Making Master Data Management Fun with Neo4j - Part 2

February 22, 2015

In my last post I said I would “bring in another data source, show how I linked the data together, and demonstrate the sort of bigger picture that one can ge...

Making Master Data Management Fun with Neo4j - Part 1

February 16, 2015

Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a g...

Modeling in Neo4j Sans Programming

February 4, 2015

I have a bit of a problem.

Loading SQL to Neo4j Like Magic

January 8, 2015

When using neo4j for the first time, most people want to import data from another database to start playing around. There are a lot of options including LOA...

Brian Underwood

Making Master Data Management Fun with Neo4j - Part 3

Tying the data together

Standardization

StackOverflow users

GitHub users

Post-processing

Linkage

pHash

Man vs. Machine

Analyzing the result

2025

Reduce, Reuse… Refactor: Clearer Elixir with the Enum Module

2024

Let Your Database Update You with EctoWatch

2023

Lifting Your Loads for Maintainable Elixir Applications

Can’t Live `with` It, Can’t Live `with`out It

2021

How Far Can I Push a GenServer?

Why I Love Lodash

2020

Structuring an Elixir+Phoenix App

Expecting the Unexpected in Elixir

2015

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

Analyzing Ruby’s ObjectSpace with Neo4j

Master Data Management Scoring Examples

Running a Neo4j Cypher Introduction Workshop

Making Master Data Management Fun with Neo4j - Part 3

Making Master Data Management Fun with Neo4j - Part 2

Making Master Data Management Fun with Neo4j - Part 1

Modeling in Neo4j Sans Programming

Loading SQL to Neo4j Like Magic

2014

Analyzing Twitter with Neo4j and Rails

neo4apis

Matchmaking Irish Citizens

Normalizing Religion in Ireland

A Blog Post Toward a Programatically Generated Philisophical Language