Let Your Database Update You with EctoWatch
Elixir allows application developers to create very parallel and very complex systems. Tools like Phoenix PubSub and LiveView thrive on this property of the ...
In my last two posts I covered the process of importing data from StackOverflow and GitHub for the purpose of creating a combined MDM database. Now we get to the really fun stuff: today I show you how I tied the data together!
The most obvious entities to link between StackOverflow and GitHub is the user profiles. Unfortuately users don’t always have two profiles, don’t fill out both profiles, or just don’t have perfectly matching information between the two. To this end I built record linkage tools which helped me make links between users.
My goal is to create a Person
label which ties together nodes from GitHub and StackOverflow (and potentially other sites) as seen below:
My first step to linking records is to standardize data as much as possible. To this end I do a number of things:
website_url
and about_me
properties:
twitter_username
property if there is a URL which points to twittergithub_username
property if there is a URL which points to twitteruris
property as an array (no URI protocol or www.
)about_me
property to the emails
propertyuris
and emails
to the domains
property (no paths)usernames
property with the GitHub and Twitter usernames as well as the StackOverflow display_name
blog
property if it points to twitterblog
property to the uris
property as an array (no URL protocol or www.
)uris
to the domains
property (no paths)usernames
property with the GitHub and Twitter usernamesAll of the above generated properties get downcased, stripped of extra whitespace, and stored without duplicates.
After all domains have been looked for, I run the followed to extract any domains which have been used more than once to find and eliminate domains like gmail.com
, github.com
, neo4j.com
, etc…
query.match(u: :User).unwind(domain: 'u.domains').
with(:domain, count: 'count(domain)').where('count > 1').
pluck(:domain)
With the data standardized we can now easily do record comparison and scoring. Here I go through the millions of combinations of user pairings and give each of them a score. For each pairing I want to create a relationship in the database so that I don’t need to calculate the score again. Since I don’t want to have millions of relationships connecting every single user to every other user, I only create relationships above a score threshold.
I created a class called ObjectIdentifier
which allows me to create rules for comparing and scoring records.
identifier = ObjectIdentifier.new do |config|
config.default_threshold = 0.8
config.default_weight = 1.0
config.add_matcher :name, :display_name, &fuzzy_string_matcher
config.add_matcher :login, :display_name, &fuzzy_string_matcher
config.add_matcher :blog, :website_url, &fuzzy_string_matcher
config.add_matcher :location, :location, &fuzzy_string_matcher
config.add_matcher :login, :github_username, weight: 2.0, &exact_string_matcher
config.add_matcher :twitter_username, :twitter_username, weight: 2.0, &exact_string_matcher
config.add_matcher :domains, :domains, threshold: 0.98, &array_fuzzy_string_matcher
config.add_matcher :usernames, :usernames, threshold: 0.98, &array_exact_string_matcher
end
You can see that there are anonymous functions for both fuzzy and exact string matching.
With the identifier
object I can call identifier.classify_hash(ghu, sou)
to get the following result:
{
[ :name, :display_name ] => 0.9769319492502884,
[ :login, :display_name ] => 0.9769319492502884,
[ :blog, :website_url ] => 0.0,
[ :location, :location ] => 1.0,
[ :login, :github_username ] => 0.0,
[ :twitter_username, :twitter_username ] => 0.0,
[ :domains, :domains ] => 0.0,
[ :usernames, :usernames ] => 0.0,
}
Using the identifier
object, I then compare every combination of users between GitHub and StackOverflow. With 2,312 StackOverflow users and 6,255 GitHub users there are 14,461,560 potential pairs, which can take some time.
For each pairing my script gets the result hash and calculates a score
as the sum of the Hash values. If that score is above a threshold, it creates a COMPUTER_IDENTIFIED
Neo4j relationship to represent a hit. The relationship stores the score
as well as a hash
property with a JSON representation of the result for debugging.
While comparing strings to strings I realized there was another aspect of the profile that I was missing: The profile photo.
If I could do fuzzy comparisons against the user images I could narrow down the results even further. Fortunately the phashion
gem provides an implementation of the pHash
algorithm to do just that. So I wrote a quick script to download the profile photos locally and then another to compare the millions of combinations. With all of that downloading and pHashing, I found that I also needed to use the parallel
gem to complete in a reasonable amount of time.
So that I didn’t need to compare photos during my ObjectIdentifier
process, I simply created a SIMILAR_IMAGE_TO
relationship between users to store the result of the pHash (a value between 1 and 15 where 1 is the closest perceived match).
Using Neo4j I don’t need to worry about table structure or if I should create a collection or embed in existing objects. I can create whatever links between users that I need as I play with algorithms that will work for my purpose.
To help guide my algorithms I used the only thing that I had available to me that could do a reasonable comparison of profiles: me. I turned my project into a Rails app which could show my two profiles side-by-side:
This view would show me a randomly chosen pair of users. 80% of the time those users were chosen because I had algorithmically determined some linkage between them. 20% of the time it would just show me a randomly chosen set of two users which hadn’t already been matched to acts as a control group. I created five links along the top of the page (“Definitely not”, “Probably not”, “Don’t know”, “Probably”, and “Definitely”) and threw in a bit of CoffeeScript to let me choose them via arrow keys.
In making decisions on hundres of pairs of profiles, a number of questions occurred to me:
To figure out how well my classification algorithm worked I turned, of course, to R. Using RNeo4j I run a query to get the human score (from -2 as ‘Definitely not’ and 2 as ‘Definitely’) and the computer score (for the rules I specified this runs from 0 to about 6):
query <- "MATCH
(ghu:User:GitHub)-[ci:COMPUTER_IDENTIFIED]-(u:User),
MATCH ghu-[i:IDENTIFIED]-u
RETURN ci.score, toInt(i.index)"
result <- cypher(graph, query)
plot(result[[2]], result[[1]], xlab = "Human Score", ylab = "ObjectIdentifier score")
…and plot the results…
This shows me that I should be able to relatively safely classify an ObjectIdentifier
score of around 3 and up as the same user. A score of 2.5 and above looks promising, though looking in that range for users with a human classification of Not sure
there are some false positives. I suspect I can improve the correlation between the two by adjusting the algorithm scoring/weights.
That’s plenty to digest for now, so next time I’ll show the sorts of queries that can be run on the linked database! You can see all the code on the GitHub repository.
If you want to talk more in person, feel free to drop by one of my upcoming speaking engagements!
Elixir allows application developers to create very parallel and very complex systems. Tools like Phoenix PubSub and LiveView thrive on this property of the ...
(This post was originally created for the Erlang Solutions blog. The original can be found here)
with
It, Can’t Live with
out It
(This post was originally created for the Erlang Solutions blog. The original can be found here)
I’ve been using Elixir for a while and I’ve implemented a number of GenServers. But while I think I mostly understand the purpose of them, I’ve not gotten t...
I love Lodash, but I’m not here to tell you to use Lodash. It’s up to you to decide if a tool is useful for you or your project. It will come down to the n...
I’ve mix phx.new ed many applications and when doing so I often start with wondering how to organize my code. I love how Phoenix pushes you to think about th...
What can a 50 year old cryptic error message teach us about the software we write today?
For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911...
Recently the continuous builds for the neo4j Ruby gem failed for JRuby because the memory limit had been reached. I wanted to see if I could use my favorite...
A while ago my colleague Michael suggested to me that I draw out some examples of how my record linkage algorithm did it’s thing. In order to do that, I’ve ...
Last night I ran a very successful workshop at the Friends of Neo4j Stockholm meetup group. The format was based on a workshop that I attended in San Franci...
In my last two posts I covered the process of importing data from StackOverflow and GitHub for the purpose of creating a combined MDM database. Now we get t...
In my last post I said I would “bring in another data source, show how I linked the data together, and demonstrate the sort of bigger picture that one can ge...
Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a g...
I have a bit of a problem.
When using neo4j for the first time, most people want to import data from another database to start playing around. There are a lot of options including LOA...
Having recently become interested in making it easy to pull data from Twitter with neo4apis-twitter I also decided that I wanted to be able to visualize an...
I’ve been reading a few interesting analyses of Twitter data recently such as this #gamergate analysis by Andy Baio. I thought it would be nice to have a ...
I am he as you are he as you are me and we are all together – The Beatles
When I told the people of Northern Ireland that I was an atheist, a woman in the audience stood up and said, ‘Yes, but is it the God of the Catholics or t...
“Wilkins! Yes! I’ve considered decorating these walls with some graffiti of my own, and writing it in the Universal Character.. but it is too depressing...