Let Your Database Update You with EctoWatch
Elixir allows application developers to create very parallel and very complex systems. Tools like Phoenix PubSub and LiveView thrive on this property of the ...
I am he as you are he as you are me and we are all together
For a while now I have been working in my spare time on a project involving the 1901 and 1911 Irish national censuses. I have been casually exploring the data, but as a larger goal I want to find a way to programatically link together the citizens of the two years to each other. With this I hope to be able ask larger questions of the data such as:
As to matching residents there are a number of challenges. For example:
So, my first step is a bit of code I call ObjectScorer
. I’ve employed it in my Resident
class with the following methods:
def similarity_to(other_resident)
object_scorer.percentage_score(other_resident)
end
def object_scorer
@object_scorer ||= ObjectScoring::ObjectScorer.new(self,
field_scorers,
field_weights,
field_options)
end
def field_weights
{
forename: 10,
surname: 4,
religion: 5,
age: 10,
sex: 10,
latitude: 5,
longitude: 5,
ded_name: 5,
townland_street_name: 5
}
end
def field_scorers
{
forename: :insensitive_levenshtein_nearness,
surname: :insensitive_levenshtein_nearness,
religion: :exact,
age: :nearness,
sex: :exact,
latitude: :nearness,
longitude: :nearness,
ded_name: :insensitive_levenshtein_nearness,
townland_street_name: :insensitive_levenshtein_nearness
}
end
def field_options
{
forename: {max_distance: 5},
surname: {max_distance: 5},
age: {max_distance: 5, value: age_in(other_census_year)},
ded_name: {max_distance: 5},
townland_street_name: {max_distance: 5},
}
end
The concept is pretty simple: I define a set of fields on which I want to compare two objects, define how to match them and how much I care about those fields matching, and then ObjectScorer gives me a percentage match.
Taking the age
field, I specify a simple integer nearness metric. So for the ages of 21
and 23
we:
2
0.6
10
) we get 6
.Taking the forename
field, I specify a case-insensitive Levenshtein distance with a weight of 10
and a maximum distance of 5
. If I’m comparing the two forenames John
and Jon
, we:
1
4
.8
8
.If these were the only two fields we were comparing we would add up our results and divide by the sum of the weights as in (8 + 6) / (10 + 10)
and we’d wind up with a score of 0.7
. Taken all together this gives us a nice way to compare many records together and sort them by which is the best match.
But of course we can’t just load up all of the records in the database and compare them in memory. We need to find a reasonable subset to run these comparisons on. In my case when looking for candidates for a particular records I look for records which:
sex
These queries are done with elasticsearch first and then turned into Neo4j ActiveNode objects in memory. And thus I get a reasonable algorithm which can allow me to compare records to each other!
Elixir allows application developers to create very parallel and very complex systems. Tools like Phoenix PubSub and LiveView thrive on this property of the ...
(This post was originally created for the Erlang Solutions blog. The original can be found here)
with
It, Can’t Live with
out It
(This post was originally created for the Erlang Solutions blog. The original can be found here)
I’ve been using Elixir for a while and I’ve implemented a number of GenServers. But while I think I mostly understand the purpose of them, I’ve not gotten t...
I love Lodash, but I’m not here to tell you to use Lodash. It’s up to you to decide if a tool is useful for you or your project. It will come down to the n...
I’ve mix phx.new ed many applications and when doing so I often start with wondering how to organize my code. I love how Phoenix pushes you to think about th...
What can a 50 year old cryptic error message teach us about the software we write today?
For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911...
Recently the continuous builds for the neo4j Ruby gem failed for JRuby because the memory limit had been reached. I wanted to see if I could use my favorite...
A while ago my colleague Michael suggested to me that I draw out some examples of how my record linkage algorithm did it’s thing. In order to do that, I’ve ...
Last night I ran a very successful workshop at the Friends of Neo4j Stockholm meetup group. The format was based on a workshop that I attended in San Franci...
In my last two posts I covered the process of importing data from StackOverflow and GitHub for the purpose of creating a combined MDM database. Now we get t...
In my last post I said I would “bring in another data source, show how I linked the data together, and demonstrate the sort of bigger picture that one can ge...
Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a g...
I have a bit of a problem.
When using neo4j for the first time, most people want to import data from another database to start playing around. There are a lot of options including LOA...
Having recently become interested in making it easy to pull data from Twitter with neo4apis-twitter I also decided that I wanted to be able to visualize an...
I’ve been reading a few interesting analyses of Twitter data recently such as this #gamergate analysis by Andy Baio. I thought it would be nice to have a ...
I am he as you are he as you are me and we are all together – The Beatles
When I told the people of Northern Ireland that I was an atheist, a woman in the audience stood up and said, ‘Yes, but is it the God of the Catholics or t...
“Wilkins! Yes! I’ve considered decorating these walls with some graffiti of my own, and writing it in the Universal Character.. but it is too depressing...