Brian Underwood

Matchmaking Irish Citizens

I am he as you are he as you are me and we are all together

– The Beatles

For a while now I have been working in my spare time on a project involving the 1901 and 1911 Irish national censuses. I have been casually exploring the data, but as a larger goal I want to find a way to programatically link together the citizens of the two years to each other. With this I hope to be able ask larger questions of the data such as:

How often did Irish citizens move over this ten year period in Irish history?
How often did Irish citizens change professions?
Can we track children which have moved out to different locations?

As to matching residents there are a number of challenges. For example:

Transcription errors from the records to the computer (fortunately we can at least verify information via PDFs available online)
There are many instances of ages being off by 1-5 years even when it is clear the resident is the same (even though the censuses were conducted the March 31st, 1901 and April 2nd, 1911)
Given names are sometimes different, either as given by the resident or written down by the enumerator
As the census was performed by the British government, Irish residents didn’t always want to give full truthful information to those who they might have seen as an occupying force

So, my first step is a bit of code I call ObjectScorer. I’ve employed it in my Resident class with the following methods:

  def similarity_to(other_resident)
    object_scorer.percentage_score(other_resident)
  end

  def object_scorer
    @object_scorer ||= ObjectScoring::ObjectScorer.new(self,
                                                       field_scorers,
                                                       field_weights,
                                                       field_options)
  end

  def field_weights
    {
      forename: 10,
      surname: 4,
      religion: 5,
      age: 10,
      sex: 10,
      latitude: 5,
      longitude: 5,
      ded_name: 5,
      townland_street_name: 5
    }
  end

  def field_scorers
    {
      forename: :insensitive_levenshtein_nearness,
      surname: :insensitive_levenshtein_nearness,
      religion: :exact,
      age: :nearness,
      sex: :exact,
      latitude: :nearness,
      longitude: :nearness,
      ded_name: :insensitive_levenshtein_nearness,
      townland_street_name: :insensitive_levenshtein_nearness
    }
  end

  def field_options
    {
      forename: {max_distance: 5},
      surname: {max_distance: 5},
      age: {max_distance: 5, value: age_in(other_census_year)},
      ded_name: {max_distance: 5},
      townland_street_name: {max_distance: 5},
    }
  end

The concept is pretty simple: I define a set of fields on which I want to compare two objects, define how to match them and how much I care about those fields matching, and then ObjectScorer gives me a percentage match.

Taking the age field, I specify a simple integer nearness metric. So for the ages of 21 and 23 we:

subtract and take the absolute value to get 2
subtracting from and dividing that by the maximum distance we get 0.6
multiplying by the weight (10) we get 6.

Taking the forename field, I specify a case-insensitive Levenshtein distance with a weight of 10 and a maximum distance of 5. If I’m comparing the two forenames John and Jon, we:

get a distance of 1
subtract that from the maximum distance to get 4
divide that by the maximum distance to get .8
multiply by the weight to get 8.

If these were the only two fields we were comparing we would add up our results and divide by the sum of the weights as in (8 + 6) / (10 + 10) and we’d wind up with a score of 0.7. Taken all together this gives us a nice way to compare many records together and sort them by which is the best match.

But of course we can’t just load up all of the records in the database and compare them in memory. We need to find a reasonable subset to run these comparisons on. In my case when looking for candidates for a particular records I look for records which:

Are in the other census year
Have the same value or null for sex
Have an age which would have been / will be within five years of being correct in the other census
Match the forename and surname with a maximum “edit distance” of 4 characters

These queries are done with elasticsearch first and then turned into Neo4j ActiveNode objects in memory. And thus I get a reasonable algorithm which can allow me to compare records to each other!

2025 1
2024 1
2023 2
2021 2
2020 2
2015 9
2014 5

2025

Reduce, Reuse… Refactor: Clearer Elixir with the Enum Module

April 24, 2025

“When an operation cannot be expressed by any of the functions in the Enum module, developers will most likely resort to reduce/3.” From the docs for E...

2024

Let Your Database Update You with EctoWatch

June 27, 2024

Elixir allows application developers to create very parallel and very complex systems. Tools like Phoenix PubSub and LiveView thrive on this property of the ...

2023

Lifting Your Loads for Maintainable Elixir Applications

June 15, 2023

(This post was originally created for the Erlang Solutions blog. The original can be found here)

Can’t Live `with` It, Can’t Live `with`out It

February 23, 2023

(This post was originally created for the Erlang Solutions blog. The original can be found here)

2021

How Far Can I Push a GenServer?

July 23, 2021

I’ve been using Elixir for a while and I’ve implemented a number of GenServers. But while I think I mostly understand the purpose of them, I’ve not gotten t...

Why I Love Lodash

May 13, 2021

I love Lodash, but I’m not here to tell you to use Lodash. It’s up to you to decide if a tool is useful for you or your project. It will come down to the n...

2020

Structuring an Elixir+Phoenix App

July 11, 2020

I’ve mix phx.new ed many applications and when doing so I often start with wondering how to organize my code. I love how Phoenix pushes you to think about th...

Expecting the Unexpected in Elixir

January 31, 2020

What can a 50 year old cryptic error message teach us about the software we write today?

2015

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

August 20, 2015

For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911...

Analyzing Ruby’s ObjectSpace with Neo4j

June 3, 2015

Recently the continuous builds for the neo4j Ruby gem failed for JRuby because the memory limit had been reached. I wanted to see if I could use my favorite...

Master Data Management Scoring Examples

May 14, 2015

A while ago my colleague Michael suggested to me that I draw out some examples of how my record linkage algorithm did it’s thing. In order to do that, I’ve ...

Running a Neo4j Cypher Introduction Workshop

April 16, 2015

Last night I ran a very successful workshop at the Friends of Neo4j Stockholm meetup group. The format was based on a workshop that I attended in San Franci...

Making Master Data Management Fun with Neo4j - Part 3

March 8, 2015

In my last two posts I covered the process of importing data from StackOverflow and GitHub for the purpose of creating a combined MDM database. Now we get t...

Making Master Data Management Fun with Neo4j - Part 2

February 22, 2015

In my last post I said I would “bring in another data source, show how I linked the data together, and demonstrate the sort of bigger picture that one can ge...

Making Master Data Management Fun with Neo4j - Part 1

February 16, 2015

Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a g...

Modeling in Neo4j Sans Programming

February 4, 2015

I have a bit of a problem.

Loading SQL to Neo4j Like Magic

January 8, 2015

When using neo4j for the first time, most people want to import data from another database to start playing around. There are a lot of options including LOA...

Brian Underwood

Matchmaking Irish Citizens

2025

Reduce, Reuse… Refactor: Clearer Elixir with the Enum Module

2024

Let Your Database Update You with EctoWatch

2023

Lifting Your Loads for Maintainable Elixir Applications

Can’t Live `with` It, Can’t Live `with`out It

2021

How Far Can I Push a GenServer?

Why I Love Lodash

2020

Structuring an Elixir+Phoenix App

Expecting the Unexpected in Elixir

2015

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

Analyzing Ruby’s ObjectSpace with Neo4j

Master Data Management Scoring Examples

Running a Neo4j Cypher Introduction Workshop

Making Master Data Management Fun with Neo4j - Part 3

Making Master Data Management Fun with Neo4j - Part 2

Making Master Data Management Fun with Neo4j - Part 1

Modeling in Neo4j Sans Programming

Loading SQL to Neo4j Like Magic

2014

Analyzing Twitter with Neo4j and Rails

neo4apis

Matchmaking Irish Citizens

Normalizing Religion in Ireland

A Blog Post Toward a Programatically Generated Philisophical Language