Brian Underwood

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911 Irish censuses I hoped I would be able to find a way to reliably link resident records from the two together to identify the same residents. Since then I’ve learned a bit about master data management and record linkage and so I thought I would give it another stab. Here I’d like to talk about how I’ve been matching records based on the local data space around objects to improve my record linkage scoring.

The data model of the imported data is very linear:

In this post, however, I’m going to be focusing on Houses and Residents and creating relationships between them based on their properties.

Relations to the Head

To view an example of what a census record from 1911 Ireland looks like you can have a look at the McCarthys of 1901 and 1911. Charles is the head of the family with his wife Hannah, mother Ellen, children (two in 1901 and seven in 1911), and a servant (Timothy Walsh in 1901 and William Regan in 1911).

The McCarthy family of Barnagowlane, Cloghdowell, Cork, 1901

The McCarthy family of Barnagowlane, Cloghdonnell, Cork, 1911

Surname	Forename	Age	Sex	Relation to Head	Religion	Surname	Forename	Age	Sex	Relation to Head	Religion
McCarthy	Charles	37	Male	Head of Family	Roman Catholic	McCarthy	Charles	47	Male	Head of Family	Roman Catholic
McCarthy	Hannah	25	Female	Wife	Roman Catholic	McCarthy	Hannah	35	Female	Wife	Roman Catholic
McCarthy	William	1	Male	Son	Roman Catholic	McCarthy	William	11	Male	Son	Roman Catholic
McCarthy	Bridget		Female	Daughter	Roman Catholic	McCarthy	Bridget	10	Female	Daughter	Roman Catholic
						McCarthy	Ellen	8	Female	Daughter	Roman Catholic
						McCarthy	Kate	6	Female	Daughter	Roman Catholic
						McCarthy	Florence	4	Male	Son	Roman Catholic
						McCarthy	Charles Peter	2	Male	Son	Roman Catholic
						McCarthy	Annie		Female	Daughter	Roman Catholic
McCarthy	Ellen	65	Female	Mother	Roman Catholic	McCarthy	? Ellen	75	Female	Mother	Roman Catholic
Walsh	Timothy	25	Male	Servant	Roman Catholic
						Regan	William	24	Male	Servant	Roman Catholic

The McCarthys are an almost exact match between two census records between 1901 and 1911. The names, ages, occupations, and relationships all match perfectly. Unfortunately the story for other records is not so simple. Many times houses, which to the human eye seem to be the same house, can have wildly varying details. For example Hannah might go be listed as Hana or Anne in a different census. Likewise ages vary a lot more than you might think. In examining the records I regularly found ages varying by a year or two and have even found a few houses with ages off by as much as 10-15 years.

In both censuses there is a field for residents to fill out called “Relation to Head”. This gives us information about how each resident is related to the head of the house. In the case of the McCarthys, Charles is listed as “Head of Family" in both years. The rest of the family has a nice representation of things that we often see in the data: "Wife", "Son", "Daughter", and "Servant".

We might be tempted to say “This person was the head in 1901, so they must be the same person who was the head in 1911”. Often, however, the head of the family can die or retire leaving the roll of head of the family to their wife or child. Can the “Relation to Head” values still be useful to us to match any given resident from 1901 to another resident in 1911?

First let’s cover the general the process of record linkage I have been using. To find a match for a resident I start by using an elasticsearch server (which contains a duplicate of my Neo4j census data) to quickly find a list of other residents with a match on very rough criteria:

Is the resident in the other census?
Does the sex match (or it it NULL)?
Is the resident’s age within 15 years of what it would be expected to be in the other census?
Does the name match roughly (within an edit distance of 4)

This comes back with anywhere from zero to hundreds of results. I call these “similarity candidates” and for each I create a relationship between the original record and the candidate.

With this list I can compare the attributes of the two records (using the record_linkage gem I created) to see how closely they match. The closer their name, sex, age, etc.. matches, the higher score they get. Ideally the real match should have the highest score, but that isn’t always true and can take some tuning.

In addition to this simple comparison of attributes, I have now added a process to take advantage of the similarity candidate relationships to compare family relationships. Let’s start with this example of a sub-graph pattern:

The relationship CHILD_OF is created whenever there is a “Son” or “Daughter” in the “Realation to Head” field. Likewise we can create other gender-neutral relationships like MARRIED_TO, SIBLING_OF, NIECE_NEPHEW_OF, etc…

In this case the resident in question is the 1901 record for William. When we are evaluating the 1911 record of William as a potential match we can explore other residents in the same house as evidence of similarity. The diagram above shows that both records have a CHILD_OF relationship to the two “Charles” records which furthermore are linked via a SIMILARITY_CANDIDATE relationship. Because of this we can say that there is a greater chance that the two “William” records represent the same person.

This only gives us the ability to find these relationships between the head of the family and other residents. What about generically matching based on the relationship of any two residents of a house? Let’s say that Charles died sometime between 1901 and 1911. If his wife Hannah takes over as the head of the family we would have a sub-graph which looks like this:

We could say that when we have the paths -CHILD_OF-><-MARIED_TO- and -CHILD_OF-> on either side that we can build our case for a match a bit more. This kind of matching can be used on all of the other residents of the house with SIMILARITY_CANDIDATE relationships. For example, -CHILD_OF-><-CHILD_OF- could be matched to -CHILD_OF-><-CHILD_OF- even in this case where the wife becomes the head of the house. Or if a child becomes the head then it could be compared to a -SIBLING_OF- relationship.

The Code

So how do we actually do this? First let’s take our sub-graph and turn our nodes into variables:

In this example let’s take resident h1 r1 (house 1, resident 1) as the resident in question and h2 r1 as the candidate that we want to compare it to. This is the sort of query that Neo4j is wonderful at both performing quickly and making easy to formulate. Let’s look at part of the Ruby code:



def get_similarity_candidate_relationship_paths
  self.query_as(:h1_r1)
    .match('(h1:House), (h2:House)')
    .match('h1<-[:LIVES_IN]-h1_r1-[sc_1:similarity_candidate]-(h2_r1)-[:LIVES_IN]->h2')
    .match('h1<-[:LIVES_IN]-h1_r2-[sc_2:similarity_candidate]-(h2_r2)-[:LIVES_IN]->h2')
    .match('path1=h1_r1-[:born_to|married_to|grandchild_of|niece_nephew_of|sibling_of|cousin_of|child_in_law_of|step_child_of*1..2]-h1_r2')
    .match('path2=h2_r1-[:born_to|married_to|grandchild_of|niece_nephew_of|sibling_of|cousin_of|child_in_law_of|step_child_of*1..2]-h2_r2')
    .pluck(
      :h2_r1,
      'collect([path1, rels(path1), path2, rels(path2)])'
      ).each_with_object({}) do |(r2, data), result|

    result[r2] = data.inject(0) do |total, (path1, rels1, path2, rels2)|
      relations1 = relation_string_from_path_and_rels(path1, rels1)
      relations2 = relation_string_from_path_and_rels(path2, rels2)

      if relations1 == relations2
        1.0
      elsif score = (RELATION_EQUIVILENCE_SCORES[relations1] || {})[relations2]
        score
      else
        -2.0
      end + total
    end
  end
end

Here we start with a Cypher query using the Query API from neo4j.rb. The object upon which we’ve called get_similarity_candidate_relationship_paths is our h1_r1 anchor. Note here that we match paths with a length of either one or two relationships long from between two residents of the same house. Then we return all residents found via the SIMILARITY_CANDIDATE relationship from our anchor and the family relationship paths aggregated into an Array.

Once the Cypher query returns data we call relation_string_from_path_and_rels which is a way of transforming the path into a string like -BORN_TO-><-BORN_TO. This string gives us a simple way to express the path between the two residents as a string.

We then can give a score based on the two paths. If the paths are the same then we say that the score is 1.0. If the pair of paths is something like -BORN_TO-><-BORN_TO and -SIBLING_OF-> then we can give a score based on a lookup. We add these scores up to give us a total score comparing our anchor resident and each of it’s similarity candidates. All with just one query to the database.

Challenges

There are a couple of things that I needed to do to make this work:

Previously I was simply grabbing one resident at a time, finding all of the similarity candidates, and then creating a set of relationships to link the resident with the candidates and to store the record linkage scores (both the individual scores for fields and the total score). However this approach requires all of the candidates in the house to have SIMILARITY_CANDIDATE relationships in order to compare family relationships. So I first process all residents for a house to create the similarity candidate relationships and store the record linkage scores and then go through them again with the graph-based comparisons and store that score and update the total.

Beyond that there is the conceptual problem of determining the scoring when comparing paths. For example, if somebody was BORN_TO the head one year but their spouse takes over as the head, could we say that they’re BORN_TO the spouse if they are are a step-child? Family relationships are complicated and don’t always fit neatly into our properties and algorithms.

Conclusion

Most record linkage focuses on the properties of an object, but we need to remember that relationships are data about our entities too. With Neo4j we have a powerful tool for analyzing those relationships natuarally and quickly. Additionally I have found that the ability to create relationships on the fly to aggregate calculations like the ones discussed above is a wonderful way to find the best solution quickly.

2025 1
2024 1
2023 2
2021 2
2020 2
2015 9
2014 5

2025

Reduce, Reuse… Refactor: Clearer Elixir with the Enum Module

April 24, 2025

“When an operation cannot be expressed by any of the functions in the Enum module, developers will most likely resort to reduce/3.” From the docs for E...

2024

Let Your Database Update You with EctoWatch

June 27, 2024

Elixir allows application developers to create very parallel and very complex systems. Tools like Phoenix PubSub and LiveView thrive on this property of the ...

2023

Lifting Your Loads for Maintainable Elixir Applications

June 15, 2023

(This post was originally created for the Erlang Solutions blog. The original can be found here)

Can’t Live `with` It, Can’t Live `with`out It

February 23, 2023

(This post was originally created for the Erlang Solutions blog. The original can be found here)

2021

How Far Can I Push a GenServer?

July 23, 2021

I’ve been using Elixir for a while and I’ve implemented a number of GenServers. But while I think I mostly understand the purpose of them, I’ve not gotten t...

Why I Love Lodash

May 13, 2021

I love Lodash, but I’m not here to tell you to use Lodash. It’s up to you to decide if a tool is useful for you or your project. It will come down to the n...

2020

Structuring an Elixir+Phoenix App

July 11, 2020

I’ve mix phx.new ed many applications and when doing so I often start with wondering how to organize my code. I love how Phoenix pushes you to think about th...

Expecting the Unexpected in Elixir

January 31, 2020

What can a 50 year old cryptic error message teach us about the software we write today?

2015

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

August 20, 2015

For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911...

Analyzing Ruby’s ObjectSpace with Neo4j

June 3, 2015

Recently the continuous builds for the neo4j Ruby gem failed for JRuby because the memory limit had been reached. I wanted to see if I could use my favorite...

Master Data Management Scoring Examples

May 14, 2015

A while ago my colleague Michael suggested to me that I draw out some examples of how my record linkage algorithm did it’s thing. In order to do that, I’ve ...

Running a Neo4j Cypher Introduction Workshop

April 16, 2015

Last night I ran a very successful workshop at the Friends of Neo4j Stockholm meetup group. The format was based on a workshop that I attended in San Franci...

Making Master Data Management Fun with Neo4j - Part 3

March 8, 2015

In my last two posts I covered the process of importing data from StackOverflow and GitHub for the purpose of creating a combined MDM database. Now we get t...

Making Master Data Management Fun with Neo4j - Part 2

February 22, 2015

In my last post I said I would “bring in another data source, show how I linked the data together, and demonstrate the sort of bigger picture that one can ge...

Making Master Data Management Fun with Neo4j - Part 1

February 16, 2015

Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a g...

Modeling in Neo4j Sans Programming

February 4, 2015

I have a bit of a problem.

Loading SQL to Neo4j Like Magic

January 8, 2015

When using neo4j for the first time, most people want to import data from another database to start playing around. There are a lot of options including LOA...

Brian Underwood

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

Relations to the Head

The Code

Challenges

Conclusion

2025

Reduce, Reuse… Refactor: Clearer Elixir with the Enum Module

2024

Let Your Database Update You with EctoWatch

2023

Lifting Your Loads for Maintainable Elixir Applications

Can’t Live `with` It, Can’t Live `with`out It

2021

How Far Can I Push a GenServer?

Why I Love Lodash

2020

Structuring an Elixir+Phoenix App

Expecting the Unexpected in Elixir

2015

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

Analyzing Ruby’s ObjectSpace with Neo4j

Master Data Management Scoring Examples

Running a Neo4j Cypher Introduction Workshop

Making Master Data Management Fun with Neo4j - Part 3

Making Master Data Management Fun with Neo4j - Part 2

Making Master Data Management Fun with Neo4j - Part 1

Modeling in Neo4j Sans Programming

Loading SQL to Neo4j Like Magic

2014

Analyzing Twitter with Neo4j and Rails

neo4apis

Matchmaking Irish Citizens

Normalizing Religion in Ireland

A Blog Post Toward a Programatically Generated Philisophical Language