Brian Underwood

Normalizing Religion in Ireland

When I told the people of Northern Ireland that I was an atheist, a woman in the audience stood up and said, ‘Yes, but is it the God of the Catholics or the God of the Protestants in whom you don’t believe?’

– Quentin Crisp

I’ve been staying at an AirBNB in Ireland recently and my host told me about the Irish National Census website. On this site many of the census records from around the turn of the 20th century have been transcribed and are available on their website, including all of the census records from the years 1901 and 1911. My host was interested in downloading records from his home county, but I thought it would be fun to download all counties for both years, load them into a Neo4j database, and see what I could do with them.

The site has no API, but the HTML data on the site is very well structured and it was easy to write a ruby script using nokogiri to crawl the site and produce CSV files. Since there are a lot of data and since I would be running the script many times I created a local cache for the HTML pages so that whenever I run the script it uses what it already had.

It turns out that not only are there a lot of records (about 4 million residents for each year), but the data is quite variable. After informally playing with the data I decided that to do any use ful analysis I would need to clean up the data first. The first target: religion

At the start there were 24,258 distinct strings entered by census takers / residents. To facilitate the cleaning I wrote a rake task which allowed me to quickly choose how the data should be normalized. The task makes a query for all distinct strings from the religion property and the number of times that string occurs, sorts the result so that the most common values are first, and then presents a prompt asking for an alternative string. An enumerated list is presented so that I can quickly choose a previously entered value. Simply hitting return causes the value to be mapped to itself. After a series of inputs the prompt looks like this:

Choices:
0> Roman Catholic
1> Church of Ireland
2> Presbyterian
3> Catholic
4> Methodist
5> Church of England
6> -
7> Episcopalian
8> Baptist
9> Unitarian
10> Reformed Presbyterian
11> Congregationalist
Current unmapped value: Presbeterian (2430 occurances)
Suggestions:
 2>  Presbyterian (distance: 1)
 9>  Unitarian (distance: 7)
(7615 more / 'QUIT' to quit) >

The suggestions come from finding the closest string (via the Levenshtein distance algorithm) from the data which has already been mapped and getting it’s mapped value.

By the way, you can’t believe how many ways that “Roman Catholic” is represented in this data. From a simple “R C” to the specific “Catholic Commonly Described in Acts of Parliament As Roman Catholic”. Likely because of the wide range of dialects and education levels there’s also “Roman Catholce”, “Roman Caiholec”, “Roman Catcklick”, “Roman Katchrlick”, and so so much more. Though I think my favorite so far is “Romeman Catholic”

Each time the prompt is responded to a YAML file representing a Hash is saved with the origional strings as keys and the normalized strings as values. This looks like:

---
Roman Catholic: Roman Catholic
Church of Ireland: Church of Ireland
Presbyterian: Presbyterian
R Catholic: Roman Catholic
Catholic: Catholic
R C: Roman Catholic
Methodist: Methodist
Irish Church: Church of Ireland
Church of England: Church of England
"-": "-"
Catholic Church: Catholic

My colleage on the neo4j gem project suggested building a prompt to quickly go through strings and do passes of yes/no on one particular religion. I ended up doing this is the low-tech manner of filtering the list of strings for things like rom and pres, working on the results in a text editor, and then manually putting them into the YAML file.

Doing this I was able to make a lot of progress, but with thousands of unique strings to go through, it’s still ongoing. For now I’ve got enough to move onto standardizing other fields and then eventually I will be able to use the standardized data to do some analysis. More to come…

2025 1
2024 1
2023 2
2021 2
2020 2
2015 9
2014 5

2025

Reduce, Reuse… Refactor: Clearer Elixir with the Enum Module

April 24, 2025

“When an operation cannot be expressed by any of the functions in the Enum module, developers will most likely resort to reduce/3.” From the docs for E...

2024

Let Your Database Update You with EctoWatch

June 27, 2024

Elixir allows application developers to create very parallel and very complex systems. Tools like Phoenix PubSub and LiveView thrive on this property of the ...

2023

Lifting Your Loads for Maintainable Elixir Applications

June 15, 2023

(This post was originally created for the Erlang Solutions blog. The original can be found here)

Can’t Live `with` It, Can’t Live `with`out It

February 23, 2023

(This post was originally created for the Erlang Solutions blog. The original can be found here)

2021

How Far Can I Push a GenServer?

July 23, 2021

I’ve been using Elixir for a while and I’ve implemented a number of GenServers. But while I think I mostly understand the purpose of them, I’ve not gotten t...

Why I Love Lodash

May 13, 2021

I love Lodash, but I’m not here to tell you to use Lodash. It’s up to you to decide if a tool is useful for you or your project. It will come down to the n...

2020

Structuring an Elixir+Phoenix App

July 11, 2020

I’ve mix phx.new ed many applications and when doing so I often start with wondering how to organize my code. I love how Phoenix pushes you to think about th...

Expecting the Unexpected in Elixir

January 31, 2020

What can a 50 year old cryptic error message teach us about the software we write today?

2015

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

August 20, 2015

For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911...

Analyzing Ruby’s ObjectSpace with Neo4j

June 3, 2015

Recently the continuous builds for the neo4j Ruby gem failed for JRuby because the memory limit had been reached. I wanted to see if I could use my favorite...

Master Data Management Scoring Examples

May 14, 2015

A while ago my colleague Michael suggested to me that I draw out some examples of how my record linkage algorithm did it’s thing. In order to do that, I’ve ...

Running a Neo4j Cypher Introduction Workshop

April 16, 2015

Last night I ran a very successful workshop at the Friends of Neo4j Stockholm meetup group. The format was based on a workshop that I attended in San Franci...

Making Master Data Management Fun with Neo4j - Part 3

March 8, 2015

In my last two posts I covered the process of importing data from StackOverflow and GitHub for the purpose of creating a combined MDM database. Now we get t...

Making Master Data Management Fun with Neo4j - Part 2

February 22, 2015

In my last post I said I would “bring in another data source, show how I linked the data together, and demonstrate the sort of bigger picture that one can ge...

Making Master Data Management Fun with Neo4j - Part 1

February 16, 2015

Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a g...

Modeling in Neo4j Sans Programming

February 4, 2015

I have a bit of a problem.

Loading SQL to Neo4j Like Magic

January 8, 2015

When using neo4j for the first time, most people want to import data from another database to start playing around. There are a lot of options including LOA...

Brian Underwood

Normalizing Religion in Ireland

2025

Reduce, Reuse… Refactor: Clearer Elixir with the Enum Module

2024

Let Your Database Update You with EctoWatch

2023

Lifting Your Loads for Maintainable Elixir Applications

Can’t Live `with` It, Can’t Live `with`out It

2021

How Far Can I Push a GenServer?

Why I Love Lodash

2020

Structuring an Elixir+Phoenix App

Expecting the Unexpected in Elixir

2015

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

Analyzing Ruby’s ObjectSpace with Neo4j

Master Data Management Scoring Examples

Running a Neo4j Cypher Introduction Workshop

Making Master Data Management Fun with Neo4j - Part 3

Making Master Data Management Fun with Neo4j - Part 2

Making Master Data Management Fun with Neo4j - Part 1

Modeling in Neo4j Sans Programming

Loading SQL to Neo4j Like Magic

2014

Analyzing Twitter with Neo4j and Rails

neo4apis

Matchmaking Irish Citizens

Normalizing Religion in Ireland

A Blog Post Toward a Programatically Generated Philisophical Language