Web Scraping – FreeBMD

Scraping FreeBMD is trivial. Why? Because there’s a button on the site which does just what we want. It’s labelled “Download”, and it downloads to your computer the data displayed on the screen. (it’s just to the left of the Key, under “Save Search”) One important point to make here is that there is a limit on the number of results FreeBMD will display – currently 3000, which is probably rather ambitious anyway!

For those of you in The Surname Society, Colin Spencer has made an excellent video of the process (it’s clear and concise for those, like me, who don’t really go for video tutorials – I prefer the written word, obviously, or I wouldn’t be writing this!) You can find the video in the Members section of the Society’s web site – on the menu, look for Surname School videos, then scroll down to Data Extraction.

For those of you not in the Society (why not? it’s only £5!), downloading the file gives you a Tab Separated Value file (like CSV, but with Tabs not commas). You can then import this file into Excel. Colin passes it through a text editor (he and I both like Notepad++) to convert it to a CSV, but I just right-clicked on the file in the file explorer and used “Open With” to load it into Excel (I use Excel 2003 – I hope other versions work similarly). Colin also uses Notepad to remove the extra column inserted by the FreeBMD format, but that’s easy to do directly in Excel.

There’s one major point I’d add to Colin’s tutorial. The format of results changes at some point – either you’ll need to add a column to part of the resulting spreadsheet to line up before and after results (prone to errors), or search in 2 parts (recommended). For births, mother’s maiden name is added from September quarter 1911. For marriages, spouse’s surname is added from March quarter 1912. For deaths, (alleged!) age at death is added from March quarter 1866.

Marriages of course show only the spouses’ surname, and that only from 1911. To get the possible spouses before that date, and/or the spouses’ given name after that date requires a more advanced technique.

Web Scraping – what and why

I really hate transcribing stuff from one programme to another. Practically, my poor typing skills and a dodgy keyboard don’t help. Philosophically, I don’t see why I should have to type stuff from one window to another. What on earth are computers for?

With most genealogy programmes, GEDCOM can be used to transfer the basics from one programme to another. Multimedia data causes most problems, but careful batch editing and some basic scripting can solve many issues.

The biggest problem is extracting data from the web on the many online resources now available, especially the subscription sites. The conventional way is to look at the web page on the browser and re-type the data into the family history database, in my case introducing all kinds of errors on the way. What is needed is some way of getting the data off the browser and into a computer database without it being touched (or typed) by human hand. This can be done – it’s called web scraping.

There are many different ways of scraping data from the web, and also putting it back on to other web sites. I hope, in a series of posts, to share some of the ways this can be done.