Web Scraping – what and why

I really hate transcribing stuff from one programme to another. Practically, my poor typing skills and a dodgy keyboard don’t help. Philosophically, I don’t see why I should have to type stuff from one window to another. What on earth are computers for?

With most genealogy programmes, GEDCOM can be used to transfer the basics from one programme to another. Multimedia data causes most problems, but careful batch editing and some basic scripting can solve many issues.

The biggest problem is extracting data from the web on the many online resources now available, especially the subscription sites. The conventional way is to look at the web page on the browser and re-type the data into the family history database, in my case introducing all kinds of errors on the way. What is needed is some way of getting the data off the browser and into a computer database without it being touched (or typed) by human hand. This can be done – it’s called web scraping.

There are many different ways of scraping data from the web, and also putting it back on to other web sites. I hope, in a series of posts, to share some of the ways this can be done.

Advertisements
Previous Post
Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s