Building a Corpus System: wayback machine

archive.org has a wayback machine, which offers snapshots of sites at specific dates. It has an API which usefully can return the snapshot closest to a specified time.

Started backwards crawling of mg.co.za, iol.co.za and grocotts.co.za

I started each backwards crawl from the homepages as they appeared in December 2013. I simply fetched all links from the homepages (first trying to get these also through the wayback machine, and if this failed, I tried to access them directly). I then subtracted one from the date, and kept doing so until a different snapshot was found as the "closest" one. Repeat.

Wayback machine is quite slow, but has almost all the content we need. It solves the problem of trying to find URLs for old articles, as these are not really linked to.

Also did general crawling of SA web (anything with a .co.za domain) over the last few days using Scrapy. This amounts to about 50GB and 230000 pages so far, but Scrapy unfortunately runs into memory issues as the queue of URLs gets too big.

Building a Corpus System

Monday, 22 September 2014

wayback machine

No comments:

Post a Comment

Blog Archive