Building a Corpus System: September 2014

bdlive.co.za made apparent several issues with reporter.py's text extraction algorithm.

* BDLive's HTML style uses inline tags without any whitespace, eg:

Lorem ipsum etc text. Some sentence here.Another sentence. More text lorem lorem.

Reporter.py joins the last sentence of each paragraph with the first of the next, eliminating all whitespace. Thus the text from the example above would become:

Lorem ipsum etc text. Some sentence here.Another sentence. More text lorem lorem.

(With no space between "here." and "Another").

This caused havoc with the word lists, 'creating' hundreds of words which were just two different words containing a full-stop as a separator.

After trying several fixes, the easiest and most efficient seemed to be to do .replace("","\n") on all HTML before passing it to Reporter, inserting the newline characters it expects.

BDLive also uses the questionable practice of sending HTML content to clients with in-line CSS "display:hidden" on some elements. Assuming that this hidden text is likely to be extraneous, I have added new filters to the generic "phase 1" cleaning to remove this. This was also problematic as even within BDLive pages there seems to be no fixed style-guide, and style="display:hidden", style="display: hidden;" and other variants are seen.

I decided to make a generic change to the filtering algorithm which could be useful for other filters too. Now one can supply a regex instead of a string to match "attribute_value". Before one could filter tags by specifying strings for tag type, attribute name, and attribute value. EG, one could remove the following tag by creating a filter with "div", "class", "author_byline":

<div class="author_byline">

Now the attribute value (author_byline in the above example) can be a regular expression, and creating a filter with:

"div", "class", re.compile(r"author.*")

would also remove the tag.

To remove non-visible text I'm using the regex

r'display\s*:\s*none\s*;?'

which allows optional variable white-space after 'display', after the colon, after the none, and an optional semi-colon at the end.

Unfortunately this won't work for text which is hidden by class or id through separate css style-sheets, but these can still be removed by specifying filters for phase2 cleaning.

I've rebuilt the wordlist on the development database, and things look much tidier. I'll push the changes to the server in the next couple of days.

Rebuilding the word list took about 50 minutes. I'm beginning to think that it would be worth the extra space requirements to store a word-tokenized copy of each article in the database alongside the plaintext one, which would substantially speed-up the wordlist creation, and some other algorithms such as collocations and KWIC.

archive.org has a wayback machine, which offers snapshots of sites at specific dates. It has an API which usefully can return the snapshot closest to a specified time.

Started backwards crawling of mg.co.za, iol.co.za and grocotts.co.za

I started each backwards crawl from the homepages as they appeared in December 2013. I simply fetched all links from the homepages (first trying to get these also through the wayback machine, and if this failed, I tried to access them directly). I then subtracted one from the date, and kept doing so until a different snapshot was found as the "closest" one. Repeat.

Wayback machine is quite slow, but has almost all the content we need. It solves the problem of trying to find URLs for old articles, as these are not really linked to.

Also did general crawling of SA web (anything with a .co.za domain) over the last few days using Scrapy. This amounts to about 50GB and 230000 pages so far, but Scrapy unfortunately runs into memory issues as the queue of URLs gets too big.

Building a Corpus System

Monday, 29 September 2014

more pre-cleaning

Monday, 22 September 2014

wayback machine

Thursday, 4 September 2014

More problems with english.pickle

Blog Archive