Monday, 29 September 2014

more pre-cleaning

bdlive.co.za made apparent several issues with reporter.py's text extraction algorithm.

* BDLive's HTML style uses inline <p> tags without any whitespace, eg:

<p>Lorem ipsum etc text. Some sentence here.</p><p>Another sentence. More text lorem lorem.</p>

Reporter.py joins the last sentence of each paragraph with the first of the next, eliminating all whitespace. Thus the text from the example above would become:

Lorem ipsum etc text. Some sentence here.Another sentence. More text lorem lorem.

(With no space between "here." and "Another").

This caused havoc with the word lists, 'creating' hundreds of words which were just two different words containing a full-stop as a separator.

After trying several fixes, the easiest and most efficient seemed to be to do .replace("</p><p>","</p>\n<p>") on all HTML before passing it to Reporter, inserting the newline characters it expects.

BDLive also uses the questionable practice of sending HTML content to clients with in-line CSS "display:hidden" on some elements. Assuming that this hidden text is likely to be extraneous, I have added new filters to the generic "phase 1" cleaning to remove this. This was also problematic as even within BDLive pages there seems to be no fixed style-guide, and style="display:hidden", style="display: hidden;" and other variants are seen.

I decided to make a generic change to the filtering algorithm which could be useful for other filters too. Now one can supply a regex instead of a string to match "attribute_value". Before one could filter tags by specifying strings for tag type, attribute name, and attribute value. EG, one could remove the following tag by creating a filter with "div", "class", "author_byline":

<div class="author_byline">

Now the attribute value (author_byline in the above example) can be a regular expression, and creating a filter with:

"div", "class", re.compile(r"author.*")

would also remove the tag.

To remove non-visible text I'm using the regex

r'display\s*:\s*none\s*;?'

which allows optional variable white-space after 'display', after the colon, after the none, and an optional semi-colon at the end.

Unfortunately this won't work for text which is hidden by class or id through separate css style-sheets, but these can still be removed by specifying filters for phase2 cleaning.

I've rebuilt the wordlist on the development database, and things look much tidier. I'll push the changes to the server in the next couple of days.

Rebuilding the word list took about 50 minutes. I'm beginning to think that it would be worth the extra space requirements to store a word-tokenized copy of each article in the database alongside the plaintext one, which would substantially speed-up the wordlist creation, and some other algorithms such as collocations and KWIC.

No comments:

Post a Comment