Building a Corpus System

Monday 29 September 2014

more pre-cleaning

bdlive.co.za made apparent several issues with reporter.py's text extraction algorithm.

* BDLive's HTML style uses inline tags without any whitespace, eg:

Lorem ipsum etc text. Some sentence here.Another sentence. More text lorem lorem.

Reporter.py joins the last sentence of each paragraph with the first of the next, eliminating all whitespace. Thus the text from the example above would become:

Lorem ipsum etc text. Some sentence here.Another sentence. More text lorem lorem.

(With no space between "here." and "Another").

This caused havoc with the word lists, 'creating' hundreds of words which were just two different words containing a full-stop as a separator.

After trying several fixes, the easiest and most efficient seemed to be to do .replace("","\n") on all HTML before passing it to Reporter, inserting the newline characters it expects.

BDLive also uses the questionable practice of sending HTML content to clients with in-line CSS "display:hidden" on some elements. Assuming that this hidden text is likely to be extraneous, I have added new filters to the generic "phase 1" cleaning to remove this. This was also problematic as even within BDLive pages there seems to be no fixed style-guide, and style="display:hidden", style="display: hidden;" and other variants are seen.

I decided to make a generic change to the filtering algorithm which could be useful for other filters too. Now one can supply a regex instead of a string to match "attribute_value". Before one could filter tags by specifying strings for tag type, attribute name, and attribute value. EG, one could remove the following tag by creating a filter with "div", "class", "author_byline":

<div class="author_byline">

Now the attribute value (author_byline in the above example) can be a regular expression, and creating a filter with:

"div", "class", re.compile(r"author.*")

would also remove the tag.

To remove non-visible text I'm using the regex

r'display\s*:\s*none\s*;?'

which allows optional variable white-space after 'display', after the colon, after the none, and an optional semi-colon at the end.

Unfortunately this won't work for text which is hidden by class or id through separate css style-sheets, but these can still be removed by specifying filters for phase2 cleaning.

I've rebuilt the wordlist on the development database, and things look much tidier. I'll push the changes to the server in the next couple of days.

Rebuilding the word list took about 50 minutes. I'm beginning to think that it would be worth the extra space requirements to store a word-tokenized copy of each article in the database alongside the plaintext one, which would substantially speed-up the wordlist creation, and some other algorithms such as collocations and KWIC.

Monday 22 September 2014

wayback machine

archive.org has a wayback machine, which offers snapshots of sites at specific dates. It has an API which usefully can return the snapshot closest to a specified time.

Started backwards crawling of mg.co.za, iol.co.za and grocotts.co.za

I started each backwards crawl from the homepages as they appeared in December 2013. I simply fetched all links from the homepages (first trying to get these also through the wayback machine, and if this failed, I tried to access them directly). I then subtracted one from the date, and kept doing so until a different snapshot was found as the "closest" one. Repeat.

Wayback machine is quite slow, but has almost all the content we need. It solves the problem of trying to find URLs for old articles, as these are not really linked to.

Also did general crawling of SA web (anything with a .co.za domain) over the last few days using Scrapy. This amounts to about 50GB and 230000 pages so far, but Scrapy unfortunately runs into memory issues as the queue of URLs gets too big.

Thursday 4 September 2014

Monday 28 July 2014

deduplication again and newage issues

Finished implementing basic near deduplication. After playing around with TF-IDF, cosine distance, and n-gram similarity I decided to use a more customized similarity function based on sentences. In short:

similarity (article1, article2):
s1 = set(sentence_tokenize(article1))
s2 = set(sentence_tokenize(article2))
shared_sentences = s1.intersection(s2)
all_terms = s1.union(s2)
return len(shared_terms)/len(all_terms)

That is, articles are given a similarity rating between 0 and 1 based on how many sentences they share. Looking at comparative results for actual similar articles from the corpus and from some in which I manually introduced small changes, this seemed a better gauge than looking at shared ngrams of characters or even words.

the sklearn python library provides a nice TfidfVectorizer which creates a similarity matrix based on the tf-idf similarity of a list of articles. This could be more efficient, but as we cannot hope to create this matrix in a single pass of the corpus (we can't hold all articles in memory at once), this efficiency is non-trivial to take advantage of. Instead doing pairwise comparison of articles as outline in the previous post seems to be the best option at this stage.

Some optimization was added to the deduplication process, namely:
If a sentence from one article matches too many other articles (for now > 10), ignore this sentence. This means we don't need to pull down hundreds of articles and pairwise compare against all of them for sentences such as "subscribe to our newsletter" which is still dirtying some of the articles. This will remain useful even on the corpus texts are properly cleaned for sentences such as "more to follow" and other reporter cliches, although for now I'm ignoring sentences which are fewer than 20 characters long. Better gauge of similarity could possibly be achieved by taking into account:

sentences which appear only in very few other articles are weighted higher for deduplication
sentences containing names are weighted higher
longer sentences are weighted higher

After running a number of tests on the development database I have now left the deduplicator to run on the main database. It is not removing the duplicates yet, but just marking them, as well as marking 'similar' articles (those which rank with above 30% similarity).

Also discovered some problems with short articles on thenewage.co.za - similar to the problem before with IOL, if the article text is too short then Reporter picks up the CSS styling instead as the 'main text'. Unfortunately unlike with IOL removing the CSS as a pre-processing step does not solve the issue, as Reporter's next guess is the "in other news" section; if this is removed, it picks up the phrase "comment now". At this stage I couldn't find a solution generic enough to be appealing - some customized code may need to be written for some publications.

Installed NLTK on the server with the punkt package. Took a while to find how to do this on a headless machine (NLTK downloader seems GUI-focussed and the cli downloader didn't provide much help in locating the "english.pickle" resource which is part of the punkt tokenizer):

python -m nltk.downloader punkt

Wednesday 23 July 2014

Second semester - deduplication

Working on the project again now that exam revision, exams, internship and field trip are over.

Worked on near deduplication. Using sklearn python library with TfidfVectorizers as described here: http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents which seems to be working very well so far.

As pairwise comparison of all articles will become increasingly impractical as corpus size grows I'm taking a customized approach of keeping a collection of sentence hashes. This takes up more database space, but it means that we only need to do pairwise comparison on articles which share at least one sentence.

Dedup can be done on an existing corpus by building up the sentence hash collection while doing the deduplication. If the sentence hashes exist already for all articles in the db then we need to pull only a limited subset of articles to compare each new article against.

Also discovered ssdeep fuzzy hashing in Python (thanks to Peter Wrench). Will take a comparative look at this at a later stage to see if can be more efficient than the method described above.

Monday 12 May 2014

deduplication and scrapy

I read several articles on near-deduplication and had an idea based on some of the algorithms previously used. Outline:

For each article, hash each sentence, and maintain table of hashed_sentences:articles_containing_sentence[]

Then, duplicates and near duplicates can efficiently be discovered and avoided with something along the lines of the following

new_article = crawl_url(url)

duplicate_possibilites = []

sentences = get_hash_sentences(new_article)

for sentence in sentences:

duplicate_possibilities += hashed_sentences[sentence]

It is then pretty straightforward to fetch the text of all existing articles which have more than some percent overlap of sentences with the new article, and to use text similarity algorithms in pairs on these articles. Alternatively, the sentence-overlap percentage could be enough to identify a new article as a 'duplicate' or not.

The sentence:article table could become undesirably large, but the size could be reduced with some heuristic selection of which sentences are 'important'. (containing at least some uncommon words, not too long or too short, etc).

I also wrote a basic IOL Spider for Scrapy, and started experimenting with using this to fetch old IOL data (ie, articles published before we started watching the RSS feeds.)

Saturday 3 May 2014

multithreading and async-crawling

Due to the growing number of publications, crawl-time has increased dramatically. I spent the day experimenting with using multi-threading on the current implementation and using the Python Twisted library to crawl asynchronously. The latter results in a far greater speed-up, but would require a lot of code refactoring to implement.