Building a Corpus System: May 2014

Monday, 12 May 2014

deduplication and scrapy

I read several articles on near-deduplication and had an idea based on some of the algorithms previously used. Outline:

For each article, hash each sentence, and maintain table of hashed_sentences:articles_containing_sentence[]

Then, duplicates and near duplicates can efficiently be discovered and avoided with something along the lines of the following

new_article = crawl_url(url)

duplicate_possibilites = []

sentences = get_hash_sentences(new_article)

for sentence in sentences:

duplicate_possibilities += hashed_sentences[sentence]

It is then pretty straightforward to fetch the text of all existing articles which have more than some percent overlap of sentences with the new article, and to use text similarity algorithms in pairs on these articles. Alternatively, the sentence-overlap percentage could be enough to identify a new article as a 'duplicate' or not.

The sentence:article table could become undesirably large, but the size could be reduced with some heuristic selection of which sentences are 'important'. (containing at least some uncommon words, not too long or too short, etc).

I also wrote a basic IOL Spider for Scrapy, and started experimenting with using this to fetch old IOL data (ie, articles published before we started watching the RSS feeds.)

Saturday, 3 May 2014

multithreading and async-crawling

Due to the growing number of publications, crawl-time has increased dramatically. I spent the day experimenting with using multi-threading on the current implementation and using the Python Twisted library to crawl asynchronously. The latter results in a far greater speed-up, but would require a lot of code refactoring to implement.

Friday, 2 May 2014

New publications

Added the following publications:

The Citizen
Sowetan Live
Dispatch Live
The New Age
Business Day Live
Times Live
Daily Maverick

There are still more to add. Adding feeds still requires some amount of manual labour, although this can be done far more generically than before. To add new publications and feeds, one needs to manually specify the RSS url(s), information about how to extract the author, and information about any static text to remove (either because Reporter misidentifies it and includes it in the plaintext or because if it is too long, such as the IOL copyright notice, Reporter may ignore the main text completely for short articles and pick up on the static text instead).

At the moment I am specifying this information programmatically so a typical new entry may look something like this:

dailymaverick = Publication("Daily Maverick", "http://www.dailymaverick.co.za", ["http://www.dailymaverick.co.za/rss"],

{'tag_type':'li',

'attribute_name':'urlid',

'attribute_value':'authorid',

'splitstring':"<div",

"splitindex":0 },

{'attribute_name':'span','attribute_value':'style'})

dailymaverick.create_feeds()

But the UI to allow the same functionality should be ready soon(ish).