- For each article, hash each sentence, and maintain table of hashed_sentences:articles_containing_sentence
Then, duplicates and near duplicates can efficiently be discovered and avoided with something along the lines of the following
new_article = crawl_url(url)
duplicate_possibilites = 
sentences = get_hash_sentences(new_article)
for sentence in sentences:
duplicate_possibilities += hashed_sentences[sentence]
It is then pretty straightforward to fetch the text of all existing articles which have more than some percent overlap of sentences with the new article, and to use text similarity algorithms in pairs on these articles. Alternatively, the sentence-overlap percentage could be enough to identify a new article as a 'duplicate' or not.
The sentence:article table could become undesirably large, but the size could be reduced with some heuristic selection of which sentences are 'important'. (containing at least some uncommon words, not too long or too short, etc).
I also wrote a basic IOL Spider for Scrapy, and started experimenting with using this to fetch old IOL data (ie, articles published before we started watching the RSS feeds.)