Started working on Deduplication. I can compare each plaintext article against each other article in the database. This works for the current data set, but will not scale well. Also, it will not pick up duplicates if any change at all is present. Looked at the Python diff library, which looks promising.
The most efficient way would probably be to find or write an algorithm to extract keywords from an article. If this were accurate enough, then we could simply look at articles with the same keywords, and perform deduplication only on these.
Gallery and Video articles are a problem. Sometimes an advert is being picked up as the 'main text' of these articles, or the text is so short that it is probably insignificant. These are pretty easy to filter out. They contain
* Either "Gallery" "Video" "Pics" in the url.
* Often the plaintext starts with "Gallery"
* The plaintext is normally very short (I experimented with limits - any story under 500 characters seems to be uninteresting).
I can therefore fairly easily filter these out, but we probably still want the comments on these, so I can't just remove them entirely.
All articles from thepost.co.za are duplicates. These urls also simply direct back to the iol.co.za/thepost homepage, even though the urls contain the same slug of the article as the iol duplicate.
Sometimes the iol.co.za article id changes, but the slug remains the same. These entries are also duplicates in the database.
For tokenizing, it might make sense to keep an ordered set of the lowercase tokens of each article. This would allow word frequency analysis and efficient lookup. Case sensitive queries and substring matches could then be done with more expensive regex. This might also help with deduplication.