similarity (article1, article2):
s1 = set(sentence_tokenize(article1))
s2 = set(sentence_tokenize(article2))
shared_sentences = s1.intersection(s2)
all_terms = s1.union(s2)
That is, articles are given a similarity rating between 0 and 1 based on how many sentences they share. Looking at comparative results for actual similar articles from the corpus and from some in which I manually introduced small changes, this seemed a better gauge than looking at shared ngrams of characters or even words.
the sklearn python library provides a nice TfidfVectorizer which creates a similarity matrix based on the tf-idf similarity of a list of articles. This could be more efficient, but as we cannot hope to create this matrix in a single pass of the corpus (we can't hold all articles in memory at once), this efficiency is non-trivial to take advantage of. Instead doing pairwise comparison of articles as outline in the previous post seems to be the best option at this stage.
Some optimization was added to the deduplication process, namely:
If a sentence from one article matches too many other articles (for now > 10), ignore this sentence. This means we don't need to pull down hundreds of articles and pairwise compare against all of them for sentences such as "subscribe to our newsletter" which is still dirtying some of the articles. This will remain useful even on the corpus texts are properly cleaned for sentences such as "more to follow" and other reporter cliches, although for now I'm ignoring sentences which are fewer than 20 characters long. Better gauge of similarity could possibly be achieved by taking into account:
- sentences which appear only in very few other articles are weighted higher for deduplication
- sentences containing names are weighted higher
- longer sentences are weighted higher
After running a number of tests on the development database I have now left the deduplicator to run on the main database. It is not removing the duplicates yet, but just marking them, as well as marking 'similar' articles (those which rank with above 30% similarity).
Also discovered some problems with short articles on thenewage.co.za - similar to the problem before with IOL, if the article text is too short then Reporter picks up the CSS styling instead as the 'main text'. Unfortunately unlike with IOL removing the CSS as a pre-processing step does not solve the issue, as Reporter's next guess is the "in other news" section; if this is removed, it picks up the phrase "comment now". At this stage I couldn't find a solution generic enough to be appealing - some customized code may need to be written for some publications.
Installed NLTK on the server with the punkt package. Took a while to find how to do this on a headless machine (NLTK downloader seems GUI-focussed and the cli downloader didn't provide much help in locating the "english.pickle" resource which is part of the punkt tokenizer):
python -m nltk.downloader punkt