Monday 28 July 2014

deduplication again and newage issues

Finished implementing basic near deduplication. After playing around with TF-IDF, cosine distance, and n-gram similarity I decided to use a more customized similarity function based on sentences. In short:

similarity (article1, article2):
    s1 = set(sentence_tokenize(article1))
    s2 = set(sentence_tokenize(article2))
    shared_sentences = s1.intersection(s2)
    all_terms = s1.union(s2)
    return len(shared_terms)/len(all_terms)

That is, articles are given a similarity rating between 0 and 1 based on how many sentences they share. Looking at comparative results for actual similar articles from the corpus and from some in which I manually introduced small changes, this seemed a better gauge than looking at shared ngrams of characters or even words.

the sklearn python library provides a nice TfidfVectorizer which creates a similarity matrix based on the tf-idf similarity of a list of articles. This could be more efficient, but as we cannot hope to create this matrix in a single pass of the corpus (we can't hold all articles in memory at once), this efficiency is non-trivial to take advantage of. Instead doing pairwise comparison of articles as outline in the previous post seems to be the best option at this stage.

Some optimization was added to the deduplication process, namely:
If a sentence from one article matches too many other articles (for now > 10), ignore this sentence. This means we don't need to pull down hundreds of articles and pairwise compare against all of them for sentences such as "subscribe to our newsletter" which is still dirtying some of the articles. This will remain useful even on the corpus texts are properly cleaned for sentences such as "more to follow" and other reporter cliches, although for now I'm ignoring sentences which are fewer than 20 characters long. Better gauge of similarity could possibly be achieved by taking into account:

  • sentences which appear only in very few other articles are weighted higher for deduplication
  • sentences containing names are weighted higher
  • longer sentences are weighted higher

After running a number of tests on the development database I have now left the deduplicator to run on the main database. It is not removing the duplicates yet, but just marking them, as well as marking 'similar' articles (those which rank with above 30% similarity).

Also discovered some problems with short articles on thenewage.co.za - similar to the problem before with IOL, if the article text is too short then Reporter picks up the CSS styling instead as the 'main text'. Unfortunately unlike with IOL removing the CSS as a pre-processing step does not solve the issue, as Reporter's next guess is the "in other news" section; if this is removed, it picks up the phrase "comment now". At this stage I couldn't find a solution generic enough to be appealing - some customized code may need to be written for some publications.

Installed NLTK on the server with the punkt package. Took a while to find how to do this on a headless machine (NLTK downloader seems GUI-focussed and the cli downloader didn't provide much help in locating the "english.pickle" resource which is part of the punkt tokenizer):

python -m nltk.downloader punkt




Wednesday 23 July 2014

Second semester - deduplication

Working on the project again now that exam revision, exams, internship and field trip are over.

Worked on near deduplication. Using sklearn python library with TfidfVectorizers as described here: http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents which seems to be working very well so far.

As pairwise comparison of all articles will become increasingly impractical as corpus size grows I'm taking a customized approach of keeping a collection of sentence hashes. This takes up more database space, but it means that we only need to do pairwise comparison on articles which share at least one sentence.

Dedup can be done on an existing corpus by building up the sentence hash collection while doing the deduplication. If the sentence hashes exist already for all articles in the db then we need to pull only a limited subset of articles to compare each new article against.

Also discovered ssdeep fuzzy hashing in Python (thanks to Peter Wrench). Will take a comparative look at this at a later stage to see if can be more efficient than the method described above.