Working on the project again now that exam revision, exams, internship and field trip are over.
Worked on near deduplication. Using sklearn python library with TfidfVectorizers as described here: http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents which seems to be working very well so far.
As pairwise comparison of all articles will become increasingly impractical as corpus size grows I'm taking a customized approach of keeping a collection of sentence hashes. This takes up more database space, but it means that we only need to do pairwise comparison on articles which share at least one sentence.
Dedup can be done on an existing corpus by building up the sentence hash collection while doing the deduplication. If the sentence hashes exist already for all articles in the db then we need to pull only a limited subset of articles to compare each new article against.
Also discovered ssdeep fuzzy hashing in Python (thanks to Peter Wrench). Will take a comparative look at this at a later stage to see if can be more efficient than the method described above.
No comments:
Post a Comment