Thursday, 13 March 2014

language identification and deduplication

An efficient way to identify language:

https://groups.google.com/forum/#!topic/nltk-users/pfUq8svEz-s

Create a set() of English vocab (nltk list has about 200000 words). Then create a set of the tokenized article. The difference of these two sets shows how many non-English words are used in the article. (Take ratio of number of non English words to total number of words).

Tested with several articles - English articles seem to have about 25% non-English words. (The English vocab list only contains root words and some derivations: e.g., it has 'walk' and 'walking' but not 'walks', and this ups the count of 'non English' words, whereas a non-English article showed about 95% non-English words.

I haven't tried or read anything about using this same method for deduplication, but I imagine that a very similar approach would work well.

No comments:

Post a Comment