Building a Corpus System: better tokenizing and beginning of database design

Wednesday 5 March 2014

better tokenizing and beginning of database design

The basic tokenization used previously was not as good as I thought, as it didn't strip out all punctuation (specifically full stops).

Now using example combination of sent_tokenize and word_tokenize as explained here: http://www.nltk.org/api/nltk.tokenize.html

Started basic database design, with a linking collection to show associations between words and the articles they appear in.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)