Wednesday 5 March 2014

better tokenizing and beginning of database design

The basic tokenization used previously was not as good as I thought, as it didn't strip out all punctuation (specifically full stops).

  • Now using example combination of sent_tokenize and word_tokenize as explained here: http://www.nltk.org/api/nltk.tokenize.html
Started basic database design, with a linking collection to show associations between words and the articles they appear in.


No comments:

Post a Comment