Wednesday, 5 March 2014

better tokenizing and beginning of database design

The basic tokenization used previously was not as good as I thought, as it didn't strip out all punctuation (specifically full stops).

  • Now using example combination of sent_tokenize and word_tokenize as explained here:
Started basic database design, with a linking collection to show associations between words and the articles they appear in.

No comments:

Post a Comment