Wednesday, 5 March 2014

better tokenizing and beginning of database design

The basic tokenization used previously was not as good as I thought, as it didn't strip out all punctuation (specifically full stops).

  • Now using example combination of sent_tokenize and word_tokenize as explained here: http://www.nltk.org/api/nltk.tokenize.html
Started basic database design, with a linking collection to show associations between words and the articles they appear in.


1 comment:

  1. bet365 – Welcome Offer - Thakasino
    The bet365 login bet365 page allows you to 온카지노 create a new account at bet365, the home of m88 ทางเข้า free bets. The bet365 login page allows you to create a new  Rating: 4.6 · ‎1 review

    ReplyDelete