Sunday 23 March 2014

New feeds and tagging

Added mail and guardian rss feed to the system. This is now using python's FeedParser library, instead of parsing the xml directly, which should hopefully allow the system to be more generic. Still need to move the iol feed to the new system.

Installed NLTK on the server. Had problems with the function to download the libraries, corpora, etc on which it relies. Not sure if this was due to the CLI or if it was memory issues again, but trying to download "all" on the options page failed repeatedly. Managed to download the requirements for using the pos_tag function with:

import nltk
Downloader> d
Downloader> maxent_treebank_pos_tagger

It took a couple of hours to tag all 4000+ articles. Tagging will either have to be done at crawl time, or regularly, as tagging a large dataset could be prohibitively complex.

Added 'tagged' link to corpus interface, to allow user to see tagged article as well as text and html. Need to figure out exactly how the best way to store text is, but storing the plain text and the tagged plain text is definitely not the most efficient, so one of these should be removed in the near future.

Richard was concerned that South African words would be incorrectly tagged. This does seem to be a problem: see for example 'maas' in

1 comment:

