Building a Corpus System: New feeds and tagging

Added mail and guardian rss feed to the system. This is now using python's FeedParser library, instead of parsing the xml directly, which should hopefully allow the system to be more generic. Still need to move the iol feed to the new system.

Installed NLTK on the server. Had problems with the nltk.download() function to download the libraries, corpora, etc on which it relies. Not sure if this was due to the CLI or if it was memory issues again, but trying to download "all" on the options page failed repeatedly. Managed to download the requirements for using the pos_tag function with:

import nltk
nltk.download()
Downloader> d
Downloader> maxent_treebank_pos_tagger

It took a couple of hours to tag all 4000+ articles. Tagging will either have to be done at crawl time, or regularly, as tagging a large dataset could be prohibitively complex.

Added 'tagged' link to corpus interface, to allow user to see tagged article as well as text and html. Need to figure out exactly how the best way to store text is, but storing the plain text and the tagged plain text is definitely not the most efficient, so one of these should be removed in the near future.

Richard was concerned that South African words would be incorrectly tagged. This does seem to be a problem: see for example 'maas' in http://sae.dwyer.co.za/tagged/5327e906c3f6083abd891d7f

1 comment:

Anonymous4 March 2022 at 18:14
The Top 10 Casino Restaurants in Phoenix, AZ - MapYRO
Top 10 Casino 서울특별 출장마사지 Restaurants 동해 출장안마 in Phoenix, AZ. Restaurant name, barcode, reviews, contact information. Menu, location, and contact information 오산 출장안마 for 양주 출장안마 The Top 10 하남 출장마사지 Casino Restaurants.

Building a Corpus System

Sunday, 23 March 2014

New feeds and tagging

1 comment:

Blog Archive