Building a Corpus System: rss, regex, dates, modules and metadata, encoding

Beginnings of rss-crawler.py which will watch rss feeds and crawl new articles is now in place. I'm currently testing this with just the main news link from the iol.co.za/rss page: http://iol.co.za/cmlink/1.640. This allows for very easy extraction of

url
date
headline
guid (permalink=false might be problematic)
description

But unfortunately the author still needs to be extracted from the html. Using a double regex, the first to extract all the article headers, and the second to extract the author from this, seems to accurately identify the authors for the small test set (20 articles) so far. Further testing will be done.

Started using Python's dateutils.parser for flexible date parsing.

Received documents from Richard about metadata and modularization.

I need to have a better look at how some of the python modules I'm using are doing encoding, as I'm having a few issues with smart quotes and other unicode characters. I discovered the unidecode python module which does a brilliant job at converting unicode characters to the nearest possible ascii match, which may be useful for some text analysis.

Started looking at NLTK's capabilities for word stemming.

Building a Corpus System

Thursday, 6 March 2014

rss, regex, dates, modules and metadata, encoding

No comments:

Post a Comment

Blog Archive