Building a Corpus System: gzip and admin panel

After trying various RSS feeds, I ran into the problem of the fact that some html is sent to the browser in gzip form, which the browser flawlessly deals with, converting it back to html. The corpus system does not do this.

Specifically, SABC.co.za articles are returned in this form. A fix for the problem was found here:

http://stackoverflow.com/questions/18146389/urlopen-trouble-while-trying-to-download-a-gzip-file

I am busy implementing this, as well as looking into potential similar problems which may occur.

Worked on admin panel for the user to add new feeds, and started reworking database to fit this model.

The corpus currently contains over 10000 articles and 38000 comments. All comments are from articles from February and early March, but I have started the script to collect the remainder of the March comments. This might need to run over the next few days, as the connection to Netherlands is still problematic.

Building a Corpus System

Monday, 14 April 2014

gzip and admin panel

No comments:

Post a Comment

Blog Archive