Building a Corpus System: April 2014

Friday, 25 April 2014

Moving servers

The Comp Sci dept has provisioned me a VPS on their server at Struben.

The website is now hosted there, so it should be a lot faster, with the space and memory issues I was running into before eliminated for now.

The site is only accessible on the Rhodes Intranet for now.

See: http://146.231.133.148/

There were some configuration issues with the database, but it's working for now, and I am in the process of moving over the CRON jobs from the Netherlands server.

This server again has limited space, as the dept were only able to give me 50GB of hard drive space. Currently the server has 35GB free. I am hoping that they will be willing to renegotiate once this space is filled up, so I will be adding new publications and comments for the old ones (currently only comments from IOL are being added).

Saturday, 19 April 2014

Author customization

Finished base functionality for back-end and UI to allow user to add new feeds to the RSS Watcher and to specify how the author should be extracted. Still need to move the IOL, MG, and Grocott's feeds to the new system and tidy up a bit, but the basic functionality can be seen at sae.dwyer.co.za/rss.

Started negotiations for server in the Struben building. Unfortunately, short term plans involve this server being accessible only on the Rhodes intranet, but I plan to add several new feeds once this is done (hopefully by the end of next week). The DO server is already battling with the load generated by just three feeds, so I will not be adding new publications before I can move over to the new server.

Added user-agent spoofing for http requests as a fallback because some sites send 403 errors if the user-agent is not set to one of the common browsers.

Monday, 14 April 2014

gzip and admin panel

After trying various RSS feeds, I ran into the problem of the fact that some html is sent to the browser in gzip form, which the browser flawlessly deals with, converting it back to html. The corpus system does not do this.

Specifically, SABC.co.za articles are returned in this form. A fix for the problem was found here:

http://stackoverflow.com/questions/18146389/urlopen-trouble-while-trying-to-download-a-gzip-file

I am busy implementing this, as well as looking into potential similar problems which may occur.

Worked on admin panel for the user to add new feeds, and started reworking database to fit this model.

The corpus currently contains over 10000 articles and 38000 comments. All comments are from articles from February and early March, but I have started the script to collect the remainder of the March comments. This might need to run over the next few days, as the connection to Netherlands is still problematic.