Building a Corpus System: Comments

Monday, 24 March 2014

Comments

Started adding the comments to the articles crawled so far. This takes a while from the server (up to 200 seconds for every 10 articles processed compared to about 20 seconds locally). This also results in the cryptic "id not valid at server" MongoDB error message, which apparently means that the connection timed out. I need to experiment with setting the timeout time a bit longer.

Leaving the comments crawler to run overnight. Hopefully by tomorrow morning all the articles from before 7 March will have comments.

Should the tokens found in comments be added to the same frequency tables as the words found in articles, or do these need to be segregated?

Building a Corpus System

Monday, 24 March 2014

Comments

No comments:

Post a Comment

Blog Archive