Thursday 13 March 2014

duplicate removal and multithreading

Found about 400 duplicate entries in the database. Not sure how these got there, but they were from the 7 March, before the RSS watcher was being run as a cron job. I removed these, and no more seem to be appearing. Currently 2200 articles in the database.

Started looking at using a multithreaded crawler to crawl web content faster. This will be especially useful for the comments, which at the moment take a long time to crawl. Using a very small set of test cases, a dramatic speed-up is apparent. I'm also planning to experiment with multiprocessing to see if this is faster still.

The comments haven't been 'closed' for any of the articles in the database yet, so as yet I have not added any comments to the database, but the functionality is in place, waiting for some content to process.

Added to reading list:
* Paper on creating a Portuguese corpus:
* The webcleaner (NCLEANER) which was used to create this corpus:

No comments:

Post a Comment