Building a Corpus System: duplicate removal and multithreading

Found about 400 duplicate entries in the database. Not sure how these got there, but they were from the 7 March, before the RSS watcher was being run as a cron job. I removed these, and no more seem to be appearing. Currently 2200 articles in the database.

Started looking at using a multithreaded crawler to crawl web content faster. This will be especially useful for the comments, which at the moment take a long time to crawl. Using a very small set of test cases, a dramatic speed-up is apparent. I'm also planning to experiment with multiprocessing to see if this is faster still.

The comments haven't been 'closed' for any of the articles in the database yet, so as yet I have not added any comments to the database, but the functionality is in place, waiting for some content to process.

Added to reading list:
* Paper on creating a Portuguese corpus: http://www.clul.ul.pt/files/michel_genereux/propor2012_final_ack.pdf
* The webcleaner (NCLEANER) which was used to create this corpus: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.180.3700&rep=rep1&type=pdf

Building a Corpus System

Thursday, 13 March 2014

duplicate removal and multithreading

No comments:

Post a Comment

Blog Archive