Building a Corpus System: rss, copyright, beautiful soup

Completed RSS crawler - this is now running as a cron job, collecting all new IOL articles.

reporter.py runs into issues when the article text is too short (about 2 paragraphs of text, normally as descriptions of galleries, or breaking news items with "more to follow"). In these cases, it identifies the copyright blurb at the end of most pages as the "main article". This needs to be removed before handing the article to reporter. Alternatively, fall back to Beautiful Soup to extract text, which is less generic but may be more accurate.

reporter also includes the "Related links" found in most IOL articles. These will need to be filtered out.

Probably best is to customize as many text extractions as possible (i.e., for all main news sites) and use reporter as a generic solution in case other sites need to be added or the formats for the current sites change.

Currently removing copyright message based on string matching - this will have to be updated if the copyright message changes (and it is different on the non-English IOL pages.)

Note: copyright div in IOL articles is <div class="copywrite">[sic] ((IT people these days.))

Ran into first issues due to running off a micro-server:

Couldn't install reporter. Turned out that lxml dependency was crashing gcc due to running out of memory.
Crawling South African pages from the server is significantly slower as the server is located in the Netherlands.

Building a Corpus System

Saturday, 8 March 2014

rss, copyright, beautiful soup

No comments:

Post a Comment

Blog Archive