Tuesday 4 March 2014

iol, regex

  • Started looking at IOL articles
  • Crawled front page - ~1200 links. About 15 minutes processing time
    • Identified author, date, and article text in ~100 of these (Many links were CLINKS or to ioldating so this is not as small a fraction as it seems)
  • Used regular expressions to find author and date 
    • These can be customized for the metadata analyser to work on other sites
    • Can have dictionary of {sites : regexes}, which allows flexibility, though it means that someone who is capable of writing regex is required for long-term updates
  • Started looking at possibility of incorporating XML feeds into crawler to identify metadata

  • Flagged software (see Richard's email from today)
    • http://corpus2.byu.edu/glowbe/
    • http://ipsc.jrc.ec.europa.eu/index.php?id=60a
    • https://github.com/aymara/lima

  • Wikipedia has a fairly extensive list of South African slang words, categorized by language of origin. This may be useful - it would be fairly trivial to extract these into a plaintext dictionary

No comments:

Post a Comment