Building a Corpus System: iol, regex

Tuesday, 4 March 2014

iol, regex

Started looking at IOL articles
Crawled front page - ~1200 links. About 15 minutes processing time

Identified author, date, and article text in ~100 of these (Many links were CLINKS or to ioldating so this is not as small a fraction as it seems)

Used regular expressions to find author and date

These can be customized for the metadata analyser to work on other sites
Can have dictionary of {sites : regexes}, which allows flexibility, though it means that someone who is capable of writing regex is required for long-term updates

Started looking at possibility of incorporating XML feeds into crawler to identify metadata

Flagged software (see Richard's email from today)

http://corpus2.byu.edu/glowbe/
http://ipsc.jrc.ec.europa.eu/index.php?id=60a
https://github.com/aymara/lima

Wikipedia has a fairly extensive list of South African slang words, categorized by language of origin. This may be useful - it would be fairly trivial to extract these into a plaintext dictionary

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)