iol, regex
- Started looking at IOL articles
- Crawled front page - ~1200 links. About 15 minutes processing time
- Identified author, date, and article text in ~100 of these (Many links were CLINKS or to ioldating so this is not as small a fraction as it seems)
- Used regular expressions to find author and date
- These can be customized for the metadata analyser to work on other sites
- Can have dictionary of {sites : regexes}, which allows flexibility, though it means that someone who is capable of writing regex is required for long-term updates
- Started looking at possibility of incorporating XML feeds into crawler to identify metadata
- Flagged software (see Richard's email from today)
- http://corpus2.byu.edu/glowbe/
- http://ipsc.jrc.ec.europa.eu/index.php?id=60a
- https://github.com/aymara/lima
- Wikipedia has a fairly extensive list of South African slang words, categorized by language of origin. This may be useful - it would be fairly trivial to extract these into a plaintext dictionary
No comments:
Post a Comment