Building a Corpus System: March 2014

Tuesday, 25 March 2014

comments again

Comments crawled and search functionality added.

On the main page you can search in articles, comments, or both, with both case sensitive and case insensitive searches.

Monday, 24 March 2014

Started adding the comments to the articles crawled so far. This takes a while from the server (up to 200 seconds for every 10 articles processed compared to about 20 seconds locally). This also results in the cryptic "id not valid at server" MongoDB error message, which apparently means that the connection timed out. I need to experiment with setting the timeout time a bit longer.

Leaving the comments crawler to run overnight. Hopefully by tomorrow morning all the articles from before 7 March will have comments.

Should the tokens found in comments be added to the same frequency tables as the words found in articles, or do these need to be segregated?

Sunday, 23 March 2014

New feeds and tagging

Added mail and guardian rss feed to the system. This is now using python's FeedParser library, instead of parsing the xml directly, which should hopefully allow the system to be more generic. Still need to move the iol feed to the new system.

Installed NLTK on the server. Had problems with the nltk.download() function to download the libraries, corpora, etc on which it relies. Not sure if this was due to the CLI or if it was memory issues again, but trying to download "all" on the options page failed repeatedly. Managed to download the requirements for using the pos_tag function with:

import nltk
nltk.download()
Downloader> d
Downloader> maxent_treebank_pos_tagger

It took a couple of hours to tag all 4000+ articles. Tagging will either have to be done at crawl time, or regularly, as tagging a large dataset could be prohibitively complex.

Added 'tagged' link to corpus interface, to allow user to see tagged article as well as text and html. Need to figure out exactly how the best way to store text is, but storing the plain text and the tagged plain text is definitely not the most efficient, so one of these should be removed in the near future.

Richard was concerned that South African words would be incorrectly tagged. This does seem to be a problem: see for example 'maas' in http://sae.dwyer.co.za/tagged/5327e906c3f6083abd891d7f

Wednesday, 19 March 2014

Generic user-assisted feed parser

Started work on a module which looks at user-input RSS URLs and tries to extract the relevant information from them, and create the correct database mappings, asking the user for confirmation.

Experimenting with the Python feedparser module to help with this (prior approach was to use the standard XML parser to parse RSS feeds).

Monday, 17 March 2014

more deduplication

Started looking at n-gram near deduplication methods. Nice article in SPIRE
Mariano Consens Gonzalo Navarro (Eds.) String Processing and Information Retrieval: 12th International Conference, SPIRE
(November 2005)

http://f3.tiera.ru/2/Cs_Computer%20science/CsLn_Lecture%20notes/S/String%20Processing%20and%20Information%20Retrieval,%2012%20conf.,%20SPIRE%202005(LNCS3772,%20Springer,%202005)(ISBN%203540297405)(418s).pdf#page=127

Also read about Onion (ONe Instance ONly) for deduplication
https://code.google.com/p/onion/

Which was developed as part of Pomik alek's PhD Thesis titled Removing Boilerplate and Duplicate Content from Web Corpora. Available at: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf

Slides titled "Near Duplicate Data in Web Corpora" by Benko available here: http://hpsg.fu-berlin.de/cow/dgfs2014/pdf/WEBTL_05_17.30_Benko.pdf (Also uses OnIOn)

Another paper on n-gram similarity methods: "Classification of RSS-formatted Documents using Full Text similarity Measures" by Wegrzyn-Wolska and Szczepaniak. Available at: http://www.researchgate.net/publication/220940781_Classification_of_RSS-Formatted_Documents_Using_Full_Text_Similarity_Measures/file/72e7e526177159fa60.pdf

Saturday, 15 March 2014

Deduplication and tokenizing

Started working on Deduplication. I can compare each plaintext article against each other article in the database. This works for the current data set, but will not scale well. Also, it will not pick up duplicates if any change at all is present. Looked at the Python diff library, which looks promising.

The most efficient way would probably be to find or write an algorithm to extract keywords from an article. If this were accurate enough, then we could simply look at articles with the same keywords, and perform deduplication only on these.

Gallery and Video articles are a problem. Sometimes an advert is being picked up as the 'main text' of these articles, or the text is so short that it is probably insignificant. These are pretty easy to filter out. They contain
* Either "Gallery" "Video" "Pics" in the url.
* Often the plaintext starts with "Gallery"
* The plaintext is normally very short (I experimented with limits - any story under 500 characters seems to be uninteresting).

I can therefore fairly easily filter these out, but we probably still want the comments on these, so I can't just remove them entirely.

All articles from thepost.co.za are duplicates. These urls also simply direct back to the iol.co.za/thepost homepage, even though the urls contain the same slug of the article as the iol duplicate.

Sometimes the iol.co.za article id changes, but the slug remains the same. These entries are also duplicates in the database.

For tokenizing, it might make sense to keep an ordered set of the lowercase tokens of each article. This would allow word frequency analysis and efficient lookup. Case sensitive queries and substring matches could then be done with more expensive regex. This might also help with deduplication.

Thursday, 13 March 2014

language identification and deduplication

An efficient way to identify language:

https://groups.google.com/forum/#!topic/nltk-users/pfUq8svEz-s

Create a set() of English vocab (nltk list has about 200000 words). Then create a set of the tokenized article. The difference of these two sets shows how many non-English words are used in the article. (Take ratio of number of non English words to total number of words).

Tested with several articles - English articles seem to have about 25% non-English words. (The English vocab list only contains root words and some derivations: e.g., it has 'walk' and 'walking' but not 'walks', and this ups the count of 'non English' words, whereas a non-English article showed about 95% non-English words.

I haven't tried or read anything about using this same method for deduplication, but I imagine that a very similar approach would work well.

duplicate removal and multithreading

Found about 400 duplicate entries in the database. Not sure how these got there, but they were from the 7 March, before the RSS watcher was being run as a cron job. I removed these, and no more seem to be appearing. Currently 2200 articles in the database.

Started looking at using a multithreaded crawler to crawl web content faster. This will be especially useful for the comments, which at the moment take a long time to crawl. Using a very small set of test cases, a dramatic speed-up is apparent. I'm also planning to experiment with multiprocessing to see if this is faster still.

The comments haven't been 'closed' for any of the articles in the database yet, so as yet I have not added any comments to the database, but the functionality is in place, waiting for some content to process.

Added to reading list:
* Paper on creating a Portuguese corpus: http://www.clul.ul.pt/files/michel_genereux/propor2012_final_ack.pdf
* The webcleaner (NCLEANER) which was used to create this corpus: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.180.3700&rep=rep1&type=pdf

Tuesday, 11 March 2014

Comments

Currently 2249 articles in the database - the RSS crawler is finding more posts now that it's not weekend, as expected.

Worked on fetching comments. Will test tomorrow and if all is as it should be will start crawling and indexing comments too. Only getting comments from articles where the comments have been closed, so this will also run as a cron job to daily check all articles to see if the comments are closed yet.

Sunday, 9 March 2014

Disqus Comments

XML watcher has been running well all day, sending me a report back every half an hour. It looks at about 1400 URLs each time (taken from 133 RSS) feeds and anything between 0 and 20 of these are new. Usually only a couple of new URLs every half hour, but I expect that there are fewer new articles due to it being weekend.

Found out how to bypass AJAX and load the full JSON of disqus comments via direct URL. Began working on loading and parsing these, but still need to put everything together into a crawler. Using Beautiful Soup to load the JSON section of the HTML page.

Need to find out how long the comment threads stay open for. Then we can easily go through the DB and crawl the comments for all the closed threads. (Or we can do this on a regular basis and check each to see if it's closed).

What information is important? An example disqus comment looks as follows (lots of metadata. Most of it could be useful, though can probably throw out at least the avatar stuff. Also, why raw_message and message - how do they differ; which should we keep?)

Installed browser automation tool, Selenium, in case we need to do anything else AJAX related.

{"isFlagged":false,"forum":"iol","parent":null,"author":{"username":"HARRYHAT1950","about":"","name":"HARRYHAT1950","url":"","isAnonymous":false,"rep":1.345875,"profileUrl":"http://disqus.com/HARRYHAT1950/","reputation":1.345875,"location":"","isPrivate":false,"isPrimary":true,"joinedAt":"2012-04-23T10:23:08","id":"25208750","avatar":{"small":{"permalink":"//a.disquscdn.com/uploads/forums/128/5645/avatar32.jpg?1386858689","cache":"//a.disquscdn.com/uploads/forums/128/5645/avatar32.jpg?1386858689"},"large":{"permalink":"//a.disquscdn.com/uploads/forums/128/5645/avatar92.jpg?1386858689","cache":"//a.disquscdn.com/uploads/forums/128/5645/avatar92.jpg?1386858689"},"permalink":"//a.disquscdn.com/uploads/forums/128/5645/avatar92.jpg?1386858689","cache":"//a.disquscdn.com/uploads/forums/128/5645/avatar92.jpg?1386858689"}},"media":[],"isDeleted":false,"isApproved":true,"dislikes":0,"raw_message":"Point 2: \"then your first call should be to an ambulance service or the traffic department who will, in turn, alert them\". This was obviously not written in SA. The first thing the traffic department does is notify their contact at their chosen towing company and negotiate their commission. After that they contact their favourite paramedic and negotiate their commission from them. Even the cops don't notify the provincial ambulance service '\u00e7os they know there is no commission and the provincial ambulance service is useless and won't turn up anyway.","createdAt":"2014-03-08T07:26:59","id":"1276210870","thread":"2379235034","depth":0,"numReports":0,"likes":8,"isEdited":false,"message":"\u003cp>Point 2: \"then your first call should be to an ambulance service or the traffic department who will, in turn, alert them\". This was obviously not written in SA. The first thing the traffic department does is notify their contact at their chosen towing company and negotiate their commission. After that they contact their favourite paramedic and negotiate their commission from them. Even the cops don't notify the provincial ambulance service '\u00e7os they know there is no commission and the provincial ambulance service is useless and won't turn up anyway.\u003c/p>","isSpam":false,"isHighlighted":false,"points":8}

Saturday, 8 March 2014

rss, copyright, beautiful soup

Completed RSS crawler - this is now running as a cron job, collecting all new IOL articles.

reporter.py runs into issues when the article text is too short (about 2 paragraphs of text, normally as descriptions of galleries, or breaking news items with "more to follow"). In these cases, it identifies the copyright blurb at the end of most pages as the "main article". This needs to be removed before handing the article to reporter. Alternatively, fall back to Beautiful Soup to extract text, which is less generic but may be more accurate.

reporter also includes the "Related links" found in most IOL articles. These will need to be filtered out.

Probably best is to customize as many text extractions as possible (i.e., for all main news sites) and use reporter as a generic solution in case other sites need to be added or the formats for the current sites change.

Currently removing copyright message based on string matching - this will have to be updated if the copyright message changes (and it is different on the non-English IOL pages.)

Note: copyright div in IOL articles is <div class="copywrite">[sic] ((IT people these days.))

Ran into first issues due to running off a micro-server:

Couldn't install reporter. Turned out that lxml dependency was crashing gcc due to running out of memory.
Crawling South African pages from the server is significantly slower as the server is located in the Netherlands.

Thursday, 6 March 2014

rss, regex, dates, modules and metadata, encoding

Beginnings of rss-crawler.py which will watch rss feeds and crawl new articles is now in place. I'm currently testing this with just the main news link from the iol.co.za/rss page: http://iol.co.za/cmlink/1.640. This allows for very easy extraction of

url
date
headline
guid (permalink=false might be problematic)
description

But unfortunately the author still needs to be extracted from the html. Using a double regex, the first to extract all the article headers, and the second to extract the author from this, seems to accurately identify the authors for the small test set (20 articles) so far. Further testing will be done.

Started using Python's dateutils.parser for flexible date parsing.

Received documents from Richard about metadata and modularization.

I need to have a better look at how some of the python modules I'm using are doing encoding, as I'm having a few issues with smart quotes and other unicode characters. I discovered the unidecode python module which does a brilliant job at converting unicode characters to the nearest possible ascii match, which may be useful for some text analysis.

Started looking at NLTK's capabilities for word stemming.

Wednesday, 5 March 2014

better tokenizing and beginning of database design

The basic tokenization used previously was not as good as I thought, as it didn't strip out all punctuation (specifically full stops).

Now using example combination of sent_tokenize and word_tokenize as explained here: http://www.nltk.org/api/nltk.tokenize.html

Started basic database design, with a linking collection to show associations between words and the articles they appear in.

Tuesday, 4 March 2014

iol, regex

Started looking at IOL articles
Crawled front page - ~1200 links. About 15 minutes processing time

Identified author, date, and article text in ~100 of these (Many links were CLINKS or to ioldating so this is not as small a fraction as it seems)

Used regular expressions to find author and date

These can be customized for the metadata analyser to work on other sites
Can have dictionary of {sites : regexes}, which allows flexibility, though it means that someone who is capable of writing regex is required for long-term updates

Started looking at possibility of incorporating XML feeds into crawler to identify metadata

Flagged software (see Richard's email from today)

http://corpus2.byu.edu/glowbe/
http://ipsc.jrc.ec.europa.eu/index.php?id=60a
https://github.com/aymara/lima

Wikipedia has a fairly extensive list of South African slang words, categorized by language of origin. This may be useful - it would be fairly trivial to extract these into a plaintext dictionary