Monday, 29 September 2014

more pre-cleaning made apparent several issues with's text extraction algorithm.

* BDLive's HTML style uses inline <p> tags without any whitespace, eg:

<p>Lorem ipsum etc text. Some sentence here.</p><p>Another sentence. More text lorem lorem.</p> joins the last sentence of each paragraph with the first of the next, eliminating all whitespace. Thus the text from the example above would become:

Lorem ipsum etc text. Some sentence here.Another sentence. More text lorem lorem.

(With no space between "here." and "Another").

This caused havoc with the word lists, 'creating' hundreds of words which were just two different words containing a full-stop as a separator.

After trying several fixes, the easiest and most efficient seemed to be to do .replace("</p><p>","</p>\n<p>") on all HTML before passing it to Reporter, inserting the newline characters it expects.

BDLive also uses the questionable practice of sending HTML content to clients with in-line CSS "display:hidden" on some elements. Assuming that this hidden text is likely to be extraneous, I have added new filters to the generic "phase 1" cleaning to remove this. This was also problematic as even within BDLive pages there seems to be no fixed style-guide, and style="display:hidden", style="display: hidden;" and other variants are seen.

I decided to make a generic change to the filtering algorithm which could be useful for other filters too. Now one can supply a regex instead of a string to match "attribute_value". Before one could filter tags by specifying strings for tag type, attribute name, and attribute value. EG, one could remove the following tag by creating a filter with "div", "class", "author_byline":

<div class="author_byline">

Now the attribute value (author_byline in the above example) can be a regular expression, and creating a filter with:

"div", "class", re.compile(r"author.*")

would also remove the tag.

To remove non-visible text I'm using the regex


which allows optional variable white-space after 'display', after the colon, after the none, and an optional semi-colon at the end.

Unfortunately this won't work for text which is hidden by class or id through separate css style-sheets, but these can still be removed by specifying filters for phase2 cleaning.

I've rebuilt the wordlist on the development database, and things look much tidier. I'll push the changes to the server in the next couple of days.

Rebuilding the word list took about 50 minutes. I'm beginning to think that it would be worth the extra space requirements to store a word-tokenized copy of each article in the database alongside the plaintext one, which would substantially speed-up the wordlist creation, and some other algorithms such as collocations and KWIC.

Monday, 22 September 2014

wayback machine has a wayback machine, which offers snapshots of sites at specific dates. It has an API which usefully can return the snapshot closest to a specified time.

Started backwards crawling of, and

I started each backwards crawl from the homepages as they appeared in December 2013. I simply fetched all links from the homepages (first trying to get these also through the wayback machine, and if this failed, I tried to access them directly). I then subtracted one from the date, and kept doing so until a different snapshot was found as the "closest" one. Repeat.

Wayback machine is quite slow, but has almost all the content we need. It solves the problem of trying to find URLs for old articles, as these are not really linked to.

Also did general crawling of SA web (anything with a domain) over the last few days using Scrapy. This amounts to about 50GB and 230000 pages so far, but Scrapy unfortunately runs into memory issues as the queue of URLs gets too big.

Thursday, 4 September 2014

More problems with english.pickle

Moved nltk_data to /var/www from /root (not sure why it was there). Works again.

Monday, 28 July 2014

deduplication again and newage issues

Finished implementing basic near deduplication. After playing around with TF-IDF, cosine distance, and n-gram similarity I decided to use a more customized similarity function based on sentences. In short:

similarity (article1, article2):
    s1 = set(sentence_tokenize(article1))
    s2 = set(sentence_tokenize(article2))
    shared_sentences = s1.intersection(s2)
    all_terms = s1.union(s2)
    return len(shared_terms)/len(all_terms)

That is, articles are given a similarity rating between 0 and 1 based on how many sentences they share. Looking at comparative results for actual similar articles from the corpus and from some in which I manually introduced small changes, this seemed a better gauge than looking at shared ngrams of characters or even words.

the sklearn python library provides a nice TfidfVectorizer which creates a similarity matrix based on the tf-idf similarity of a list of articles. This could be more efficient, but as we cannot hope to create this matrix in a single pass of the corpus (we can't hold all articles in memory at once), this efficiency is non-trivial to take advantage of. Instead doing pairwise comparison of articles as outline in the previous post seems to be the best option at this stage.

Some optimization was added to the deduplication process, namely:
If a sentence from one article matches too many other articles (for now > 10), ignore this sentence. This means we don't need to pull down hundreds of articles and pairwise compare against all of them for sentences such as "subscribe to our newsletter" which is still dirtying some of the articles. This will remain useful even on the corpus texts are properly cleaned for sentences such as "more to follow" and other reporter cliches, although for now I'm ignoring sentences which are fewer than 20 characters long. Better gauge of similarity could possibly be achieved by taking into account:

  • sentences which appear only in very few other articles are weighted higher for deduplication
  • sentences containing names are weighted higher
  • longer sentences are weighted higher

After running a number of tests on the development database I have now left the deduplicator to run on the main database. It is not removing the duplicates yet, but just marking them, as well as marking 'similar' articles (those which rank with above 30% similarity).

Also discovered some problems with short articles on - similar to the problem before with IOL, if the article text is too short then Reporter picks up the CSS styling instead as the 'main text'. Unfortunately unlike with IOL removing the CSS as a pre-processing step does not solve the issue, as Reporter's next guess is the "in other news" section; if this is removed, it picks up the phrase "comment now". At this stage I couldn't find a solution generic enough to be appealing - some customized code may need to be written for some publications.

Installed NLTK on the server with the punkt package. Took a while to find how to do this on a headless machine (NLTK downloader seems GUI-focussed and the cli downloader didn't provide much help in locating the "english.pickle" resource which is part of the punkt tokenizer):

python -m nltk.downloader punkt

Wednesday, 23 July 2014

Second semester - deduplication

Working on the project again now that exam revision, exams, internship and field trip are over.

Worked on near deduplication. Using sklearn python library with TfidfVectorizers as described here: which seems to be working very well so far.

As pairwise comparison of all articles will become increasingly impractical as corpus size grows I'm taking a customized approach of keeping a collection of sentence hashes. This takes up more database space, but it means that we only need to do pairwise comparison on articles which share at least one sentence.

Dedup can be done on an existing corpus by building up the sentence hash collection while doing the deduplication. If the sentence hashes exist already for all articles in the db then we need to pull only a limited subset of articles to compare each new article against.

Also discovered ssdeep fuzzy hashing in Python (thanks to Peter Wrench). Will take a comparative look at this at a later stage to see if can be more efficient than the method described above.

Monday, 12 May 2014

deduplication and scrapy

I read several articles on near-deduplication and had an idea based on some of the algorithms previously used. Outline:

  • For each article, hash each sentence, and maintain table of hashed_sentences:articles_containing_sentence[]

Then, duplicates and near duplicates can efficiently be discovered and avoided with something along the lines of the following

new_article = crawl_url(url)
duplicate_possibilites = []
sentences = get_hash_sentences(new_article)
for sentence in sentences:
    duplicate_possibilities += hashed_sentences[sentence]

It is then pretty straightforward to fetch the text of all existing articles which have more than some percent overlap of sentences with the new article, and to use text similarity algorithms in pairs on these articles. Alternatively, the sentence-overlap percentage could be enough to identify a new article as a 'duplicate' or not. 

The sentence:article table could become undesirably large, but the size could be reduced with some heuristic selection of which sentences are 'important'. (containing at least some uncommon words, not too long or too short, etc).

I also wrote a basic IOL Spider for Scrapy, and started experimenting with using this to fetch old IOL data (ie, articles published before we started watching the RSS feeds.)


Saturday, 3 May 2014

multithreading and async-crawling

Due to the growing number of publications, crawl-time has increased dramatically. I spent the day experimenting with using multi-threading on the current implementation and using the Python Twisted library to crawl asynchronously. The latter results in a far greater speed-up, but would require a lot of code refactoring to implement.

Friday, 2 May 2014

New publications

Added the following publications:

  • The Citizen
  • Sowetan Live
  • Dispatch Live
  • The New Age
  • Business Day Live
  • Times Live
  • Daily Maverick
There are still more to add. Adding feeds still requires some amount of manual labour, although this can be done far more generically than before. To add new publications and feeds, one needs to manually specify the RSS url(s), information about how to extract the author, and information about any static text to remove (either because Reporter misidentifies it and includes it in the plaintext or because if it is too long, such as the IOL copyright notice, Reporter may ignore the main text completely for short articles and pick up on the static text instead). 

At the moment I am specifying this information programmatically so a typical new entry may look something like this:

dailymaverick = Publication("Daily Maverick", "", [""],
"splitindex":0 },

But the UI to allow the same functionality should be ready soon(ish).

Friday, 25 April 2014

Moving servers

The Comp Sci dept has provisioned me a VPS on their server at Struben.

The website is now hosted there, so it should be a lot faster, with the space and memory issues I was running into before eliminated for now.

The site is only accessible on the Rhodes Intranet for now.


There were some configuration issues with the database, but it's working for now, and I am in the process of moving over the CRON jobs from the Netherlands server.

This server again has limited space, as the dept were only able to give me 50GB of hard drive space. Currently the server has 35GB free. I am hoping that they will be willing to renegotiate once this space is filled up, so I will be adding new publications and comments for the old ones (currently only comments from IOL are being added).

Saturday, 19 April 2014

Author customization

Finished base functionality for back-end and UI to allow user to add new feeds to the RSS Watcher and to specify how the author should be extracted. Still need to move the IOL, MG, and Grocott's feeds to the new system and tidy up a bit, but the basic functionality can be seen at

Started negotiations for server in the Struben building. Unfortunately, short term plans involve this server being accessible only on the Rhodes intranet, but I plan to add several new feeds once this is done (hopefully by the end of next week). The DO server is already battling with the load generated by just three feeds, so I will not be adding new publications before I can move over to the new server.

Added user-agent spoofing for http requests as a fallback because some sites send 403 errors if the user-agent is not set to one of the common browsers.

Monday, 14 April 2014

gzip and admin panel

After trying various RSS feeds, I ran into the problem of the fact that some html is sent to the browser in gzip form, which the browser flawlessly deals with, converting it back to html. The corpus system does not do this.

Specifically, articles are returned in this form. A fix for the problem was found here:

I am busy implementing this, as well as looking into potential similar problems which may occur.

Worked on admin panel for the user to add new feeds, and started reworking database to fit this model.

The corpus currently contains over 10000 articles and 38000 comments. All comments are from articles from February and early March, but I have started the script to collect the remainder of the March comments. This might need to run over the next few days, as the connection to Netherlands is still problematic.

Tuesday, 25 March 2014

comments again

Comments crawled and search functionality added.

On the main page you can search in articles, comments, or both, with both case sensitive and case insensitive searches.

Monday, 24 March 2014


Started adding the comments to the articles crawled so far. This takes a while from the server (up to 200 seconds for every 10 articles processed compared to about 20 seconds locally). This also results in the cryptic "id not valid at server" MongoDB error message, which apparently means that the connection timed out. I need to experiment with setting the timeout time a bit longer.

Leaving the comments crawler to run overnight. Hopefully by tomorrow morning all the articles from before 7 March will have comments.

Should the tokens found in comments be added to the same frequency tables as the words found in articles, or do these need to be segregated?

Sunday, 23 March 2014

New feeds and tagging

Added mail and guardian rss feed to the system. This is now using python's FeedParser library, instead of parsing the xml directly, which should hopefully allow the system to be more generic. Still need to move the iol feed to the new system.

Installed NLTK on the server. Had problems with the function to download the libraries, corpora, etc on which it relies. Not sure if this was due to the CLI or if it was memory issues again, but trying to download "all" on the options page failed repeatedly. Managed to download the requirements for using the pos_tag function with:

import nltk
Downloader> d
Downloader> maxent_treebank_pos_tagger

It took a couple of hours to tag all 4000+ articles. Tagging will either have to be done at crawl time, or regularly, as tagging a large dataset could be prohibitively complex.

Added 'tagged' link to corpus interface, to allow user to see tagged article as well as text and html. Need to figure out exactly how the best way to store text is, but storing the plain text and the tagged plain text is definitely not the most efficient, so one of these should be removed in the near future.

Richard was concerned that South African words would be incorrectly tagged. This does seem to be a problem: see for example 'maas' in

Wednesday, 19 March 2014

Generic user-assisted feed parser

Started work on a module which looks at user-input RSS URLs and tries to extract the relevant information from them, and create the correct database mappings, asking the user for confirmation.

Experimenting with the Python feedparser module to help with this (prior approach was to use the standard XML parser to parse RSS feeds).

Monday, 17 March 2014

more deduplication

Started looking at n-gram near deduplication methods. Nice article in SPIRE
Mariano Consens Gonzalo Navarro (Eds.) String Processing and Information Retrieval: 12th International Conference, SPIRE
(November 2005),%2012%20conf.,%20SPIRE%202005(LNCS3772,%20Springer,%202005)(ISBN%203540297405)(418s).pdf#page=127

Also read about Onion (ONe Instance ONly) for deduplication

Which was developed as part of  Pomik alek's PhD Thesis titled Removing Boilerplate and Duplicate Content from Web Corpora. Available at:

Slides titled "Near Duplicate Data in Web Corpora" by Benko available here: (Also uses OnIOn)

Another paper on n-gram similarity methods: "Classification of RSS-formatted Documents using Full Text similarity Measures" by Wegrzyn-Wolska and Szczepaniak. Available at:

Saturday, 15 March 2014

Deduplication and tokenizing

Started working on Deduplication. I can compare each plaintext article against each other article in the database. This works for the current data set, but will not scale well. Also, it will not pick up duplicates if any change at all is present. Looked at the Python diff library, which looks promising.

The most efficient way would probably be to find or write an algorithm to extract keywords from an article. If this were accurate enough, then we could simply look at articles with the same keywords, and perform deduplication only on these.

Gallery and Video articles are a problem. Sometimes an advert is being picked up as the 'main text' of these articles, or the text is so short that it is probably insignificant. These are pretty easy to filter out. They contain
* Either "Gallery" "Video" "Pics" in the url.
* Often the plaintext starts with "Gallery"
* The plaintext is normally very short (I experimented with limits - any story under 500 characters seems to be uninteresting).

I can therefore fairly easily filter these out, but we probably still want the comments on these, so I can't just remove them entirely.

All articles from are duplicates. These urls also simply direct back to the homepage, even though the urls contain the same slug of the article as the iol duplicate.

Sometimes the article id changes, but the slug remains the same. These entries are also duplicates in the database.

For tokenizing, it might make sense to keep an ordered set of the lowercase tokens of each article. This would allow word frequency analysis and efficient lookup. Case sensitive queries and substring matches could then be done with more expensive regex. This might also help with deduplication.

Thursday, 13 March 2014

language identification and deduplication

An efficient way to identify language:!topic/nltk-users/pfUq8svEz-s

Create a set() of English vocab (nltk list has about 200000 words). Then create a set of the tokenized article. The difference of these two sets shows how many non-English words are used in the article. (Take ratio of number of non English words to total number of words).

Tested with several articles - English articles seem to have about 25% non-English words. (The English vocab list only contains root words and some derivations: e.g., it has 'walk' and 'walking' but not 'walks', and this ups the count of 'non English' words, whereas a non-English article showed about 95% non-English words.

I haven't tried or read anything about using this same method for deduplication, but I imagine that a very similar approach would work well.

duplicate removal and multithreading

Found about 400 duplicate entries in the database. Not sure how these got there, but they were from the 7 March, before the RSS watcher was being run as a cron job. I removed these, and no more seem to be appearing. Currently 2200 articles in the database.

Started looking at using a multithreaded crawler to crawl web content faster. This will be especially useful for the comments, which at the moment take a long time to crawl. Using a very small set of test cases, a dramatic speed-up is apparent. I'm also planning to experiment with multiprocessing to see if this is faster still.

The comments haven't been 'closed' for any of the articles in the database yet, so as yet I have not added any comments to the database, but the functionality is in place, waiting for some content to process.

Added to reading list:
* Paper on creating a Portuguese corpus:
* The webcleaner (NCLEANER) which was used to create this corpus:

Tuesday, 11 March 2014


Currently 2249 articles in the database - the RSS crawler is finding more posts now that it's not weekend, as expected.

Worked on fetching comments. Will test tomorrow and if all is as it should be will start crawling and indexing comments too. Only getting comments from articles where the comments have been closed, so this will also run as a cron job to daily check all articles to see if the comments are closed yet.

Sunday, 9 March 2014

Disqus Comments

XML watcher has been running well all day, sending me a report back every half an hour. It looks at about 1400 URLs each time (taken from 133 RSS) feeds and anything between 0 and 20 of these are new. Usually only a couple of new URLs every half hour, but I expect that there are fewer new articles due to it being weekend.

Found out how to bypass AJAX and load the full JSON of disqus comments via direct URL. Began working on loading and parsing these, but still need to put everything together into a crawler. Using Beautiful Soup to load the JSON section of the HTML page.

Need to find out how long the comment threads stay open for. Then we can easily go through the DB and crawl the comments for all the closed threads. (Or we can do this on a regular basis and check each to see if it's closed).

What information is important? An example disqus comment looks as follows (lots of metadata. Most of it could be useful, though can probably throw out at least the avatar stuff. Also, why raw_message and message - how do they differ; which should we keep?)

Installed browser automation tool, Selenium, in case we need to do anything else AJAX related.

{"isFlagged":false,"forum":"iol","parent":null,"author":{"username":"HARRYHAT1950","about":"","name":"HARRYHAT1950","url":"","isAnonymous":false,"rep":1.345875,"profileUrl":"","reputation":1.345875,"location":"","isPrivate":false,"isPrimary":true,"joinedAt":"2012-04-23T10:23:08","id":"25208750","avatar":{"small":{"permalink":"//","cache":"//"},"large":{"permalink":"//","cache":"//"},"permalink":"//","cache":"//"}},"media":[],"isDeleted":false,"isApproved":true,"dislikes":0,"raw_message":"Point 2: \"then your first call should be to an ambulance service or the traffic department who will, in turn, alert them\". This was obviously not written in SA. The first thing the traffic department does is notify their contact at their chosen towing company and negotiate their commission. After that they contact their favourite paramedic and negotiate their commission from them. Even the cops don't notify the provincial ambulance service '\u00e7os they know there is no commission and the provincial ambulance service is useless and won't turn up anyway.","createdAt":"2014-03-08T07:26:59","id":"1276210870","thread":"2379235034","depth":0,"numReports":0,"likes":8,"isEdited":false,"message":"\u003cp>Point 2: \"then your first call should be to an ambulance service or the traffic department who will, in turn, alert them\". This was obviously not written in SA. The first thing the traffic department does is notify their contact at their chosen towing company and negotiate their commission. After that they contact their favourite paramedic and negotiate their commission from them. Even the cops don't notify the provincial ambulance service '\u00e7os they know there is no commission and the provincial ambulance service is useless and won't turn up anyway.\u003c/p>","isSpam":false,"isHighlighted":false,"points":8}

Saturday, 8 March 2014

rss, copyright, beautiful soup

Completed RSS crawler - this is now running as a cron job, collecting all new IOL articles. runs into issues when the article text is too short (about 2 paragraphs of text, normally as descriptions of galleries, or breaking news items with "more to follow"). In these cases, it identifies the copyright blurb at the end of most pages as the "main article". This needs to be removed before handing the article to reporter. Alternatively, fall back to Beautiful Soup to extract text, which is less generic but may be more accurate.

reporter also includes the "Related links" found in most IOL articles. These will need to be filtered out.

Probably best is to customize as many text extractions as possible (i.e., for all main news sites) and use reporter as a generic solution in case other sites need to be added or the formats for the current sites change.

Currently removing copyright message based on string matching - this will have to be updated if the copyright message changes (and it is different on the non-English IOL pages.)

Note: copyright div in IOL articles is <div class="copywrite">[sic] ((IT people these days.))

Ran into first issues due to running off a micro-server:
  • Couldn't install reporter. Turned out that lxml dependency was crashing gcc due to running out of memory.
  • Crawling South African pages from the server is significantly slower as the server is located in the Netherlands.

Thursday, 6 March 2014

rss, regex, dates, modules and metadata, encoding

Beginnings of which will watch rss feeds and crawl new articles is now in place. I'm currently testing this with just the main news link from the page: This allows for very easy extraction of

  • url
  • date
  • headline
  • guid (permalink=false might be problematic)
  • description
But unfortunately the author still needs to be extracted from the html. Using a double regex, the first to extract all the article headers, and the second to extract the author from this, seems to accurately identify the authors for the small test set (20 articles) so far. Further testing will be done.

Started using Python's dateutils.parser for flexible date parsing.

Received documents from Richard about metadata and modularization.

I need to have a better look at how some of the python modules I'm using are doing encoding, as I'm having a few issues with smart quotes and other unicode characters. I discovered the unidecode python module which does a brilliant job at converting unicode characters to the nearest possible ascii match, which may be useful for some text analysis.

Started looking at NLTK's capabilities for word stemming.

Wednesday, 5 March 2014

better tokenizing and beginning of database design

The basic tokenization used previously was not as good as I thought, as it didn't strip out all punctuation (specifically full stops).

  • Now using example combination of sent_tokenize and word_tokenize as explained here:
Started basic database design, with a linking collection to show associations between words and the articles they appear in.

Tuesday, 4 March 2014

iol, regex

  • Started looking at IOL articles
  • Crawled front page - ~1200 links. About 15 minutes processing time
    • Identified author, date, and article text in ~100 of these (Many links were CLINKS or to ioldating so this is not as small a fraction as it seems)
  • Used regular expressions to find author and date 
    • These can be customized for the metadata analyser to work on other sites
    • Can have dictionary of {sites : regexes}, which allows flexibility, though it means that someone who is capable of writing regex is required for long-term updates
  • Started looking at possibility of incorporating XML feeds into crawler to identify metadata

  • Flagged software (see Richard's email from today)

  • Wikipedia has a fairly extensive list of South African slang words, categorized by language of origin. This may be useful - it would be fairly trivial to extract these into a plaintext dictionary