Sunday, 9 March 2014

Disqus Comments

XML watcher has been running well all day, sending me a report back every half an hour. It looks at about 1400 URLs each time (taken from 133 RSS) feeds and anything between 0 and 20 of these are new. Usually only a couple of new URLs every half hour, but I expect that there are fewer new articles due to it being weekend.

Found out how to bypass AJAX and load the full JSON of disqus comments via direct URL. Began working on loading and parsing these, but still need to put everything together into a crawler. Using Beautiful Soup to load the JSON section of the HTML page.

Need to find out how long the comment threads stay open for. Then we can easily go through the DB and crawl the comments for all the closed threads. (Or we can do this on a regular basis and check each to see if it's closed).

What information is important? An example disqus comment looks as follows (lots of metadata. Most of it could be useful, though can probably throw out at least the avatar stuff. Also, why raw_message and message - how do they differ; which should we keep?)

Installed browser automation tool, Selenium, in case we need to do anything else AJAX related.

{"isFlagged":false,"forum":"iol","parent":null,"author":{"username":"HARRYHAT1950","about":"","name":"HARRYHAT1950","url":"","isAnonymous":false,"rep":1.345875,"profileUrl":"http://disqus.com/HARRYHAT1950/","reputation":1.345875,"location":"","isPrivate":false,"isPrimary":true,"joinedAt":"2012-04-23T10:23:08","id":"25208750","avatar":{"small":{"permalink":"//a.disquscdn.com/uploads/forums/128/5645/avatar32.jpg?1386858689","cache":"//a.disquscdn.com/uploads/forums/128/5645/avatar32.jpg?1386858689"},"large":{"permalink":"//a.disquscdn.com/uploads/forums/128/5645/avatar92.jpg?1386858689","cache":"//a.disquscdn.com/uploads/forums/128/5645/avatar92.jpg?1386858689"},"permalink":"//a.disquscdn.com/uploads/forums/128/5645/avatar92.jpg?1386858689","cache":"//a.disquscdn.com/uploads/forums/128/5645/avatar92.jpg?1386858689"}},"media":[],"isDeleted":false,"isApproved":true,"dislikes":0,"raw_message":"Point 2: \"then your first call should be to an ambulance service or the traffic department who will, in turn, alert them\". This was obviously not written in SA. The first thing the traffic department does is notify their contact at their chosen towing company and negotiate their commission. After that they contact their favourite paramedic and negotiate their commission from them. Even the cops don't notify the provincial ambulance service '\u00e7os they know there is no commission and the provincial ambulance service is useless and won't turn up anyway.","createdAt":"2014-03-08T07:26:59","id":"1276210870","thread":"2379235034","depth":0,"numReports":0,"likes":8,"isEdited":false,"message":"\u003cp>Point 2: \"then your first call should be to an ambulance service or the traffic department who will, in turn, alert them\". This was obviously not written in SA. The first thing the traffic department does is notify their contact at their chosen towing company and negotiate their commission. After that they contact their favourite paramedic and negotiate their commission from them. Even the cops don't notify the provincial ambulance service '\u00e7os they know there is no commission and the provincial ambulance service is useless and won't turn up anyway.\u003c/p>","isSpam":false,"isHighlighted":false,"points":8}


No comments:

Post a Comment