A weekend project to help extract machine-readable data from websites. I’ve been using this for a couple of months now, the first version was only a ruby script, last weekend I decide to wrap it in a very simple website.
http://wscrap.stjhimy.com/
Example
A github blog post:
https://github.com/blog/2004-new-atom-shirt-in-the-shop
Extract it using wscrap:
http://wscrap.stjhimy.com/scrap?url=https%3A%2F%2Fgithub.com%2Fblog%2F2004-new-atom-shirt-in-the-shop
Result:
Building it
The first version was based on the pismo gem, worked well for a few websites but then I decided to write my own version (wrong decision but worked well at the end). Requests are throttled with Rack, cache is done using Redis.
Limitations
It’s a tiny project, runs on a single heroku dyno which gives a few limitations. In order to keep things fast every extracted url is cached on redis for 60 seconds and you have a limit of 30 requests per minute.