I recently wrote about the rss.pl RSS news crawler I wrote. Majority
of it took a few hours to write - the goal was to reduce my mobile
bandwidth and avoid annoying adverts being injected into pages. I set
my sights very low - this doesnt need to be a technically difficult
piece of software, and it gave me something to play with - something
I have done numerous times over the years. (CRiSP incorporates a
news side-bar, for instance; and CRiSP used to support a web
spider - it still does, but theres less need for it in the
day of superfast broadband - compared to the early modems it
was written against).
I can successfully report that rss.pl exceeded my desire to go on
a diet. Previously I was approaching 500MB/month of bandwidth use
and it was out of control, due to the obnoxious way adverts were
being generated (they were mutilating the HTML of a page - after
I had paid the cost of downloading the page - and effectively
paying twice for a page, whilst denying compression). In addition,
the mechanism would add significant delay - I was lucky to read
1 or 2 web pages in multiple minutes.
With rss.pl, I do a page refresh maybe 2 or 3 times a day. The
single (compressed) page load of ~90kb (~250kb uncompressed) lets
me wade through maybe 100-200 news items over the course of a few
hours - I can read far more, without annoying delays.
In one month, I hit 60MB of use - I had set myself a diet of 100MB/month
but I exceeded that. I would have hit 50MB or lower, but occasionally
I use my device for email.
Anyway, I started adding stats to the service. Allows me to fine
tune how it works. I noticed other people use it (by all means do -
rss.pl - but
I reserve the right to take it down or restrict usage if I find any abuse).
One of the things I am debating is how much history to carry. I currently
allow for the last 2000 new sites in a cache, but of the 10-12 sites
I poll, at varying frequencies, some sites are exposing old items.
(I dont use date aging - yet - simply URL history). The working set
is about 700-800 news items, so within 2 or so days, I will be exposed
to stale news items popping through. Its interesting watching the effect.
For example, www.osnews.com is not a fast updating site - maybe 1 item
per day, yet my code is allowing multiple old items to pop through.
"The Old New Thing" (http://blogs.msdn.com/b/oldnewthing/rss.aspx) is a
great site, and updates at one update per day. Yet very old items are
Somehow, the algorithm for retrieving sites, and the cycle update
from these sites is causing a kind of "shearing" effect - something
typically seen with a fragmented heap (hence the reference to malloc) or
locality issues in a page swapping algorithm.
Its as if the sinusoidal wave of each news feed is leading to some
I can just increase my history limit - doing so wastes more CPU
cycles on my server (so I now need to use a better data structure
than what I presently do). Maybe I will relent and do
If-Modified-Since requests and use the timestamps to keep everything
At the moment, I discard older news items and pages I generate, but
I may decide to keep them for long term archival - for more data mining