I recently wrote about the rss.pl RSS news crawler I wrote. Majority
of it took a few hours to write - the goal was to reduce my mobile
bandwidth and avoid annoying adverts being injected into pages. I set
my sights very low - this doesnt need to be a technically difficult
piece of software, and it gave me something to play with - something
I have done numerous times over the years. (CRiSP incorporates a
news side-bar, for instance; and CRiSP used to support a web
spider - it still does, but theres less need for it in the
day of superfast broadband - compared to the early modems it
was written against).
I can successfully report that rss.pl exceeded my desire to go on
a diet. Previously I was approaching 500MB/month of bandwidth use
and it was out of control, due to the obnoxious way adverts were
being generated (they were mutilating the HTML of a page - after
I had paid the cost of downloading the page - and effectively
paying twice for a page, whilst denying compression). In addition,
the mechanism would add significant delay - I was lucky to read
1 or 2 web pages in multiple minutes.
With rss.pl, I do a page refresh maybe 2 or 3 times a day. The
single (compressed) page load of ~90kb (~250kb uncompressed) lets
me wade through maybe 100-200 news items over the course of a few
hours - I can read far more, without annoying delays.
In one month, I hit 60MB of use - I had set myself a diet of 100MB/month
but I exceeded that. I would have hit 50MB or lower, but occasionally
I use my device for email.
Anyway, I started adding stats to the service. Allows me to fine
tune how it works. I noticed other people use it (by all means do -
rss.pl - but
I reserve the right to take it down or restrict usage if I find any abuse).
One of the things I am debating is how much history to carry. I currently
allow for the last 2000 new sites in a cache, but of the 10-12 sites
I poll, at varying frequencies, some sites are exposing old items.
(I dont use date aging - yet - simply URL history). The working set
is about 700-800 news items, so within 2 or so days, I will be exposed
to stale news items popping through. Its interesting watching the effect.
For example, www.osnews.com is not a fast updating site - maybe 1 item
per day, yet my code is allowing multiple old items to pop through.
"The Old New Thing" (http://blogs.msdn.com/b/oldnewthing/rss.aspx) is a
great site, and updates at one update per day. Yet very old items are
peeping through.
Somehow, the algorithm for retrieving sites, and the cycle update
from these sites is causing a kind of "shearing" effect - something
typically seen with a fragmented heap (hence the reference to malloc) or
locality issues in a page swapping algorithm.
Its as if the sinusoidal wave of each news feed is leading to some
interesting effects.
I can just increase my history limit - doing so wastes more CPU
cycles on my server (so I now need to use a better data structure
than what I presently do). Maybe I will relent and do
If-Modified-Since requests and use the timestamps to keep everything
fresh.
At the moment, I discard older news items and pages I generate, but
I may decide to keep them for long term archival - for more data mining
expeditions.
Post created by CRiSP v11.0.27a-b6712
No comments:
Post a Comment