I’ve been using RSS Generator for a while to generate RSS for web pages which don’t provide RSS. However, the service often goes unreliable probably due to enormous load from various RSS readers. Another caveat was that the URL of the generated RSS is so long that it’s not accepted by the input form of some web-based RSS readers.
So, I rather chose to write a simple shell script which sends me an e-mail message when the web pages in my watch list change. It’s name is ‘URL Monitor’:
#!/bin/sh
# Path: /usr/local/bin/url-monitor
mkdir -p /var/cache/url-monitor
cat /etc/url-monitor.conf | while read -r NAME; do
read -r URL || exit 1
read -r INTERVAL || exit 2
read -r STRIP_REGEX || exit 3
read -r NEEDLE || exit 4
read -r REPLACEMENT || exit 5
if [ -f "/var/cache/url-monitor/$NAME.html" ]; then
MTIME=`stat --format=%Z "/var/cache/url-monitor/$NAME.html"`
NOW=`date +%s`
AGE=$(($NOW - $MTIME))
if [ $AGE -lt $INTERVAL ]; then
continue;
fi
fi
wget -q -T 60 -O - "$URL" | perl -pi -e 's/[rn]/ /g' | perl -pi -e "s/$STRIP_REGEX//gi" | perl -pi -e 's/s+/ /g' | perl -pi -e "s/$NEEDLE/$REPLACEMENT/gi" > "/var/cache/url-monitor/$NAME.html.new"
if [ ! -f "/var/cache/url-monitor/$NAME.html.new" ] || [ `stat --format=%s "/var/cache/url-monitor/$NAME.html.new"` == "0" ]; then
echo "Failed to fetch - $NAME" >&2
rm -f "/var/cache/url-monitor/$NAME.html.new"
touch "/var/cache/url-monitor/$NAME.html"
exit 6
fi
if [ -f "/var/cache/url-monitor/$NAME.html" ]; then
diff -q "/var/cache/url-monitor/$NAME.html" "/var/cache/url-monitor/$NAME.html.new" > /dev/null 2>&1
if [ "$?" == "0" ]; then
rm -f "/var/cache/url-monitor/$NAME.html.new"
touch "/var/cache/url-monitor/$NAME.html"
continue
else
mv -f "/var/cache/url-monitor/$NAME.html.new" "/var/cache/url-monitor/$NAME.html"
fi
else
mv -f "/var/cache/url-monitor/$NAME.html.new" "/var/cache/url-monitor/$NAME.html"
fi
# Send notification
{
echo 'From: URL Monitor <[email protected]>'
echo 'To: Trustin Lee <[email protected]>'
echo "Subject: $NAME - updated"
echo 'Content-Type: text/html; charset=euc-kr'
echo
cat "/var/cache/url-monitor/$NAME.html"
echo
} | sendmail trustin
done
This quick and dirty shell script simply strips out unnecessary part from the fetched web page, caches it and notifies me (local user ‘trustin’) via an e-mail when the newly fetched stuff differs from the cached one. The following is the example configuration file (/etc/url-monitor.conf):
JavaWorld: Featured Tutorials
http://www.javaworld.com/features/index.html
86400
(^.*<div id="toplist">|<p><a class="findmore".*$)
/javaworld/
http://www.javaworld.com/javaworld/
DDJ.com: High Performance Computing
http://www.ddj.com/hpc-high-performance-computing/archives.jhtml
86400
(^.*Feature Articless*-->|<br clear="left">.*$)
/hpc-high-performance-computing/
http://www.ddj.com/hpc-high-performance-computing/
Lono.pe.kr
http://lono.pe.kr/src/
86400
(^.*[[Start]]-->|<!--[[.*$)
/src/
http://www.lono.pe.kr/src/
Each line has the following meaning:
- 1st line – the subject of the web page
- 2nd line – revisit interval (in seconds)
- 3rd line – what to strip out (in regex)
- 4th line – something to replace (in regex), probably relative URLs
- 5th line – what you want to replace the expression specified in the 4th line with
Once configured, url-monitor should be executed periodically. I added the following line to my crontab:
# Path: /etc/cron.d/url-monitor.cron
SHELL=/bin/bash
PATH=/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=trustin
HOME=/root
# Run the URL monitor every three minutes
*/3 * * * * root /usr/local/bin/url-monitor
As you noticed, it’s very primitive and requires you to modify the script itself to configure certain parameters. However, I think it’s just OK as long as the number of the web pages I have to monitor (read: which doesn’t provide RSS) is small.
Trustin,
Just wanted to say thank you for writing & posting this. It is exactly what I was looking for.
@Chris Harmoney: Hello Chris, it’s my pleasure that this crude script helped you. I’ve just reformatted the post so that it’s easier to copy-and-paste. Thanks for a visit! 🙂