URL Monitor – Getting Notified When a Web Page is Updated

I’ve been using RSS Generator for a while to generate RSS for web pages which don’t provide RSS. However, the service often goes unreliable probably due to enormous load from various RSS readers. Another caveat was that the URL of the generated RSS is so long that it’s not accepted by the input form of some web-based RSS readers.

So, I rather chose to write a simple shell script which sends me an e-mail message when the web pages in my watch list change. It’s name is ‘URL Monitor’:

#!/bin/sh
# Path: /usr/local/bin/url-monitor

mkdir -p /var/cache/url-monitor
cat /etc/url-monitor.conf | while read -r NAME; do
  read -r URL || exit 1
  read -r INTERVAL || exit 2
  read -r STRIP_REGEX || exit 3
  read -r NEEDLE || exit 4
  read -r REPLACEMENT || exit 5

  if [ -f "/var/cache/url-monitor/$NAME.html" ]; then
    MTIME=`stat --format=%Z "/var/cache/url-monitor/$NAME.html"`
    NOW=`date +%s`
    AGE=$(($NOW - $MTIME))
    if [ $AGE -lt $INTERVAL ]; then
      continue;
    fi
  fi

  wget -q -T 60 -O - "$URL" | perl -pi -e 's/[rn]/ /g' | perl -pi -e "s/$STRIP_REGEX//gi" | perl -pi -e 's/s+/ /g' | perl -pi -e "s/$NEEDLE/$REPLACEMENT/gi" > "/var/cache/url-monitor/$NAME.html.new"

  if [ ! -f "/var/cache/url-monitor/$NAME.html.new" ] || [ `stat --format=%s "/var/cache/url-monitor/$NAME.html.new"` == "0" ]; then
    echo "Failed to fetch - $NAME" >&2
    rm -f "/var/cache/url-monitor/$NAME.html.new"
    touch "/var/cache/url-monitor/$NAME.html"
    exit 6
  fi

  if [ -f "/var/cache/url-monitor/$NAME.html" ]; then
    diff -q "/var/cache/url-monitor/$NAME.html" "/var/cache/url-monitor/$NAME.html.new" > /dev/null 2>&1
    if [ "$?" == "0" ]; then
      rm -f "/var/cache/url-monitor/$NAME.html.new"
      touch "/var/cache/url-monitor/$NAME.html"
      continue
    else
      mv -f "/var/cache/url-monitor/$NAME.html.new" "/var/cache/url-monitor/$NAME.html"
    fi
  else
    mv -f "/var/cache/url-monitor/$NAME.html.new" "/var/cache/url-monitor/$NAME.html"
  fi

  # Send notification
  {
    echo 'From: URL Monitor <[email protected]>'
    echo 'To: Trustin Lee <[email protected]>'
    echo "Subject: $NAME - updated"
    echo 'Content-Type: text/html; charset=euc-kr'
    echo
    cat "/var/cache/url-monitor/$NAME.html"
    echo
  } | sendmail trustin
done

This quick and dirty shell script simply strips out unnecessary part from the fetched web page, caches it and notifies me (local user ‘trustin’) via an e-mail when the newly fetched stuff differs from the cached one. The following is the example configuration file (/etc/url-monitor.conf):

JavaWorld: Featured Tutorials
http://www.javaworld.com/features/index.html
86400
(^.*<div id="toplist">|<p><a class="findmore".*$)
/javaworld/
http://www.javaworld.com/javaworld/
DDJ.com: High Performance Computing
http://www.ddj.com/hpc-high-performance-computing/archives.jhtml
86400
(^.*Feature Articless*-->|<br clear="left">.*$)
/hpc-high-performance-computing/
http://www.ddj.com/hpc-high-performance-computing/
Lono.pe.kr
http://lono.pe.kr/src/
86400
(^.*[[Start]]-->|<!--[[.*$)
/src/
http://www.lono.pe.kr/src/

Each line has the following meaning:

  • 1st line – the subject of the web page
  • 2nd line – revisit interval (in seconds)
  • 3rd line – what to strip out (in regex)
  • 4th line – something to replace (in regex), probably relative URLs
  • 5th line – what you want to replace the expression specified in the 4th line with

Once configured, url-monitor should be executed periodically. I added the following line to my crontab:

# Path: /etc/cron.d/url-monitor.cron
SHELL=/bin/bash
PATH=/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=trustin
HOME=/root

# Run the URL monitor every three minutes
*/3 * * * *  root  /usr/local/bin/url-monitor

As you noticed, it’s very primitive and requires you to modify the script itself to configure certain parameters. However, I think it’s just OK as long as the number of the web pages I have to monitor (read: which doesn’t provide RSS) is small.

2 Comments URL Monitor – Getting Notified When a Web Page is Updated

  1. Trustin Lee

    @Chris Harmoney: Hello Chris, it’s my pleasure that this crude script helped you. I’ve just reformatted the post so that it’s easier to copy-and-paste. Thanks for a visit! 🙂

Comments are closed.