{"id":1510,"date":"2008-04-05T17:42:00","date_gmt":"2008-04-05T17:42:00","guid":{"rendered":"http:\/\/t.motd.kr\/articles\/1510\/url-monitor-getting-notified-when-a-web-page-is-updated"},"modified":"2022-12-28T01:45:56","modified_gmt":"2022-12-27T16:45:56","slug":"url-monitor-getting-notified-when-a-web-page-is-updated","status":"publish","type":"post","link":"https:\/\/vault.motd.kr\/wordpress\/posts\/1510\/url-monitor-getting-notified-when-a-web-page-is-updated\/","title":{"rendered":"URL Monitor – Getting Notified When a Web Page is Updated"},"content":{"rendered":"\n

I\u2019ve been using RSS<\/span> Generator<\/a> for a while to generate RSS<\/span> for web pages which don\u2019t provide RSS<\/span>. However, the service often goes unreliable probably due to enormous load from various RSS<\/span> readers. Another caveat was that the URL<\/span> of the generated RSS<\/span> is so long that it\u2019s not accepted by the input form of some web-based RSS<\/span> readers.<\/p>\n\n\n\n

So, I rather chose to write a simple shell script which sends me an e-mail message when the web pages in my watch list change. It\u2019s name is \u2018URL<\/span> Monitor\u2019:<\/p>\n\n\n\n

#!\/bin\/sh\n# Path: \/usr\/local\/bin\/url-monitor\n\nmkdir -p \/var\/cache\/url-monitor\ncat \/etc\/url-monitor.conf | while read -r NAME; do\n  read -r URL || exit 1\n  read -r INTERVAL || exit 2\n  read -r STRIP_REGEX || exit 3\n  read -r NEEDLE || exit 4\n  read -r REPLACEMENT || exit 5\n\n  if [ -f \"\/var\/cache\/url-monitor\/$NAME.html\" ]; then\n    MTIME=`stat --format=%Z \"\/var\/cache\/url-monitor\/$NAME.html\"`\n    NOW=`date +%s`\n    AGE=$(($NOW - $MTIME))\n    if [ $AGE -lt $INTERVAL ]; then\n      continue;\n    fi\n  fi\n\n  wget -q -T 60 -O - \"$URL\" | perl -pi -e 's\/[rn]\/ \/g' | perl -pi -e \"s\/$STRIP_REGEX\/\/gi\" | perl -pi -e 's\/s+\/ \/g' | perl -pi -e \"s\/$NEEDLE\/$REPLACEMENT\/gi\" > \"\/var\/cache\/url-monitor\/$NAME.html.new\"\n\n  if [ ! -f \"\/var\/cache\/url-monitor\/$NAME.html.new\" ] || [ `stat --format=%s \"\/var\/cache\/url-monitor\/$NAME.html.new\"` == \"0\" ]; then\n    echo \"Failed to fetch - $NAME\" >&2\n    rm -f \"\/var\/cache\/url-monitor\/$NAME.html.new\"\n    touch \"\/var\/cache\/url-monitor\/$NAME.html\"\n    exit 6\n  fi\n\n  if [ -f \"\/var\/cache\/url-monitor\/$NAME.html\" ]; then\n    diff -q \"\/var\/cache\/url-monitor\/$NAME.html\" \"\/var\/cache\/url-monitor\/$NAME.html.new\" > \/dev\/null 2>&1\n    if [ \"$?\" == \"0\" ]; then\n      rm -f \"\/var\/cache\/url-monitor\/$NAME.html.new\"\n      touch \"\/var\/cache\/url-monitor\/$NAME.html\"\n      continue\n    else\n      mv -f \"\/var\/cache\/url-monitor\/$NAME.html.new\" \"\/var\/cache\/url-monitor\/$NAME.html\"\n    fi\n  else\n    mv -f \"\/var\/cache\/url-monitor\/$NAME.html.new\" \"\/var\/cache\/url-monitor\/$NAME.html\"\n  fi\n\n  # Send notification\n  {\n    echo 'From: URL Monitor <url-monitor@gleamynode.net>'\n    echo 'To: Trustin Lee <trustin@gmail.com>'\n    echo \"Subject: $NAME - updated\"\n    echo 'Content-Type: text\/html; charset=euc-kr'\n    echo\n    cat \"\/var\/cache\/url-monitor\/$NAME.html\"\n    echo\n  } | sendmail trustin\ndone<\/code><\/pre>\n\n\n\n

This quick and dirty shell script simply strips out unnecessary part from the fetched web page, caches it and notifies me (local user \u2018trustin\u2019) via an e-mail when the newly fetched stuff differs from the cached one. The following is the example configuration file (\/etc\/url-monitor.conf<\/tt>):<\/p>\n\n\n\n

JavaWorld: Featured Tutorials\nhttp://www.javaworld.com\/features\/index.html\n86400\n(^.*<div id=\"toplist\">|<p><a class=\"findmore\".*$)\n\/javaworld\/\nhttp:\/\/www.javaworld.com\/javaworld\/\nDDJ.com: High Performance Computing\nhttp:\/\/www.ddj.com\/hpc-high-performance-computing\/archives.jhtml\n86400\n(^.*Feature Articless*-->|<br clear=\"left\">.*$)\n\/hpc-high-performance-computing\/\nhttp:\/\/www.ddj.com\/hpc-high-performance-computing\/\nLono.pe.kr\nhttp:\/\/lono.pe.kr\/src\/\n86400\n(^.*[[Start]]-->|<!--[[.*$)\n\/src\/\nhttp:\/\/www.lono.pe.kr\/src\/<\/code><\/pre>\n\n\n\n

Each line has the following meaning:<\/p>\n\n\n\n