WebFetcher Docs

[ about | what's new | download | docs ]

About WebFetcher

What it costs: WebFetcher is Shareware. THIS IS NOT PUBLIC DOMAIN SOFTWARE. After 30 days, educational and nonprofit institutions must send a postcard. Others must pay for WebFetcher to use it after 30 days. Checks should be in US funds drawn on a US bank. To license one copy, remit $35 to:

OnTV, L.P.
4616 Henry Street
Pittsburgh, PA  15213

What it does: Downloads World Wide Web pages to your local hard disk for offline viewing. Pages are periodically updated on a regular schedule set by you.

How it works: You supply a list of http URLs (in a file called schedule.txt) and desired download times. At those times, WebFetcher downloads the associated documents. Embedded images and hyperlinked pages (down to a certain depth) can be downloaded as well. You view these pages offline using your favorite Web browser. WebFetcher periodically checks the original site and automatically downloads any new or updated pages.

Requirements

A live Internet connection. (Direct connection, SLIP or PPP.)

Macintosh (fat binary), System 7.0.1 or later.

UNIX versions: NeXT, SunOS, Solaris 2.3, OSF/1 (for Dec Alpha), Dec Ultrix 4.3.

Windows '95 or NT.

Installation

Macintosh

Unhqx and expand the file WebFetcher.sit.hqx. WebFetcher expands into it's own folder, containing the application and other files.

UNIX

Gunzip and tar -xvf the file Web<Your_OS>.tar.gz. A directory called WebFetcher is created, containing the WebFetcher application and other files.

Windows

By default WebFetcher installs here:

C:\Program Files\Bright Plaza\WebFetcher\

The executable program is in the Programs subdirectory. All other files, including the sample schedule file (schedule.txt) and WebFetcher Master Index (index.html) are in the Data subdirectory. The sample schedule file (schedule.txt) and WebFetcher Master Index (index.html) are all in this one directory.

Quick Start

In General

Here's a quick overview. Detailed instructions appear below under User Reference.

Lines in a schedule file fit this pattern:

1 3/20/96 7:30 am http://www.ontv.com/ 12 h 2 1

The encoding is what you'd expect: starting date and time, url, and some "detail codes". The "detail codes" are repeat interval, fetch depth, and graphics flag (1=yes, 0=no). (The '1' at the start of the line is a format code. It needs to be there...)

The above line is interpreted as: Fetch the page http://www.ontv.com/ every 12 hours starting at 7:30 am on March 13, 1996. Go two levels deep (beyond the first page), and include graphics."

For a one time only fetch, use a repeat interval of zero, e.g. 0 h like this:

1 3/20/96 7:30 am http://www.ontv.com/ 0 h 2 1

For Macintosh

Connect to the internet, then launch the WebFetcher application. You will be prompted for a schedule file to load. Try the default schedule schedule.txt. WebFetcher stores fetched files in the application directory.

(Now read the Windows section below regarding the Master Index and creating your own schedule...)

For UNIX

WebFetcher is best run as a background process. Recommended usage:


  $ WebFetcher [-s schedule_file ] [-d data_directory ] [&]

Schedule_file points to your WebFetcher schedule. The default is schedule.txt in the current directory. Data_directory points to the directory where you want WebFetcher to store your fetched files. The default is the current directory.

All messages are written to the file log.txt in the data directory. To see what WebFetcher is doing "right now", cd to that directory and say this: $ tail log.txt .

Under UNIX, there's nothing to prevent you from running multiple copies of WebFetcher simultaniously. Also note that once in teh background WebFetcher will not terminate on it's own: you'll have to kill it yourself from the shell. (Do the kill -9 thing...)

(Now read the Windows section below regarding the Master Index and creating your own schedule...)

For Windows

After connecting to the Internet, just launch WebFetcher. It loads the sample schedule.txt file and fetches the files listed there. The main window displays a log of WebFetcher's activities.

After a minute or two (give WebFetcher a chance to fetch the files!), launch your favorite web browser and open the WebFetcher Master Index file index.html in the WebFetcher folder. Follow the links: the pages you see have been fetched to your local hard disk.

Now exit WebFetcher and edit the file schedule.txt. Build your own schedule -- see the notes below under Schedule File for help. List the URLs you'd like to fetch and set their download times. Delete the lines you don't want. Relaunch WebFetcher and you're on your way!

Suggestions

Configure WebFetcher to fetch your favorite news, sports, and weather pages every few hours. Have it check your favorite sites weekly for updates. Mirror important documentation to your own hard disk. Check monthly for updates. Have WebFetcher make daily checks for important press releases. Keep an eye on your competition...

User Reference

Below are the items found in the WebFetcher subdirectory. Some items are present after installation, others after WebFetcher is first run.

WebFetcher Program: The executable, WebFetcher or WebFetcher.exe.
Master Index : index.htm
An HTML page with hyperlinks to your successfully-fetched pages. As WebFetcher runs, it appends new links to this page that point to new downloads. This is a "top level" index: it lists only the pages you've explicitly asked for and only pages that have actually been fetched. You're welcome to edit this file with your favorite text editor if you wish to reorder the listing to suit your preferences.
Daily Index : di<datecode>.htm
Daily HTML pages generated by WebFetcher with hyperlinks to ALL the new pages downloaded that day. (These indices may be deleted if they take up too much disk space. They're basically here to provide a "What's New" function.)
WebFetcher Update Page : update.htm
A page reachable from the Master Index and fetched fresh from our site every time you run WebFetcher. It contains information on WebFetcher updates and other short news items. Later, if demand warrants, the Update Page will announce services such as "fetch profiling" for congestion-avoidance.
Help File : help.htm
This file, also available in WebFetcher's Help menu.
Log File : log.txt
A disk-based copy of the information written to WebFetcher's main window. (When the log reaches 256K in length, it truncates itself to 32K. The log file will not fill up your hard disk...)
Data Files : various subfolders
WebFetcher creates a subfolder for each host visited. Pages for a particular host are stored in that host's folder. You may delete these folders whenever you wish, but deleting them will obviously destroy the pages they contain. If you wish to view the deleted pages again, you'll have to refetch them.
Schedule File: schedule.txt
The default schedule WebFetcher automatically loads at startup. You can modify this schedule or create new schedules using your favorite text editor. You designate which schedule to load at startup using the Load New Schedule choice in the File menu.
The schedule file is a simple text file that contains records, one per line, in exactly this form:
frmt date-time URL repeat-interval interval-type fetch-depth graphics-flag
as in:
```
1 11/20/97 12:00 pm http://www.ontv.com/ 2 w 5 1 
```
This translates into English as
"Fetch the page at http://www.ontv.com/ on November 20, 1997 at 12:00 pm. Fetch all the attached pages up to 5 links away, and fetch all embedded graphics. Every two weeks, check to see if anything has changed, and download only the pages and graphics that have changed."
The fields in each record are defined as follows (note that each field is separated by a space, and the record ends with a carriage return):
- frmt : A record format flag, for now always 1.
- date-time : The local date and time to fetch this URL, written in the form mm/dd/yy hh:mm followed by either am or pm. You must use either am or pm, military time (24 hour clock) is not recognized.
- URL : An ordinary Universal Resource Locator, as described in RFC 1738. This web page, along with all it's embedded images and hyperlinked pages (if requested) will be fetched to your hard disk. (For now, only URLs of the scheme 'http' are allowed.)
- repeat-interval and repeat-type: Together these two codes describe how often to check the original source URL for new updates. If repeat-interval is zero, repeat-code is ignored and the page is fetched "one time only". Otherwise, any other positive integer counts the number of intervals of interval-type before the next periodic fetch occurs. The codes are as follows:
  m = minutes, h = hours, d = days, w = weeks.
  Examples clarify their use:
  The code 90 m would mean check every 90 minutes.
  The code 10 d would mean check every ten days.
  The code 3 w would mean check every three week.
  The code 0 w would mean check "one time only" (no repeat interval).
  It follows that the following codes are equivalent:
  60 m = 1 h , 24 h = 1 d, 1 w = 7 d.
  As another example:
```
1 1/20/96 7:00 am http://www.yahoo.com/headlines/summary.html 3 h
0 1
```
  Means "fetch Yahoo's News Summary every three hours, starting January 20, 1996 at 7 am."
- fetch-depth : The maximum number of links to follow away from the page initially requested. The above http://www.ontv.com/ example will fetch any files within five links (jumps) of the requested page http://www.ontv.com/.
  Generally speaking large sites like CERN, Microsoft, Netscape, Yahoo, etc., tend to "fan out" quickly, so start with very small numbers. Good choices are 0 (fetch only the requested page), 1 (fetch the requested page and it's attached pages) and 2 (fetch the requested page and all the attached pages, and all their attached pages). Numbers bigger than 3 or 4 should be used with extreme care. A good technique is to schedule a fetch at depth 1 and examine the results. Pick out the interesting branches and focus new, deeper fetches on those branches only.
- graphics-flag : 1 or 0. 1 means fetch any inline graphics (GIFs and JPEGs), 0 means ignore (don't fetch) graphics.
  (Note that this setting is different from hyperlink depth. With this flag, you can order your pages "with or without graphics", so to speak.)

Limitations

WebFetcher isn't yet intended to be a general-purpose automatic downloader robot. Its purpose for now is to facilitate caching of WWW HTML pages for convenient (and fast) offline viewing. It will only download text and images data, not compressed files, postscript files, executable binaries, etc.
The schedule accepts http URLs only. Gopher, WAIS, news, ftp, etc. URLs are explicity disallowed.
WebFetcher will fetch data on the sites other than the one indicated in the original request URL, but only one level deep. That is, if an embedded hyperlink jumps to a "non-local" site, WebFetcher will follow that hyperlink no deeper that it's first page.
If a page hasn't been fetched, a local hyperlink to that page will not work. WebFetcher never leaves your hard disk. It will not go back to the net to find pages you haven't fetched.
Server-side image maps won't work. They rely on software running on the remote server, so it's nearly impossible for WebFetcher to properly mimic image map behavior.
Queries, like http://www.wallstreet.com/prices/stocks?company=APPL, usually won't work.
WebFetcher will not fetch data from servers running HTTP/0.9. The server must be running HTTP/1.0 or better.
WebFetcher endeavours to be a good net citizen. The authors are sensitive to the havoc personal robots can wreak on the net. To make WebFetcher more server-friendly, we enforce a minimum 10 second wait between any two non-graphics fetches to the same site, and a minimum 30 minute refetch interval in user schedules.
WebFetcher does not currently follow the Robot Exclusion Protocol, but it will as soon as we can implement it.
Webfetcher comes as a software demon and an text based API . We realize that our scheduling method (editing a text file) is a bit arcane. A GUI interface will be forthcoming if demand warrants one.

Send email to webfetch@ontv.com. We're especially interested in hearing about which of the above limitations you'd like to see removed. Further development on WebFetcher will be strictly feedback-driven.