|
[
about |
what's new |
download |
docs ]
What it costs: WebFetcher is Shareware. THIS IS NOT PUBLIC DOMAIN SOFTWARE. After 30 days, educational and nonprofit institutions must send a postcard. Others must pay for WebFetcher to use it after 30 days. Checks should be in US funds drawn on a US bank. To license one copy, remit $35 to:
OnTV, L.P.
4616 Henry Street
Pittsburgh, PA 15213
What it does: Downloads World Wide Web pages to your local hard
disk for
offline viewing. Pages are periodically updated on a regular schedule
set by you.
How it works: You supply a list of http URLs (in a file called
schedule.txt ) and desired download
times. At those times, WebFetcher downloads the associated documents.
Embedded images and hyperlinked pages (down to a certain depth) can be
downloaded as well. You view these pages offline using your favorite
Web browser. WebFetcher periodically checks the original site and
automatically downloads any new or updated pages.
[
Requirements |
Installation |
Quick Start |
Suggestions |
User Reference |
Limitations |
Feedback ]
A live Internet connection. (Direct connection, SLIP or PPP.)
Macintosh (fat binary), System 7.0.1 or later.
UNIX versions: NeXT, SunOS, Solaris 2.3, OSF/1 (for Dec Alpha), Dec
Ultrix 4.3.
Windows '95 or NT.
Macintosh
Unhqx and expand the file WebFetcher.sit.hqx . WebFetcher
expands into it's own folder,
containing the application and other files.
UNIX
Gunzip and tar -xvf the file Web<Your_OS>.tar.gz . A
directory called WebFetcher is created,
containing the WebFetcher application and other files.
Windows
By default WebFetcher installs here:
C:\Program Files\Bright Plaza\WebFetcher\
The executable program is in the Programs subdirectory.
All other files,
including
the sample schedule file (schedule.txt ) and
WebFetcher Master Index (index.html )
are in the Data subdirectory. The sample
schedule file (schedule.txt) and WebFetcher Master Index (index.html)
are all in this
one directory.
In General
Here's a quick overview. Detailed instructions appear below under
User Reference.
Lines in a schedule file fit this pattern:
1 3/20/96 7:30 am http://www.ontv.com/ 12 h 2 1
The encoding is what you'd expect: starting date and time,
url, and some "detail codes". The "detail codes" are repeat
interval, fetch depth, and graphics flag (1=yes,
0=no). (The '1' at the start of the line is a format code. It needs to
be there...)
The above line is interpreted as: Fetch the page http://www.ontv.com/
every 12 hours starting at 7:30 am on March 13, 1996. Go two levels
deep (beyond the first page), and include graphics."
For a one time only fetch, use a repeat interval of zero, e.g.
0 h like this:
1 3/20/96 7:30 am http://www.ontv.com/ 0 h 2 1
For Macintosh
Connect to the internet, then launch the WebFetcher application. You
will be prompted for a schedule file
to load. Try the default schedule schedule.txt .
WebFetcher stores fetched files in the application directory.
(Now read the Windows section below regarding the Master Index and
creating your own schedule...)
For UNIX
WebFetcher is best run as a background process. Recommended usage:
$ WebFetcher [-s schedule_file ] [-d data_directory ] [&]
Schedule_file points to your WebFetcher schedule. The
default is schedule.txt
in the current directory. Data_directory points to the
directory where
you want WebFetcher to store your fetched files. The default is the
current directory.
All messages are written to the file log.txt in the data
directory. To see what
WebFetcher is doing "right now", cd to that directory and say this:
$ tail log.txt .
Under UNIX, there's nothing to prevent you from running multiple copies
of WebFetcher
simultaniously. Also note that once in teh background WebFetcher will
not terminate on
it's own: you'll have to kill it yourself from the shell. (Do the
kill -9
thing...)
(Now read the Windows section below regarding the Master Index and
creating your own schedule...)
For Windows
After connecting to the Internet, just launch WebFetcher. It loads the
sample schedule.txt file and
fetches the files listed there. The main window displays a log of
WebFetcher's
activities.
After a minute or two (give WebFetcher a chance
to fetch the files!), launch your favorite web browser and open the
WebFetcher Master
Index file index.html in the WebFetcher folder.
Follow the links: the pages you see have been fetched to your local hard
disk.
Now exit WebFetcher and edit the file schedule.txt . Build
your own schedule --
see the notes below under Schedule File for help. List the URLs
you'd like to fetch and set their download times. Delete the lines you
don't want.
Relaunch WebFetcher and you're on your way!
Configure WebFetcher to fetch your
favorite news,
sports, and weather pages every few hours. Have it check your favorite
sites weekly
for updates. Mirror important documentation to your own hard disk.
Check monthly for
updates. Have WebFetcher make daily checks for important press
releases. Keep an eye on your
competition...
Below are the items found in the WebFetcher subdirectory. Some items
are present
after installation, others after WebFetcher is first run.
- WebFetcher Program: The executable, WebFetcher or
WebFetcher.exe.
- Master Index : index.htm
An HTML page with hyperlinks to your successfully-fetched pages. As
WebFetcher
runs, it appends new links to this page that point to new downloads.
This is a "top
level" index: it lists only the pages you've explicitly asked for and
only pages
that have actually been fetched. You're welcome to edit this
file with your favorite text editor if you wish to reorder the listing
to suit
your preferences.
- Daily Index : di<datecode>.htm
Daily HTML pages generated by WebFetcher with hyperlinks to ALL the new
pages
downloaded that day. (These indices may be deleted if they take up too
much disk
space. They're basically here to provide a "What's New" function.)
- WebFetcher Update Page : update.htm
A page reachable from the Master Index and fetched fresh from our site
every time you
run WebFetcher. It contains information on WebFetcher updates and other
short
news items. Later, if demand warrants, the Update Page will announce
services such as
"fetch profiling" for congestion-avoidance.
- Help File : help.htm
This file, also available in WebFetcher's Help menu.
- Log File : log.txt
A disk-based copy of the information written to WebFetcher's main
window. (When
the log reaches 256K in length, it truncates itself to 32K. The log
file
will not fill up your hard disk...)
- Data Files : various subfolders
WebFetcher creates a subfolder for each host visited. Pages for a
particular host are stored in that host's folder.
You may delete these folders whenever you wish, but deleting them
will obviously destroy the pages they contain. If you wish to view the
deleted pages again, you'll have to refetch them.
- Schedule File: schedule.txt
The default schedule WebFetcher automatically loads at startup.
You can modify this schedule or create new schedules using your favorite
text editor.
You designate which schedule to load at startup using the Load New
Schedule choice
in the File menu.
The schedule file is a simple text file that contains records, one per
line, in
exactly this form:
frmt date-time URL repeat-interval interval-type fetch-depth
graphics-flag
as in:
1 11/20/97 12:00 pm http://www.ontv.com/ 2 w 5 1
This translates into English as "Fetch the page at
http://www.ontv.com/ on November
20, 1997 at 12:00 pm. Fetch all the attached pages up to 5 links away,
and fetch
all embedded graphics. Every two weeks, check to see if anything has
changed, and
download only the pages and graphics that have changed."
The fields in each record are defined as follows (note that each
field is separated by a space, and the record ends with a carriage
return):
- frmt : A record format flag, for now always 1.
- date-time : The local date and time to fetch this URL,
written in
the form
mm/dd/yy hh:mm followed by either am
or pm .
You must use either am or pm, military time (24 hour
clock) is not recognized.
- URL : An ordinary Universal Resource Locator, as described
in RFC 1738.
This web page,
along with all it's embedded images and hyperlinked pages (if
requested) will be
fetched to your hard disk. (For now, only URLs of the scheme 'http'
are allowed.)
- repeat-interval and repeat-type: Together these two
codes describe how often to check the
original source URL for new updates. If repeat-interval is zero,
repeat-code is ignored and the
page is fetched "one time only". Otherwise, any other positive integer
counts the
number of intervals of interval-type before the next periodic
fetch occurs. The codes are
as follows:
m = minutes, h = hours, d = days, w =
weeks.
Examples clarify their use:
The code 90 m would mean check every 90 minutes.
The code 10 d would mean check every ten days.
The code 3 w would mean check every three week.
The code 0 w would mean check "one time only" (no repeat
interval).
It follows that the
following codes are equivalent: 60 m = 1 h , 24
h = 1 d, 1 w = 7 d.
As another example:
1 1/20/96 7:00 am http://www.yahoo.com/headlines/summary.html 3 h
0 1
Means "fetch Yahoo's News Summary every three hours, starting January
20, 1996 at 7 am."
- fetch-depth :
The maximum number of links to follow away from the page initially
requested.
The above http://www.ontv.com/ example will fetch
any files within five links (jumps) of the requested page
http://www.ontv.com/.
Generally speaking large sites like CERN, Microsoft, Netscape, Yahoo,
etc., tend to
"fan out" quickly, so start with very small numbers.
Good choices are
0 (fetch only the requested page), 1 (fetch the requested page
and it's attached pages) and 2 (fetch the requested page and all the
attached pages, and all their
attached pages). Numbers bigger than 3 or 4 should be used with extreme
care.
A good technique is to schedule
a fetch at depth 1 and examine the results. Pick out the interesting
branches and focus new, deeper fetches on those branches only.
- graphics-flag : 1 or 0.
1 means fetch any inline graphics (GIFs and JPEGs), 0 means
ignore (don't fetch) graphics.
(Note that this setting is different from hyperlink depth. With this
flag, you can order
your pages "with or without graphics", so to speak.)
- WebFetcher isn't yet intended to be a general-purpose automatic
downloader robot. Its purpose for now is to facilitate caching of WWW
HTML pages for convenient (and fast) offline viewing. It will only
download text and images data,
not compressed files, postscript files, executable binaries, etc.
- The schedule accepts http URLs only. Gopher, WAIS, news, ftp, etc.
URLs are explicity disallowed.
- WebFetcher will fetch data on the sites other than the one indicated
in the original
request URL, but only one level deep. That is, if an embedded hyperlink
jumps to a
"non-local" site, WebFetcher will follow that hyperlink no deeper that
it's first page.
- If a page hasn't been fetched, a local hyperlink to that page will
not work. WebFetcher never leaves your hard disk. It will not go back
to the
net to find pages you haven't fetched.
- Server-side image maps won't work. They rely on software running on
the remote
server, so it's nearly impossible for WebFetcher to properly mimic image
map
behavior.
- Queries, like
http://www.wallstreet.com/prices/stocks?company=APPL ,
usually won't work.
- WebFetcher will not fetch data from servers running HTTP/0.9.
The server must be running HTTP/1.0 or better.
- WebFetcher endeavours to be a good net citizen. The authors are
sensitive to the
havoc personal robots can wreak on the net. To make WebFetcher more
server-friendly,
we enforce a minimum 10 second wait between any two non-graphics fetches
to the same site,
and a minimum 30 minute refetch interval in user schedules.
- WebFetcher does not currently follow the Robot Exclusion Protocol,
but it will
as soon as we can implement it.
- Webfetcher comes as a software demon and an text based API . We
realize that our
scheduling method (editing a text file) is a bit arcane. A
GUI interface will be forthcoming if demand warrants one.
Send email to webfetch@ontv.com. We're
especially interested in
hearing about which of the above limitations you'd like to see removed.
Further
development on WebFetcher will be strictly feedback-driven.
|