###
### Pagecast 2.0.0 README
###
=====================================================================

(C) Copyright 2000 Preston Landers <pbl@pbl.cx> 
All Rights Reserved.

Distributed under the GNU General Public License, see the file COPYING 
for details.  NO WARRANTY WRITTEN OR IMPLIED.

See the file NEWS for release notes and version changes.

The Pagecast project home page is at:

http://pagecast.sourceforge.net

There you will find up-to-date project information, anonymous CVS
checkouts, new releases, and information about the Pagecast mailing
list.  You will also find a current list of which search engines are
supported.

0.  TABLE OF CONTENTS
1.  WHAT IS PAGECAST?
2.  SYSTEM REQUIREMENTS
3.  INSTALLATION
4.  CONFIGURATION
5.  INTERNATIONALIZATION
6.  SEARCH ENGINES
7.  META TAGS
8.  KNOWN BUGS
9.  LINKS OF INTEREST

1.  WHAT IS PAGECAST?

Pagecast is a program for automatically submitting lists of URLs
(Uniform Resource Locators, i.e., web addresses) to Internet
search engines such as AltaVista, Hotbot, Lycos, Excite, etc.  These
search engines will add the URLs to its queue for "spidering" -- it
will in turn look up each URL (at a later time, usually) and add the
pages contents to its database.

Specifically, it is a Python language program that takes a
line-separated list of URLs from a mail message (or optionally,
standard input or a file) and submits them, one at a time, to
different web search engines in a multithreaded fashion, and logs the
results (either to a file or standard output, or both.)

The only major search engine that Pagecast will not (yet) work
with is Yahoo!  This is due to the nature of Yahoo, which is not a
traditional spider-style search engine but a directory maintained by
humans, and submitting URLs is a (relatively) complex operation.
Future versions of this program may support Yahoo's "suggest a site"
page.  

Nevertheless, Yahoo does use AltaVista for traditional style searches
and this program does handle submission to AltaVista and other fine
search engines such as Google.com.

For a current list of supported search engines, please see the project
home page mentioned above.

2.  SYSTEM REQUIREMENTS

Pagecast requires a Python interpreter version 1.5.1 or newer with
threading support.  It has been tested mostly on Linux and OpenBSD but
the author has recieved reports of it running on a variety of systems
including MS Windows. It may take a small amount of tweaking to get it
to run on non-Unix-like systems.  The Python "threading" module is
required, but beware as it is not enabled by default on some systems.

If you get an error similar to: "No module named thread" then your
Python doesn't currently have thread support.  Please see the FAQ
document.  Some users may have problems with recent Pythons and
threads if it was compiled with GCC 2.7.2.3; try compiling Python with
EGCS if you're having problems like kernel oopses.  Again, see the
FAQ.

At least one search engine (InfoSeek) takes URL submissions via email.
Pagecast also supports emailing the results logs to anyone you choose.
It uses SMTP style mail and you must put your smtp server (often
something like mail.your-isp.com) into the Pagecast main configuration
file (usually etc/pagecast.conf) on the appropriate line (see the
examples in the file.)

Of course, some kind of connection to the Internet is required.  If
you are behind a firewall, you will need to be able to access port 80,
which is the http / web port and this shouldn't be a problem for most
people.  Pagecast also supports HTTP proxys.  If you need to turn on
proxy (most don't) then examine the etc/pagecast.conf file and/or read
the FAQ.)

If you want to use the mailto: interface of Pagecast (mailing URL's to
a specific account) you will need to create a special user on the
system that can send and recieve mail.  Procmail should also be
availible.  Pagecast works fine from the command line, however --
having a separate pagecast user is not neccesary.  See Configuration
below.

3.  INSTALLATION AND USE

Please read the INSTALL document in the pagecast-2.0.0 directory for
installation instructions.

People using the CVS checkout should read INSTALL.CVS.

Pagecast can be invoked with a -v option which will enable extra
status messages that might be helpful in debugging a problem.  You can
use -v additional times to increase verbosity.

Invoking pagecast with a -h option will print a simple help screen
that describes the command line usage.

Pagecast can be configured to run as a mail-robot.  This is entirely
optional and some users prefer to run directly from the command line.
To do make Pagecast a mail-triggered 'daemon' (though not a true
daemon in the classic sense), follow these steps:

1/ Create an account for pagecast.  These instructions assume you name 
this account 'pagecast' but you can choose anything you like.

2/ Log into that account.  

3/ Copy the pagecast-2.0.0.tar.gz file to the pagecast home directory
(usually /home/pagecast).

4/ tar xzvf pagecast-2.0.0.tar.gz

5/ either:
   
     ln -s pagecast-2.0.0 pagecast

   or:

     mv pagecast-2.0.0 pagecast

6/ cd pagecast

7/ cp etc/forward ../.forward ; cp etc/procmailrc ../.procmailrc

8/ cp -r etc/mail ..

9/ edit etc/pagecast.conf in your favorite text editor.
   
   Most of the settings you will not need to touch; the defaults
should be fine for most people.  However, you will definetely want to
take a look at the things in the [Mail] section.

10/ If you want to do meta-tag magic (see below), edit tags.conf   
  
   If you don't know or don't care about meta tags, you can safely
ignore this step.

11/ Modify ~/.procmailrc (if neccesary.)

   The default procmail recipie that comes with Pagecast is set to
ignore all mail that does not have the word "Pagecast" as a subject
line.  This can function as a password to prevent people from abusing
the program.  If you want a different password, you'll need to edit
.procmailrc.  All other mail will simply be forwarded to the default
mail queue for that user.

    Optionally, you can tell procmail to trash all other mail so as
not to clutter up your mail spool.  If you want this, uncomment the
last two lines of your .procmailrc.  Also, by default, temporary
backups of all mail will be saved in the mail folder.

    You can also enable Pagecast's 'debugging' (verbose) mode by
editing the pagecast line in the ~/.procmailrc file.  Simply add -v
after the -m and Pagecast will print additional status information
that might be useful in debugging a problem.

12/ logout

Pagecast should now be installed and ready to run.  

Simply email pagecast@yourmachine with a list of URL's to submit in
the body of the message.  You will need the word "Pagecast" (or
whatever 'password' you put in ~/.procmailrc) as the subject line of
each message.

4.  CONFIGURATION

A note about configuration files; if Pagecast came with your operating
system, the config files might not be in the usual place.  (The usual
place is simply in the etc directory of the untar'ed
pagecast-2.0.0.tar.gz.)  In some configurations, Pagecast has both
system-wide configuration and local user config files.  The system
files (often in /etc/pagecast/ or /usr/local/etc/pagecast) are read
first, and then the user files (stored in ~/.pagecast/) are read,
overriding anything in the system files.

You will probably want to edit (minimally) the etc/pagecast.conf
configuration file to suit your needs.  (Use a plain text editor.)

At minimum you should put your email address and smtp mail server
there, not the fake default one.  Not all search engines actually use
this email address but some do require an email address when
submitting links.  I can't guarantee you won't get spam from using
this program, so feel free to change the email address to whatever you
want.  This email address also becomes the "From:" address if you have
Pagecast send email.

You may also want to have carbon copies of Pagecast's replies sent to
an email address (or blind carbon copies; neither recipient knows the
other recieved a message.)  In this case, uncomment out the cc= and/or
the bcc= lines in pagecast.conf in the [Mail] section.  (A comment is
any line that begins with a # character.  Remove all # characters to
'uncomment' a line.)

For instance, if you want blind carbon copies of each of Pagecast's
responses sent to joe@somewhere.com, add this line to pagecast.conf in
the [Mail] section:

bcc=joe@somewhere.com

The logfile is on by default and Pagecast appends its output to this
file in directory where pagecast resides:

./pagecast.log

In the [Log] section of pagecast.conf you can turn on and off
logging with the

log=on         [or off]

line.  You can change the filename of the logfile (including full or
relative path to it) by changing this line:

logfile=/var/log/pagecast.log    [for example]

Of course, you must have write permission where you want to write the
log file.

By default, Pagecast 1.1 (or greater) will try to 'fix' any URLs that
are malformed.  This can be turned on or off by changing the
fixlinks= item in the [General] section of pagecast.conf.  There are
two modes for link fixing, strict and lenient.  By default, fixing is 
on and set to lenient.  

Lenient mode will try to make a URL out of just about anything, while
strict mode will reject links that don't 'look' like they can be
turned into valid URLs.  Rejected links will not be submitted to
search engines, and a warning will be printed to the log about them.
Lenient mode will not reject any links.

To turn on or off strict link checking, change the strict_links= item
in the [General] section of pagecast.conf.  Here are a few examples of
the kinds of link 'fixing' Pagecast can do:

   www.some-site.com  -->   http://www.some-site.com
   ftp://site.com     -->   ftp://site.com  (same)
   not_a_valid_link   -->   rejected if strict mode, 
                            otherwise http://not_a_valid_link
   fake://site.com    -->   rejected if strict mode, 
                            otherwise http://site.com

Other variables of interest:

retries= is a number indicating how many times each URL submission
should be "redialed" if it didn't go through for whatever reason, most
especially a temporary domain name lookup or other network problem.
After this retry count is exceeded the server will record an error and
move on to the next URL in the list.  Note that if the URL is
explicity rejected by the search engine as invalid, it will not be
redialed, but logged as an error.

wait= is a *floating point* number (this means use 1.0 instead of 1,
for example) that indicates how long in seconds to pause in between
submitting each URL.  This can be used as a "brake" on the script so
it doesn't overload your network and/or the search engine.

output= can be either on or off and it indicates whether or not to
print log information to standard output (i.e., the screen / console.)
This feature will work even if log=off.

You indicate which search engines to actually use in the section
[SearchEngines]. All of the ones listed in pagecast.conf as shipped 
can be activated with an on and disabled with an off setting.  Each of 
these entries has a corresponding section in the servers.conf file that 
defines the behavior of each search engine.

5.  INTERNATIONALIZATION

The output of Pagecast -- the log messages shown to users -- can be
customized relatively easily.  The main use for this is to support
languages other than the author's (English) but it could also be used,
for instance, to customize the output at your site.

To customize the output of Pagecast, change into the languages
directory.  Then copy the us-english.lang file to yourlanguage.lang
and edit that file.

The format of the file is one big Python dictionary.  A dictionary is
a way to associate a "key" with a "definition" or value.  In this
case, the key is a special code used in the source code to identify a
particular output message.  The key is the left side of the : on any
particular line. The value, or definition, is the thing on the right
side of the : (colon) and it is the string that is actually displayed
to the user.

You want to change the values and not the keys.

The values use Python variable substitution syntax to allow numbers
and other bits of data to be put in the appropriate place in the value
string.  In the source code, the %s signs are substituted for
meaningful numbers.  The comment above each language entry tells you
what order the numbers will be printed in, and what they mean.
Unfortunately, a current limitation of Pagecast is that you cannot
change the order in which these numbers appear (to make it more
correct for native syntax).  This may change in the future, but in the
meantime, be sure not to add or remove any %s signs from your
translated strings.)

Here is an example from us-english.lang:

Lookups = {
    # . . . (stuff cut)

    # date, version
    "begin_run_msg" : "[%s] === %s === Beginning Run",

    # servername, URL, failure reason
    "submit_fail" : "%s ; @ <Error> %s : %s",

    # . . . (stuff cut)

    }

The entry with the key "begin_run_msg" defines what Pagecast prints
when it begins contacting search engines.  As the comment above it
indicates, the two %s signs in the definition will be translated into
the current date (and time) and Pagecast's version.

The next example "submit_fail" is what is printed when a search engine
does not accept a URL.  The %s strings provided are the name of the
search engine (server), the URL that the user submitted, and whatever
error message can be extracted from the server.

To actually *activate* a language file, read the FAQ question about
changing languages.

I would gladly accept any language translations files to include with
future releases of Pagecast.


6.  SEARCH ENGINES

Pagecast is highly flexible in how it handles interactions with
specific search engines.  Most search engines handle URL submissions
in a very simple way; a single-page form on a web page.  This can be
easily faked from a program; however, each search engine's page is
slightly different.

Pagecast sees search engines in a sort of abstract way, and drivers
can be written that handle alternate methods of submission (such as
email or subspace transmissions.)  Driver files for specific search
engines are written in the servers/ directory with .def extensions.
When Pagecast starts up, all of these files are scanned as possible
search engines.  However, each search engine is only used if it
belongs to a group marked as active.

This is controlled by the etc/groups.conf file, which is actually a
fragment of Python code that is executed by Pagecast.  It is in this
file that you assign search engines to groups, and mark groups as
active.  Since this is Python, you must assign group names to the
Active list according to Python syntax (which, fortunately, is pretty
straightforward.)  If you are not a programmer, it is best to follow
the examples present in the etc/groups.conf file.

Generally, for any specific group you want to mark active, this syntax
will work:

Active.append("My Group Name")

Each servers/searchengine.def file defines a Name that identifies that
server.  To make the search engine with the .name == "FooBar.com" a
member of one or more groups, use this syntax:

Assign["FooBar.com"] = ["group1", "group2", "group3"]

Most search engines use a web/CGI interface to add new URLs to their
database. If you want Pagecast to use a search engine like this that
is not already defined in the servers directory, you will need to
setup a .def file. Copy the template.def.example file.  For instance:

$ cp servers/template.def.example servers/foobar.com.def

Then edit your new file, servers/foobar.com.def, in a plain text
editor.  Usually you only have to fill in values on a few lines, then
save the file and exit.  (Be sure to assign it to an active group in
etc/groups.conf when you want to use it.)  

Everything you need to change is between the two ### START HERE and
### END HERE markers.  Don't change anything below or above those
markers unless you know what you're doing.  The comments in the
templates should help you understand what the different settings need.
To find out what you need to fill in for your desired search engine,
you are going to have to look at the raw HTML source of your desired
search engine's main "Add URL" page.  

You can also append group names to the Prohibited list in the
groups.conf file. Any group names in Prohibited will not be selected
to run.

You can specify specific groups on the command line with the -g
option. For instance,

pagecast -g debug,broken -u http://pagecast.sourceforge.net

will submit the URL http://pagecast.sourceforge.net only to the
servers that belong the the "debug" and/or "broken" groups.

If you need further help configuring a new server, check out the
Pagecast mailing lists.  [See the top of this document.]  Please
contribute any server definitions or extensions that you may come up
with.

Currently supported search engines:

Excite! (same database as Webcrawler) 
InfoSeek (using mail submissions, which are 24/7) 
AltaVista (with invalid URL detection and logging; Yahoo uses)
Lycos (ditto; Tripod uses)
Hotbot (the idiot savant of search engines)
Google (a great search engine, especially for technical topics)
Northern Lights (never use this one much myself)
Canada (obviously oriented to Canadians)
Anzwers (oriented towards Australia and New Zealand)


7. META TAGS

Pagecast can (optionally) download the URLs you are submitting and
examine the keywords contained in the header meta tags.  Pagecast will
attempt to do this if tags=on in the [General] section of
pagecast.conf.

Pagecast will compare the keywords to the words in the <TITLE> of the
document, and produce a "confidence rating".  Many search engines use
a similar algorithm to determine how relevent the keywords are to the
actual content of the document, to avoid the spam syndrome.  

Pagecast will also warn you if no keywords or present, or if the
document is using the outdated HTTP-EQUIV style of keywords.

If the documents are accessable to the pagecast user on the
filesystem, Pagecast can actually modify the documents, updating the
date meta tag and fixing HTTP-EQUIV style keywords.  This can give a
search engine the impression that the document is very recent.

To activate this feature, set localdocuments=on in [General] section
of pagecast.conf, and then edit the tags.conf file to tell Pagecast
how to find the documents on the filesystem.

8. KNOWN BUGS

There are a few "issues" to be aware of in extreme cases:

I don't know what the hard limit on the url size is.  It probably
depends on each search engine.  Most web browsers limit URLs to around 
250 characters, but this program doesn't enforce that.

The program may be aborted with CTRL-C, but due to its multithreaded
nature, it may not completely stop right away.  All the threads (each
search engine is a thread) will exit after they complete their current 
interaction with the search engine, but with a slow server, this might 
take a while.

There have been problems reported with Python 1.5.2 and Linux 2.0.x.
Please see the FAQ file for solutions.

Thanks to Joshua Levitsky <jlevitsk@joshie.com> for pointing out a
brown-paper-bag bug in the 1.0 release.  


9. LINKS OF INTEREST:

-- About search engines, "robots" 
        http://info.webcrawler.com/mak/projects/robots/faq.html

=====================================================================