### ### Pagecast 2.0.0 README ### ===================================================================== (C) Copyright 2000 Preston Landers All Rights Reserved. Distributed under the GNU General Public License, see the file COPYING for details. NO WARRANTY WRITTEN OR IMPLIED. See the file NEWS for release notes and version changes. The Pagecast project home page is at: http://pagecast.sourceforge.net There you will find up-to-date project information, anonymous CVS checkouts, new releases, and information about the Pagecast mailing list. You will also find a current list of which search engines are supported. 0. TABLE OF CONTENTS 1. WHAT IS PAGECAST? 2. SYSTEM REQUIREMENTS 3. INSTALLATION 4. CONFIGURATION 5. INTERNATIONALIZATION 6. SEARCH ENGINES 7. META TAGS 8. KNOWN BUGS 9. LINKS OF INTEREST 1. WHAT IS PAGECAST? Pagecast is a program for automatically submitting lists of URLs (Uniform Resource Locators, i.e., web addresses) to Internet search engines such as AltaVista, Hotbot, Lycos, Excite, etc. These search engines will add the URLs to its queue for "spidering" -- it will in turn look up each URL (at a later time, usually) and add the pages contents to its database. Specifically, it is a Python language program that takes a line-separated list of URLs from a mail message (or optionally, standard input or a file) and submits them, one at a time, to different web search engines in a multithreaded fashion, and logs the results (either to a file or standard output, or both.) The only major search engine that Pagecast will not (yet) work with is Yahoo! This is due to the nature of Yahoo, which is not a traditional spider-style search engine but a directory maintained by humans, and submitting URLs is a (relatively) complex operation. Future versions of this program may support Yahoo's "suggest a site" page. Nevertheless, Yahoo does use AltaVista for traditional style searches and this program does handle submission to AltaVista and other fine search engines such as Google.com. For a current list of supported search engines, please see the project home page mentioned above. 2. SYSTEM REQUIREMENTS Pagecast requires a Python interpreter version 1.5.1 or newer with threading support. It has been tested mostly on Linux and OpenBSD but the author has recieved reports of it running on a variety of systems including MS Windows. It may take a small amount of tweaking to get it to run on non-Unix-like systems. The Python "threading" module is required, but beware as it is not enabled by default on some systems. If you get an error similar to: "No module named thread" then your Python doesn't currently have thread support. Please see the FAQ document. Some users may have problems with recent Pythons and threads if it was compiled with GCC 2.7.2.3; try compiling Python with EGCS if you're having problems like kernel oopses. Again, see the FAQ. At least one search engine (InfoSeek) takes URL submissions via email. Pagecast also supports emailing the results logs to anyone you choose. It uses SMTP style mail and you must put your smtp server (often something like mail.your-isp.com) into the Pagecast main configuration file (usually etc/pagecast.conf) on the appropriate line (see the examples in the file.) Of course, some kind of connection to the Internet is required. If you are behind a firewall, you will need to be able to access port 80, which is the http / web port and this shouldn't be a problem for most people. Pagecast also supports HTTP proxys. If you need to turn on proxy (most don't) then examine the etc/pagecast.conf file and/or read the FAQ.) If you want to use the mailto: interface of Pagecast (mailing URL's to a specific account) you will need to create a special user on the system that can send and recieve mail. Procmail should also be availible. Pagecast works fine from the command line, however -- having a separate pagecast user is not neccesary. See Configuration below. 3. INSTALLATION AND USE Please read the INSTALL document in the pagecast-2.0.0 directory for installation instructions. People using the CVS checkout should read INSTALL.CVS. Pagecast can be invoked with a -v option which will enable extra status messages that might be helpful in debugging a problem. You can use -v additional times to increase verbosity. Invoking pagecast with a -h option will print a simple help screen that describes the command line usage. Pagecast can be configured to run as a mail-robot. This is entirely optional and some users prefer to run directly from the command line. To do make Pagecast a mail-triggered 'daemon' (though not a true daemon in the classic sense), follow these steps: 1/ Create an account for pagecast. These instructions assume you name this account 'pagecast' but you can choose anything you like. 2/ Log into that account. 3/ Copy the pagecast-2.0.0.tar.gz file to the pagecast home directory (usually /home/pagecast). 4/ tar xzvf pagecast-2.0.0.tar.gz 5/ either: ln -s pagecast-2.0.0 pagecast or: mv pagecast-2.0.0 pagecast 6/ cd pagecast 7/ cp etc/forward ../.forward ; cp etc/procmailrc ../.procmailrc 8/ cp -r etc/mail .. 9/ edit etc/pagecast.conf in your favorite text editor. Most of the settings you will not need to touch; the defaults should be fine for most people. However, you will definetely want to take a look at the things in the [Mail] section. 10/ If you want to do meta-tag magic (see below), edit tags.conf If you don't know or don't care about meta tags, you can safely ignore this step. 11/ Modify ~/.procmailrc (if neccesary.) The default procmail recipie that comes with Pagecast is set to ignore all mail that does not have the word "Pagecast" as a subject line. This can function as a password to prevent people from abusing the program. If you want a different password, you'll need to edit .procmailrc. All other mail will simply be forwarded to the default mail queue for that user. Optionally, you can tell procmail to trash all other mail so as not to clutter up your mail spool. If you want this, uncomment the last two lines of your .procmailrc. Also, by default, temporary backups of all mail will be saved in the mail folder. You can also enable Pagecast's 'debugging' (verbose) mode by editing the pagecast line in the ~/.procmailrc file. Simply add -v after the -m and Pagecast will print additional status information that might be useful in debugging a problem. 12/ logout Pagecast should now be installed and ready to run. Simply email pagecast@yourmachine with a list of URL's to submit in the body of the message. You will need the word "Pagecast" (or whatever 'password' you put in ~/.procmailrc) as the subject line of each message. 4. CONFIGURATION A note about configuration files; if Pagecast came with your operating system, the config files might not be in the usual place. (The usual place is simply in the etc directory of the untar'ed pagecast-2.0.0.tar.gz.) In some configurations, Pagecast has both system-wide configuration and local user config files. The system files (often in /etc/pagecast/ or /usr/local/etc/pagecast) are read first, and then the user files (stored in ~/.pagecast/) are read, overriding anything in the system files. You will probably want to edit (minimally) the etc/pagecast.conf configuration file to suit your needs. (Use a plain text editor.) At minimum you should put your email address and smtp mail server there, not the fake default one. Not all search engines actually use this email address but some do require an email address when submitting links. I can't guarantee you won't get spam from using this program, so feel free to change the email address to whatever you want. This email address also becomes the "From:" address if you have Pagecast send email. You may also want to have carbon copies of Pagecast's replies sent to an email address (or blind carbon copies; neither recipient knows the other recieved a message.) In this case, uncomment out the cc= and/or the bcc= lines in pagecast.conf in the [Mail] section. (A comment is any line that begins with a # character. Remove all # characters to 'uncomment' a line.) For instance, if you want blind carbon copies of each of Pagecast's responses sent to joe@somewhere.com, add this line to pagecast.conf in the [Mail] section: bcc=joe@somewhere.com The logfile is on by default and Pagecast appends its output to this file in directory where pagecast resides: ./pagecast.log In the [Log] section of pagecast.conf you can turn on and off logging with the log=on [or off] line. You can change the filename of the logfile (including full or relative path to it) by changing this line: logfile=/var/log/pagecast.log [for example] Of course, you must have write permission where you want to write the log file. By default, Pagecast 1.1 (or greater) will try to 'fix' any URLs that are malformed. This can be turned on or off by changing the fixlinks= item in the [General] section of pagecast.conf. There are two modes for link fixing, strict and lenient. By default, fixing is on and set to lenient. Lenient mode will try to make a URL out of just about anything, while strict mode will reject links that don't 'look' like they can be turned into valid URLs. Rejected links will not be submitted to search engines, and a warning will be printed to the log about them. Lenient mode will not reject any links. To turn on or off strict link checking, change the strict_links= item in the [General] section of pagecast.conf. Here are a few examples of the kinds of link 'fixing' Pagecast can do: www.some-site.com --> http://www.some-site.com ftp://site.com --> ftp://site.com (same) not_a_valid_link --> rejected if strict mode, otherwise http://not_a_valid_link fake://site.com --> rejected if strict mode, otherwise http://site.com Other variables of interest: retries= is a number indicating how many times each URL submission should be "redialed" if it didn't go through for whatever reason, most especially a temporary domain name lookup or other network problem. After this retry count is exceeded the server will record an error and move on to the next URL in the list. Note that if the URL is explicity rejected by the search engine as invalid, it will not be redialed, but logged as an error. wait= is a *floating point* number (this means use 1.0 instead of 1, for example) that indicates how long in seconds to pause in between submitting each URL. This can be used as a "brake" on the script so it doesn't overload your network and/or the search engine. output= can be either on or off and it indicates whether or not to print log information to standard output (i.e., the screen / console.) This feature will work even if log=off. You indicate which search engines to actually use in the section [SearchEngines]. All of the ones listed in pagecast.conf as shipped can be activated with an on and disabled with an off setting. Each of these entries has a corresponding section in the servers.conf file that defines the behavior of each search engine. 5. INTERNATIONALIZATION The output of Pagecast -- the log messages shown to users -- can be customized relatively easily. The main use for this is to support languages other than the author's (English) but it could also be used, for instance, to customize the output at your site. To customize the output of Pagecast, change into the languages directory. Then copy the us-english.lang file to yourlanguage.lang and edit that file. The format of the file is one big Python dictionary. A dictionary is a way to associate a "key" with a "definition" or value. In this case, the key is a special code used in the source code to identify a particular output message. The key is the left side of the : on any particular line. The value, or definition, is the thing on the right side of the : (colon) and it is the string that is actually displayed to the user. You want to change the values and not the keys. The values use Python variable substitution syntax to allow numbers and other bits of data to be put in the appropriate place in the value string. In the source code, the %s signs are substituted for meaningful numbers. The comment above each language entry tells you what order the numbers will be printed in, and what they mean. Unfortunately, a current limitation of Pagecast is that you cannot change the order in which these numbers appear (to make it more correct for native syntax). This may change in the future, but in the meantime, be sure not to add or remove any %s signs from your translated strings.) Here is an example from us-english.lang: Lookups = { # . . . (stuff cut) # date, version "begin_run_msg" : "[%s] === %s === Beginning Run", # servername, URL, failure reason "submit_fail" : "%s ; @ %s : %s", # . . . (stuff cut) } The entry with the key "begin_run_msg" defines what Pagecast prints when it begins contacting search engines. As the comment above it indicates, the two %s signs in the definition will be translated into the current date (and time) and Pagecast's version. The next example "submit_fail" is what is printed when a search engine does not accept a URL. The %s strings provided are the name of the search engine (server), the URL that the user submitted, and whatever error message can be extracted from the server. To actually *activate* a language file, read the FAQ question about changing languages. I would gladly accept any language translations files to include with future releases of Pagecast. 6. SEARCH ENGINES Pagecast is highly flexible in how it handles interactions with specific search engines. Most search engines handle URL submissions in a very simple way; a single-page form on a web page. This can be easily faked from a program; however, each search engine's page is slightly different. Pagecast sees search engines in a sort of abstract way, and drivers can be written that handle alternate methods of submission (such as email or subspace transmissions.) Driver files for specific search engines are written in the servers/ directory with .def extensions. When Pagecast starts up, all of these files are scanned as possible search engines. However, each search engine is only used if it belongs to a group marked as active. This is controlled by the etc/groups.conf file, which is actually a fragment of Python code that is executed by Pagecast. It is in this file that you assign search engines to groups, and mark groups as active. Since this is Python, you must assign group names to the Active list according to Python syntax (which, fortunately, is pretty straightforward.) If you are not a programmer, it is best to follow the examples present in the etc/groups.conf file. Generally, for any specific group you want to mark active, this syntax will work: Active.append("My Group Name") Each servers/searchengine.def file defines a Name that identifies that server. To make the search engine with the .name == "FooBar.com" a member of one or more groups, use this syntax: Assign["FooBar.com"] = ["group1", "group2", "group3"] Most search engines use a web/CGI interface to add new URLs to their database. If you want Pagecast to use a search engine like this that is not already defined in the servers directory, you will need to setup a .def file. Copy the template.def.example file. For instance: $ cp servers/template.def.example servers/foobar.com.def Then edit your new file, servers/foobar.com.def, in a plain text editor. Usually you only have to fill in values on a few lines, then save the file and exit. (Be sure to assign it to an active group in etc/groups.conf when you want to use it.) Everything you need to change is between the two ### START HERE and ### END HERE markers. Don't change anything below or above those markers unless you know what you're doing. The comments in the templates should help you understand what the different settings need. To find out what you need to fill in for your desired search engine, you are going to have to look at the raw HTML source of your desired search engine's main "Add URL" page. You can also append group names to the Prohibited list in the groups.conf file. Any group names in Prohibited will not be selected to run. You can specify specific groups on the command line with the -g option. For instance, pagecast -g debug,broken -u http://pagecast.sourceforge.net will submit the URL http://pagecast.sourceforge.net only to the servers that belong the the "debug" and/or "broken" groups. If you need further help configuring a new server, check out the Pagecast mailing lists. [See the top of this document.] Please contribute any server definitions or extensions that you may come up with. Currently supported search engines: Excite! (same database as Webcrawler) InfoSeek (using mail submissions, which are 24/7) AltaVista (with invalid URL detection and logging; Yahoo uses) Lycos (ditto; Tripod uses) Hotbot (the idiot savant of search engines) Google (a great search engine, especially for technical topics) Northern Lights (never use this one much myself) Canada (obviously oriented to Canadians) Anzwers (oriented towards Australia and New Zealand) 7. META TAGS Pagecast can (optionally) download the URLs you are submitting and examine the keywords contained in the header meta tags. Pagecast will attempt to do this if tags=on in the [General] section of pagecast.conf. Pagecast will compare the keywords to the words in the of the document, and produce a "confidence rating". Many search engines use a similar algorithm to determine how relevent the keywords are to the actual content of the document, to avoid the spam syndrome. Pagecast will also warn you if no keywords or present, or if the document is using the outdated HTTP-EQUIV style of keywords. If the documents are accessable to the pagecast user on the filesystem, Pagecast can actually modify the documents, updating the date meta tag and fixing HTTP-EQUIV style keywords. This can give a search engine the impression that the document is very recent. To activate this feature, set localdocuments=on in [General] section of pagecast.conf, and then edit the tags.conf file to tell Pagecast how to find the documents on the filesystem. 8. KNOWN BUGS There are a few "issues" to be aware of in extreme cases: I don't know what the hard limit on the url size is. It probably depends on each search engine. Most web browsers limit URLs to around 250 characters, but this program doesn't enforce that. The program may be aborted with CTRL-C, but due to its multithreaded nature, it may not completely stop right away. All the threads (each search engine is a thread) will exit after they complete their current interaction with the search engine, but with a slow server, this might take a while. There have been problems reported with Python 1.5.2 and Linux 2.0.x. Please see the FAQ file for solutions. Thanks to Joshua Levitsky <jlevitsk@joshie.com> for pointing out a brown-paper-bag bug in the 1.0 release. 9. LINKS OF INTEREST: -- About search engines, "robots" http://info.webcrawler.com/mak/projects/robots/faq.html =====================================================================