Remote Search Services
Volume Number: 15
Issue Number: 12
Column Tag: Web Site Design
Remote Search Services for Web Sites
by Avi Rappoport
Adding search to your site - even if you don't own
the server
What is Remote Site Searching?
When you want to add search to your site, you may be have some technical difficulties.
Perhaps your site is hosted on a large server somewhere, or you have an
uncooperative web administrator, or the challenges of adding a CGI are too daunting.
Never fear! You can outsource your search to a remote site search service and let
someone else worry about the gory details.
The indexer and search engine run on the remote server: they will use a web indexing
robot, or spider, to follow links on your site and read the pages, then store every word
in the index file on that server. When it comes time to search, the form on your local
Web page send a message to the remote search engine. Although it's going through the
Web, process doesn't change - it just has to move a little farther. The remote search
engine takes the search terms, matches the words in the index, sorts them according to
relevance, and creates an HTML page with the results. When a searcher clicks on the
result link, they will see the page from your site, just as though the search came from
there. It's easy and painless for practically everyone.
This review covers the range of remote search services, their features and their
drawbacks. It will teach you to prepare your site, try indexing it, test the search,
customize the results, keep the search up to date, and choose the right program for
your long-term needs.
What you Get With Remote Search Services
• No need for server access: Even if your site is hosted and you have FTP
access only, you can run a search engine.
• No need to learn CGIs or server systems: You never need to install any
software, worry about version compatibility, or learn about permissions and
paths (or paying someone else to do so).
• Easy administration: The remote search service will provide a set of Web
pages for administration, rather than making you learn about command lines
or config files.
• No load on your server: Search engines require significant resources,
such as CPU time during researching and retrieval, as well as disk space.
Outsourcing to a remote server moves the load away from you. In addition,
these servers are usually in data centers with excellent connectivity and 24/7
administration.
• Minimal initial investment: Instead of paying for a search engine up
front, you can pay a small monthly fee. Some services are free, showing
advertising with the search results.
• Easy to switch: If you aren't happy with your search service, it's easy to
switch to another.
The Tradeoffs
• Advertising or continuing costs: You must pay every month or allow your
searchers to see other people's advertising
• Less control over the indexing: If your data changes frequently (hourly or
daily), most of these services will not index that often.
• Dependent on outside service: If the service's search engine gets busy, it
may delay responses for your site, and there's not much you can do.
• Less capacity: The remote search services have a page limit, usually
somewhere between 200 and 5000 pages. While many can go higher than that,
they can't handle hundreds of thousands of pages.
• Fewer special features: Each search engine has its own special features,
but you have more choices if you plan to run your own engine. For example,
indexing password-protected areas, or word processing file formats, adding a
thesaurus or a spellchecker, etc.
• Intranet privacy: Intranets (internal networks using standard software)
want to keep control of all their data, rather than allowing access external
systems.
• Multi-site indexing: Most remote services allow you to index just the
sites you control. With a local search engine, you can index other sites and
create a public search portal.
Remote Site Search Services Covered
The following services are covered in this review, and also have pages and examples on
this site.
Atomz <http://www.atomz.com/>
• free for 500 pages, fewer than 5,000 searches per month, (no ads, just a
logo)
• paid version: 250 pages & 2.5K searches @ $75 per year; 500 pages &
5K searches/month @ $150 per year; 1,000 pages & 10K searches/month @
$300 per year; 2,500 pages & 25 K searches/month @ $600 per year;
5,000 pages & 50K searches/month @ $1,200 per year
FreeFind <http://www.freefind.com/>
• free (with advertising) can handle up to 32MG of HTML (flexible), will
"sample" sites if they get large.
intraSearch (WhatUSeek) <http://www.whatUseek.com/intraSearch/>
• free (with advertising) to at least 10,000 pages
MondoSearch (remote version) <http://www.mondosearch.com/>
• paid version only: 1 - 1,000 pages: $144; to 5,000 pages: $585; to
10,000 pages: $990; above: contact sales@mondosoft.com
• local server version also available
PicoSearch <http://www.picosearch.com/>
• free (with advertising), to 5,000 pages
• paid version: $6.99 per month (12 month commitment); $9.99 per
month (3 month commitment)
PinPoint <http://pinpoint.netcreations.com/>
• free (with advertising) to 5,000 pages
SearchButton <http://www.searchbutton.com/>
• free (with advertising), for up to 5,000 pages, 30,000 searches per
month
• paid version: up to 1,000 pages: $300 per year; up to 5,000 pages:
$600 per year (limit of 30,000 searches per month); for more pages,
contact company
SiteMiner <http://www.siteminer.com/>
• free (with advertising), to 10,000 + pages
Webinator (remote version) <http://www.thunderstone.com/texis/indexsite>
• free (with Thunderstone logo), to 5,000 pages
• local server version also available, can do thousands and millions of pages
Checking Links and Pages
Before you install any search engine with a indexing spider, you must make sure it can
find the pages on your site. The good news is that cleaning up your links will improve
your accessibility to the large public search engines (such as AltaVista, Google, HotBot
and Infoseek), and make it easier for you to run an automated site mapper.
Robot Spider Compatibility
The indexing spiders follow links from a starting page, so use a home page if you have
good text links, or a site map page.
Whole sites: Robots.txt
The first thing is to check the "robots.txt" file. This is a standard file for web servers
that sits at the root of your site, and excludes robots that are not welcome on the site,
or in certain specific directories (though this is voluntary). If you run your own
server, you control this file: otherwise your host server administrator controls it.
You want to make sure that this file exists, and that it allows at least your indexing
spider to access your directories. You may need to negotiate with your web hosting
provider on this point, as this file must be stored in the root folder of the web host.
For more information on this topic, see Search Indexing Robots and Robots.txt:
and the WebMasters
Guide to the Robots Exclusion Protocol at <
http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html>
Individual Pages: META ROBOTS tag
The other way that page designers can control robots and spiders is by using the META
ROBOTS tags. These are particularly useful if you have a hosted site and don't want to
bother your server administrator.
For example, if you have a directory listing or site map page, you can tell the spiders
to follow the links but not index the text on the page by placing the following
information into the HTML header: . If
you have pages with useful data but inappropriate links, such as a web calendar page
with duplicate links to other calendar pages, use content="index,nofollow">.
For more information, see Search Indexing Robots and the Robots Meta Tags
and the Webmaster
guide above.
Good Links and Bad Links
Indexing spiders tend to be pretty dumb. They know about the simple HREF links, but
just get lost on anything more complex. Spiders and robots may not follow links in:
• image maps (especially server-side image maps)
• redirect and META Refresh tags
• Framesets
• DHTML layers
• ActiveX controls
• JavaScript menus and pages
• Java pages and site maps
• Flash or Shockwave (unless you use the AfterShock options to generate
HTML text and links!)
Check Your Links
To give yourself a spider-eye view, try a text browser such as Lynx, or a graphical
browser with images and JavaScript turned off, and no Plug-Ins: this will give you a
good view of what the spiders see.
Don't rely on your content-management system to check local links: it knows too much
about the structure of your site and the special formats you use!
To make sure all your local links work, run a link-checking robot such as Big Brother
for Mac & Unix , or use a service
such as NetMechanic . If these services can follow the
links, there's a good chance that your search indexing robot can do the same.
Solution: Supplement Complex Links
If you find you have problems, there are two ways around bad links: both require
work, but they will make the indexing spiders happy.
• Alternate Navigation: add alternate links in