Remote Search Services

Volume Number: 15

Issue Number: 12

Column Tag: Web Site Design

Remote Search Services for Web Sites

by Avi Rappoport

Adding search to your site - even if you don't own

the server

What is Remote Site Searching?

When you want to add search to your site, you may be have some technical difficulties.

Perhaps your site is hosted on a large server somewhere, or you have an

uncooperative web administrator, or the challenges of adding a CGI are too daunting.

Never fear! You can outsource your search to a remote site search service and let

someone else worry about the gory details.

The indexer and search engine run on the remote server: they will use a web indexing

robot, or spider, to follow links on your site and read the pages, then store every word

in the index file on that server. When it comes time to search, the form on your local

Web page send a message to the remote search engine. Although it's going through the

Web, process doesn't change - it just has to move a little farther. The remote search

engine takes the search terms, matches the words in the index, sorts them according to

relevance, and creates an HTML page with the results. When a searcher clicks on the

result link, they will see the page from your site, just as though the search came from

there. It's easy and painless for practically everyone.

This review covers the range of remote search services, their features and their

drawbacks. It will teach you to prepare your site, try indexing it, test the search,

customize the results, keep the search up to date, and choose the right program for

your long-term needs.

What you Get With Remote Search Services

• No need for server access: Even if your site is hosted and you have FTP

access only, you can run a search engine.

• No need to learn CGIs or server systems: You never need to install any

software, worry about version compatibility, or learn about permissions and

paths (or paying someone else to do so).

• Easy administration: The remote search service will provide a set of Web

pages for administration, rather than making you learn about command lines

or config files.

• No load on your server: Search engines require significant resources,

such as CPU time during researching and retrieval, as well as disk space.

Outsourcing to a remote server moves the load away from you. In addition,

these servers are usually in data centers with excellent connectivity and 24/7

administration.

• Minimal initial investment: Instead of paying for a search engine up

front, you can pay a small monthly fee. Some services are free, showing

advertising with the search results.

• Easy to switch: If you aren't happy with your search service, it's easy to

switch to another.

The Tradeoffs

• Advertising or continuing costs: You must pay every month or allow your

searchers to see other people's advertising

• Less control over the indexing: If your data changes frequently (hourly or

daily), most of these services will not index that often.

• Dependent on outside service: If the service's search engine gets busy, it

may delay responses for your site, and there's not much you can do.

• Less capacity: The remote search services have a page limit, usually

somewhere between 200 and 5000 pages. While many can go higher than that,

they can't handle hundreds of thousands of pages.

• Fewer special features: Each search engine has its own special features,

but you have more choices if you plan to run your own engine. For example,

indexing password-protected areas, or word processing file formats, adding a

thesaurus or a spellchecker, etc.

• Intranet privacy: Intranets (internal networks using standard software)

want to keep control of all their data, rather than allowing access external

systems.

• Multi-site indexing: Most remote services allow you to index just the

sites you control. With a local search engine, you can index other sites and

create a public search portal.

Remote Site Search Services Covered

The following services are covered in this review, and also have pages and examples on

this site.

Atomz <http://www.atomz.com/>

• free for 500 pages, fewer than 5,000 searches per month, (no ads, just a

logo)

• paid version: 250 pages & 2.5K searches @ $75 per year; 500 pages &

5K searches/month @ $150 per year; 1,000 pages & 10K searches/month @

$300 per year; 2,500 pages & 25 K searches/month @ $600 per year;

5,000 pages & 50K searches/month @ $1,200 per year

FreeFind <http://www.freefind.com/>

• free (with advertising) can handle up to 32MG of HTML (flexible), will

"sample" sites if they get large.

intraSearch (WhatUSeek) <http://www.whatUseek.com/intraSearch/>

• free (with advertising) to at least 10,000 pages

MondoSearch (remote version) <http://www.mondosearch.com/>

• paid version only: 1 - 1,000 pages: $144; to 5,000 pages: $585; to

10,000 pages: $990; above: contact sales@mondosoft.com

• local server version also available

PicoSearch <http://www.picosearch.com/>

• free (with advertising), to 5,000 pages

• paid version: $6.99 per month (12 month commitment); $9.99 per

month (3 month commitment)

PinPoint <http://pinpoint.netcreations.com/>

• free (with advertising) to 5,000 pages

SearchButton <http://www.searchbutton.com/>

• free (with advertising), for up to 5,000 pages, 30,000 searches per

month

• paid version: up to 1,000 pages: $300 per year; up to 5,000 pages:

$600 per year (limit of 30,000 searches per month); for more pages,

contact company

SiteMiner <http://www.siteminer.com/>

• free (with advertising), to 10,000 + pages

Webinator (remote version) <http://www.thunderstone.com/texis/indexsite>

• free (with Thunderstone logo), to 5,000 pages

• local server version also available, can do thousands and millions of pages

Checking Links and Pages

Before you install any search engine with a indexing spider, you must make sure it can

find the pages on your site. The good news is that cleaning up your links will improve

your accessibility to the large public search engines (such as AltaVista, Google, HotBot

and Infoseek), and make it easier for you to run an automated site mapper.

Robot Spider Compatibility

The indexing spiders follow links from a starting page, so use a home page if you have

good text links, or a site map page.

Whole sites: Robots.txt

The first thing is to check the "robots.txt" file. This is a standard file for web servers

that sits at the root of your site, and excludes robots that are not welcome on the site,

or in certain specific directories (though this is voluntary). If you run your own

server, you control this file: otherwise your host server administrator controls it.

You want to make sure that this file exists, and that it allows at least your indexing

spider to access your directories. You may need to negotiate with your web hosting

provider on this point, as this file must be stored in the root folder of the web host.

For more information on this topic, see Search Indexing Robots and Robots.txt:

and the WebMasters

Guide to the Robots Exclusion Protocol at <

http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html>

Individual Pages: META ROBOTS tag

The other way that page designers can control robots and spiders is by using the META

ROBOTS tags. These are particularly useful if you have a hosted site and don't want to

bother your server administrator.

For example, if you have a directory listing or site map page, you can tell the spiders

to follow the links but not index the text on the page by placing the following

information into the HTML header: . If

you have pages with useful data but inappropriate links, such as a web calendar page

with duplicate links to other calendar pages, use content="index,nofollow">.

For more information, see Search Indexing Robots and the Robots Meta Tags

and the Webmaster

guide above.

Good Links and Bad Links

Indexing spiders tend to be pretty dumb. They know about the simple HREF links, but

just get lost on anything more complex. Spiders and robots may not follow links in:

• image maps (especially server-side image maps)

• redirect and META Refresh tags

• Framesets

• DHTML layers

• ActiveX controls

• JavaScript menus and pages

• Java pages and site maps

• Flash or Shockwave (unless you use the AfterShock options to generate

HTML text and links!)

Check Your Links

To give yourself a spider-eye view, try a text browser such as Lynx, or a graphical

browser with images and JavaScript turned off, and no Plug-Ins: this will give you a

good view of what the spiders see.

Don't rely on your content-management system to check local links: it knows too much

about the structure of your site and the special formats you use!

To make sure all your local links work, run a link-checking robot such as Big Brother

for Mac & Unix , or use a service

such as NetMechanic . If these services can follow the

links, there's a good chance that your search indexing robot can do the same.

Solution: Supplement Complex Links

If you find you have problems, there are two ways around bad links: both require

work, but they will make the indexing spiders happy.

• Alternate Navigation: add alternate links in

</span></div><div class="chunk"><span class="style0">tags, lists of the links from image maps, simple alternate pages for DHTML</span></div><div class="chunk"><span class="style0">and Java pages, etc. This should work for all kinds of robots and spiders.</span></div><div class="chunk"><span class="style0">• Site Page Listing make a page or sitemap with links to every page on your</span></div><div class="chunk"><span class="style0">site. This is hard to maintain and synchronize with your other changes. You</span></div><div class="chunk"><span class="style0">can't use a site mapper application that uses a link-following robot, because it</span></div><div class="chunk"><span class="style0">will have the same problems that the search engine spiders have.</span></div><div class="chunk"><span class="style2">Five for the Price of One</span></div><div class="chunk"><span class="style0">The good news is that all this work will pay off in five ways:</span></div><div class="chunk"><span class="style0">1. Your search engine robot spider can find your pages</span></div><div class="chunk"><span class="style0">2. The robot spiders for the webwide public search engines such as HotBot,</span></div><div class="chunk"><span class="style0">Infoseek, AltaVista find your pages</span></div><div class="chunk"><span class="style0">3. Robot-based link checker can check your links</span></div><div class="chunk"><span class="style0">4. Robot-based site map creator can find your pages to make a map</span></div><div class="chunk"><span class="style0">5. Your site is now accessible to blind and visually-disabled web surfers (as</span></div><div class="chunk"><span class="style0">described in the W3C Web Accessibility Initiative), and those using text</span></div><div class="chunk"><span class="style0">browsers such as PDAs.</span></div><div class="backlinks"><div class="backlinks-title">Referenced by (6):</div><ul class="backlinks-list"><li><a href="0.html">Vol 15 Issues</a></li><li><a href="../MacTech Index/90.html">R Authors (MacTech Index)</a></li><li><a href="../MacTech Index/37.html">R Topics (MacTech Index)</a></li><li><a href="../MacTech Index/38.html">S Topics (MacTech Index)</a></li><li><a href="../MacTech Index/16.html">Vol 15 Issues (MacTech Index)</a></li><li><a href="../MacTech Vol 15-1999/0.html">Vol 15 Issues (MacTech Vol 15-1999)</a></li></ul></div></div><script> let crossrefData = null; const searchInput = document.getElementById('search-input'); const searchResults = document.getElementById('search-results'); let selectedIndex = -1; let currentResults = []; // Load crossref data on first search async function loadCrossrefData() { if (crossrefData !== null) return; try { const response = await fetch('../crossref.json'); crossrefData = await response.json(); } catch (e) { crossrefData = []; console.error('Failed to load crossref.json:', e); } } function escapeHtml(text) { const div = document.createElement('div'); div.textContent = text; return div.innerHTML; } function highlightMatch(text, query) { if (!query) return escapeHtml(text); const escaped = escapeHtml(text); const regex = new RegExp('(' + query.replace(/[.*+?^${}()|[\]\\]/g, '\\$&') + ')', 'gi'); return escaped.replace(regex, '<span class="highlight">$1</span>'); } async function search(query) { if (!query || query.length < 2) { searchResults.classList.remove('visible'); currentResults = []; return; } await loadCrossrefData(); const lowerQuery = query.toLowerCase(); currentResults = crossrefData.filter(item => item.name.toLowerCase().includes(lowerQuery) ).slice(0, 50); currentResults.sort((a, b) => { const aLower = a.name.toLowerCase(); const bLower = b.name.toLowerCase(); const aExact = aLower === lowerQuery; const bExact = bLower === lowerQuery; const aStarts = aLower.startsWith(lowerQuery); const bStarts = bLower.startsWith(lowerQuery); if (aExact && !bExact) return -1; if (bExact && !aExact) return 1; if (aStarts && !bStarts) return -1; if (bStarts && !aStarts) return 1; return a.name.localeCompare(b.name); }); if (currentResults.length === 0) { searchResults.innerHTML = '<div class="search-result"><span class="name">No results found</span></div>'; } else { searchResults.innerHTML = currentResults.map((item, idx) => ` <div class="search-result${idx === selectedIndex ? ' selected' : ''}" data-index="${idx}"> <div class="name">${highlightMatch(item.name, query)}</div> <div class="location">${escapeHtml(item.db)}</div> </div> `).join(''); } searchResults.classList.add('visible'); selectedIndex = -1; } function navigateToResult(index) { if (index >= 0 && index < currentResults.length) { const item = currentResults[index]; window.location.href = '../' + encodeURIComponent(item.db) + '/' + item.page + '.html'; } } function updateSelection() { const items = searchResults.querySelectorAll('.search-result'); items.forEach((item, idx) => { item.classList.toggle('selected', idx === selectedIndex); }); if (selectedIndex >= 0 && items[selectedIndex]) { items[selectedIndex].scrollIntoView({ block: 'nearest' }); } } searchInput.addEventListener('input', (e) => { search(e.target.value); }); searchInput.addEventListener('keydown', (e) => { if (e.key === 'ArrowDown') { e.preventDefault(); if (selectedIndex < currentResults.length - 1) { selectedIndex++; updateSelection(); } } else if (e.key === 'ArrowUp') { e.preventDefault(); if (selectedIndex > 0) { selectedIndex--; updateSelection(); } } else if (e.key === 'Enter') { e.preventDefault(); if (selectedIndex >= 0) { navigateToResult(selectedIndex); } else if (currentResults.length > 0) { navigateToResult(0); } } else if (e.key === 'Escape') { searchResults.classList.remove('visible'); selectedIndex = -1; } }); searchResults.addEventListener('click', (e) => { const result = e.target.closest('.search-result'); if (result && result.dataset.index !== undefined) { navigateToResult(parseInt(result.dataset.index)); } }); document.addEventListener('click', (e) => { if (!e.target.closest('.search-box')) { searchResults.classList.remove('visible'); } }); searchInput.addEventListener('focus', () => { if (searchInput.value.length >= 2) { search(searchInput.value); } }); </script></body></html>