How to Crank Up a Search Engine

A seminar presentation by Rod Eime


The work of the traditional investigative journalist has been made substantially easier over the years with the progressive introduction of data processing technology.

Databases, as we know, are simply catalogued electronic filing cabinets containing sometimes thousands, often millions of records. Taxation, Motor Vehicles, Police, Health, Welfare and Credit Bureaus are just some of the enormous repositories of information currently stored by electronic means. For most of us, however, this information is privileged and protected by strict privacy laws.

In the last few years we have witnessed the emergence of another, enormously powerful, database - the Internet. Far from an overnight phenomenon, networked computers have been around for many years. Local Area Networks (LANs) and Wide Area Networks (WANs) are the norm in both major multi-national corporations and small businesses alike, enabling fast data sharing throughout the organisation.

During the Cold War, the US Military developed a command network that could continue to operate even if one or more of its linked stations failed or were blown up. When the Cold War ceased around ten years ago, the network was turned over to academic and scientific establishments who continued to use it very effectively.

Now known as the Internet, short for internetwork, it has become a network of networks covering the entire globe. Since it was embraced by commercial organisations in the mid-1990s, the Internet has literally exploded in our faces, and now accommodates around 100 million users around the world.

To continue with an elaboration of the Internet, and all of its protocols and languages, would take several seminars alone and I apologise in advance for taking some necessary shortcuts. In this presentation I will simply highlight some methods of using the Internet for information location and retrieval.

Mining The Mother Lode

Perhaps the mother of all databases, the scope of information available on the Internet is mind-boggling. Hundreds of millions of documents around the world are accessible via a simple PC in the home or office. Each document has its own unique address or Uniform Resource Locater (URL) and can be accessed by simply typing that address into your computer.

This all sounds marvellous, but unless you know the address, you have virtually no chance of finding that document. Enter the Search Engine!

To comprehend the power of a search engine, is a bit like calculating the volume of the universe. Search engines - and there are many - attempt to catalogue and index EVERY publicly available document on the Internet.

They analyse every word in each document, and very often analyse each word's relationship with every other word in that document, storing all this information in a continually updated database. Phew!

Many search engines perform this task by a process called "crawling". A Crawler will attempt to access every available page on the Internet, and in turn, index the words within those pages. An enormous task in anybody's book, Crawlers inevitably suffer from being caught out-of-date as pages change or disappear between "crawls".

Crawlers are perhaps better described as "keyword engines" as they respond to searches for words or phrases. Examples of this type of engine are Alta Vista, Anzwers and Excite.

This type of engine is the most popular, and I believe the most useful, but other types also exist. Of the remainder, the hierarchical, category-based engines like Yahoo are also widely used. These engines, or perhaps more accurately "directories", are best used when researching broad topics like cinema, computers, business and commerce.

Performing a Keyword Search with Anzwers

I have chosen the Anzwers search engine for this example for two reasons; 1) it is kept reasonably up to date, and 2) it has an Australian and New Zealand bias, hence the name.

Before embarking on our electronic odyssey, we must first pinpoint exactly what we want to find out. Nebulous requests like "computer", "construction" or "university" will turn up many thousands of matches without any useful means of "ranking" their importance.

Keyword searches are best performed with a group of words or even a phrase. This is quite easy at Anzwers, as you simply select either "all the words" or "exact phrase" to search against a group of words.

For example, to search against "Fitzgerald Inquiry", simply type in the two words and select the "exact phrase" option. (see figure 1) Anzwers also gives you the option to search within Australia, which is a good idea in this case. We will also ask for 100 brief results, based on our search success expectations.

 
Figure 1: Basic Search Options at Anzwers

An initial search turns up almost one hundred matches ranked according to result. Now it is relatively easy to investigate the top dozen or so matches for usefulness.

If we choose "detailed" instead of "brief", our job may sometimes be easier in weeding out extraneous matches. This option displays, not only the title of the page and its URL, but also the first few lines of text in the document. It may however, add to the search time.

In order to ferret out some types of information, we need to be somewhat more ingenious in our methods. This is where the "Power Search" options can be useful.

Let's imagine that we want to find recent information on Keith Williams's Hinchinbrook development.

Selected the "Power Search" Option, we enter the term as in Figure 2.
Note that we have specified Location: Australia, and dated after January 1 1998

 
Figure 2. Advanced Search Options at Anzwers (part 1)

In the second set of options (which appear on the same screen), we then specify that we want "Keith Williams" included in the document. We do this by selecting "must contain" and "the person" from the available options.

 
Figure 3. Advanced Search Options at Anzwers (part 2)

In doing so, we have narrowed our search from over 1200 vague matches by typing in simply "hinchinbrook", to just 28 specific matches most likely to yield us the information we want.

Anzwers makes complex searching reasonably easy with the use of its "power" options. Another, more fundamental, way of performing advanced searches is with the use of Boolean operators.

Searching in Boolean

Devised by self-taught mathematician, George Boole (1815-1864), the Boolean language of logic is proven a mainstay in pure mathematics and computing.

The main Boolean operators are OR, AND, NOT, and NEAR, but others like ADJ, FAR and BEFORE may also be used by some engines.

The use of Boolean operators was devised well before the advent of user-friendly graphical interfaces like those found at Anzwers. The modern, sophisticated and powerful search engine still uses Boolean terms as its basic tools. The simplified graphical interfaces were added for non-technical users who wanted to perform basic searches quickly.

Nevertheless. it is still possible to override the fancy front-ends and go straight to Boolean with most engines.

Drawing on a popular example, let's say we want to find out about boxers but wish to exclude all references to a specific article of clothing. We would the type into our search window "boxers NOT shorts", which would return matches the contain "boxers" but eliminate any that also contain "shorts".

Boolean terms can also be combined, but things can get complicated with this method as some engines deal with combined terms differently.

Susan Feldman, in her seminal article, The Internet Search-Off, (http://www.infotoday.com/searcher/feb/story1.htm) offers six tips on the use of Boolean when searching.
 

1. Use rare or unusual words rather than common words -- if you are doing an assignment on technology, avoid words like "computer", "network" or "internet", unless you want to get flooded with pages and pages of responses. 

Instead, look for less conventional words that characterise your core interests such as, for example, "latency", "packet loss" or "SGML".

2. Search for phrases where possible -- if you are looking for something on nuclear fusion use "nuclear fusion" as a phrase rather than "nuclear AND fusion" or "nuclear NEAR fusion". While the latter Boolean technique provides you with MORE responses, many may have spurious content. The phrase will provide you with LESS but more on your topic of interest.

3. Use many synonyms -- pay attention to cultural diversity. If you want to research the design issues in car design, a bonnet is referred to as a "hood" in the US. If you are interested in the physics of colour and the psychology of humour, be sure to query for "color" and "humor" as well.

4. Use "more like this" rather than persisting with Boolean -- some search engines offer a "more like this" option. These are usually more efficient at filtering spurious results than trying to use nested Boolean operators.

5. Put most important words first: some search engines' approach to listing responses is to give the greatest priority to sites that respond to the first word of the query -- not the most important word. So if you are researching the history and politics of land claims, try inserting the query <"land claims" and "history" and "politics">.

6. Use "refine" options rather than Boolean operators. Some engines offer a "refine" capacity that cuts the number of spurious hits down more precisely than persisting with a more complex Boolean query.


Push NOT Shove

Another popular method for acquiring information is to have it sent to you as it is published.

Some web sites will allow you to enter certain topics of interest, people's names and other data and have matching items sent to you. This is known as "push" technology.

Going back to our earlier example, if we wanted to be kept up to date on Hinchinbrook developments, we could request that all news items containing the term "hinchinbrook" be sent to us when they are published.

To do this, we must first find a suitable service that meets our needs.

International contenders include Pointcast, TotalNEWS, and Excite's NewsTracker.

Closer to home, Ozemail's Newswatch site (www.newswatch.com.au), which is served by AAP and Reuters, can provide a more localised service.

The Newswatch service operates in any or all of three ways.

1) As a general news service accessible each time you log in.
2) As a personalised news service that only displays items in your area of interest, and
3) As a email alert service that will notify you by e-mail when a story on a chosen topic breaks.

For personalised news, the process of selection is quite simple. You choose from a list of pre-determined news topics, with an option to further refine your selection with the inclusion of a keyword. It is these keyword matches that will be e-mailed to you.

You may also choose to have a once-per-day headline summary sent to you at a time you specify.

Conclusion - No Need For Netphobia

Properly tamed and reasonably well understood, the Internet can yield just about anything you can imagine - and often some things you hadn't!

Not always for the easily offended or feint-of-heart, the 'Net can usually be persuaded to offer up its secrets if you have the correct tools and equipment for the job.