8. Search Technique

(This is originally a chapter from the book Efficient information searching on the web.)

Good and bad search words

When you use search engines the distillation of good search words often constitutes the key to a successful search. Words that describe important parts of the query.

Good search words are unique or uncommon. Common words generate large number of hits, but can be added to other words in a search query to get the right angle of the more uncommon words.

All words should be correctly spelled. But if you have trouble with the spelling just relax, the major search engines suggest spelling alternatives. If there are different spellings of a word, try to search on each word in Google or some other search engine to see which version of the word generates most hits and then use that spelling. Or use both variants in your search strategy. Misspelt words only generate hits which are also misspelt.

Bad search words are ambiguous words, words with several meanings like break or light. But like common words they are useful as search words in combination with other words. The words define the context of each other. For example light together with sun is completely different from light together with calories.

Small, common words like and aren’t always indexed by the search engines because they are too common. These words are called stop words and are often excluded from the search when entered in a search engine. But stop words can be used in phrases, so the stop words are often indexed in some way.

Describe the question/query with synonyms and with more concepts. This won’t limit the query, it will just focus it.

The best words first

Place the best words first when you do a search in a search engine. In certain search services the first terms are given more consideration. And place related words next to each other, sometimes this will affect the relevance estimation. At times it may pay off to change the word order in the search query.

Think like a Web page

Don’t select search words that represent your subject – instead select the search words that you think will show up on the pages that are relevant to your search.

Assess and do a new search

Assess the search string quickly. If no. 1-10 in the hit list aren’t relevant, then why should no. 25 be relevant? Change the search words and do a new search instead of looking over all the first 30 hits.

Remove undesired aspects

If a search word has several meanings and the search result is imprecise you can remove undesired aspects. On the search engines’ pages for advanced searches you can often select “without these words” or something similar, which means that the word may not occur on the pages in the hit list. Alternatively you can enter a minus sign before the search word in the usual search box, e.g., -saab.

Search word suggestions

Yahoo! gives suggestions for searches when you start writing words into the search box. This is a good help to define searches in real time. The Web browser Firefox’s search box also provides search word suggestions.

Yahoo search word suggestion

Fig. Yahoo! gives search word suggestions.

Phrase searching

A phrase is a group of words that have to stand next to each other in a certain order. Most search engines use quotation marks (” ”) to mark a phrase, e.g.,  ”the invisible web”. It is used to specify a search.

Phrase searching is particularly important in languages where words are not compound words. Compare information seeking behaviour with the Swedish informationssökningsbeteende (the three words have become one compound word).

Be careful with searching on phrases. Phrase searching may be a good way to define searches but should only be used for words that normally appear next to each other, e.g., “association of information specialists” (name of an association). Even proper names may be problematic. A search on “george bush” would miss references to George W Bush.

Simple and advanced searching

Most search engines and databases have something that they call simple searching on their start page. You fill in your search words in a search box. You can also enter the search service’s special syntax in the search box, commands which specify and define. But you reach most of the special functions of the search services easier by searching on the page for advanced searching. The simple searching has its advantages and is efficient when you know what you’re looking for, e.g., a specific Web site.

Almost all search engines and databases have a form containing more choices than the simple search. It may be called advanced searching, expert searching or enhanced searching. But you don’t need to be an expert to use the advanced search, on the contrary. In the advanced search many of the search service’s possibilities are often split up in a simple manner. In this way you often get clues to how you can improve your search.

On the search engines’ pages for advanced searching there are often forms for searches with Boolean logic (described in the next section). Google’s page is shown below, but the pages of the other search engines are rather similar.

Google advanced search

Fig. Page for advanced search in Google.

On Google’s page under the heading ”Find Web pages that have…” you get possibilities:

  • With all these words – an AND-search, AND is placed between the words that are entered here.
  • With the exact phrase – the words that are entered are handled as a phrase, quotation marks are not needed.
  • With one or more of these words – an OR-search, OR is placed between the entered words.
  • Under that there is yet another possibility, “But don’t show pages that have… any of these unwanted words” – a NOT-search, the word which is entered is not found among the hits.

The search possibilities are easy to combine, but be careful. If all the possibilities are used it’s easy to do very specific searches, perhaps so specific that the hit list will only consist of a single hit.

Boolean logic

The Internet can be seen as a big database and searches for content must, therefore, follow the rules for computer-based data mining. Computers work with ones and zeros, yes and no. When searching with Boolean logic you set up conditions that have to be met in the search. All search engines use Boolean logic in the search query formulation, but in somewhat different ways. Make sure that you understand the different operators: AND, OR and NOT.

Search engines generally have a preset Boolean operator. This means that the space between the entered search words either means OR or AND. Nowadays it’s normally AND that is preset, but look in the help texts to be on the safe side. In the infancy of the search engines OR was common as the preselection in order for the searcher to get more hits.

There are three basic operators in Boolean logic: AND, OR and NOT. They are often written in English. By means of the operators the search words are combined to execute more specific searches.

boolean operators

Fig. The three Boolean operators

In the figure above the darker fields represent that which is retrieved through the different searches.

AND

ScreenShot117

Search query: I’m interested in the relationship between work and stress.

An AND-search requires that both (all if several) search words are retrieved in the document for it to be on the hit list.

A AND B – When using AND the hit must contain both word A and word B.

Example in Google:

work: 1.56 billion hits
stress: 169 million hits
work AND stress: 40.5 million hits

AND example

The more words you combine in an AND-search, the fewer documents are retrieved, as each of the documents has to contain all of the search words. By adding yet another word in the AND-search the search is further defined.

work AND stress AND ”sick leave”: 70 500 hits

OR

OR

A OR B – When using OR the hit must contain at least one of the words A and B.

Search query: I want information about universities and university colleges.

In Sweden both university college and university are used as terms for higher education. The search needs to result in hits for at least one of the concepts and therefore OR is used.

Example in Yahoo!

“corvus frugilegus” (rook) 241 000 hits
”corvus corone” (crow) 374 000 hits
”corvus frugilegus” OR “corvus corone” 451 000 hits

A search on corvus frugilegus OR corvus corone gives hits containing corvus frugilegus or corvus corone. The overlap of the two circles represents the documents that contain both terms (and which are retrieved in the AND-search).

OR is primarily used for synonyms or similar concepts. The more words used in an OR-search, the more hits are retrieved. The search becomes broader.

NOT

NOT

A NOT B – When using NOT the hit must contain the word A but not the word B.

Search query: I want information about dogs, but nothing about cats.

Only documents where dog occurs, but not cat, are retrieved. If both dog and cat occur, the document won’t be part of the hit list. You should therefore be careful with this type of searches; important documents or central resources may easily be excluded.

Example in Google

cat 678 million
dog 300 million
cat NOT dog, i.e. cat -dog (see below) 576 million

The search above gives hits which contain dog but which don’t contain cat.

Combine the operators

The operators can be used together with phrase searching and parentheses. An example:

(university OR ”university college”) AND uk NOT london

The search string gives hits containing university or university college together with uk. No hit contains london. Relevant material may of course be screened out because it contains the word london, but if we’re dealing with a large amount of hits this might not matter. The parentheses determine in which order the expressions will be searched. The operators first work within a parenthesis, and thereafter with the whole expression (or the next parenthesis if there are more).

Boolean logic in practice

There are several different ways to use the Boolean operators when searching the Internet. In a search engine the operators can be used in three ways, but all search engines don’t support all types of uses.

  • Full Boolean logic with the operators.
  • Applied Boolean logic at searches (simple searching).
  • Predetermined search language in a form (advanced searching).

Full Boolean logic with the operators

In many search engines you can search by means of the logical operators, e.g.,

feline AND food AND senior

See Boolean logic earlier in this chapter for more examples.

Applied Boolean logic at simple searching

When you do a simple search with several words in the search engine’s simple search box it’s often an AND-search that is done. In the infancy of the search engines (and the Web) OR was preselected to increase the number of hits. Today AND is preselected in all big search engines to give better precision and to avoid a flooding of hits.

You can normally use + (plus) and – (minus) in the simple searching. + works like an AND and can secure the inclusion of so-called stop words, words which the search engine otherwise will exclude. To write + before each word is normally not necessary as AND is the preselected operator between the words. – is used as NOT, i.e., to exclude words from the search. E.g.:

lund -university

If you want to use OR you have to use the full Boolean logic or the form in the form for advanced searching.

Predetermined search langague in forms

Many search engines have a page called “advanced searching” or something similar. The page is often not particularly advanced, but offers the Boolean logic together with other search restrictions in a simple manner. The operators are generally described in common language. Examples from different search engines follow below:

AND
find pages with ALL these words
show results with ALL these words

OR
find pages with ANY of these words
show results with ANY of these words

NOT
find pages WITHOUT these words
show results with NONE of these words

Bing doesn’t have a general page for “advanced searching.” When you have done your search and got the hit list you can select “alternative” or “advanced” for search restrictions.

The search words – constants or variables?

The selected search words can be seen as constants or variables in the search depending upon which Boolean operator you place before the search word. AND and NOT make the search word function as a constant; the word needs to be present respective can’t be present. OR, however, makes the search into a variable; the word or its alternative has to be present but not both of them.

Field searching

Most Web pages consist of more than just text. (See the introduction for basic HTML) These different parts, which are called fields, are searchable. In a search engine it can look like this:

domain:lu.se

title:”information seeking” (words in the title tag of the page)

link:www.lu.se (the pages that link to www.lu.se)

inurl:guide (the word must be in the URL)

Field searching is an important way of limiting a search in a search engine with millions of documents. Example: title:”middle ages” (in Google intitle:”middle ages”) gives more relevant hits than a search on just the Middle Ages. But many relevant pages will be missed as far from all pages about the Middle Ages have the Middle Ages in their titles.

Search in the Web address

Search in the URL to delimit the search. In the search engines you can normally search on the words in the URL (the Web address) as great parts of the URL are (may be) meaningful.

The anatomy of a URL:

http://en.wikipedia.org/wiki/Middle_Ages

protocol: http

name of Web server: en (often www)

domain name: wikipedia

top domain name: org

directory name: wiki

file name: Middle_Ages

Particularly the two last ones, the directory name and the file name, often contain subject words. But it’s also worthwhile to search on the domain name, for example if you’re looking for an organization or a brand.

Example: url:”middle ages”/inurl:”middle ages”gives pages that contain the words Middle Ages somewhere in the Web address.

Web site search

To find a page within a Web site or domain you can search with the delimitation site:. You’ll then be searching on the domain name or top domain name, e.g., bbc.co.uk or volvo.com. In order to find information about Sweden at BBC’s Web, search on sweden site: bbc.co.uk, which limits the search after sweden to bbc.co.uk.. When it comes to smaller top domains it might be useful to limit the search to a certain top domain, e.g., .se or .mil, but the big .com and .edu might be too big for the delimitation to be efficient, unless it’s combined with several other search words.

Example in Google:

A search on the Swedish word for sugar (socker) in Google gives about 2 million hits. But if you limit the search to the Swedish National Food Administration (you add site:slv.se) the search will only give about 400 hits. If you do a search on the same word (socker)in the search engine at the Web site of the National Food Administration you get about 300 hits, i.e., a hundred fewer than in Google.

Site: works in among others Google, Yahoo! and Bing.

More search techniques

Proximity operators

If possible, use proximity operators, e.g., NEAR, instead of an AND between your search words to specify their connection. This guarantees that the words are found near each other in the document. Google automatically considers the proximity between the search words in the relevance calculation that is done to compile the hit list, but there is no possibility of specifying a distance between the words.

Always use at least two search words in your search

A search with two or three search words gives a much better result than a search with one word. A search on one word will almost always be too broad and a lot of noise will form part of the hit list. Each word that you add specifies the earlier ones. Example in Google:

phone gives 1.2 billion hits
phone number gives 128 million hits
phone number catalog gives 4.2 million hits
phone number catalog search gives 314 000 hits

The result is often considerably restricted by each added search word. In the example above over 90 per cent of the hits disappeared for each added search word.

Find other file types

If you search for special file types you have two alternatives. You can either use a search engine that is specialized on that particular file type, or you can search in a regular search engine. In a regular search engine you can restrict the search to only one file type. On the advanced search page there are often possibilities to choose the file type in a simple manner. But it means that the search engine must have the file type indexed for it to be found. When you search on the file extension, links to files of the desired type will also be part of the hit list, as a link to the file from the Web page is enough for inclusion.

An example of the usefulness of ”similar pages” in Google

I was searching for a record that I had owned several years ago, a reggae collection with early ska and rocksteady. The record was released by a classic company, don’t remember which one, and was the first in a series. Thought of a likely record company: Trojan Records. Checked their Web site but no luck. Not at Amazon music under reggae, ska or rocksteady either. Came to think of the Google function of “similar pages”. Entered the Trojan Records URL in Google and searched. When Trojan Records came up in the hit list I chose ”similar pages”. And hey, first in the hit list was “Soul Jazz Records”, which sounded familiar. At the Soul Jazz Records Web sites I found the record I was looking for – 100 % Dynamite.

Quick guide to Google

Basic examples

Basic examples Finds pages which contain…
traveltrain words travel and train
cambridge OR oxford Either the word cambridge or the word oxford (or both words)
we are the world” the exact phrase we are the world
spider search engine the word spider but not the word search engine
google ~guide the word google and the word guide with their synonyms, e.g. tips and help
define:firewall definitions of the word firewall from the Web
george * bush The words george and bush separated by exactly one word (gives, e.g., George W Bush or George Walker Bush)

Calculator

Calculator Means Enter into the search box
+ addition 13 + 8
subtraction 21 – 8
* multiplication 13 * 8
/ division 8 / 3
% of per cent of 75% of 755
^ or ** raised to 2^6 or 2**6
old units in new units convert units 30 euros in SEK or 30 feet in m

Restrict the search

Restrict the search Means Enter into the search box (and result)
site: Only searches within one Web site or domain. linux site:www.kth.se (searches for linux at KTH, KTH is the Swedish abbreviation for the Royal Institute of Technology.)
linux site:.se (searches for linux in the se-domain)
File type:
(or ext:)
Only searches among indicated file type. population statistics file type:pdf (PDF files with the phrase population statistics)
safesearch: Excludes adult material. safesearch: sex (searches for sex without presenting e.g. porn)

Alternative query types

Alternative
query types
Means Enter into the search box (and result)
cache: Shows Google’s saved version of the Web page. cache:www.kb.se (shows Google’s saved version of the National Library of Sweden’s first page)
info: Shows info about the Web page. info:www.volvo.com (shows the link’s hit list text and several choices, among others cache: and related:
link: Finds linked pages, i.e., pages which link to the URL. link:www.interpol.int (pages which link to www.interpol.int)
related: Lists Web pages which are similar or related to the URL. related:www.fbi.gov (lists Web pages which are similar or relate to the FBI Web page)

Restrictions

Restrictions Means Enter into the search box (and result)
allinanchor: All search words must be in the link text on the page. Allinanchor:invisible  web (pages which have links with invisible web in the link text)
inanchor: The search word after inanchor: must be in a link text. used cars inanchor:cheap (pages with the words used cars in the text and cheap in the link text)
allintext: All search words must be in the text on the page. allintext: recipet banana curry chicken (pages where the text contains the words recipe, banana, curry and chicken)
intext: The search word after intext: must be in the text on the page. brightplanet intext:”deep web” (pages with the word brightplanet and with the phrase deep web in the text)
allintitle: All search words must be in the title of the page. allintitle:deep web (pages that have the words deep and web in the title)
intitle: The search word after intitle: must be in the title of the page. film cinema intitle:top list (pages with the words film and cinema and with top list in the title)
allinurl: All search words must be in the URL of the page. allinurl:google faq (pages with the words google and faq in the URL)
inurl: The search word after inurl: must be in the URL of the page. Information seeking inurl:guide (pages with the phrase information seeking and with guide in the URL)

Leave a Reply

Your email address will not be published. Required fields are marked *