(This is originally a chapter from the book Efficient information searching on the web.)
Search engines is the type of search services mostly used. To seek information via Google has even become a verb—to google. To google has come to mean a search through a search engine, not just on Google. The largest search engines are:
- Google (www.google.com)
- Yahoo! (http://search.yahoo.com)
- Bing (www.bing.com), formerly Live Search and MSN
What really is a search engine?
A search engine is a search service which automatically indexes Web pages and, to some extent, other file types. Search engines are computer programs which read Web pages on the Internet and store these in a database. When you make a search, the search engine searches its database for relevant material which it then presents to you.
To enter one or a couple of search words in the search box in Google is not that difficult, so what’s the point of trying to understand how a search engine works? If you want to get a bit more out of your search, want to search more efficiently or get more exact hits, you’ll need an understanding of how a search engine works. Also, with a deeper knowledge of search engines comes the understanding of what you may expect from search engines and how unlike search engines may be, both technically and in practice. It’s not at all a given that they will index the same things or that they rank or present sponsored links in the same way. When you realize the weaknesses of the search engines you’ll understand the strengths of the other search services. Last but not least, you’ll need certain knowledge of how search engines work to understand what the Invisible Web is, all of that which isn’t included by the search engines.
How does a search engine work?
A large search engine like Google is a complex system which consists of thousands of computers. In order for the user to get a good idea of how a search engine works, you can divide it into three parts:
- The Spider – finds and collects Web pages
- The Indexer – indexes and stores Web pages
- The Query Processor – delivers search results
The spider follows and collects links. Before you can find the information in the search engine the search engine itself has to find it. The spider is a small program that is constantly “on the net” to find new pages for the search engine to index, that is, to store in the database. The spider works more or less in the same way as you and I when we’re surfing. It follows links that it finds and the Web pages are downloaded to the search engine (we do the same thing in our Web browser). The new downloaded Web pages, the addresses, are sent to the indexer where they are placed in a queue for indexing. The spider does not only find new Web pages, it regularly revisits the pages that already exist in the search engine’s index in order to keep its database updated.
Two things distinguish the spider from you and your Web browser. The spider can follow links backwards, that is, it can follow links that lead to a page, not only links that lead from it. The spider can also download thousands of pages at once, not only one at a time as we have to do.
How do the search engines find the Web pages? A new search engine is “fed” with an address list to existing Web sites. After that the spider requests the Web pages in the list and sends them on to the indexer, but at the same time it looks for links to other Web pages. If it finds links to other pages these are added to the spider’s visit list. It is also possible to register the address to one’s Web site on some search engines in order for the search engine to find and index it. When it comes to revisits to already indexed pages, the spider’s point of departure is the addresses which exist in the index.
The spider is also called robot, bot, web crawler or crawler.
The indexer is the part that fetches the Web pages, divides them and stores them in the index (the database). The index stores information indicating on what Web page each word can be found and where on the page it is found; the information is stored irrespective of if it concerns common text, headings, captions (the so-called ALT tag) or links. The index of the search engines is constructed in the same way as the index that you find at the very end of a book. Each word refers to the pages where they can be found.
The size of the index is an important competitiveness factor between the different search engines. Bigger is often better. But at the time of writing, the major search engines indexes contain over 20 billion indexed pages, probably over 100 billion pages. A couple of years ago the search engines declared their index size on their home pages, but now they occasionally mention a number in a press release. But it is hard to compare the figures because the search engines counts in different ways.
Certain search engines, among others Google, store a copy of the indexed Web page. It’s the copy that you get to see if you choose the link Cached in Google’s hit list. If the real Web page is not accessible for the time being, the cached page might be a way out. But the cached page may be far from updated. Several months might have passed since the search engine last indexed the Web page. Sometimes there is a date on the cached page and then you know when the search engine visited the page last time.
The Query Processor
The query processor is the part that you as a searcher meet. It processes the query (your search words), calculates relevance and presents the results (that is, it creates a hit list for exactly your query). It’s this part of the search engine that differs most between the various search engines. Each search engine has a unique (and secret) way of determining the relevance and of creating hit lists. Some people claim that this is why Google has become so popular, the users experience that they get good hit lists in response to their searches. At the same time, Google was the first one to use a clean and publicity-free search page (a fact which is also used to explain Google’s popularity).
Fig. The structure of search engines
All three parts of the search engine may be programmed in different ways. Examples:
The spider prioritizes certain links above others when it collects links.
The indexer does not enter all the information on the Web page, perhaps only the first 100 lines or a certain amount of kilobytes. Google earlier indexed 101 kb and Yahoo! 500 kb.
The way the query processor calculates the relevance is a determining difference between the search engines.
As the owner makes all these selections at the programming of the search engine, each search engine becomes unique with its own behaviour, strengths and weaknesses. If you try to use two or three search engines regularly, you’ll get to know them and will be able to use their strengths. And you’ll realize how different they are.
Examples of rules at the collection and indexation of links.
The search engine revisits pages according to a certain schedule. Popular pages are visited daily, while obscure and odd pages may never be revisited.
Certain types of pages are difficult or impossible to ”crawl,” for example pages with frames or a difficult coding (today all large search engines index frames).
A Webmaster may use the possibility to prevent the spider from visiting or indexing certain defined pages (robots.txt or the meta tag robots).
Certain search engines remove pages from the database, among other things due to lack of room.
You can analyse how a Web page works in relation to a search engine spider.
The search engine Seekport (www.seekport.co.uk) has a service, Seekbot, where you can analyse how a specific page works in relation to the search engine’s spider.
The inverted index of the search engines
The search engines store the indexed Web pages in an inverted index. The index contains references to the pages on which the search word can be found
Example of inverted index
The search engine has indexed 32 pages and the word amazon is found on the pages 2, 7, 11, 12 and 30. Volvo, however, is found on the pages 4, 8, 12, 15 and 24
The index looks more or less like this:
amazon 2, 7, 11, 12, 30
volvo 4, 8, 12, 15, 24
A search for:
amazon gives the pages: 2, 7, 11, 12, 30
volvo gives the pages: 4, 8, 12, 15, 24
amazon OR volvo gives the pages: 2, 4, 7, 8, 11, 12, 15, 24, 30
amazon AND volvo gives the page: 12
The concepts OR and AND are described further in Chapter 8 which deals with search techniques.
How relevance/ranking is determined
The search engines’ ranking is determined in a similar way, but each one has its own secret formula. A large number of factors are balanced against each other in a way that is unique to the search engine and from this balancing a hit list is created. Most of the factors concern the search words, as for example how many times the word can be found on the different pages. The search engines also look at the number of people who link to the respective page and from where the links are coming. Common factors are:
Number of occurrences (word frequency)
If the search word occurs 30 times in a document, this document probably deals more with the subject in demand than a document in which the word is only mentioned once.
Occurrences/document size (word density)
If the search words occurs 20 times in a document of 1000 words, this document probably deals more with the subject than if the search word occurs 20 times in a document of 10 000 words.
Overall rareness of the search word (inverse document frequency)
This is a way of measuring the importance of a word. The fewer times a word occurs in the database (the document collection), the more important or unique is the word.
In searches of more words, a shorter distance between the search words in the document is generally a sign of higher relevance.
Where the search word is found
If a word is found in the document title or file name it’s likely that the document to a greater extent deals with the subject than a document with a different name. This also applies if the word is found in headings, the first section, captions, etc. Sometimes the word is only found in the links to the page, not on the actual page, and it may be confusing when you don’t find your search word on the pages in the hit list.
How many users follow the different links? Link popularity is nothing that the search engines talk about today, partly because the development goes towards more “scientific” calculation methods. Earlier the link popularity constituted a more important factor in the relevance calculation, but all search engines probably take it into some consideration.
Which are the “characteristics” of the link? Where does it come from? How is the link described in the link text? One example is Google’s PageRank (see below), but all large search engines analyse links.
PageRank (PR) is the name of Google’s link analysis. When Google was launched in 1998 they were the first to use a more advanced link analysis. Today all large search engines operate in a similar way.
The basis of PageRank is:
- The Web page quality can be assessed by the number of links to the page.
- Incoming links to a page are more important than the outgoing links from the page. (A sort of citation analysis).
With the two points above as starting points, a calculation system has been created to obtain the weight of a Web page, that is, to measure how important it is in relation to the rest of the Web.
Google describes the method on their Web page:
“PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results.
PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value. We have always taken a pragmatic approach to help improve search quality and create useful products, and our technology uses the collective intelligence of the web to determine a page’s importance.”
In practice this means that a link from a big and famous Web site (with a high PR) is more important than many links from unknown pages (with a low PR). Link analysis and PageRank improvement is one of the fields that so-called search engine optimisation consultants work with in order to improve a Web page’s placement on the hit list (see further in Chapter 10).
In the Google Toolbar, PR is shown as a number between 0 and 10 for the Web pages that you visit. In Google Directory, the PR is shown in the form of a small meter (www.google.com/dirhp). The PR is, actually, a logarithmic value (like the Richter scale) between 0 and 1.
Ranking in Google
But at a search the PR is merely one of the many variables that are balanced against others before the placement on the hit list is determined. The balancing is done by means of the search engine’s secret algorithm. The calculation is secret as each search engine wants to offer the most relevant hit list to the users (and gain market shares). The algorithm is also secret to prevent outsiders from manipulating their own Web site to obtain a high placement on the hit list. If you understand how, for example, Google calculates its ranking you can create a Web site using your knowledge of how the search engine works and with that ignore the relevance calculation.
Fig. Illustration of Google’s ranking.
The figure of Google’s ranking illustrates the many variables that are balanced against each other in the secret algorithm. PageRank is only one of many variables, but only Google knows how important it is in practice.
The balance between the different parts done by means of the secret algorithm changes somewhat all the time, partly as a result of the above mentioned reasons, partly as a result of improvement and the act of keeping it a secret.
Google gives the following steps to describe a search in its search engine:
- Search in the index for all the pages that contain the search words.
- Relevance analysis of the found pages through examining where and how often the search words occur.
- Assessment of the reputation of the Web sites, that is, an analysis of those that link to respective Web site, where the found pages are, to obtain applicability/popularity. Google calls this assessment PageRank (PR).
- The Web pages are ranked by adding together relevance (step 2) and reputation (step 3) and after that Google produces a hit list based on the calculated applicability.
The process in the search engines is the same, so the description above could be said to apply to just any search engine.
Myths about search engines
There are many myths about search engines. The most common are:
The search engines search on the net
The search engines do not search on the net, they search in the database (index) that they have created from Web pages. The size of the index of the different search engines are important because web pages that are not in the index will not show up in the hit lists.
All search engines are the same
They vary substantially in many respects:
- The size (how large the index is).
- Updating of the index.
- Coverage of the Web.
- They have different “personalities”, strengths/weaknesses and advantages/disadvantages.
- How they process search inquiries.
- The presentation of the results (the ranking).
All together this means that the search engines are not alike. At a search for a specific Web site, i.e. the government’s site, the differences will not be noticed since all, probably, will place the link to this Web site at the very top of the hit list. But for most other searches you’ll gain from searching in several different search engines.
The indexes of the search engines are updated
The indexes of the search engines are not entirely updated for various reasons:
- Crawling the Web is expensive as computer power is needed.
- More popular pages are re-indexed more frequently just because they get visits more often, while less effort is given to less popular Web sites.
- An older page on a Web site is indexed, but the link in the search engine leads to a new page since the older page has been altered or replaced since the indexing.
If you choose cached in Google or cached page in Bing it will tell you when the stored page was last visited.
The coverage of the different search engines overlap
How big the overlap is between the large search engines is a matter of discussion. What applies is generally this:
- The overlap of the large search engines is much greater when it comes to much frequented, popular Web pages than obscure, seldom visited pages.
- The overlap in the index is larger than the overlap in the hit lists as the search engines calculate relevance in different ways.
If you search for popular subjects you don’t have to think so much about the search engine’s coverage, but if you search for unusual subjects you should use several search engines for the search.
The index of the search engines is extensive
- The large search engines cover 5-10 % of the visible Web (no one knows for sure and it depends on how you count).
- The search engines can’t keep up with the explosive growth of the Web.
- The search engines can’t find all pages on the Web (some pages don’t have any links leading to or from the page).
- Each search engine has its own rules that it follows for collecting and indexing pages.
The size of the index is a piece of information kept within the walls of the search-engine companies. Occasionally a search engine will make a statement saying that they have now indexed x billion pages. Google, Bing and Yahoo! have more than 20 billion, maybe more than 100 billion, pages in their indexes at the time of writing (May 2009).
When should you use a search engine?
Some guidelines for when to use a search engine are listed below. In Chapter 5 you’ll find more advice on the choice of search service.
- When you are searching for an exclusive or odd subject.
- When you are searching for a specific Web site.
- When you want to find particular document types, for example PDF files.
- When you want to search the full text of many pages.
When you want to find many documents in your subject.
Specialized search engines
Specialized search engines have their own indexes but are directed in different ways, that is, they don’t index ”everything” as the large search engines do. The limitations result in these search engines becoming more focused and in their capacity to index more material. Sometimes specialized search engines are called vertical search engines.
The subject orientation may be everything from science to recipes. One example is Scirus (www.scirus.com) which indexes scientific material.
The focus may be blogs or news sites. Google blog search (http://blogsearch.google.com) is one example of a blog search engine.
Search engines can be limited to particular file types such as PDF files or MP3. One example is the Adobe PDF search (http://searchpdf.adobe.com).
See further in search service collections, e.g.:
- Search Engine Colossus (www.searchenginecolossus.com)
- Beaucoup (www.beaucoup.com)
- Google Directory: Computers > Internet > Searching > Search Engines > Specialized (http://directory.google.com)
Compare hit lists
Thumbshots.com has a useful service which compares the first hundred hits for different searches in a few search engines – Thumbshots Ranking (http://ranking.thumbshots.com).
The significance of the order of the search words at a search on Google is shown below. A reversal of the two search words produces different hit lists. Each marked dot in the two lines is a hit found in both hit lists. The lines between the marked dots show where the corresponding hit is found in the other hit list. Unmarked dots are unique hits for the hit list.
Fig. Comparison between ”boeing 747” and ”747 boeing” on Google
In the search above, barely 7 of the 60 hits are unique, i.e., the overlap is 88 per cent. Among the ten first hits the ranking corresponds fairly well, but further down in the hit lists the differences become bigger. Hit 1 is unique in the two searches, which is remarkable.
Fig. Search on the ”Invisible Web”on Google and Live (MSN).
The search engines have different ways of ranking the hits. Above you find a search on the “Invisible Web” on Google and Live (MSN). Only 42 per cent of the first sixty hits in respective hit list overlap between the two search engines.
Precision and recall
At a search there are two standards of how efficient a search or a search service is, recall and precision. In a given quantity of documents, a certain number of them are relevant to a specific query. But far from always these exact documents are retrieved.
Each search gives a number of retrieved documents, where some of them are relevant, i.e. the overlapping between relevant and retrieved documents in the figure above.
Precision refers to the number of overlapping documents divided by the number of retrieved documents. The precision should preferably be high as the retrieved documents in this case contain few irrelevant documents.
Recall refers to the number of overlapping documents divided by the number of relevant documents. When the recall is high, a large number of the relevant documents have been retrieved.
Preferably both precision and recall should be high as the search engine then delivers practically only relevant hits and, at the same time, all the relevant hits in its index. But in practice, and to a certain extent, precision and recall cancel each other out. If the number of hits is increased at a search (higher recall), it means that the number of irrelevant hits also increases (lower precision).
Fig. Document quantity with relevant documents, retrieved documents and overlapping.
As a standard, recall is perhaps not entirely relevant for Web searches as most users only look at the first hits in the hit list. The fact that a search gives eleven thousand hits may not be interesting. On the contrary, it would probably stress us out as searchers if we knew that all the hits in the Google search were relevant to our search.
When searching on the Web, precision is so much more important, particularly for the first hits on the first pages of the hit list, the pages that one really looks at.