(This is originally a chapter from the book Efficient information searching on the web.)
The Invisible Web is hard to define. Several similar concepts are used for its designation. The Invisible Web was coined by a researcher in 1994 for that which wasn’t visible to the search engines. The Hidden Web is used as a synonym to the Invisible Web. Some people consider that the Invisible Web as a concept is misleading as nothing is invisible, only hidden. The company BrightPlanet introduced The Deep Web in 2000 to focus more on techniques that make the information in databases visible.
The relevance of the Invisible Web
The Invisible Web is becoming increasingly relevant as a concept for many reasons. Both the number of visitors and the share of those who choose to use a search engine for their searches are growing – more people than ever are affected by the Invisible Web. The awareness of the Invisible Web has grown; both among searchers and information producers, and more information sources have become visible. Most people who search for information use search services. The information that can’t be found in the search services can then be thought of as invisible, as it’s unfindable by the search services.
At the same time the Web is growing faster than the search engines’ indexes – the growth on the Net results in the largest part of the information remaining invisible. Other information sources than pure Web pages are added at a hurried pace: sound, image, video, podcasts, news, e-books, e-journals, discussion lists, blogs, RSS flows, wikis. The search engines only include a few fi le types other than html, principally those with text-based information.
The search engines give an idea of the Web which is about a month old in average. It takes time and resources for search engines to re-visit Web pages. The average time is estimated to one month.
More and better hybrid search services which do searches in the Deep Web are being introduced. Still, these are principally technical solutions which are sold for use within companies. But also the traditional search engines are including increasingly more material from the Deep Web.
A number of different reasons as to why the Invisible Web is important to all who search for information on Net were presented here. An awareness of the Invisible Web is required to ensure really efficient searches.
What is the Invisible Web?
A short definition would be: Everything which the search engines can’t see. Meaning all the information which the search engines can’t include in their indexes and with that make it searchable. But it’s more complex than that. The concept of the Invisible Web departs from the search engines, not from us as information seekers or from the Web as a whole.
What the Invisible Web actually is would be something that shifts constantly depending on:
Which search engine it is
As the Invisible Web is search-engine dependent it’s diff erent for each search
What the search engine indexes
The search engines have somewhat diff erent coverage, i.e., they index somewhat different parts of the Web.
How the search engine does the indexing
A search engine won’t always include the whole text or all images on a Web page. That which is excluded will be invisible at searches.
How often the search engine re-indexes its pages
Popular Web pages are generally re-visited more oft en by the search engines, but it’s up to each search engine.
How big an index the search engine has
The numbers of Web pages that the search engines have indexed diff er greatly. A smaller index means that more things are invisible.
What fi le types the search engine indexes
In addition to Web pages search engines may index other fi le types such as PDF, DOC or Flash. But the search engines are restrictive when it comes to which fi le type they will index and how big a part of the files they will include in their index.
The rapid growth of the Web
The Web is constantly growing and at a hurried pace. The search engines can’t keep up and at the same time have a fresh index. And it can take weeks for a new Web page to become indexed.
What does the Invisible Web contain?
The Invisible Web contains texts, fi les and other information which is not indexed by the search engines, due to technical reasons or because of selections that the seach engine makes.
Many databases are accessible via the Web. Their contents can in most cases be regarded as forming part of the Invisible Web.
In databases which are reachable via the Web, but which most oft en are invisible in the search engines, you find:
- telephone numbers
- definitions of words
- things for sale in Web stores or Web auctions
- product information
- digital exhibitions and galleries
- graphic- and sound fi les
- new and variable information
- job ads
- available airplane tickets, hotel rooms etc.
- stock-exchange rates, prices for bonds, currency rates etc.
- essays and degree theses
Reasons for invisibility
The Invisible Web consists of documents on the Web which the large search engines can’t or don’t want to index for different reasons. Partly because of technical reasons, many things are hard or impossible to index, partly because of financial reasons. Each indexed page requires a little bit of the search engine’s computer capacity. Perhaps it requires too much to index a certain Web site in proportion to how interesting it is to the search engine users. Space costs money. But also strategic reasons exist to attract more searchers and not lose the ones you already have.
Different technical reasons
- Unlinked pages – no link for the search engine spider to follow to the page. The page turns into an island on the Web.
- Pages that principally consist of images, sound or video – not enough text for the search engine to ”understand” what the page is about results in the ignoring of the page.
- Real time information – short-lived data; enormous quantities; rapidly changeable, like stock-exchange rates; radio- and TV emissions.
- Contents in relation databases – the spider can’t fill in the fields or in other ways make the choices that are required for the database to search for information.
- Dynamically generated contents – adapated contents are irrelevant to most searchers. Here is also a fear of ”spider traps” on the part of the search engines, places where the spider can get caught without being able to move on.
What can’t or won’t the search engines index?
- Not entire documents if they’re big. Th e search engines generally have an upper limit where they stop indexing the documents. Google’s limit was earlier 101 kB. This is the explanation to why cached pages can end abruptly.
- Not all so-called stop words like to and and can be excluded to save space. Since more and more search engines save complete pages, the earlier excluded stop words are now also included.
- Not all meta data. E.g., not the information in the head of the page, between <head> … <\head> which earlier was used, among other things, to indicate misleading keywords in the meta tags with the purpose of manipulating the search engines.
- Not all directories at a Web site. Th e search engines oft en delimit how deeply they will index in the directory structure (e.g., not all the way down to www.omis/fi ler/dokument/2005/december/lista.html).
Advantages with the Invisible Web
In the Invisible Web there are no personal homepages and no commercial advertising (yet). The contents are, to a great extent, produced by different institutions and oft en with a specific purpose. The broad, general information is not published here but on visible Web sites. This saves time for the searcher (if you know what you’re looking for) and the risk of using ”bad” information becomes smaller. You’re also spared sales and commercials, something that can take time and that generates a lot of noise (if you’re not looking for products or services).
Different types of Invisible Web
The Invisible Web can be divided into different parts:
- The poorly ranked Web
- The un-indexed Web
- The private Web
- The protected Web
- The really invisible Web
- The fresh Web
- The disappearing Web
- The non-existent Web
The poorly ranked Web
Much information in the search engines is practically invisible as it’s ranked low in the search engines’ hit lists; the information can be said to be buried far down in the hit list.
- Unusual fi le types, everything that’s not a normal Web page, oft en end up far down in the hit list.
- Limitations in the number of hits shown. All search engines have a limitation in the number of shown hits. You can oft en see only a couple of hundred of hits, even if the search has generated 3 million hits.
- Fresh Web pages that are linked to by few others (few inlinks) receive a lower value in the ranking than those linked to by important Web sites or by many (many/important inlinks).
How you search the poorly ranked Web
- Change search words or use synonyms – substitute search words or use OR. In Google the tilde (~) is used.
- Change the order of the search words. The search words are weighted differently when the relevance is estimated.
- Double the most important search word to increase its weight at the relevance estimation (works in Google, Yahoo! and Exalead).
- Repeat the search in different search services – they have different coverage and ranking.
- Search for different types of information, sound, image etc. with the special functions of the search services or in search services for specific file types.
- Think about in what format you may easier fi nd the information. Perhaps it’s presented in reports (PDF) or as presentations (PPT) – limit your search to a specific file type.
The un-indexed Web
Consists of fi les which could be included in the search engines’ indexes but which aren’t. This part of the Invisible Web is extensive and hard to find. A large part of the Web is un-indexed. Diff erent reasons are:
- The indexing depth
- The indexing frequency
- Unlinked pages
- Badly designed Web pages (search engine unfriendly)
To test whether a page is indexed in Google you write info: before the Web address:
Fig. Info:www.ucla.edu i Google.
The page is indexed and you can choose among the following:
- Look at the version in Google cache. (The page is saved as it was when Google last visited/indexed the page).
- Let Google do a search for pages that look like the page of interest.
- See which pages link to the page of interest.
- Look at the pages from the site www.ucla.edu.
- Let Google do a search for pages that contain the URL of the page of interest.
If you look for info: for a page which isn’t indexed you only get the following choices:
- Go to the address of interest.
- Let Google do a search for pages that contain the page’s URL.
A page which isn’t indexed:
Fig. Info:www.jonasfransson.com/about.html i Google.
How you search the un-indexed Web
- Subject directories
- Industry portals
- Go directly to where the information may be found
- Test different search engines (different coverage on the Web)
The private Web
Web site owners can in different ways prevent search engines from indexing Web pages. A Web editor has three ways in which to exclude a page from a search engine:
- To use passwords which protect the page so that a search engine spider can’t get passed the form.
- To use robots.txt to prevent spiders from getting to the page (the Web Robots Pages, www.robotstxt.org/wc/robots.html).
- To use the meta tag ”noindex” which prevents the spider from reading the rest of the page and from indexing the page.
Web pages in the private Web can have diff erent purposes. Perhaps these are family pages which are intended for use only by family and friends. Th ey might also be pages for a project that originally had a specific target group. Or a Web site under construction.
Increasingly more oft en it’s required that you enter a code consisting of four, five letters and numbers shown in a somewhat distorted image box next to the box where you have to enter the code. This is a simple password which protects the service from being utilized by automatized programs, like the search engine spiders. It has to be a human being who reads and enters the code and with that use the service. Other passwords for search engines work in the same way – like a blind alley.
How you search the private Web
- Subjecct directories
- Industry portals
- Indirectly in search engines
The protected Web
In the protected Web, as in the private Web, the search engines are prevented from indexing the contents. The protected Web is more commercial to its nature, it’s oft en about tying up users to make them return or pay for the services.
- Pages where the user has to agree to certain terms to get access to the page.
- In many cases the Web pages are freely accessible after registration.
- In other cases a fee is charged, per page or as some kind of subscription.
- Traditional database companies are also included here (e.g. Dialog, www.dialog.com).
How you search the protected Web
- Industry portals
- Register as a user at separate, free Web sites
- Searching in pay services – subscription or pay for separate searches
The really invisible Web
The really invisible Web is the large amount of information that is stored in databases and unusual file formats. With modern technique a lot can be indexed, but to a great extent today’s search engines build on techniques from the birth of the Web (first half of the 1990’s).
- The Web sites which the search engines can’t index for technical reasons.
- The file formats which the spider isn’t programmed to handle.
- Difficult for the search engine to categorize owing to a small amount of text. All searching in the search engines takes place through matching of search words to words on the Web page. If the Web page contains few words the page may become unfindable in the index.
- The search engines have chosen not to index the pages.
- Dynamic pages – poorly constructed pages can turn into spider traps, i.e., the search engine spider gets trapped in the generated links and can’t move on.
- Information in relation databases – a question to the database is required.
How you search the really invisible Web
- Indirecly in search engines (e.g. for databases)
- Industry portals
The fresh Web
There’s a constant publishing going on on the Web. News, blog contributions, press releases, new Web pages, reports etc. One part of this is indexed practically immediately by the large search engines, but a lot of information remains invisible for weeks or months.
How you search the fresh Web
- News search services
- Blog search services
- Via experts’ Web pages
- Monitoring subject or Web site – passive searching
The disappearing Web
Th e Web is not static, but developing and changing constantly. Information is added and disappears – ”here today, gone tomorrow.”
How you search the disappearing Web
- Cut the URL to get to a working Web page in order to start searching onwards again.
- Check the URL in a search engine using cache:
- Do a Web site search in a search engine with keywords from the disappeared page (site:).
- If 1-3 don’t work, or if the entire Web site has disappeared, try the Internet Archive (www.archive.org).
The disappeared information can be sought aft er in Web archive services, see the Old Web.
The non-existent Web
EVERYTHING is not on the Web! People may oft en be the key to the information you’re looking for. If your search doesn’t result in anything, in spite of your efforts – look for an expert.
How you search the non-existent Web
- Check with friends or colleagues. Perhaps someone knows about a good starting-point or an important resource. Many organizations have an “information guru” or information function.
- Contact a public library near by – professional help for free.
- Consult a researcher.
The Deep Web
The company BrightPlanet divides the Web into two parts, the Surface Web and the Deep Web. They consider that the Deep Web is a more correct concept than the Invisible Web.
The Surface Web is the static Web pages that the search engines can reach, and which thereby become visible and searchable in the traditional search engines. Under the surface lie the dynamically generated Web pages and the databases which the search engines can’t reach, which is why the Deep Web becomes hidden or invisible.
Most of the information on the Web is buried in dynamically generated Web pages, which don’t exist until they’re created as the answer to a specific search.
The size of the Deep Web
In 2001 the company BrightPlanet estimated the Web’s size to 7500 terabytes. More numbers from the report (1):
- The Superfical Web: 19 terabytes and 1 billion documents (1 terabyte = 1000 gigabytes).
- The Deep Web: 7500 terabytes and 550 billion documents.
- The Deep Web is 400-550 times bigger than the Superficial Web.
- The Deep Web contains 1000-2000 times more quality information than the Superficial Web.
Google has estimated the content on the Internet to 5 million TB, whereof Google had indexed 170 TB (a thirty-thousandth part) in October 2005 (2).
If the content on the Internet has doubled to 10 million TB totally since Google’s estimation and if the Deep Web is 500 times bigger than the Superfi cial Web, then the Superfi cal Web is 20 000 TB and the Deep Web almost 10 million TB. If Google has doubled its index in the meantime they have indexed about 350 TB, almost 2 per cent of the Superfi cial Web.
On the other hand there are researchers who claim that Google has indexed 76 per cent of the indexable Web (the Superfi cial Web) which is estimated to consist of minimum 11.5 billion Web pages (2005) (3).
The content of the Deep Web
- The content of the Deep Web is relevant to each information need.
- More than half of the content is found in subject-specific databases.
- 95 % of the Deep Web is freely accessible information – neither fees nor subscriptions.
The greatest part of the content is according to BrightPlanet subject databases (54 %). Together with documents on Web sites and archived publications the subject databases make up incredible amounts of subject-specific information and make up close to 80 % of the Deep Web. Trade-related Web sites, like auction sites, represent about 10 % of the content. Some other parts are portals (3 %) and libraries (2 %). The study was done in 2001 so the proportions may have changed somewhat, but the numbers offer an image of the Deep Web.
The Invisible Web and the Deep Web
You can combine the Invisible Web and the Deep Web. The illustration below is an attempt to make clear the relations between the two concepts. The Invisible Web is search-engine dependent, whereas the Deep Web departs from the nature of the information and how this is stored.
Fig. Visible and invisible in Google.
The proportions in the image show the possible different relations between what is visible and what is invisible, or, in other words, how much is indexed in the search engines, in this case Google, and how much isn’t.
The Old Web
The growth on the Web is over 200 per cent per year. At the same time about 40 per cent disappears yearly, of which a part is of historical interest.
Through the Internet Archive and the Waybackmachine (www.archive.org) you can search for old Web pages.
You find an alternative entrance to the Internet Archive and the Waybackmachine at the Web site of the Alexandria Library (it may be needed at times when the other one is slow): www.bibalex.org/English/initiatives/internetarchive/about.htm
Fig. The Internet Archive and the Waybackmachine
Search engines with cached versions of the indexed Web pages in their search engine indexes can also be used to retrieve lost pages. This is only possible, however, if the search engine hasn’t had time to return to the Web page to re-index the page. The sometimes slow re-indexing is the answer to the question why it’s not always the same page that you see as cached in the search engine and which you later come to. If it comes to the worst, the Web page will have changed completely from the point of view of content since the search engine last paid a visit.
Searching the Invisible Web
When should you search the Deep/Invisible Web?
- When you’re looking for dynamic and changeable information, like news, job ads or flight departures.
- When you want to find information that is normally stored in a database.
- When you want to search beyond the limited index of the search engines.
Different search services for searching the Invisible Web
- Invisible Web gateways
- Directories/subject directories
- Search engines and specialized search engines
- At the source (database searches)
Invisible Web gateways
Pronounced focus on the Invisible Web
- CompletePlanet (www.completeplanet.com) – directory
- IncyWincy (www.incywincy.com) – search engine
- The Librarians’ Internet Index (http://lii.org) – try searching on database to find database, combine with subject words.
- Infomine (http://infomine.ucr.edu)
Search engines and specialized search engines
- Google, Yahoo!, Bing, … – e.g., try migration database
- Google News (http://news.google.com)
- Blinkx (www.blinkx.com) – video search engine
- Picsearch (www.picsearch.com) – image search engine
To find search services similar to a known one, try related: in Google. A search for related:www.picsearch.com returns a bunch of image search services.
Search words which point you in the right direction
If you don’t search in services which are specific for the Invisible Web, you have to select search words that point in the direction of the invisible material that you’re after:
- database + subject word
- archive + subject word
- search (“click here to search”, points to database)
- listen (points to sound fi le) in Google: elvis listen
Search for databases
The information in the Invisible Web is generally found in databases that you can reach via the Web. In spite of the fact that the search engines can’t search in the databases they can locate the homepages or search forms of many databases.
When you search for databases you should use your search words/subject words together with words like database, archive and repository. For example:
”plane crash” AND database
The limiting of your search to a certain fi le type may result in valuable material, e.g.:
“invisible web” and limit the fi le type to PPT (Power Point)
Search for places where the content of the Invisible Web is likely to be
Figure out how you can reach the page with the information indirectly. Which page may link to the information you’re looking for? Interest groups and authorities are good starting points.
Indirect search – two-step search
Which Web site may contain what you’re looking for? The search engines don’t index (include in their databases) all the pages, not the complete pages and not all the images and links. But you can find a lot of things indirectly through the search engines.
Inverted link search
When you’ve found a database usable for the Invisible Web you should make use of other users who have found the resource to be good when it comes to locating other Web sites. Through inverted link searching you can find directories and link collections which link to the database you’re interested in. They also oft en link to other good resources.
will give Web pages that link to www.planecrashinfo.com and which are included in the index of the search engine.
Make use of the search tools of the Web sites
Databases are often ”hidden” far down at Web sites. To fi nd them you should use the search tools of big authoritative Web sites, e.g. the UN, the World Bank or the big universities.
Look through your bookmarks
Perhaps you’ve already found a good Web site for the Invisible Web and saved a bookmark. How do you know that a bookmarked Web site belongs to the Invisible Web? Use the “URL test” to check it. Place the cursor to the left of a question mark in a URL in the browser’s address box and erase everything to the right, including the question mark, and then reload the page. If you get an error message or “the page could not be found” it’s probably a resource to the Invisible Web as the whole address was needed in order for the database to create a meaningful page.
Monitor subject-specific e-mail lists or forums
Find e-mail lists or forums where subject experts or information specialists such as librarians and journalists carry on discussions. Follow the discussions or search the archive if there is one. Much of the knowledge and the experiences that pass these channels will never be printed (or be up on the Web).
Limit your search to specific file types
By limiting your search in a search engine to one file type it will be easier to find material of more content.
filetype:ppt “invisible web” OR “deep web”
Search in specialized search engines
Specialized search engines oft en index deeper at relevant Web sites or in entire PDF fi les (both Google and Yahoo! have had limitations in the number of kB that is indexed.
At the source – example travel planners
Some bus- and train companies may have travel planners, services on the Web where you can find information about stops, routes and timetables by searching on point of departure and destination. The information is invisible in an ordinary search engine. It’s not possible to enter the destination of your journey and means of travel in a search engine, e.g., bus lund copenhagen and get the next relevant bus departure. Some companies publish their timetables in PDF format, so search words like timetable Stockholm pdf can lead you to timetables.
Searchability of images in search engines
To search for images is harder as the search engines base their operations on text and signs. The following normally applies:
- The file name is indexed, e.g., leo12.jpg.
- The ALT tag can be indexed if it exists.
- Text which is close to the image is oft en indexed.
This means that you should search on 1-3 search words when you search for images, not more.
Google Image Search: search on boeing 747 and 747 boeing in two different windows and compare the hit lists. What you get is more or less the same hits but with different ranking owing to the order of the search words. (See the illustration in Chapter 2).
Remember the copyright! Most images that you fi nd on the Web are owned by someone.
(1) The ‘Deep’ Web: Surfacing Hidden Value (http://www.brightplanet.com/component/content/article/23-details/119-the-deep-web-surfacing-hidden-value.html)
(2) Softpedia: How Big is the Internet? (http://news.soft pedia.com/news/How-Big-Is-the-Internet-10177.shtml)
(3) Gulli, A. & Signorini, A. (2005) Th e indexable web is more than 11.5 billion pages. (http://doi.acm.org/10.1145/1062745.1062789)