1. The Basics

(This is originally a chapter from the book Efficient information searching on the web.)

Internet and the Web

What really are the Internet, the Net and the Web? Strictly speaking the Internet refers to a worldwide network which for the most part consists of computers and fibre-optic cables. The Web (www or the World Wide Web) consists of Web pages which are read by a Web browser (e.g., the Internet Explorer). The Web builds on the physical network constituted by the Internet. The Internet is in reality a number of computers connected through network cables that communicate with each other through the use of a common language, i.e. a network and nothing else. Web sites (a number of connected Web pages) are on a so-called Web server and made accessible via the Internet. With your connected computer you order home the Web pages that you want to look at in your Web browser. E-mail is another utilization of the Internet. E-letters are sent between different mail servers which are connected to the Internet. File transfer is a third usage of the Internet.

The foundation of the Internet was a network in the US that connected defences and universities. Up to the 1990’s the main users were the researchers who worked at research establishments and universities. In the 1990’s, the Web caught on and the Internet became increasingly commercialized. In the 21st century the Internet has become a natural part of many people’s everyday life and is no longer seen as something strange.

Web addresses

A Web address is called a URL (Uniform Resource Locator). Each URL is unique and leads to a specific file on a specific computer.

A URL is constructed according to the following:

http://www.omis.se/exempel/webb/webpage.html

http:// means that the Web browser uses the Hypertext Transfer Protocol (HTTP) to get the file to your computer. Other protocols are for example used for e-mail or file collecting.

www.omis.se is the name of the computer (often called a Web server) where the file (the Web page) is stored. The ending .se shows that it is a page which belongs to the Swedish top-level domain on the Internet.

/exempel/webb/ is the directory and sub-directory on the computer (the Web server) that the file (the Web page) is on.

webpage.html is the name of the file. The file ending .html shows that it is a page written in the Hypertext Markup Language (HTML).

Web addresses are thus structured according to the following:

how-to-get-there://where-to-go/what-to-collect

IP numbers

The IP number (the IP address) is the unique adress of the computer which makes communication on the Internet possible. The IP adress is controlled by the Internet Protocol (IP) and consists of 32 bits which are generally written decimally, e.g. 194.14.94.1. The DNS (Domain Name System) translates the domain in the Web address, e.g. www.omis.se, to an IP number so that communication can take place between the computers on the Internet via the protocol TCP/IP. (TCP/IP is a standard for computer communication which builds on the two protocols TCP and IP.)

Static Web pages

An ordinary ”old-fashioned” Web page written in the HTML language and saved as a file on a Web server is a static Web page. The Web page is only altered if a new version of the file is uploaded to the Web server replacing the old one. Static Web pages are often created manually in a text-editing program or in a program with a graphical user interface such as Frontpage. The static pages contain general information as they have to be changed by hand. Most static Web pages are indexable, i.e. the search engines can add them to their databases. Problems may arise from frames or different scripts.

HTML

To know basic HTML is important to be able to use the search engines in an efficient way. The large search engines are originally constructed to manage only HTML pages, and many of their search features build on the different elements in, or parts of, the HTML.

A simple HTML page is shown below. The following pages contain an explanation to the HTML code for www.omis.se/exempel/webb/webpage.html.

Image. The Web page www.jonasfransson.com/example/webpage.html

The HTML code for the Web page in the image above:

<html>

<head>

<title>Web page</title>

</head>

<body>

<h1>Universities</h1>

<p><a href=”http://www.lu.se/lund-university”>Lund University (LU)</a></p>

<p><a href=”http://www.su.se/english/”>Stockholm University (SU)</a></p>

<br>

<h3>Where do I find more Swedish universities?</h3>

<p>Try to search with “university site:se” in a search engine like Google and Yahoo. Site:se limits the search to the Swedish domain .se, and thereby excludes hits from other domains like .no (Norway) or .edu (education in the US).</p>

<img src=”google-site-box.png” alt=”Box for limiting search to specific site or domain in Google” border=”0”>

<p><i>Picture. Box for site-search in Google.</i>.</p>

<br>

<p>In most search engines there are possibilities for limiting a search to a specific <b>site</b> or <b>domain</b> in “advanced search”. The picture above shows the box in Google where the site can be specified.</p>

</body>

</html>

The words within <> are called gags and could be said to be commands. In the first line it is decided that the document is an HTML document.

<html>

<head>

<title> Web page </title>

</head>

Then comes the head of the document (head) where the title of the document is given: ”Web page”. It’s the title that is shown in the list at the top of the page in the Web browser. What it says in the head is not visible on the Web page itself but has other functions. After that the head is completed (/ inside the tags means that the command ends) and the ”body” follows ”, <body>. It’s the text in the body that is shown in the Web browser’s window.

<body>

<h1> Universities </h1>

h1 is a heading, the highest level which gives big headings.

<p><a href=”http://www.lu.se/lund-university”>Lund University (LU)</a></p>

<p><a href=”http://www.su.se/english/”>Stockholm University (SU)</a></p>

<p> means the beginning of a paragraph. The section consists of five paragraphs using the same structure and each paragraph contains a link to a daily paper. The link is structured according to the following:

<a href=”http://www.lu.se/lund-university”>Lund University (LU)</a>

a stands for anchor, which is the code for a link.

href=http://www.lu.se/lund-university is a hypertext reference which in this case leads to the English-speaking start page of Lund University.

Lund University (LU) is the clickable text.

/a means that the link is completed.

<br>

br stands for BREAK, i.e. a new line.

<h3>Where do I find more Swedish universities?</h3>

<p>Try to search with “university site:se” in a search engine like Google and Yahoo. Site:se limits the search to the Swedish domain .se, and thereby excludes hits from other domains like .no (Norway) or .edu (education in the US).</p>

The text above is a text section with a heading. The heading is of level 3, meaning that it’s smaller than h1 above.

<img src=”google-site-box.png” alt=”Box for limiting search to specific site or domain in Google” border=”0”>

<p><i>Picture. Box for site-search in Google.</i>.</p>

<img src> stands for image source and inserts an image with the name ” google-site-box.png” on the page. Alt=”…” is the text shown if you can’t see the image for some reason, for example due to visual impairment. The tag determines the height and width of the image and whether it should have a frame. After that comes a paragraph with a caption in italics (<i>).

<p>In most search engines there are possibilities for limiting a search to a specific <b>site</b> or <b>domain</b> in “advanced search”. The picture above shows the box in Google where the site can be specified.</p>

</body>

</html>

In the last section two words are marked with <b> for text in bold type (bold). The section and the whole example ends with the completion of the document body (body and after that the page ends with </html>.

Sometimes ”&aring” and similar expressions occur in HTML text. &aring means ”and a+ring”, i.e. the Swedish ”å”. The paraphrases of language-specific signs are required for these to be correctly shown in all Web browsers.

Meta tags

Meta tags are tags in the head of the HTML page where meta-information can be entered. Meta-information is information about information, e.g. who created the page. In the meta tags of a Web page, the page’s creator can specify keywords and concepts under which the page will be located. The search engines can then consider the words in the meta tags at the relevance estimation. Earlier it was often information from the meta tags that was shown in the search engines’ hit lists. Google was one of the first search engines to instead show excerpts from the Web page text in their hit list. However, meta tags is something which has been abused as pages have been given popular search words as tags in order to attract visitors, without the tags having had anything to do with the contents of the page. The importance of the meta tags for both page creators and visitors has diminished radically and today the large search engines pay little or no attention to the meta tags. In the cases where they actually consider the meta tags, the search engine checks the words in the meta tags against the text contents on the Web page of interest to be able to ignore meta tags which don’t match the contents.

Dynamic Web pages

Dynamic pages are not, unlike the static pages, lying ”ready” on a server. They are created from the different choices that you make when you visit a Web site. Sometimes the Web site remembers your choices, e.g. colour settings, and at that point the Web site uses so-called cookies (small filters that are saved on your computer). With dynamic Web pages it’s possible to create e-trade solutions, discussion forums and on-line journals.

The information (the contents) on the dynamic Web pages is saved in a database and when you visit a dynamic page a page is created for you from the information in the database and from the frames of the Web site. In certain cases the information is updated often, as for example the newspapers’ front pages, and in that case the “age” of the information together with the news value is important.

One way of finding out whether a Web page is dynamic is to study its Web address (URL). If the URL contains a question mark in the middle, it is a sign of the page being a dynamic Web page.

Dynamically generated information can sometimes be hard to find in the search engines. It depends on the system through which the information is published, and how this system works, if the search engines will be able to find and enter the information into their indexes.

The Web browser

A Web browser is a program in your computer which makes it possible for you to see HTML documents and with that have access to all the information and all the files that are accessible via the Web. Today Microsoft’s Internet Explorer is the most common Web browser, but others exist as well. Mozilla Firefox is an example of a new Web browser. Before Microsoft made Internet Explorer into a part of its operating system Windows, Netscape Navigator was the most common Web browser.

So-called plug-ins hook into the Web browser and these are small programs that improve the functionality of the Web browser. Many sound- and image files require that you have a certain plug-in to make them visible, or audible, in the Web browser. If you click on a link to a file that your Web browser doesn’t support, a window will often pop up to ask you if you want to install the program referred to. Most of the programs are free of charge and safe to install on your computer if you follow the instructions.

With the Web browser you can manage the Web pages in different ways. You can save addresses (favourites/bookmarks), print Web pages, send the page or the link via e-mail, and also save the file to your computer.

Different types of search services

Search services are navigation tools on the Web. The search services don’t in themselves contain any information, but links to pages with contents. The search services can be divided into three fundamentally different types:

  • Search engines
  • Directories
  • Meta search services

Search engines

Search engines automatically index Web pages, i.e. computer programs read and save Web pages into a database. When you perform a search in a search engine you search in its database, not out on the Web. Examples are Google (google.com) and Yahoo! search (search.yahoo.com).

Directories

Directories are created by human beings, in contrast to the search engines. The editors, often librarians, collect links and place them in a hierarchy of subjects. Usually the links are commented and sometimes labelled with subject words. Examples are the Librarians’ Internet Index (lii.org) and Infoo (http://infoo.se/index-en.html).

Meta search services

Meta search services are search services that perform searches in other search services, in directories and search engines, and which then put together a hit list. In some meta search services advertising is mixed in with the hit list. Examples are Clusty (clusty.com) and Metacrawler (metacrawler.com).

Operation and financing of search services

The search services differ from each other with regard to operation and financing. They play in different divisions and have different origins and purposes. They can be divided into the following categories (examples within parenthesis):

  • Non-profit work (e.g. the Open Directory Project)
  • Hobby (individual persons’ link collections)
  • Advertising (e.g. Google)
  • PR for a special search technique or company (e.g. the  Bright Planet)
  • Publicly financed search service (e.g. Librarians’ Internet Index)
  • Research project (many of the search engines started as research projects, e.g. Google)

Terminology

There is no clear terminology on search services. I use search services as the comprehensive concept under which I place three types: search engines, directories and meta search services. But the word search engines is often used to refer to all search services. In computer science a search engine, however, is the part of for example Google which performs the search in the index, not the whole search service (i.e. yet a more narrow definition). Other concepts are search machines or search robots.

When it comes to directory services there are yet more concepts: directories, or catalogues (plus subject- or Web directories), link lists, link collections, portals/subject portals (or gateways) and virtual libraries.

Database

I use the concept database for a limited amount of information which is stored in a structured system that is accessible through a specific interface. The databases often contain information in a narrow subject. Databases in this sense have existed long before the Web and they have been accessible via cd-rom, modems etc. Now most of them are reachable via the Internet through Web interfaces, but many of them require registration or subscription.

Beta version

A beta version is a version of a computer program or a Web service which is under development. The program or the service might not be entirely stable and as a user you can’t place the same requirements on a beta version. Beta versions are often marked by the addition beta after the name. Beta versions are particularly common when it comes to new Web services.

1 comment for “1. The Basics

Leave a Reply

Your email address will not be published. Required fields are marked *