(This is originally a chapter from the book Efficient information searching on the web.)
What is a database?
From the user perspective a database is oft en a large amount of information, collected in a closed system. As a user you can oft en search in separate databases but only the owner of the database can publish the information in the database. On the Web, however, everybody can easily publish his or her own information.
The tradiational databases existed long before the Web came into existence and they were earlier accessible in other ways, e.g., through a direct connection with a modem. The databases that you meet today on the Net have a so-called Web interface which makes it possible to search the database via the Web.
In contrast to the search engines, the information in the databases is not put there entirely automatically; the input takes place under controlled forms and is done by editors. The image below illustrates the fact that editors do the input in the index which together with the interface makes up the database.
Fig. The structure of a traditional database. Compare to the structure of the search engines in Chapter 2.
Databases are important tools when it comes to efficient searches for information.
Common types of databases are:
- Library directories which show what literature there is at one or several libraries
- Reference database which give references to literature (e.g., journal articles or scientifi c reports), sometimes they contain abstracts, short summaries of the contents
- Fact databases which contain facts, not references, e.g., statistics and encyclopaedias
- Full-text databases which contain complete texts, e.g., scientifi c journal articles.
Characteristics of the databases
The database is restocked by editors
This makes for a selected and controlled content. The form in which the information is stored is also fixed. Clear fields are filled in with meta-information (information about information) in a standardized manner. In many databases a controlled vocabulary (subject words) is also used which makes it easier to find everything in a specific subject.
Controlled searches can be done
By having the content in the database well regulated, as to its form, controlled searches are possible. You can, e.g., search in specific fields (author, year…). You can normally also use the Boolean operators (AND, OR, NOT) to combine search words. Advanced search strings can also oft en be combined in different ways.
Differences between different databases
Databases differ from each other in different ways:
- Subject content: Does the database contain one subject or several subjects closely related?
- Geographical coverage: Is the content delimited geographically? Are there, e.g., only articles from Lund University in the database?
- Linguistic coverage: Which languages are included in the database? Are there, e.g., only research results in English in the database?
- Types of material: Are there only journal articles in the database or also books and other publications?
- Different periods: Which period does the database cover, when did the collecting of information start and is it concluded, and in that case when? If the database, e.g., contains articles from different journals, when did the indexing of each separate journal begin?
A database can be accessible in different ways, freely on the Web and through different database companies (one example is Thomson at www.thomson.com). The database companies make the database accessible through their platforms and consequently you can find great differences when it comes to interface and functionality.
Choice of database
Many factors determine the choice of database:
- Which information is needed: books, journal articles or articles from the daily papers?
- How much time do you have to your disposal to get the material?
- Are you looking for Swedish or international contexts?
- Language limitations – only in Swedish?
- Should the material be scientific?
- Access to the database?
Example of database: ERIC
ERIC stands for the Education Resources Information Center and contains bibliographic information, i.e., a reference database which doesn’t contain any full texts. Th e posts in ERIC describe articles and books and each post includes subject words and short abstracts, a short description of the content.
Fig. ERIC Advanced search (www.eric.ed.gov)
The database has a pedagogical content but also includes closely related research
areas as, e.g., information science. ERIC is accessible both via traditional database suppliers (e.g., Dialog) and freelyon the Web (www.eric.ed.gov). ERIC is a US database, funded by public means.
Comparison traditional database – search engine
In comparison to a traditional database the search engines perform an extensive collecting of data, which takes place automatically and which is therefore relatively uncontrolled. In the databases the collecting of data is done by human beings who follow certain stipulated rules.
The search engines use advanced retrieval technique and complex relevance estimation (see Chapter 2), everything to be able to present as good a hit list as possible. But this leads to less control for the user. In the databases things are often the other way around, there is no advanced relevance estimation. In the databases the hit list is oft en presented in alphabetical order or by date and in many cases you can select a sorting order. But most of the structuring and the selections are left to the user, which is why subject words are particularly important for searches in databases. To search with subject words is a way to guarantee certain relevance in the search results.
In a reference database you search in the database for information about the information, the search is done in standardized information about articles and reports. The content in the posts of the database may entirely lack a linguistic resemblance to the document of interest; the content will be presented in a uniform way through subject words and abstract. In the index of the search engines you search entirely or partly for indexed Web pages and other fi le types, i.e., you search ”directly” in the information. When searching in search engines you have to start out from what’s on the Web page or in the document instead of using carefully structured posts as your point of departure.
Searching in databases
Subject vocabulary or thesaurus
Databases oft en have subject vocabularies accessible in their database. The lists present the controlled subject words which the editors use for description of the content when they put up the posts. A thesaurus is a word list where the relations between the words are defined. The thesaurus is in most cases hierarchical, i.e., the subject words are divided into inferior and superior concepts. In English a controlled subject word is called a descriptor.
When the database is selected it’s time to define your search query using words that narrow down the subject. It’s important to think of the folllowing things:
- Are there synonyms or words which are much alike?
- Is the search word too specifi c or too general? Th e search word needs to relate to the content in the database; the search word Internet is perhaps too common in a database containing computer-scientifi c texts.
- Is the word spelled correctly? Are there any alternative spellings?
- Is singular or plural used?
- Is there any subject vocabulary or thesaurus in the database? Look in the description of the database, there will probably be a link to the subject vocabulary or the thesaurus.
Subject words from the subject vocabulary or the thesaurus are always to prefer. In a thesaurus each subject word is described in a post. The subject word is briefly explained in something called a scope note. Words that are superior and inferior in relation to the looked-up word are included, together with closely related subject words. Also earlier used subject words which have now been replaced by the new subject word are included.
Fig. The subject word ”search engines” in ERIC’s thesaurus.
In the post above the subject word search engines is described in ERIC’s thesaurus. Search engines is added as a subject word in 2002 and is now used instead of the earlier subject words internet search engines and web search engines.
You should always look in the subject word index/thesaurus to see which term is preferred. And, if possible, use the subject word field for your search.
Fig. A post in ERIC.
Study the posts
Study the appearance of the posts, their structure and content. In the image above, a post in the database ERIC is shown. Th e fi elds in the database are to the left . Of particular interest is the third fi eld, descriptors, where you find the post’s subject words. In the example below the title is Who’s Afraid of Google and the article has the subject words: Internet, Information Seeking, Search Strategies, Library Services and Online Searching. Through the abstracts written about the article you can get an idea of the contents.
Free-text searching – field searching
In databases you can normally choose between doing a free-text search, i.e., a search in all fields, or do a field search, a search in a specific field, e.g., the title. If you search in the title field you will only get hits for the documents whose titles contain the search word, not all the documents which deal with the subject. You can do field searches also in search engines on the Web. Through the search syntax of the search engine you can do a search merely on the title of the Web page. In Google you write title:anatomical, if you want pages with anatomical in the title. With a free-text search you will get more hits in a database, but the number of bad hits also increases. Choose to search in specific fields as far as possible; the precision of the hits will increase. Are you looking for material about or by Astrid Lindgren?
Truncation means to search on different endings of a word stem. Usually you add * or ? to the word stem. Specific information can be found in the help texts of the databases.
Example: search* gives hits on:
Truncation may give large amounts of hits, but in combination with other search words and search limitations truncation is an efficient tool.
Subject words in databases
In databases special words, so-called subject words, are used to describe the contents of the documents. The subject words may be ranged in a system, in a thesaurus, where the words are also grouped into levels of superior and inferior terms. The subject words provide one way of finding information in a subject. In order to find out which words are used in the database you’re working with you can go into the subject vocabulary and have a look. There you may also find references to other words instead of the one you’ve looked up. One example is an article about farms in Kronoberg in the 14th century. It can take the subject words: farms, 14th century, Middle Ages, history, Kronoberg, Smaland, Gotaland and Sweden (1).
One way of searching through the use of subject words is to observe what subject words a good hit contains and then use the same subject words for further searches— this should result in similar hits.
(1) Kronoberg, Smaland and Gotaland are designations of diff erent geographical areas in southern Sweden.