The Field Selection Table

The Field Selection Table (or FST) is the table used by ISIS software to create and maintain inverted files, and also when exchanging (importing/exporting) data and when creating sort keys for sorted output and reports.

When defining the FST, the designer has to take into account the types of searches to make available to the users to maximise the chance of successful retrieval.

CDS/Isis provides a large number of possibilities to guarantee successful retrieval, i.e.:

8 different indexing techniques (field, word, marked terms..), allowing the same field to be included in the index in different ways
The extraction of the indexed values for the index is defined by the Formatting Language, which means values can be calculated and manipulated from the field before being sent to the index
Transparency in the use of uppercase, lowercase and diacritics in the search terms
An identification of the search terms, which allows the exact origin (MNF, fieldtag, field occurrence and word position within the field) of each term in the dictionary to be determined.

The FST is a pure TXT file with three columns in which the following three elements are identified:

ID = Identification of the key The tag of the field from which the keys were taken, either real or 'virtual' (alias). The keys will be considered as taken from the indicated field, whether this is real or not (in the case of a virtual field or an alias, which means an element of one field is said to originate from another). This last technique allows e.g. the combination of all descriptors to be searched from several descriptor fields as just one type of descriptor, or all title variations to be searched as just 'title' data.

TI = Indexing technique This specifies the indexing technique applied to the lines obtained after the execution of the extraction format on a record from the database.

Extraction format Specifies the extraction format with the ISIS formatting language, which will be applied to the record for generating the index key

The Inverted File of ISIS structures contains five elements:

Search term (key)

Mfn

Occurrence of the field

Position in the field occurrence

The value given in the 1st column of the FST defines the ID component of the IF, identifying the keys generated by the format. This identification can then be used to qualify the origin of the key (e.g. field-limitation) in the searches on the database. In general the ID uses the tag used in the FDT, but in some cases special results can be obtained by using 'virtual' (non-existing fieldtags) or 'alias' tags (re-defining the origin of one field to another).

Example: assuming our database has 3 fields to index authors:

10: Personal authors at analytic level
16: Personal authors at monographic level
23: Personal authors at series level

In order for all authors to be searchable, independently from the type of material in which they appear, we can assign in the FST the ID 10 for all keys taken from the fields 10, 16 and 23, in order to allow simple searching on authors without considering the type of material to which they are linked.

Hence, if our FST specifies:

10 0 (v10/)(v16/)(v23/)

performing the search 'Amaro, Jorge Luis /(10)' with this IF, we will get as a result all records in which 'Amaro, Jorge Luis' appears, whether in the field 10, 16 or 23. So we retrieve the records independently of the author type registered in the database.

Assigning one unique ID for different field tags allows facilitation of search expressions for more general searches (locate a term independently of the field where it was taken from), or qualified searches (with field limitations defined with the ID). The degree of recall and precision to be applied (meaning : how large or how precise the 'catch' is aimed to be) will depend on the decisions taken in the definition of the FST.

Currently there are 8 indexing techniques:

0 select the full line generated by the format

1 select each subfield form the line generated by the format

2 select from the line generated by the format only the terms embraced by the characters < ... and >

3 select from the line generated by the format only the terms embraced by the characters /.../

4 select from the line generated by the format all individual words (= full text indexing)

5 Same as technique 1, putting a prefix in front of the generated key

6 Same as technique 2, putting a prefix in front of the generated key

7 Same as technique 3, putting a prefix in front of the generated key

8 Same as technique 4, putting a prefix in front of the generated key

The techniques 2 and 3 have a similar effect on the key generation; the difference stems from the delimiter used. The characters /.../ cannot be substituted since they will always be present in the output to the screen or printer, whereas the < and > can be hidden or shown by the mode-command of the ISIS Formatting Language.

When an extraction format of the FST is applied to a record, the sequence is the following:

The format is applied to the record to generate the keys
The indexing technique is applied to the keys
Each key resulting from the indexing technique is assigned the ID given and then both are inserted into the Inverted File.
Example. Let's consider the following record (MARC format):

35 :  $9(DLC) 90049743
10 :  ^a 90049743
20 :  ^a0387974490 (alk. paper)
40 :  ^aDLC^cDLC^dDLC
41 : 0^aeng^bfregerhebjapsparus
500: 0^aGC89^b.E54 1991
820: 0^a551.4/58$220
100:1 ^aEmery, K. O.^q(Kenneth Orris),^d1914-
245:10^aSea levels and tide gauges /^cK.O Emery, David G. Aubrey.
260: ^aNew York :^bSpringer-Verlag,^cc1991.
300: ^axiv, 237 p. :^bill., maps :^c29 cm.
500: ^aIn English, with summaries in French, German, Hebrew, Japanese, Spanish, and Russian.
504: ^aIncludes bibliographical references (p. 207-226) and indexes.
650: 0^aSea level.
650: 0^aSubsidences (Earth movements)
650: 0^aTide-gages.
650: 0^aDatabase management^xCongresses.
650: 0^aArtificial intelligence^xCongresses.
700:1 ^aAubrey, David G.
520:000113 35151
935:LA

we would like to obtain the following searchable keys:

Title (245) to be retrieved by each individual word

Authors (100 and 700) to be retrieved by full reference (name + firstname) and both name and firstname apart

Materials (650) to be retrieved by complete reference or each word in it

Languages (41) All languages (note: in subfield b of field 41 the languages are given by a 3-character code representing each language

Publisher (260) as given in the record

Publication date(260) as given in the record

LC Classification (50) to allow an overall search at the level of the first classification code part and at a detailed level of the full code

Date of entry into the database (5) to retrieve all records entered in a particular year, month or day

The FST needed for this is:

Title (245)

245 4 v245^a

Authors (100 and 700)

100 0 v100^a/,(v700^a/)
100 4 v100^a/(v700^a/)

Materials (650)

650 1 (v650*2/)
650 4 (v650*2/)

Languages (41)

41 0 v41^a/v41^b.3/ v41^b*3.3/ v41^b*6.3/ v41^b*9.3/ v41^b*12.3/ v41^b*15.3/ v41^b*18.3

Publisher (260)

260 0 v260^b

Publication date(260)

260 0 v260^c

LC Classification (50)

50 0 v260^a/v260^a,v260^b

Date of entry (5)
5 0 v5.4/v5.6/v5

The Inverted File is a set of 6 files, 5 of which are indexes to the dictionary of terms, a file (with the .ifp extension) which contains all keys extracted from the database by applying the FST on all records.

The terms dictionary is an alphabetical list of all entries which were extracted from the database (defined by the FST) and all such entries contain the pointers to denote the exact locations of the term. Each such pointer or 'posting' has 5 components :

MFN of the record from which the key was extracted
Id. of the field, as indicated in the first column of the FST
Field occurrence number where the key was extracted
Position within the field of the word extracted (with indexing technique 4 or 8)

For example, if the term "Education" occurs in records 1 and 20 in the material fields (v76) and also occurs in record 35 in the title field (v16): Methods of distance education, applying the following FST to this record :

76 0 (v76/)
16 4 v16

the terms dictionary will contain the term Education as follows :

EDUCATION              1 76 1 1
                               20 76 1 1
                               35 16 1 3

Three "postings" were created for this term. The first one, 1 76 1 1 indicates that the key comes from MFN 1, first occurrence of field 76 and is located at the first position. The second posting 20 76 1 1 indicates that it also occurs in MFN 20, field 76, first occurrence as the first word, and finally: 35 16 1 3 indicates that MFN 35 contains the term 'education', extracted from field 16, first occurrence as the 3rd word of that field occurrence.

The indexing technique 0 always takes the value 1 as word position of the key within the field. The other indexing techniques will count the position of the key in the field. This position counting allows proximity searching. The 'proximity' or distance in between words is derived from the difference in position values from each other.

The field occurrence value is used for the search-within-field operator (F) which denotes that all search keys need to come from the same field occurrence.

For example, let's consider an abstract containing the following phrase :

72: The direct education is strengthened by adjusting ...
72: The distance in between the lecture theatre and the library should .......

With a FST presented as : 72 4 (v72/) when we want to retrieve all records referring to DISTANCE EDUCATION, the search statement:

DISTANCE (G) EDUCATION

will retrieve the previously mentioned record even if it does not specifically refer to the concept of distance education, but

EDUCATION (F) DISTANCE

will not retrieve the record. The reason for this difference is the fact that the operator (G) does not verify the occurrence of the key, while the operator (F) does.

Let's analyze the FST at the beginning of this page:

245 4 v245^a

Extracts the subfield of field 245 and will apply technique 4 to the resulting key. Each word obtained will be included in the dictionary with the ID 245

100 0 v100^a/,(v700^a/)

Extracts subfield from field 100 and subfield from field 700. The resulting lines (technique 0) are sent to the Inverted File with the ID 100. Note that this format v100^a, v700^a will not produce the required results because :

The indexing technique 0 results in a line from the field as indicated in the FST. Since no 'new-line' (/) is addded, the format will produce only one line with all authors chained together and only the first 30 characters will be put into the dictionary.

Field 700 is repeatable; so, if not treated as a repeatable group the format will only extract all occurrences as one phrase, meaning the individual authors names will not be sent to the IF as such.

100 4 v100^a/(v700^a/)

Extract subfield from the fields 100 and 700 of the records. Each occurrence goes to a separate line. From this list of values each word will be extracted (technique 4) and put into the dictionary with ID 100. Each word carries with it the occurrence counter of the field and also the word counter to indicate the relative position within this occurrence.
When indexing full-text (per word), why is it still necessary to include end-of-lines to separate the occurrences? Well, because: if there is no separation in between fields v100^a and v700^a, the last word of v100^a will appear joined to the first word of the first occurrence of v700^a, producing an incorrect entry in the dictionary. This way also, if the occurrences of v700 are not separated with an end-of-line ('carriage-return'), the first word of the next occurrence will appear joined to the last word of the previous occurrence.

650 1 (v650*2/)

In this example we are extracting each subfield of field 650 and generating an entry for each one in the IF with ID 650.
Why v650*2?. The record in the example is catalogued according to the MARC format and because of this the two 'indicators' preceed the subfield ^a.
650 0^aDatabase management^xCongresses.

650 0^aArtificial intelligence^xCongresses.

If we use the FST expression (650 1 (v650/), ISIS will try to identify all subfields of each occurrence of field 650; equally, the part corresponding to the indicators will be seen as a subfield and these will all be seen as keys generated from a subfield of V650. By putting the expression (v650*2/) we start from the 3rd position (software counts as 0,1,2...) and the indicators on position 0 and 1 will not be taken into account..
When indexing technique 1 is used, it is necessary to verify if the extraction format indeed keeps the subfields there; this is because if we use the extraction format, with technique 1 650 1 mhu,(v650*2/)', we will generate incorrect keys since, by definition, the mode MHU will replace subfield markers by punctuation, making the subfield markers disappear from the format applied to the record and in this case the key will be created from one single phrase with all subfields, in addition the index will only contain the first 30 characters (in the case of ABCD extended to 60) from that phrase, resulting in a loss of entries for that record.

650 4 (v650*2/)

The same reasoning applies here as in the previous case, but here the words from the lines will be extracted.

v41^a/v41^b.3/ v41^b*3.3/ v41^b*6.3/ v41^b*9.3/ v41^b*12.3/ v41^b*15.3/ v41^b*18.3

Since the subfield ^b has a mandatory pattern with only 3 characters to denote the language code, the substring operator can be used to avoid including all languages in the dictionary.

50 0 v260^a/v260^a,v260^b

In this example we create entries for each LC classification. The first v260^a will allow searches per thematic group to be performed. The second key produced will allow the definition of particular classification codes. Note again the presence of the new-line character (/), in order to obtain two independent keys.

v5.4/v5.6/v5

With the date-of-entry of the record we create 3 keys : the first v5.4 will allow quick identification of all materials entered in one year; the second v5.6 will retrieve all entered in a given month; the third one all entries of one specific day. Note that creating all three keys (per year, year-month and year-month-day) makes retrieval more efficient than generating only one key at the level of year, month and day, and applying the 'right-hand truncation operator' ($) to search per year or month.

It is important therefore that the format to extract values from the record is in line with the indexing technique. If this is not the case, errors will occur in the generation of keys. Moreover, by mastering the formatting language this will enable creation of keys which assure efficient information retrieval, as is a good understanding of the mechanism of how to gather keys and the identification of the proper search operators. This will guarantee a precise answer to the requests of each particular user.

The use of prefixes in the generation of index keys

Since the dictionary is only one file with all keys alphabetically sorted, all authors will be mixed in with titles and keywords, and all other fields which were identified in the FST which begin with the same characters.

If we want to keep the keys together by field (from which they were extracted) there are 2 options :

Use prefixes at the time of creation of the index keys in order to produce 'subdictionaries' (or 'sections')
Create separate dictionaries for each field.

According to the first option, let's change the FST

Title (245) 245 4 v245^a

Authors (100 and 700) 100 0 v100^a/,(v700^a/)
100 4 v100^a/(v700^a/)

Materials (650) 650 1 (v650*2/)
650 4 (v650*2/)

Languages (41)

41 0 v41^a/v41^b.3/ v41^b*3.3/ v41^b*6.3/ v41^b*9.3/ v41^b*12.3/ v41^b*15.3/ v41^b*18.3

Publisher (260) 260 0 v260^b

Publication date(260) 260 0 v260^c

LC Classification (50) 50 0 v260^a/v260^a,v260^b

Date of entry (5) 5 0 v5.4/v5.6/v5

by :

Title (245) 245 8 `/T:/`,v245^a

Authors (100 and 700) 100 0 "A:"v100^a/,(|A:|v700^a/)
100 8 `/A:/`,v100^a/(|A:|v700^a/)

Materials (650) 650 5 `/M:/`,(v650*2/)
650 4 `/M:/`,(v650*2/)

Languages (41)

41 8 `/I:/`,v41^a," "v41^b.3," "v41^b*3.3, " "v41^b*6.3, " "v41^b*9.3, " "v41^b*12.3, " "v41^b*15.3, " "v41^b*18.3

Publisher (260) 260 0 "E:"v260^b

Publication date(260) 260 0 "F:"v260^c

LC Classification (50) 50 0 "C:"v260^a/"C:"v260^a,v260^b

Date of entry(5) 5 0 "F:"v5.4/"F:"v5.6/"F:"v5

As can be seen we have changed the following :

Technique	changed to
1	5
4	8

To keys indexed with technique 0 it is sufficient to add a pre-literal (prefix) to differentiate the data. For the rest of the techniques (5, 6, 7 and 8) the prefix has to be defined before the extraction format with the following syntax :

the prefix should be enclosed within unconditional quotes
the literal constituting the prefix should be enclosed by two special characters NOT occurring in the prefix.
example:`/A:/`
`#A:#`

In addition to enable us to see the content of the field in a sorted order, without mixing terms obtained from other fields, the search by prefixes is faster than a 'field-limited' search. i.e.:

Searching 'M:Education' is more efficient than 'Education/(650)' because a field-qualified search requires that all postings of the term are checked.

Still, depending of the experience level of the end-users and the capacity of the equipment of the system, sometimes it might be suitable to index the data in different ways, with and without a prefix, in order to give users more flexibility in searching. More index keys means more space used by the system but not necessarily lower retrieval speed, in view of the Inverted File structure (B-Tree) which will be re-organized constantly so that the 'height' of the tree remains the same for all branches (the height of the tree reflects the number of required accesses to locate a term in the inverted file).

The ISIS family softwares, except CDS/ISIS for DOS and Windows (WinISIS) allow the definition of more than one dictionary for one database. i.e., with the IsisDLL, for example we can create a dictionary (or index) for authors, another one for titles etc. Still, in order to combine terms of different fields with one single search expression, by using Boolean operators, it will always remain necessary to create a general dictionary which combines all fields since it is not possible to intersect with terms from another dictionary. The possibility of separate dictionaries substitutes the use of prefixes in presenting all terms extracted from one particular field to the users and allows logical operations in the same dictionary.

Transparency in the use of upper-case, lower-case and diacritics (special characters)
One of the niceties of the CDS/ISIS search mechanism is based on the transparancy of upper/lower case and special characters in the search terms.

For this purpose, all keys are gathered in the Inverted File in upper-case, and if so prepared, the accented characters (diacritics) will be substituted for their equivalent in upper-case. The search expressions given by the user are also transformed to upper-case, which minimizes the risk of user typing errors.

For the conversion of keys and search expressions, the CDS/ISIS software members use the file ISISUC.TAB, which should be in line with the character set used in the database.

When indexing with technique 4 (per word) CDS/ISIS uses the file ISISAC.TAB to define the concept of a 'word'; i.e, the table ISISAC.TAB tells CDS/ISIS which characters should be considered to be alphabetical to constitute words. All characters not included in the ISISAC.TAB will be considered a separator and will terminate the word.

Suppose ISISUC.TAB puts the character � equivalent to the uppercase variant �. If we don't include the code � (165 in Ascii or 209 in Ansi) in ISISAC.TAB

the words: will appear in the index as

ni�o NI O

ca�er�a CA ERIA

ca�averal CA AVERAL

acu�aci�n ACU ACION

i.e., each word will be split into two parts, creating two entries in the dictionary, since � is not included in ISISAC.TAB and will be considered as punctuation.