The Field Selection TableThe Field Selection Table (or FST) is the table used by ISIS software to create and maintain inverted files, and also when exchanging (importing/exporting) data and when creating sort keys for sorted output and reports. When defining the FST, the designer has to take into account the types of searches to make available to the users to maximise the chance of successful retrieval. CDS/Isis provides a large number of possibilities to guarantee successful retrieval, i.e.:
The FST is a pure TXT file with three columns in which the following three elements are identified:
The Inverted File of ISIS structures contains five elements: The value given in the 1st column of the FST defines the ID component of the IF, identifying the keys generated by the format. This identification can then be used to qualify the origin of the key (e.g. field-limitation) in the searches on the database. In general the ID uses the tag used in the FDT, but in some cases special results can be obtained by using 'virtual' (non-existing fieldtags) or 'alias' tags (re-defining the origin of one field to another). Example: assuming our database has 3 fields to index authors: 10: Personal authors at analytic level In order for all authors to be searchable, independently from the type of material in which they appear, we can assign in the FST the ID 10 for all keys taken from the fields 10, 16 and 23, in order to allow simple searching on authors without considering the type of material to which they are linked. Hence, if our FST specifies: 10 0 (v10/)(v16/)(v23/) performing the search 'Amaro, Jorge Luis /(10)' with this IF, we will get as a result all records in which 'Amaro, Jorge Luis' appears, whether in the field 10, 16 or 23. So we retrieve the records independently of the author type registered in the database. Assigning one unique ID for different field tags allows facilitation of search expressions for more general searches (locate a term independently of the field where it was taken from), or qualified searches (with field limitations defined with the ID). The degree of recall and precision to be applied (meaning : how large or how precise the 'catch' is aimed to be) will depend on the decisions taken in the definition of the FST. Currently there are 8 indexing techniques:
The techniques 2 and 3 have a similar effect on the key generation; the difference stems from the delimiter used. The characters /.../ cannot be substituted since they will always be present in the output to the screen or printer, whereas the < and > can be hidden or shown by the mode-command of the ISIS Formatting Language. When an extraction format of the FST is applied to a record, the sequence is the following:
we would like to obtain the following searchable keys:
The FST needed for this is:
The Inverted File is a set of 6 files, 5 of which are indexes to the dictionary of terms, a file (with the .ifp extension) which contains all keys extracted from the database by applying the FST on all records. The terms dictionary is an alphabetical list of all entries which were extracted from the database (defined by the FST) and all such entries contain the pointers to denote the exact locations of the term. Each such pointer or 'posting' has 5 components :
For example, if the term "Education" occurs in records 1 and 20 in the material fields (v76) and also occurs in record 35 in the title field (v16): Methods of distance education, applying the following FST to this record : 76 0 (v76/)16 4 v16 the terms dictionary will contain the term Education as follows : EDUCATION 1 76 1 1 Three "postings" were created for this term. The first one, 1 76 1 1 indicates that the key comes from MFN 1, first occurrence of field 76 and is located at the first position. The second posting 20 76 1 1 indicates that it also occurs in MFN 20, field 76, first occurrence as the first word, and finally: 35 16 1 3 indicates that MFN 35 contains the term 'education', extracted from field 16, first occurrence as the 3rd word of that field occurrence. The indexing technique 0 always takes the value 1 as word position of the key within the field. The other indexing techniques will count the position of the key in the field. This position counting allows proximity searching. The 'proximity' or distance in between words is derived from the difference in position values from each other. The field occurrence value is used for the search-within-field operator (F) which denotes that all search keys need to come from the same field occurrence. For example, let's consider an abstract containing the following phrase : 72: The direct education is strengthened by adjusting ... With a FST presented as : 72 4 (v72/) when we want to retrieve all records referring to DISTANCE EDUCATION, the search statement: DISTANCE (G) EDUCATION will retrieve the previously mentioned record even if it does not specifically refer to the concept of distance education, but EDUCATION (F) DISTANCE will not retrieve the record. The reason for this difference is the fact that the operator (G) does not verify the occurrence of the key, while the operator (F) does. Let's analyze the FST at the beginning of this page:
It is important therefore that the format to extract values from the record is in line with the indexing technique. If this is not the case, errors will occur in the generation of keys. Moreover, by mastering the formatting language this will enable creation of keys which assure efficient information retrieval, as is a good understanding of the mechanism of how to gather keys and the identification of the proper search operators. This will guarantee a precise answer to the requests of each particular user. |
The use of prefixes in the generation of index keys
Since the dictionary is only one file with all keys alphabetically sorted, all authors will be mixed in with titles and keywords, and all other fields which were identified in the FST which begin with the same characters. If we want to keep the keys together by field (from which they were extracted) there are 2 options :
According to the first option, let's change the FST
by :
As can be seen we have changed the following :
To keys indexed with technique 0 it is sufficient to add a pre-literal (prefix) to differentiate the data. For the rest of the techniques (5, 6, 7 and 8) the prefix has to be defined before the extraction format with the following syntax :
In addition to enable us to see the content of the field in a sorted order, without mixing terms obtained from other fields, the search by prefixes is faster than a 'field-limited' search. i.e.: Searching 'M:Education' is more efficient than 'Education/(650)' because a field-qualified search requires that all postings of the term are checked. Still, depending of the experience level of the end-users and the capacity of the equipment of the system, sometimes it might be suitable to index the data in different ways, with and without a prefix, in order to give users more flexibility in searching. More index keys means more space used by the system but not necessarily lower retrieval speed, in view of the Inverted File structure (B-Tree) which will be re-organized constantly so that the 'height' of the tree remains the same for all branches (the height of the tree reflects the number of required accesses to locate a term in the inverted file). The ISIS family softwares, except CDS/ISIS for DOS and Windows (WinISIS) allow the definition of more than one dictionary for one database. i.e., with the IsisDLL, for example we can create a dictionary (or index) for authors, another one for titles etc. Still, in order to combine terms of different fields with one single search expression, by using Boolean operators, it will always remain necessary to create a general dictionary which combines all fields since it is not possible to intersect with terms from another dictionary. The possibility of separate dictionaries substitutes the use of prefixes in presenting all terms extracted from one particular field to the users and allows logical operations in the same dictionary. |
Transparency in the use of upper-case, lower-case and diacritics (special characters)
One of the niceties of the CDS/ISIS search mechanism is based on the transparancy of upper/lower case and special characters in the search terms. For this purpose, all keys are gathered in the Inverted File in upper-case, and if so prepared, the accented characters (diacritics) will be substituted for their equivalent in upper-case. The search expressions given by the user are also transformed to upper-case, which minimizes the risk of user typing errors. For the conversion of keys and search expressions, the CDS/ISIS software members use the file ISISUC.TAB, which should be in line with the character set used in the database. When indexing with technique 4 (per word) CDS/ISIS uses the file ISISAC.TAB to define the concept of a 'word'; i.e, the table ISISAC.TAB tells CDS/ISIS which characters should be considered to be alphabetical to constitute words. All characters not included in the ISISAC.TAB will be considered a separator and will terminate the word. Suppose ISISUC.TAB puts the character ñ equivalent to the uppercase variant Ñ. If we don't include the code Ñ (165 in Ascii or 209 in Ansi) in ISISAC.TAB
i.e., each word will be split into two parts, creating two entries in the dictionary, since ñ is not included in ISISAC.TAB and will be considered as punctuation.
|