Thursday 9 June 2011

Reprint: Formalism or Flexibility [Relational Data Base] 1978

 Proceedings, International Conference on Data Base Management Systems, Milam, 29-30 June, 1978, pp127-138

FORMALISM OR FLEXIBILITY?
C. F. Reynolds & D. Omrani
Department of Computer Science, Brunel University, Uxbridge, Middlesex.

The relational data base approach represents the essence of formalism in data base technology. In contrast CDOIL provides an extremely flexible, but potentially ambiguous, approach to open-ended and poorly defined information processing systems. This paper discusses the implementation of a relational system using CDOIL and discusses the various features of the language that make this possible.

INTRODUCTION
The proposals for a relational data base made by E.F. Codd [1 - 4] have caused widespread interest because they provide a mathematically rigorous approach to the problems of storing and retrieving data in a totally unambiguous manner. To apply the techniques it is necessary to know the detailed semantics of the information that will be held in the data base. The retrieval of information from such a data base will normally be carried out using either a relational algebra or calculus oriented sublanguage and many such systems have been proposed or implemented.

CODIL [5 - 7] is a user oriented computer language based on an approach to set theory which is easy for non-mathematicians to understand. It is designed for amorphous applications whose structure is not known (or perhaps knowable) in advance, or where the structure is so convoluted that it is easier to assume that it is not known. It admits ambiguity simply because many real world situations are ambiguous and it is felt that it is more important to reflect reality as accurately as possible, rather than to substitute a mathematically exact, but factually incorrect representation. However the most significant distinction between CODIL and the relational data base approach is that of control. CODIL is designed to handle all information about the application within a single, very simple, framework that encompasses the concepts of "program" and "data". The relational data base approach is based on the assumption that the "data" and the "programs" for manipulating the data are completely distinct. In practical terms this means that the rules for processing a file written in CODIL are themselves written in CODIL while the rules for manipulating a relation in a relational data base are not themselves part of that data base.

This paper takes a pragmatic look at the differences between the two approaches. It is clear that the theoretical assumptions underlying the relational data base make it impractical to extend the approach to handle ambiguous and poorly defined problems. On the other hand there is nothing to prevent the user of CODIL from selecting an application in which the degree of open-endedness, poor definition and ambiguity happens to be zero. The paper demonstrates with worked examples the ease with which CODIL files may be set up as relations and processed with relational operators. It then shows how the concept of third normal form can be used to explain the workings of the CODIL interpreter. Finally it describes an implementation of a relational data base system, written in CODIL in such a way that the potential ambiguities of CODIL are no longer accessible to the user.

RELATIONS AND CODIL FILES
If a user wishes to set up a CODIL file as a relation he must constrain the structure of the file so that it conforms to the basic rules for normalised relations. this can best be explained by examining suitable examples, such as the CODIL file "CUSTOMER REBATES" of figure 1 and the relation "CUST REBATES" of figure 2.
Figure 1 - An extract from CUSTOMER REBATES file written in CODIL
Figure 2 - An extract from CUST REBATES relation
The first stage of the comparison is to look at the basic units of information storage. In relational terms "1 KG DRUMS" is an element in the domain "PACK", while in CODIL "1 KG DRUMS" is the value of the item "PACK = 1 KG DRUMS" where "PACK" is the item name. As long as the logical operator between the item name and value is "=" an exact correspondence is possible between a CODIL item and a relational domain element, and any CODIL file that is to have the properties of a relation must contain no other kind of items. CODIL items that represent the ends of ranges, such as "QUANTITY < 1000", or negations, such as "PACK NOT LOOSE", must be excluded. In the case of a negation from a domain containing a comparatively small number of element values the best approach may well be to list all the positive elements. This approach will not work for terminators of large numeric ranges. These can best be normalised by introducing new domains such as MAX QUANT and MIN QUANT.

At the next stage up the structure the relational tup1e can be equated to the CODIL statement. The relational approach to normalisation requires tuples to be independent of each other and this means that, if a CODIL file is to represent a relation, items cannot be shared between statement. This means that leading items, such as "CUST NO = 12345", must be duplicated in all statements in which they occur and that statements containing "OR" and "ELSE" must be unscrambled. Difficulties can also occur if an attribute name occurs twice in a tuple and further renaming may be needed. At the highest level a comparison is possible between the relation and the CODIL file. The requirements here are that all tuples in a relation must have the same form and this can lead to the introduction of extra dummy items such as "MAX QUANT = 999999" and MIN QUANT = 0.

To summarise, a CODIL file can be treated as being identical to a relation of the same name if:
(a) all items are elementary,
(b) all items in statement are distinguishable by name,
(c) all statements are independent,
(d) all statements contain equivalently named items,
(e) there are no duplicate statements.
It should be noted that in CODIL there is no automatic way of controlling the structure of files and it is therefore the user's own responsibility to ensure that the above rules are adhered to. Usually he will do this by writing a CODIL file which reads the information in in the format desired. Figure 3 is a CODIL file, RCR, designed to read customer rebates into the relation CUST REBATES. It reads in each domain in turn, checks that the statement/tup1e is not a duplicate by comparing it with the relation, and then stores the statement in the relation. Its use in "interactive" mode is illustrated in figure 4 and in "batch" mode in figure 5.
Figure 3  - The CODIL file, RCR, designed to input new tuples and add them to the relation, CUST REBATES, if they are not duplicates.
Figure 4 - Inputting a new tuple to the relation CUST REBATES in interactive mode.
Figure 5 - Inputting new tuples to the relation CUST REBATES in batch mode.
  RELATIONAL OPERATIONS IN CODIL

While relational operators are not present in CODIL as primitives it is very easy to write CODIL files to carry out the appropriate operations, such as union, join, projection, etc. The example given in figure 6 involves the creation of a file, PROJECT PACKS, which generates the relation PACKS from the relation CUST REBATES. CREATE is an inbuilt CODIL function for inputting new files, and is terminated by END. The file is entered as a function by typing its name, PROJECT PACKS, and the resulting relation is subsequently printed using PRINT ITEM. STORE FILE = PACKS has the effect of setting up a new file, PACKS, in storage mode. READ = CUST REBATES accesses the file CUST REBATES statement by statement, applying the subfile, delineated by ":", to each. PACKS NONE is used to check that the current PACK is not already on the file PACKS, and hence prevents duplicates. STORE = PACK stores the current value of PACK and STORE by itself terminates the current output statement.
Figure 6 - Generating the relation PACKS from CUST REBATES by projection
Figure 7 illustrates a more general CODIL routine that will generate the cartesian product of any two CODIL files and put the result in a third file. It reads in the names of the three files in the form "FILE 1 = filename". STORE FILE IS = FILE 3 opens a file for information storage whose name corresponds to the value of the item, FILE 3. (The "IS" indicates that the value of the item is to be evaluated at run time, rather than being a constant.) READ is then used recursively to read the two source files, with the resulting statement being stored. It is important to realise that the resultant file will only be a relation if both the source files are relations.
Figure 7 - A CODIL file for generating the cartesian product of a pair of files.
Figure 8 - The CODIL decision making unit
JOINING OF RELATIONS AND THE CODIL DECISION MAKING UNIT


At the heart of the CODIL interpreter is a decision making unit which is represented diagrammatically in figure 8. At anyone time the unit holds a single statement, referred to as the facts. This statement is automatically compared with other statements, referred to as criteria. This comparison process involves each item in the criteria statement in turn until (a) the last item in the criteria statement is reached or (b) a false criteria item is detected. In the first case the terminal item is, by definition, true and it is either moved to the facts statement or obeyed if it represents a function. In either case the interpreter then proceeds to the next criteria statement. For this to happen each statement read as criteria must be in a form equivalent to a subset of Codd's third normal form. Each statement must consist of one or more key items followed by a single dependent non-key item. This allows the statement to be interpreted as:
"IF the key items are TRUE THEN the non-key item is TRUE."
This is best illustrated by an example. Figure 9 contains a relation, CUST TOWN, which is in third normal form, with the TOWN being dependent on the CUST NO. This is followed by a single statement describing an order for 600 1 KG DRUMS of OZOX. By typing in the name of the file/relation each statement is compared in turn with the facts until a true CUST NO is encountered, when the appropriate TOWN is moved to the facts. This is shown twice, the second time with a trace, and it can be seen that the comparison process does not stop when the matching CUST NO is found. This is because the CODIL interpreter is not restricted to processing normalised relations and there may be more than one true statement. Seen in this light the CODIL interpreter can be considered a system for automatic­ ally making natural joins between statements, with the join being made on the key items of the criteria statements.
The mechanism is similar to that employed in a number of artificial intelligence packages, with the criteria statement being interpreted as production rules while the facts are equivalent to the contents of "short term memory" [10].
Figure 9 - Joining the relation CUST TOWN with an order for OZOX.

Figure 10 shows that it is possible to make a natural join between the order from Figure 9 and the CODIL file, CUSTOMER REBATES. However there is no join between the order and the relation, CUST REBATES, until QUANTITY is reinterpreted in terms of MAX QUANT and MIN QUANT.

Figure 10 - Joining CUST REBATES with the order from figure 9.
It should be noted that relations in third normal form need not contain any non-key items. For instance the relation BAD DEBT, in figure 11, contains a list of customers with bad debts outstanding. CODIL allows files such as this to be used as conditions and it can be seen that the customer whose order is being processed is not a BAD DEBT. However by comparing BAD DEBT with CUST REBATES it is possible to determine that there is at least one customer entitled to a rebate who is a BAD DEBT. The use of files in this way is common in CODIL, as can be seen in figures 3 and 6, and can be considered as a dynamic extension of the condition name concept that is widely used in COBOL.
Figure 11 - The use of files as conditions.
IMPLEMENTING A RELATIONAL ALGEBRA IN CODIL


The above examples have shown that it is simple enough to set up CODIL files in the form of relations and a companion set of relational operators. Problems arise if it is necessary to ensure the integrity of the relations being processed. For instance there is nothing to stop someone trying to find the union of two files which are not union compatible, and which may not even be relations: In providing a relational system in CODIL it is therefore essential that the user can only input and process valid relations. The approach described below uses CODIL to write the software needed to constrain the user, and to protect him from those aspects of the language that could lead to ambiguity. It also provides him with a relational algebra that allows him to define, update and interrogate the data base.


There are two main operators connected with the definition of the data base. DECLARE DOMAIN is used to define new domain names, together with a brief description of their format, and checks are made that the name is valid. DECLARE RELATION is similarly used to define new relations in terms of pre­ defined domain names. Both of these are used to update a series of directories maintained by the system and used to check the validity of other operations on relations. The use of these operations is illustrated in figure 12. There are two complementary operations, DROP DOMAIN and DROP RELATION and in addition new relation names may be .generated as a result of-using the algebra .
Figure 12 - Use of the DECLARE DOMAIN and DECLARE RELATION operators.
Relations may be populated directly by the use of the operator INSERT TUPLE or as the result of the algebraic operators described below. Figure 13 illustrates the population of a previously defined relation, MEMBERS, with the names and majorities of Members of Parliament. Tuples may be deleted from relations by means of the DROP TUPLE operator.
Figure 13 - Populating the relation MEMBERS by means of INSERT TUPLE.
The first group of purely algebraic operators are the traditional set operations of union, intersection, difference and cartesian product. These require, as parameters, the names of two source relations and either generate a third relation (whose description is entered in the appropriate directories) or list out the results for the user. Codd's relation operators of PROJECTION, JOIN, PROJECTION and DIVISION are also provided. Examples of their use are given in figures 14 and 15. It should be noted that the full flexibility of the CODIL language can be used in specifying the conditions for the JOIN and RESTRICTION operators.
Figure 14 - The use of JOIN and PROJECTION to list MP's PARTY (from CANDIDATES) and MAJORITY (from MEMBERS)
Figure 15 - The use of RESTRICTION to select MP's from the sample collection with majorities in excess of 5000 votes
A series of additional operators have been introduced for examining the distribution of domain values within a given relation. These provide information such as the maximum, minimum and mean values of a domain, or a frequency distribution such as the one illustrated in figure 16. Full details of these operators will be published elsewhere.
Finally there are a series of housekeeping operators. These include LIST, which lists out the source relations in tabular form, and ORDER, which orders tuples within a relation. The terminal SAVE ensures that all the relations, etc., generated or modified during the run are saved until next time.
Figure 16 - The use of SIMDIST to produce a simple distribution of the political parties in CANDIDATES.
CONCLUSIONS


CODIL was originally designed for people who had little or no knowledge of computing or mathematics, and who had open-ended, poorly defined or ambiguous problems. The earlier parts of this paper have shown that such users have access to relational data base type facilities, when needed, simply by applying self-restraint and restricting themselves to an appropriate subset of CODIL. The implementation of the relational algebra is perhaps even more significant. CODIL was not designed for "programming" and does not look or behave like a conventional procedural language. The current research has shown that CODIL's considerable power in poorly defined applications is combined with at least a working capability of handling conventionally well structured problems. As a result of the research described here a number of extensions have been made to the CODIL interpreter to improve the ease of handling formal situations. These include an optional "PICTURE" facility for defining the format of item values, and a "KEY" facility for selecting information from files.


REFERENCES
  1. Codd, E.F. 'A relational model of data for large shared data banks', Communications of the ACM, 13, 377-387, 1970.
  2. Codd, E.F. 'Normalised data base structure: A brief tutorial. Proceedings 1971 ACM SIGFIDET Workshop on Data Description, Access and Control.
  3. Codd, E.F. 'Further normalisation of the data base relational model', Courant Computer Science Symposium 6, Data Base Systems, 1971, Prentice-Hall
  4. Codd, E.F., 'Relational completeness of data base sublanguages', ibid.
  5. Reynolds C.F., 'CODIL, Part 1, The Importance of Flexibility, Computer Journal, 14, 217-220, 1971.
  6. Reynolds, C.F., 'CODIL, Part 2, The CODIL language and its interpreter', Computer Journal, 14, 327-332, 1971.
  7. Reynolds, C.F., 'Designing.an interactive language for the pragmatic user', Proceedings European Computing Conference, 991-1006, 1974.
Further information on CODIL too extensive to list here, is given in CODIL Newsletter and a range of technical reports and notes produced within the Computer Science Department of Brunel University. The interpreter can be used on ICL 1900 computers and has recently been transferred to a CDC 6600. Applications range from teaching, via the processing of medical records, to the writing of a heuristic problem solver. Recent papers with a data base slant include:
  • Reynolds, C.F., 'A data base system for the individual research worker', Proceedings Sym. Tech. Selective Dissemination of Information, IEEE 1976, pp 1-8.
  • Reynolds, C.F., Shackell M. and Sutton G., 'Using CODIL to handle poorly structured clinical information', Proceedings Medical Informatics, Europe 1978 (September).
  • Reynolds, C.F., 'The design and use of a Computer language based on production system principles' CSTR/15, Brunel University, 1978.

No comments:

Post a Comment