Friday 27 May 2011

Reprint: Historical Knowledge Bases, 1988

 From Computer and Quantitative Methods in Archaeology 1988, edited by S. P. Q. Rahtz, BAR International Series 446(ii) 1988
 CODIL as a knowledge base system for handling historical information
Chris F. Reynolds
30.1 Introduction

    Carrying out any historical research is like trying to solve a jigsaw puzzle in which the vast majority of the pieces are missing. Those that remain may be misunderstood, distorted with time, or highly biased. The possibilities for uncertainty, ambiguity, and error are legion. If only computers could help ...
   Of course computers can help historians in many ways-but the one area where they are particularly weak is in providing a data base facility for handling the untidy mass of information that is the best we can assemble from the surviving documentary and archaeological evidence. This should not be surprising. Conventional data base techniques are specifically designed to handle large quantities of homogeneous data­ because that is what industry, commerce and research scientists want. As a result each record contains a fixed number of predefined fields, most if not all of which contain data. These can be used for certain highly organized and comparatively modem historical records, such as the Victorian census returns, but become less and less suitable as the variety of the information is increased.
    At the other extreme there are text-based systems, which can look very impressive to the naive, but have little real information processing capability. This is because there is no record structure and to do anything useful the computer would need a virtually human level of ability to understand (rather than reproduce) the language of the original document. (The examples given in artificial intelligence studies using natural language data bases are typically confined to very small, and hence highly selective, collections of data.)
    To overcome these problems many libraries and museums have designed tagged field record systems. these allow variable information to be held about different kinds of artifacts. Each record can consist of a large number of fields, most of which are absent. While ideal for cataloguing and retrieval they are not capable of providing the historian with a sufficiently flexible tool to solve his jigsaw puzzle.
   To a historian user, the CODIL and MicroCODIL software, described here, can be considered as a significant extension of the tagged approach. However they incorporate ideas from conventional data bases, artificial intelligence research, and human factor studies, which makes them far more flexible and far more friendly. (CODIL is the original mainframe package, MicroCODIL is more user friendly micro based subset. For discussion purposes they will be assumed to be equivalent.) Their novel features are best introduced by means of Fig. 30.1, based on a 'file dump' of the biography of a George Washington Gibbs.

    The first thing to notice about this computerized biography is that you can read it without knowing anything about CODIL or computers. The only difficulty some people have is with the level numbers on the left. Basically an item's influence extends from the point at which it occurs until another item is reached with the same, or a lower, level number. This means that the whole of the listing concerns George, because his name is the only item at level 1. In the same way events between 1832 and 1865 all involve the 'Liverpool' item at level 2.
    The second thing to notice is that the information comes from a wide range of historical records in a wide variety of forms. For instance an engraving was made in 1826 which shows the front of George's shop. In the original engraving the words on the boards can be clearly seen, giving the shop keeper's name and occupations. Other information comes from an Aylesbury newspaper (published by George's nephew)­ including the reference to a letter which first appeared in the Liverpool Mercury.
    The listing shows a number of 'records' or 'statements' describing events in George's life. Each statement consists of a number of items, starting at level 1 and terminated by a full stop. For example one of these statements contains the following items:
NAME = Gibbs, George Washington,
PLACE = St Albans, Herts,
YEAR = 1826,
ADDRESS = Clock House,
OCCUPATION = Auctioneer,
PARTNER = Gibbs, John,
SOURCE = Trade Directory.
   Conventional data bases are restricted to records which contain predefined fields in a predefined order. It is only necessary to look at the full example to realise that such a strait-jacket is totally unacceptable for anyone interested in holding any realistic representation of a person's biography. CODIL allows statements to contain any combination of items in any order, including multiple entries-as George's occupations in 1826 illustrate. There are also no constraints on the order of statements in a file. New item names can be created if new kinds of information are discovered-for example I still do not know his salary when he worked for the customs service. Should I find out it would only take a moment to create a SALARY item. Where something happened at a somewhat uncertain date (see his daughters birth dates) it is a routine operation to record a range, as opposed to a single value or a list of values.
    The statement relating to 1853 shows another interesting feature. At about this time the Bucks Advertiser carried a regular 'Liverpool' column. It not known who the reporter was. However George was a relative of the editor and he was probably the reporter concerned. Considerable care is needed in handling such uncertain data in a conventional data base - as otherwise you may look at the data several years later and forget that it was only an informed guess. As can be seen, CODIL allows a probability to be associated with an item to indicate the reliability of the information.
    It is one thing to store historical data in a flexible manner. However it is only useful if it can be processed. I am sure that most historians would recoil in horror at the idea of writing a program which reads records from a file in which every record has a different structure, and where each field may contain a single value, a range or a list. CODIL gets round this problem by using some advance artificial intelligence style techniques and a considerable degree of computer human factors research has gone into the deign of MicroCODIL. It is not possible to describe these techniques in detail in this paper but in most cases CODIL does what you might expect a human to do in the same circumstances. For instance a search for 'OCCUPATION = Broker' will find the statement relating to the picture of the shop without the questioner having to know that a person may have more than one occupation simultaneously. A search for events in 1829 will indicate than he had a daughter called Ann Gibbs, although there is some uncertainty about whether she was born in 1828 or 1829.
    There are other features which are particularly relevant to the historian. The biog­ raphy contains references to other people, with roles such as FATHER, PARTNER and EXECUTOR. By including definitions such as:
FATHER (ISA) PERSON
MOTHER (ISA) PERSON
EXECUTOR (ISA) PERSON
it is possible to carry out a search for 'PERSON = Gibbs, John' without knowing the nature of George's link to John when the question is asked. It is also possible to use 'fuzzy' definitions using probability. To take a simple example you might want to define the word PUBLISHING to be any of the following items:
OCCUPATION = Publisher
OCCUPATION = Journalist
OCCUPATION (.75) = Printer
AUTHOROF
    Note that the 'Printer' item has a probability associated with it because some printers do work which is not normally considered as publishing. A simple search for things that are definitely known about PUBLISHING will only find the letter George wrote to the Liverpool Mercury. A search using a 0.6 probability threshold will reveal his printing activities in St Albans. Dropping the threshold to 0.4 allows his possible journalistic activities of 1853 to be reported as well.

30.2 Example studies

    Of course, it is one thing to demonstrate flexible handling of historical information, and another to do it on a scale useful to a working historian. For this reason a number of historical studies have been carried out using CODIL on a mainframe computer. The volume of data involved is about four million bytes of compressed data, or over 1000 pages of biographies. The main studies involved are:
  1. The Gibbs Family of Aylesbury: This involves an in-depth study of this non­ conformist family of shop keepers, mainly in the early 19th century. The book, 'The Gibbs Family of Aylesbury' was published using the information from the data base. The biography of George Washington Gibbs is taken from an updated version of the file, modified for use with MicroCODIL. The longest biography, involving over 200 separate statements, describes the career of the politically active John Gibbs.
  2. The Farming Families of Sandridge: This is an incomplete study, which combines census material with many other sources, which attempts to investigate cousin marriages and migration among well-to-do tenant farmers. Some of the infor­ mation gathered for this study has been used to set up and demonstrate a local history data base for schools, using MicroCODIL.
  3. The Phipson Family: An in-depth study has been carried out on this West Midland family. The senior branch moved to Birmingham in the 18th century and become actively involved in manufacturing. They had contacts with Joseph Priestly and Mathew Bolton and married into the Ryland family. Side branches became involved in the medicine, the law and architecture. Another branch were based in the Stourbridge area, and saw the other side of the industrial revolution as nail makers on Lye Waste.
    It is appropriate to look at the Phipson files in more detail, to demonstrate that CODIL can do much more than just store and retrieve information.

Fig. 30.3 shows the biography of William Howell Phipson listed out a partially formatted manner which takes up less space than a file print. It should be noted that the main participants are all given reference numbers and that these are replaced by the name / birth year before printing. This information can be reorganized in a variety of ways. The simplest is to produce alphabetic indexes, for example by place or occupation. Such indexes not only provide a quick way of means of accessing the information, but also allow aberrant spellings and errors to be spotted and corrected. Summary tables can also be be produced and an example is given in Fig. 30.4.

    Check lists of this nature are extremely useful aids, and it is often possible to produce search aids with the information in an order that corresponds to the manuscripts, etc., to be examined in a records office. Other aids can be used to help with data input, while family trees such as Fig. 30.2 can be 'sideways printed'.

30.3 Technical information

    The above discussion, plus the demonstration and sample listings on display at the Conference, show some of the ways that the software can be used in historical research. However the techniques can be used by anyone who has poorly structured information, from an amorphous collection of sources, and, for example, the techniques have been used to handle medical research information. CODIL started with some ideas put forward about 20 years ago. These have been developed over the intervening period at both the software and application level. The existing mainframe software is written in COBOL 76 and is available to other universities, etc. It supports the large studies de­ scribe above, and if installed with slight modification on a suitable mainframe it should be able to support files of 10 million bytes or more without difficulty (unfortunately the computer on which it runs at BruneI University was withdrawn in June 1988 so that further work there has had to be stopped.) MicroCODIL is a version of CODIL designed to run on a BBC microcomputer (any model) in a school environment. It is aimed at showing how a wide range of 'information technology' ideas can be used in 'across the curriculum' studies, especially history. For instance there is a special 'History Project Pack' which includes a copy of the book, 'MicroCODIL and History'. MicroCODIL has a much better user interface than CODIL, and has some powerful extra facilities, such as the use of probability, described above. However the small size of the BBC computer means that its file handling capabilities are much less. Future plans are to move this system to larger computers, rather than to upgrade the older CODIL package.
    Reynolds 1987 and Reynolds 1989 describe the basic concepts underlying the CODIL / MicroCODIL approach, and replaces earlier papers on how the software works. Reynolds 1984 and Reynolds 1985 provide descriptions of the prototype version of MicroCODIL as a teaching package. Reynolds 1988 is based on the package currently being marketed. Reynolds 1979 consists of computer printed biographies and family trees, with some connecting text. The MicroCODIL software and documentation, including the 'MicroCODIL Manual' and 'MicroCODIL and History', are available from the author at the CODIL Language Systems address.

References

REYNOLDS, C. F. 1979. The Gibbs Family of Aylesbury.
REYNOLDS, C. F. 1984. "MicroCODIL as an Information Technology Teaching Tool", University Computing, 6: 71-75.
REYNOLDS, C. F. 1985. "A Microcomputer Package for Demonstrating Information Processing Concepts", J. Microcomputer Applications, 8: 1-14.
REYNOLDS, C. F. 1987. "Human Factors in System Design-A Case Study". in Diaper & Winder, (eds.), People and Computers Ill, pp. 93-102. Cambridge University Press.
REYNOLDS, C. F. 1988. "Introducing Expert Systems to Pupils", J. Computer Aided Learning. in press.
REYNOLDS, C. F. 1989. "CODIL-The Architecture of an Information Language", Computer Journal.

No comments:

Post a Comment