Modelos informáticos para análisis microhistórico:
métodos a pequeña escala para gran volumen de información.

Computer models for analysing households :
small-scale methods for large-scale data.

XIII CONGRESO INTERNACIONAL DE LA ASOCIACION HISTORY & COMPUTING

"La Historia en una nueva frontera" / "History in a new frontier"

Toledo(España), Convento de San Pedro Mártir, 20-23 de Julio, 1998







Arne Solli

Department of History,

University of Bergen,

Norway

E-mail: arne.solli@hi.uib.no

Modelos informáticos para análisis microhistórico:
métodos a pequeña escala para gran volumen de información./
Computer models for analysing households : small-scale methods for large-scale data.

Arne Solli

Draft

Introduction

The quantity of machine-readable census material has increased dramatically the last decade, e.g. in Britain the 1881 census alone contains over 30 million data records; for Norway both the 1801 and the 1900 censuses are now fully transcribed, totally about 3 million data records.

Two approaches are usual when working with census material for historical and demographic research: The first uses small subsets of a census with all the original data present and based on the census-lists. The second type of study usually has a greater geographical coverage but uses data on aggregate levels and are based on existing published reports. An example of the first approach might be a close examination of working class families in part of a town; the second approach might be typified by a comparative economic history for several towns.

The large-scale computerisation of the original census listings change the relationship between these two approaches, the micro-study and the macro-study. In theory, as one has the whole census in machine-readable format, one could consider using the small scale approach on larger (or the whole) scale. It is also quite possible to redo the previously conducted aggregations, and additionally change and make many other and new types of aggregation. This also allows for a much greater freedom in sampling; sampling not rigorously defined by the administrative units of the census.

However, in order to use small-scale approaches on larger parts of a census one also needs to computerise the preparation of the source material, i.e. the examination of each record in order to code, classify and/or aggregate for historical analysis. Many of the techniques developed are based on close and individual examination of each individual in the census-lists, like; is what is each persons gender and marital status? What is the relationship to the head of the family, 'son', 'mother in law'? What is the occupation of this individual, and so on.

This paper deals with the problems encountered in the formalisation of some types of family and household studies (the small scale approach), studies where kin relationship, household composition and household structure are some of the key objects of the study. More specifically the approach for studying family and household developed by the Cambridge Group for the History of Population and Social structure. The group was established in 1964 and the methods and techniques have become quite widespread. This paper seeks to answer two important questions: To what degree are these techniques for household studies possible to conduct automatically with computer programmes? What processes can be formalised and what are the problems with using computers to do "low-level" interpretations of a historical source. This "low-level" semantic could be rather "high-level" for computers.

The source material

A census list has a simple structure, it is basically a list of persons with relatively few attributes, name, sex, marital status, relationship to head of family, age and occupation, sometimes also birthplace, decease and disabilities. The census lists of the last half of 19th century was also often written on some kind of pre-printed or tabulated form with clearly defined rows and columns. In a way they look very much like a spreadsheet or a table in a database. A final typical feature of the censuses is the way persons are listed, they are very often grouped together in some type of co-resident groups, often denoted as families or households.

Figure 1 shows as an example of this type of source, the 1881 census of Great Britain. Figure 1 is one page of the Census Enumerators Book, which really are a transcription of the household schedules, one schedule was delivered to each household and collected within a week. The reader should take a little time studying the fourth to the seventh column (labelled Houses and Condition). The '\\' (or '\') of the fourth column indicates the end of a house or household, the fifth column is the name of a person, the sixth the relation to the head of family, seventh marital status. Together with age and sex (eight and ninth column) these are important when one studies household structure.Figure 1 Example of a page in the Census Enumerators book of the 1881 Census, Great Britain.

The undermentioned Houses are situated within the Boundaries of the
22 Page 2
Civil Parish [or Township]
of
Shenley
City or
Municipal Borough of
Municipal Ward of 
Parliamentary Borough of
 
Town or Village or hamlet
of
Shenley
Urban Sanitary District
of
Rural Sanitary District
of
Ecclesiastical Parish or
District of
No. of Schedule
ROAD,  STREET
HOUSES
RELATION
to
CON-
DITION
AGE last
Birthday of
Rank, Profession, or OCCUPATION
WHERE BORN
If
(1) Deaf-&-Dumb
(2) Blind [Idiot]
and NO or Name of HOUSE
Inhabited
Un-inhabited
(U.) or Building

(B.)

Person
Head of Family
MaleFemale (3) Imbecile or
(4) Lunatic
 1The Prince William Beer House  / Charles SmartHead Mar44  Publican Great Gadesden Herts
      Patience SmartWife Mar  56Publican Wife South Perrott Dorset
      Louisa SmartDaur Unm  16Dressmaker Shenley Herts
      Albert BartonStepson Unm25  Shoemaker Shenley Herts
      Henry FieldLodger Widr65  General Labr Shenley Herts
      James WebbLodger Unm66  General Labr Codenham Suffolk
     \\ Joseph AuntlerLodger Unm66  General Labr Shenley Herts
 2Shoe Shop  / William HackettHead Mar72  Boot Maker Shenley Herts
     \\ Sophia HackettWife Mar  73Boot Maker Wife Aldenham Herts
 3Private House  / Henry ClarkeHead Mar48  Blacksmith Beighton Norfolk
      Phoebe ClarkeWife Mar  41--- Green Of Shenley Herts
      Isabel ClarkeDaur Unm  18Scholar Shenley Herts
    \\ John BartramNursechild -7  Scholar Tottridge ---
 4    Mary SmithHead W  66--- Chesham Bucks
      Issak SmithSon Unm25  Gardeners Labr Shenley Herts
     \\ Robert Entickross (?) LodgerUnm 24  Gardeners LabrSussex ---
 5Chapel Yard  / William GreenhamHead Mar60  Farm Labr Hatfield ---
     \\ Charles GreenhamSon Unm18  Farm Labr St Petters Herts
 6    George LawfordHead Mar54  Gardener Smallford Fat... Herts
      Sarah LawfordWife Mar  50Gardeners Wife Enfield Middlesex
      James PerryBoarder Unm54  Labr Shenley Herts
     \\ Abraham AuntlerBoarder Unm37  Labr Shenley Herts
 7    Thomas PerryHead Mar30  Gardener Shenley Herts
      Martha PerryWife Mar  32Gardeners Wife North Mimms ---
Total of Houses..4 Total of Males and Females... 168 

Table 1 Household structure, Bergen Norway 1865-1875, Ipswich and Hull, England 1881.

Hammel-Laslett classification scheme
Bergen, Norway 1865
Bergen, Norway 1875
Ipswich, England 1881
Hull, England 1881
Category Class
No of house-holds
%
No of house-holds
%
No of house-holds
%
No of house-holds
%
1 1a) Widowed 4907,9 % 7088,1 % 6266,1 % 11987,9 %
Solitaries 1b Single 4447,1 % 116713,4 % 3173,1 % 4543,0 %
2 No family 2a coresident siblings 340,5 % 560,6 % 1341,3 % 1751,1 %
2b coresident relatives, other 330,6 % 851,0 % 2672,6 % 3232,1 %
3 Simple family 3a Married couple alone 76612,3 % 95010,9 % 159015,4 % 226214,8 %
house-holds 3b Married couple with child(ren) 327052,6 % 405146,5 % 483647,0 % 644942,3 %
3c Widowers with child(ren) 2524,1 % 2472,8 % 2492,4 % 4202,8 %
3d Widows with child(ren) 5759,3 % 92010,6 % 7807,6 % 159810,5 %
3e 530,9 % 1101,3 % 600,6 % 940,6 %
4 Extended 4a Extended Upwards 881,4 % 1131,3 % 3113,0 % 4382,9 %
family house- 4b Extended downwards 791,3 % 800,9 % 6136,0 % 8505,6 %
holds 4c Extended laterally 911,5 % 1872,2 % 3333,2 % 6054,0 %
4d Comb-inations 70,1 % 120,1 % 610,6 % 1270,8 %
5 Multiple Family 5a Secondary unit(s) UP 20,0 % 90,1 % 110,1 % 230,2 %
house-holds 5b Secondary units(s) DOWN 290,5% 80,1 % 970,9 % 1921,3 %
5c Units all on one level 6 0,0 %
5d Fréréches 1 11 0,1 % 360,2 %
5e Other 1 0,0 %
No. of house-holds 6213100 % 8704100 % 10296 100 %15251 100 %
No. of person records 30403 3971748143 70206

Sources: Census of Bergen, Norway 1865 and 1875 transcribed by Statsarkivet i Bergen. Great Britain census 1881, History Data Service, University of Essex. Public Record Office equivalents RG 11/1848-1878 (Ipswich registration district, Suffolk, England), RG 11/4765-4780 (Hull registration district, Yorkshire, England).

One 'product' of a household analysis is a table like the one in table 1. Table 1 shows number and the frequency distribution of households according to the Hammel-Laslett classification scheme.

The figures in Table 1 are frequency distribution of households according to the Hammel-Laslett classification scheme. The sample is extracted from a (hypothetical) study of North Sea ports between 1865-1881, Bergen in Norway, Ipswich and Hull in England, using three different censuses. A table like this could be part of a comparative study on household structure, but contrary to many studies of this kind the coding of the census data is partly done by a programme and the classification of households is done solely by a computer programme. A brief look on the last row of the table indicates that large populations have been examined by the software programme, larger than in a household study where the coding and classification is done by hand. Both the size of the data and the time-consuming process to code and classify can restrict the researcher's study and force limitations of the study's scope and perspective. However by using large populations the researcher can also quite easily select sub-populations, like only comparing households of a special age group, like male headed household where the head is 25-30 years old or only the households of a occupational group, like mariners (or both). In a computer environment the selection of such sub-populations can be changed and refined at any point.

Elements of household analysis

The process going from the census list in figure 1 to the figures in table 1 is long and seldom straight forward, and will not be described in detail here. For a detailed description of the method see Peter Laslett and Richard Wall (eds.), Household and family in past time, Cambridge 1972, p. 1-86. Briefly in a computer-assisted analysis of this kind, the steps would be:

  1. Creating a database to hold the source in computer-readable format.
  2. Transcribing the source in a source-oriented manner, i.e. the textual information is typed as written in the source. No coding or short-cuts of source data.
  3. For analytical purposes new fields are added to the table(s) or a new table. As examples of new fields are marital status, sex and age and these are either coded or in a standard format. In the source marital status and age can be denoted in several ways. In the coded version marital status may perhaps have just four values, unmarried, married, widowed, unknown. Ages are sometimes given as a date, year of birth, as an age in years, as age in days or months, or combination of these. For our purpose an age in whole years would be sufficient.
  4. Household heads must be defined. In most cases this can be fairly simple since it is explicitly flagged (slashes, numbers) or stated in the source, in other cases it can be more difficult because of ambiguity in the data.
  5. Each person's relationship to the head are coded or standardised into values like, "Head", "Wife", "Children" with sub-codes like "Son", "Daughter", "Step-son", "Servants", "Kin" usually with several sub-types, like "Mother", "Mother-in-law" and so on.
  6. Finally one goes through each household sequentially and examines each persons sex, marital status and relationship to head of household to classify the household. These attributes are crucial because the key to classification is the conjugal family unit (CFU). These marital units can have one of three forms:

The term never-married is used here because it is more specific than unmarried (which also can include widowed) and offspring can include both own children and step-children. Many households contains only one CFU and these will end up in group 3 cf. Table 1, but if there are two CFU's within a household, e.g. husband and wife together with a married son, his wife and child this household will be classified as 5b Secondary unit disposed downwards from head.

In order to classify households this way the researcher must sometimes also consider the persons age and the surnames if the specified relationship to the head is ambiguous or unclear. Typically one wants to create a new table or file where each household is the record. Both the person-file and the household-file can be kept in one database or exported into a statistical package like SPSS or SAS.

  1. The researcher is now able to create tables like the one in table 1 and can continue his research on his households.

Manually this process is rather time-consuming and most household analyses are therefore quite limited in speaking of the size of the population. In a typical household study one or a few parishes using two or three censuses, i.e. with a total population ranging from 300 to 4000 persons and 60 to 800 households in each census.

It is quite obvious that applying this method on larger samples both takes a lot of time and one could easily make mistakes. Therefore computer programs have been made to do part of the steps 3-6, coding and classification of relationship to head and classification of households. Common for all of these attempts are that they are highly specialised, they have been done on a limited amount of censuses and highly proprietary databases, because in these cases one is looking for one solution for one census or a census type. Therefore I presume that little effort has been put in to make a general tool or a general model for this type of historical computing. The purpose of this paper is partly to highlight some of the problems making a more general software tool for this type of analysis that can be used for a wider range both censuses, and a variety of census languages and databases.

Firstly what features do we need to do a computer-aided analysis?

  1. A source-faithful transcription. We assume that the source has been transcribed as letter by letter word by word. No pre-coding unless it is done by appending data and no changes or alterations other than those that confirm the national or regional rules for transcriptions.
  2. Coding. Some type of automatic or semi-automatic coding is necessary. With large datasets (10.000-30.000.000 individual records), manual coding is too time-consuming.
  3. Sequence. Some way to model and implement the meaning of sequence, that a list has a certain sequence that also is meaningful. In list type of sources, backward references are quite common, like relationships "His Wife", "Their daughter". The referenced person must be identifiable.
  4. Fuzzy comparsions. Surnames must be compared for 'equalness', but surnames are not always spelled the same way. Soundex coding is one way of suppress spelling differences. Age or year of birth are also examples of attributes that may need some type of fuzzy compare, i.e. equal with deviation.
  5. Expandability or change. There is seldom any demand for update or delete of text/records in a transcribed source, but the analytical extensions, e.g. dictionaries, code-tables and code-values, can become altered, updated and deleted at any stage in the research as the researcher gain knowledge or change his questions.

The preparation of source data for classification of kin-relationships and household classification

Not all information on each individual found in the census are relevant for the classification of households, so I will just focus on those fields that is used first to code relationship to head of household and secondly to classify households according to the Hammel-Laslett scheme. For a general discussion of the information in the British censuses relating to individuals see Edward Higgs, A clearer sense of the census, London 1996. And for the Norwegian censuses cf. Ståle Dyrvik, Historisk Demografi, Bergen 1983 and Gunnar Thorvaldsen, Håndbok i registrering og bruk av historiske data, Oslo1996.

Which attributes are needed in this type of household analysis and what codes or semantic categories must they be given? As far as possible the codes given to each individual attribute will follow the proposal for the coding of machine readable sources by Manfred Thaller.

SEQUENCE. Data must appear in the same sequece as in the source. If this information is not part of the data, a sequential number must be added to hold the source order.

SEX: In the Hammel-Laslett scheme (Table 1) the SEX of the head is necessary to classify into household types 3c Widowers with never-married child(ren) or 3d Widows with never-married child(dren). The values in the source must be coded into the semantic categories MALE, FEMALE or UNKNOWN. If there are persons with unknown sex this should be manually coded by using the first name or other variables that indicates sex, e.g. occupation.

MARITAL STATUS. Marital status is a crucial variable because it identifies the CFU's (Conjugal Family Units). It is also used to differ between type 1a and 1b, see Table 1. The values for marital status in the source must be coded into the values UNMARRIED, MARRIED, MARRIED SPOUSE ABSENT, WIDOWED, DIVORCED, UNKNOWN MARITAL STATUS.

AGE. Age will be used to verify relationships specified by the relationship to head attribute. Age in censuses can appear both as a date of birth, age in year, months or days or as year of birth. In the 1865 census of Bergen all these three combinations can be found. The age must be represented as a number in whole years, neither decimals nor a particular date format are really necessary.

SURNAME. Surname can also be used for verifying kin relationships. However surname can be a problematic attribute to identify relationships or to assure relationship between household members, because of spelling differences and anomalies of different origins. In the 1881 census of Great Britain the surnames seem to be quite consistent i.e. few spelling variations within a family and all members belonging to a family have got the same surname (with exceptions in Wales). Used with care, surnames can identify relationship that are ambiguous or needs confirmation. Surname in Norway in the 19th century are a totally different story. In rural areas patronyms where used as last name or surname, but for towns there can be three 'types' of surnames.

  1. Surnames like Felle, Meyer or Bruun. Families with these are often descendants of foreign immigrants mainly of Dutch, Danish and German roots or they are rural immigrants using the name of their farm or hamlet of origin as surname. All family members share the same surname.
  2. Patronyms. Also towns people (lower class) , perhaps rural immigrants used patronyms as surnames or last name. Example:

First nameSurname/Last name Relationship
ArneLarsenMan
AnnaOlsdatterWife
AndersArnesenSon
EmmaArnesdtr.Daughter

Even they are a family of four, they have four different 'surnames'

  1. Patronyms as surnames. In the 1860s and onwards the process of makinto patronyms into surnames took place both among rural immigrants and emigrants across the Atlantic Sea. That mean that the whole family above will have 'Larsen' as surname which really is the fathers patronym, but as this is a changing process, so different types of surnames can be found in one family. One can also trace this process in the censuses; there are wifes that in 1865 has their patronym as surname while they in the 1875 census got their husbands patronym, i.e. the husbands patronym has be 'surnamed'.

In an English context surnames can be used to identify a family if the relationships either are missing, ambiguous or need higher precision., e.g. whether a child is adopted, a step-child or a biological child of the head can be spotted this way. Another way of using surname is to differ between the heads family and a lodgers family (esp. children). In the Norwegian context however surname is more problematic, because the 1865-1875 censuses are in the midst of a changing process.

RELATIONSHIP TO HEAD OF HOUSEHOLD is the main source of information to classify households, but to do so the text-string found in the census must be coded and/or classified. There are two main problems: Firstly the text in this column is not as easy to automatically code as SEX, AGE and MARITAL STATUS, mainly because there are more data, i.e. the relationship is described with several words, more values, i.e. several types of relationships is described, ambiguous data, i.e. the relationship itself can have several meaning and the 'correct' meaning can be dependent on several factors. Secondly there exists several coding schemes. The code values proposed by Thaller in his draft proposal has too few values and a semantically richer set of code values are needed on kin-relations in order to classify households.

In Norway there exists at least two official coding schemes, a) the scheme used for the 1801 census with about 24 categories and b) a simpler scheme based on the 1801 scheme proposed by The Norwegian Historical Data Centre. The scheme currently used to code the relationship to head of household in the 1881 census of Great Britain, uses a much more detailed coding scheme, derived from Michael Anderson's scheme used on the 1851 census. This new 1881 scheme has about 150 codes for kin-relationship to head of households and the total number of unique codes will probably pass 500. However, a software tool, need internally to rely on one coding scheme in order to do these semantic operations automatically, i.e. to classify persons into households. This can be achieved by using conversion tables, i.e. code X in schema A equals code Y in the internal scheme. And a some stage Manfred Thaller's proposals for Coding of Machine Readable Sources must be extended to be semantically rich enough to handle the aforementioned operations.

However, coding and classification of relation to head of household is not based solely on the values in one field only at least three types of 'external' information is used when done by hand.

  1. Data in other fields. If the relationship column contains the string value "Child" we can check the sex column to find out whether it is a daughter or a son.
  2. Sequence. Data found on other persons in the household, e.g. The relationship "Wife" will normally be interpreted as "Wife of head" if the person comes after the head, but if the person is listed after a married son of the head one assumes that it is the sons wife, i.e. "daughter-in-law".
  3. Data found in other sources about the family or person like other censuses or church records.
  4. Historical knowledge in general, knowledge about the source and common sense. If a person aged 6 is listed as widowed we 'know' that something is wrong, and also a person is aged 6 (correctly) and marital status is not specified we will presume and code the persons marital status as UNMARRIED.

The problem of course is to try to formalise some of this contextual information, e.g. information that can be inferred from sequence and external information like common sense. We will have a brief look on some of the external knowledge, knowledge that necessarily is not in the data itself..

Cultural 'constants' and common sense

The categorising of household mean that we have to make decisions about relationships, who is married to who, whose children are these, are these really married. Often the relationship specified in the source can be ambigious, and when classifying we sometimes use 'common sense' or 'common historical knowledge'. Both to code relationships and to categorise households we also tend to use some more or less explicit assumptions about the historical past. These come into use when either information in the source is missing, is ambiguous or need confirmation or higher preciscion. Some are quite trivial and obvisous, like if a child age 5 is missing marital status we assume that the child as unmarried (assuming the age data can be trusted). Also if someone is said to be a mother of a child and the age difference is 8 years we 'know' it is a step-child, perhaps by checking the surname or in a Norwegian case by looking at the patronyms of other family members. The computer software need also somehow to have this 'understanding' and knowledge of the 'real world'. Examples of these 'facts' are like age of menopause, age of first menarche, lowest age for marriage, 'reasonable' age gap between spouses and also the legal age of marriage. I will describe this knowledge as biological and cultural variables. Table 2 Biological and cultural variables for England and Norway in late 19th century

ConstantSuggestive values for Hull and Ipswich 1881 Suggestive values for Bergen 1865-1875
Mean age of menarche
16
17
Age of menopause
45
45
"Maximum" age gap between spouses
25
27
Age when less than 2% is married, female
15 (Hull)
16 (Ipswich)
19
Age when no less than 2% is married, male
17 (Hull)
18 (Ipswich)
21
Lowest age for male headship
20
21

For England and Norway in the second half of the 19th century we could initially set these variables to values like shown in table 2. One problem with these figures are that they change over time. In Norway the age of menarche fell from 17,5 in the 1830s to 16,5 in the 1860s. In Great Britain the it has fallen from about 15,5 in the 1890s to 14,5 in the 1920s. Also the age of the youngest brides will differ over time, space and also socially and will therefore vary and we must therefore be able to change them easily depending on date and place of census.

How can to software programme use these variables? The main purpose is to solve ambigious relations. The age of menopause and menarche can be used to decide the type of a relationship between a mother and child. The age for first marriage can be used to asume that persons below this age are unmarried if marital status is missing. The age-gap between spouses can be used to decide whether it is likely that two persons in a household is married, e.g. if there is a second 'couple' in the household and the relationship to head of household does not give any good indicator whether they are man and wife or not. A typical decision to make them into a couple is if:

But there are more problems to solve, and the next issue is the importance and semantics of sequence.

Contextual inference

In a census normally the string in the relationship column refers to the head of household, as the intention is. However, quite often the factual relationship and the stated relationship is not always the same, because stated relationship is indirect or goes through another person, e.g Table 3.Table 3 Example of contextual inference, using biological and cultural variables

Person noRelationship sexageMarital status
1HeadFemale 70Widowed
2Her daughterFemale 35Widowed
3Her daughterFemale 5Unmarried

In the example in Table 3 we will assume that the second daughter is really the granddaughter of the head, however this relationship is inferred. By using our biological variables and checking the ages we can either refuse the last person number three in table 3 as being a daughter of the head or even assign the third person as granddaughter to the head.
Table 4 Example of contextual inference, the semantics of sequence

Person no.SurnameRelationship sexageMarital status
1OwenHead Male40Married
2OwenWife Female38Married
3OwenDaughter Female3Unmarried
4SheringhamServant Female19Unmarried
5ShearerLodger Male27Widowed
6ShearerDaughter Female1Unmarried

Table 4 exemplifies the semantics of sequence. In this case there are two individuals described as "Daughter", but are both "Daughter" of the head? Doing the classification by hand one will assume that the last person in the household is the daughter of the lodger (person 5) and not the daughter of the head, even though the relationship column holds the same text for person no 3 and person no. 6. This is also a type of case where surname (see discussion on the usage of surnames above) will tell us who is parent of person no 6. This type of inference is contextual, it is not purely based on the person in question and this persons relation to the head.

One way of making the program cope with these type of problem is to check the consistency or 'logic' of a household, e.g. if there is a person called lodger and the next person is a wife, then the wife becomes the wife of the lodger. In this process one can also check other types of 'errors' or inconsistencies. A typical one is sex. In the 19th century census lists there are not always a separate field for gender, but the age is written in two separate columns, one for males and one for females. A quite common error, both by enumerators and by transcribers is to change the sex of one or several household members, e.g. the head of the household is female and the next person described as a wife is also a female. These type of errors can be corrected also by a contextual analysis of the attributes of the members of the household.

So far we have discussed ways of making a software program able to process a household we need to feed the program with:

  1. Several attributes on each person
  2. Contextual information, the semantic of sequence.
  3. Real world assumptions (biological and cultural variables and common sense)

In order do:

  1. Make corrections if needed
  2. Make decisions when information in one field (like relationship to head) is either missing, ambiguous or need higher precision in order to classify the whole household.

The last but not least element in this process is to classify the household.

The household classification algorithm

It passes the scope of this paper to in detail explain the algorithm used to classify household, so I will restrict to just mention a couple of the key features. The classification algorithm was originally written in the SAS script language by Kevin Schürer, and later refined and rewritten by Arne Solli into C/C++. This will make is possible to make the software tool to run on different machine platforms and file formats.

As input the programme read a tab-delimited file and the user must specify which columns or fields that holds marital status, sex, age, surname and relationship to head of household, and also the column(s) that uniquely identify each person, possibly just a sequential number. Appendix A-1 and A-2 gives a sample of a input data file and a input description file.

The classification itself is based on a sequential examination of each households. First a pass to check the consistency of the fields sex and marital status and relationship of head of household, thereafter a several pass to identify members of conjugal family units (see explanation above on CFU's) and to mark the 'head' of these CFU's and to identify the relationship between the first CFU (the head of household CFU) and possibly other CFUs units within a household, like is the first and the second CFU related upwards or downwards from the CFU where the household head is a member of. The last stage of the examination is to use this aggregated information to classify the household into the Hammel-Laslett categories, like 3b Simple family household, Married couples, with children or 5a Multiple family household, secondary unit up, Cf. Table 1.

As output the user get two files or tables; a) a modified person file/table and a household file/table including the re-coded fields and household classification. All the data from input are also replicated to the output files/tables. Appendix A-3 gives a sample of the corresponding output files.

Problems and Challenges

There are several problems not yet discussed, but must be solved to get a level of generality that make it worthwhile to invest time in this type of historical computing.

Firstly a general language component must be attached to the system so that attributes like SEX and MARITAL STATUS can be 'understood' and handled automatically by the software. Currently English and Norwegian are partly 'understood' by the software but not general and a fully automatic way.

Secondly a way to attach different coding schemes and conversion tables between coding schemes must be implemented. Currently one works partly under the assumption that some of the variables are already coded into the 'Essex-scheme', see above.

Thirdly more effort must be put in to make the software programme infer from sequence and this way understand relationship to head of household correctly. The program now depends on that the coding of relationship to head of household is done in advance, but a fairly reasonable extension is to supply the program with information on language, a larger dictionary, a coding scheme and more advanced parsing techniques to make the program code relationship to head of household automatically. A 100% automatic coding of relationship to head of household is perhaps not either recommendable or possible.

One also need ways of handle the data even if some of fields or types are missing, like what do to if the gender field is missing or how to cope with coding schemes relationship to head of household that are not as semantically rich as optimal.

Implement support for reading census data as tables in commercial DBMS like MS Access, Dbase, Paradox or Foxpro. Also easy reporting is lacking and must be implemented. Now only tab-delimited exports from these products will be handled.

An important element in this type of computing is also to develop ways of finding and measuring the differences between manually coded and classified data and the same coding and classification done by software, man against machine! This can also help us making the software better.

Lastly, but very important. The current the number of and the character of the presumptions built into the software programme are rather few and simple compared to what quite often is needed. Presumptions on marital status, offspring and about kinship will vary both in time and space (England differ from Norway, see discussion on surnames), and they are necessary when data is ambiguous or partly missing.

Concluding Remarks

This paper must be treated as a highly prelimenary report from ongoing research at the Historical Departments both at the University of Essex and the University in Bergen. A lot of work is still to come. But so far some more general conclusions on Historical computing can be drawn.

  1. Data Modelling. The focus of modelling must be drawn away from purely structural aspects of data modelling. A data model consists of two parts, structures and operations. The structure of structured sources are often quite simple and a debate whether to use a relational model or not can easily lead into nowhere. Modelling historical methods (partly what I call semantic operations) are far more complicated than purely to model the structural aspects of both the sources and the historical past. Sources tend of often to have a hierarchical structure anyway. Therefore the choice of DBMS software should rely on its ability to implement the operations, not the structures.
  2. The need for standards and focus on semantic types, like AGE, NAME, and MARITAL STATUS and their semantic categories (values) and not data types, like string and numeric and file-formats. Semantic types are equal or similar to what Daniel Greenstein calls primitive and compound datatypes and to what Manfred Thaller calls Prototypes. The type of Historical computing discussed here depends deeply on agreeing on historical datatypes and their semantic categories. However, some semantic categories, e.g. for marital status, are a lot easier to standardise, than say occupation.
  3. Focus on semantic operations. The semantic operation 'compare' is not the same as a the data operation compare for a string or a number. If a surname is spelled with some variations it can still be the 'same' name and family. Automatic coding and classification are other examples 'high-level' semantic operations. Just as some operations (e.g. adding) are feasable to some datatypes there exists also links between semantic types and semantic operations.
  4. Focus on modelling and formalising historical methods in general. As the this paper illuminates, new technology combined with large databases lets the historian now use methods developed for small scale datasets with large scale data. However this transference of methods must be done with care and not all methods or parts of methods are transferable. It is important both to examine our assumptions about the historical past and the assumptions built into our methods, before we let the computer 'inherit' them.

The focus on these four aspects of historical computing, are needed in order to use small scale techniques these techniques on large scale data sets. The work to formalise these techniques is perhaps a never-ending task and differences between computer-aided and a manually done coding and classification will occur, similar to differences by two manually conducted classifications. The question what can be formalised is perhaps not that interesting, because as part of this paper illuminates, formalising the low-level interpretation of a source can also be rather time-consuming and in the end it is perhaps the short term cost in time that decides whether the method or technique is worth trying to computerise or it is better to do the work by hand. In some ways that is a pity, because it often means reinventing the wheel.

Appendix A-1 Sample input data file, 1881 census of Great Britain.

Appendix A-2 Sample corresponding description file for the 1881 census extract

[SYSTEM]

MIDS=SKIP

UNINHAB=SKIP

INST=SKIP

DLM=TAB

OUTQ={none}

MISSING=-

[EXTRACT]

0=PIECE

1=FOLINUM

2=PAGE

3=RFNSEQ

4=REGCITY

5=CIVPAR

7=ADDR

8=SNAME

9=PNAME

10=RELENU

11=RELSTD

12=MARENU

13=MARSTD

14=SEXENU

15=AGEENU

16=OCCUP

17=BIRPAR

18=BIRCNAM

[CLASSES]

SEX=SEXENU

MARSTAT=MARENU

REL2HEAD=RELENU

AGE=AGEENU

[KEYFIELDS]

0=PIECE

1=FOLINUM

2=PAGE

3=RFNSEQ

Section "SYSTEM" describes general fileformat attributes, section "EXTRACT" describes the fields as extracted from the source, section "CLASSES" describes which fields that corresponds to the semantic types of SEX, MARITAL STATUS, RELATIONSHIP TO HEAD OF HOUSEHOLD and AGE and section KEYFIELDS defines which fields that defines the order (or sequence) of the input. There are also two sections with name "HHOLD" and "PERSON" that describes the output which not replicated here.

Appendix A-3 Sample output files