Computer and Household Analysis in Norway

XI INTERNATIONAL AHC
CONFERENCE
MOSCOW, AUGUST, 20-24

SESSION A8: HOUSEHOLD HISTORY



Arne Solli

Department of History

University of Bergen

Introduction

Population census-lists are the main source in traditional household analysis and the methodology is normally based on the work of Peter Laslett and his research fellows in Cambridge Group of Population History and Social Structure. In Norway there have been several household studies according to this methodology. The criticism applied to this methodology can also be applied to the Norwegian studies. Traditional household analysis in Norway has often tended to focus on the households in one parish at some few points in time. And the number and quality of censuses have made restrictions to studies of family dynamics and the focus on one parish has often restricted the possibility of a broader and deeper regional comparison. Therefore one goal in my hovedfag (Master) degree in History at the University of Bergen was to analyze the diversity and variation of the Norwegian pre-industrial household. Another was to focus on some dynamic aspects of the household structure, because some keys for understanding the regional variation could be hidden in family dynamics.

I wanted to test a model that claimed that the proportion of nuclear family households in a region were dependent on the possibilities for making extra income from fishing, shipping, mining or lumber industry. Contrarliy: Where agriculture (live-stock farming or grain-production) was the main or sole basis for a household, the household structure tended to become more complex i.e. larger households and households often extended with kin. In 1801 90% of the Norwegian population lived in rural areas and these four industries gave an important extra income for farmers and cottagers in different regions, sometimes so important that farming could be said to be the secondary occupation. These industries also represented the main fields of Norwegian export.

The focus of a study like this would cause one to use large samples of data on different levels. In Norway there is only one nation-wide population census in pre-industrial time, the 1801 census. The whole census is recorded on computer files and there is also a coded version of the census, where all the different terms for household position, occupation, sex and marital status had been coded into a finite number of categories.

By using a coded version of the census done automatically by computer program one serious question is: Can we trust classification done by a computer-program? There was also a necessity to go beyond the individual level of a census and this raised the problem of analytical levels. What new aggregates will be needed? And I also wanted to make the source-material and the new aggregates available for other historians. This could allow them to examine this source-material, ask different questions or to test my results. This creates a new problem: How could these new aggregates be made public for other historians worldwide? What kind of analysis could be done on the World Wide Web (WWW)? Finally I will try to give a short description or sketch of our technical solution to these questions.

The Computer as Historian

Household composition

Can we trust the computer? This question can be answered by comparing the work of the computer to the work of historians. With the 1801 census we are fortunate enough to be able to compare the work of human historians with our computer historian. The 1801 census is coded after the principles that are briefly discussed above.

Table 1 Classification (by man and by machine) of Household composition . The parishes of Ullensaker, Rendalen and Etne, 1801.
Categories - household Composition
Parish
Heads
Spouse
Child-ren
Kin
Servant
Other
Inmates
Sum
Ullensaker Higley* 759
2187
316 555 116 147 4080
Computer 764
2205
226 562 149 174 4080
Rendalen Sogner301 261 665 158 193 29 78 1685
Computer 301 265 644 171 191 55 58 1685
EtneDyrvik 272 250 547 54 242 46 1412
Computer 272 251 549 58 249 19 14 1412
Sources: Ullensaker, L. Higley 1976: 147; Rendalen, S. Sogner 1979:291,297; Etne, S. Dyrvik 1981:186-187. Computer: Algorithm written by Jan Oldervoll in the program-language PL/1. Central Bureau of Statistics NOS B 134 (1980)

Table source: Arne Solli, Individ-Hushald-Samfunn (1995), p. 19.

* Higley has not differed between spouse and children. She has also categorized grandchildren as kin and not as children.
Figure 1 Classification (by man and by machine) of Household composition . The parishes of Ullensaker, Rendalen and Etne, 1801.


Table 1 (and Figure 1) shows the classification of household position done by computer and by historians. As we see the differences between man and machine are not large when it comes to categorizing heads, spouses children and servants. When it comes to the groups other and inmates there are some differences, but these are also the smallest categories of household position. In Table 1 we observe that there are differences for the Ullensaker parish also for heads, spouse and children. Lisbeth Higley who did the classification was critical to the census-takers' division into households and therefor the number of heads is different. Higley has also chosen a different way of categorizing grandchildren. If we subtract the grandchildren from the kin and add them to the child-category there is in fact only a difference of 6 persons, not 20. Higley has also used other sources to confirm and find kin-relations and this shows that in a census the census-taker is not always specifying the kin-relation.

Why do the classification of other and inmates differ? There are probably several reasons for that. One reason can be that when whole families are lodgers this information is only clearly stated for the husband and not his wife and children. When a historian classify he will infer that wife and children also are lodgers from the context not from the words for household position, and the computer model has to be extended to infer the correct household position from the context, not only from the household position itself. This was implemented in the 1801 classification by Jan Oldervoll, often with good results.

This shows that different principles for categorizing are probably a greater factor for diversity than whether the classification is done by a "computer" or a "human"

The main conclusion is: The computer classifies these kinds of restricted text-sources almost like a human, after all there is always a human behind every computer and computer program. The result of the computerized classification is totally dependent on the model and the program made by the historian.

At this point we have:

  1. A complete census on machine-readable format.
  2. All terms for marital status, sex, household position, occupation reduced to a comprehensible number of categories and subcategories. For example: 5000 terms for marital status have become only four categories: unmarried, married, divorced and widowed. (with 10 subcategorizes for married and widowed).

And at this point it is crucial to decide which new aggregates to make. The aim is to reduce the amount of data and get new kind of information and do different kinds of analysis. The next step would be to aggregate the data to a) household level and b) parish level. Instead of 880,000 individuals, there is only 163,000 households and with the parish aggregate just about 330 parishes. Of course some information will be lost but some will be gained - that is one of the aims in aggregating.

Aggregating - less Data but new Information

The Household Level

At the household level following variables (extracted or aggregated) are present: Geographical variables (location), age, sex and occupation of the head and the spouse but not the other members of the household, number of members in each category (i.e. 3 children, 1 servant, 0 lodgers, 1 kin...). Furthermore, the total number of members, number of married persons in the household and some other variables relating to the households workforce. At the household level it is included an important variable for traditional household analysis: The type of the household according to the Laslett-Hammel classification. This gives a total number of 80 variables.

The Parish Level

The variables at the individual and household level are aggregated to the parish level which has about 50 variables, different demographic variables as Total population in the parish, Number of household heads, Factor of cottagers to farmers, the Singulate Mean Age at Marriage for men and women, Mean household size, Number of relatives, number of households with relatives and so on. In order to manage a regional analysis on this level (330 parishes) one solution would be to make digital maps so that different variables could be plotted on an 1801 map of Norway.

Household Structure

Going from individuals to households also makes a new classification, the classification of household types according to Laslett-Hammel. Normally the classification of households into the Laslett-Hammel classification is done manually, but with approximately 160,000 households this would be a considerable task. In this classification there are five main categories (1. Solitaries, 2. No-family, 3. Simple family, 4. Extended and 5. Multiple) and a variable number of subcategories, but a total of 20 subcategories. Example of subcategories: 3c Widow with children, or 4a Extended upwards when a nuclear family is extended by the mother of the spouse or the head. In order to classify one need to know the household position, sex and the marital status of each member.

To make the computer do this kind classification I used a combination of two models. First a Finite State Transducer (FST) to decide whether the household was in category 1, 2 or 3 with subcategories, then a model called decision-tree to decide whether the household preliminary classified into category 3 should go into category 4 or 5. Decision-tree models are used in various computer systems, for example expert systems, management information systems or industrial robots. So far we have the recording of a census and its subsets - a computerized and coded version on individual level, and a coded aggregate on household level and on parish level.

Publication -making sources available on the WEB

It is of course nice to have a whole census on a hard disc, together with a coded version of the census, and aggregates like the household-level and the parish-level ready to import into any database or statistical program. This, however is also a rather selfish idea.

Since the late 1970s the Department of History at the University of Bergen has been selling diskettes, paper printouts and microfiches of the 1801-census. Together with the data on diskettes the user could also buy an MS-DOS program developed by professor Jan Oldervoll called CENSSYS. The census became really popular when Oldervoll put it on the World Wide Web under the name WEBSYS. This is the census in full version, without additions or coded fields. The URL is: http://www.uib.no/hi/1801page.html. This WWW-resource is in fact one of the most visited sites at the University of Bergen and is extremely popular among genealogist especially in Norway and in the US.

The coded version of the Norwegian Census was publicized in 1980, ten years after the work had been initiated. 880,000 individuals had been recorded into the computer (on cards) and the aggregated tables were made and publicized by the Central Bureau of Statistics. In 1992 Jan Oldervoll publicized the coded version as a sub-module of the CENSSYS system, called a CENS1801 (MSDOS-program), which gives some basic bivariate descriptive statistics on a PC and with the possibilities of printing and exporting data to word-processors or databases. In 1995 the WWW-version of the CENS1801 was ready for publication and it is called WEB1801. The URL for WEB1801 is: http://www.uib.no/hi/1801page.html. This allows the user to do the bivariate descriptive analysis on the web with just some simple mouse-clicks.

In the spring of 1996 we put the household-aggregate on the World Wide Web. The household aggregate can also be found at URL: http://www.uib.no/hi/1801page.html. The name of this aggregate is WEBHUSH.

The parish aggregate of the 1801 census will be available on the WWW by springtime 1997. This will include a Graphical Information System (GIS) using digital maps. Other Norwegian censuses will also be included on the WEBSYS resource by the end of 1996.

Demographic and Household Analysis on the WEB

I will give two examples of what kind of analysis a student of history or a historian can do on the Internet with the data from the Norwegian census of 1801.

One of the results by the Cambridge group was the hypothesis concerning the Western family pattern (Western European family system): Late or deferred marriage, nuclear or simple households, little or no age difference between spouses, and the presence of life-cycle servants as members in a significant proportion of households. Laslett claimed that this system had been prevalent in Western Europe since 1500, perhaps even longer. But there are some problems beneath these broad generalizations. One that concerned me was how to explain the variations that existed within regions of Western Europe (Norway, for example). First an example from the coded census to study the Marital age and then an example from the household database to study the household composition.

Uncovering the Marriage Pattern

One of the bold generalizations of the 1960s is the European Marriage Pattern by John Hajnal. Hajnal divided Europe into two parts, the "western" and "eastern". In Western Europe the mean age at marriage was much higher than in other parts of the world. Celibacy in terms of the percentage of the population that remained unmarried throughout their lives was also much higher in Europe. According to Hajnal the mean age at marriage for women could sometimes be as low as 24.5 years in areas where the "European marriage pattern" existed, but generally it was much higher, about 30 years. Can we find these traits in an arbitrary part of Norway in 1801? With the WEB1801 we can get some idea. After selecting an arbitrary parish out of 330 in Norway, WEB1801 will produce a table like Table 2.
Table 2 Example of output from WEB1801 coded version of 1801 census of Norway
Horizontal: Marital Status, Vertical: Age

Selected parish: Ejd. 1801.

Age-groupMarried WidowedUnmarried Total number of persons
1- 5 00 281281
6- 10 00 292292
11- 15 00 243243
16- 20 30 228231
21- 25 250 165190
26- 30 740 109183
31- 35 930 45138
36- 40 1203 40163
41- 45 1240 17141
46- 50 1505 21176
51- 55 886 13107
56- 60 9312 8113
61- 65 6113 680
66- 70 4727 276
71- 75 2216 139
76- 80 827 237
81- 85 619 025
86- 90 26 08
96-100 01 01
Total916135 14732524
Source: WEB1801, Norwegian Census 1801, Department of History, University of Bergen

Table 2 show some features of the "European Marriage Pattern". In the selected region Ejd in the age group 26 to 30 years still 109 out of 183 (60%) are unmarried, and in the next age-group 49 (33%) are still unmarried. The table includes both males and females. The table also shows the celibacy-feature of the pattern: The proportion unmarried in the age-group of 46-50 can be taken as an estimate for the proportion never married. In this parish in 1801 the proportion was 12%. All these figures are well inside the "European Marriage Pattern". This type of crude demographic analysis can be done on the Internet without the necessity of downloading any data and/or importing them into an statistical program or a spreadsheet.

But there is also an household aggregate of these data, so let us turn over to some simple household analysis.

Absence of resident-kin

Let us say we wanted to study household formation and we could formulate a hypothesis, for example: The resident kin is nearly absent in the pre-industrial household in Northwest Europe due to the neo-local practice of household formation.

According to Norwegian material there is some evidence that this hypothesis does not hold, and in Norway there is a strong socio-economic variation in the proportion of resident kin. With WEBHUSH we can choose some parishes in Western Norway to see whether the hypothesis would stand under falsification of data from this region. In Table 3 the selected region consists of seven parishes: Askevold ,Yttre Holmedahl ,Indre Holmedahl ,Førde ,Julster, Gloppen ,Indvig and Ejd. The selected region is relatively economically and socially homogeneous in the late 18th century, live-stock farming was of great importance. Since life-expectancy can be one factor that could limit the number of receding kin we decide to make a cross-table with number of kin placed horizontally and the age of the household head vertically. The age-group of eight years is user-defined and can be decided when one selects variables.
Table 3Example of output from WEBHUSH. Absolute numbers.
Table 17.5 Horizontal: no. of resident kin, vertical: age of household head

Selected parishes: Askevold ,Yttre Holmedahl ,Indre Holmedahl ,Førde ,Julster, Gloppen ,Indvig ,Ejd

Absolute numbers

no. of resident kin
01 234 56 SUM
age of household headNumber of households
16-2388 411 3025
24-31105 8059178 21272
32-39284 14710716 1532574
40-47512 2056776 10798
48-55592 1462852 10774
56-63418 82500 00505
64-71222 58400 00284
72-7971 23200 1097
80-8725 3200 0030
88-9510 000 001
SUM 2238 75227846 32113 3360
Source: WEBHUSH, Norwegian Census 1801, Department of History, University of Bergen

After selecting a county, parishes and variables WEBHUSH produces a table like Table 3. The table shows that when the household heads are between 24 and 31 there are resident kin present in 167 of the 272 households (61%) and even in the age-group 32 to 39, 290 households (51%) have resident kin. Most commonly are 1 or 2 kin, and it is no careless guess to say that this is the mother and the father or mother and a sister of the household head or his spouse.

The tables produced by WEBHUSH have three different output formats: a) Nicely formatted with the HTML-tag TABLE, b) plain text for import to word-processors (the columns are space delimited) or c) the columns are delimited by a semicolon for import to word-processors, spread-sheets or databases.

When the user has made a cross-tabulation like in Table 3, WEBHUSH offers the user some simple bivariate statistics or functions: The table can be inverted with age on the top, and a total percentage for each cell can be calculated, and also vertical percentages and horizontal percentages. Table 4 is an example calculation of vertical percentage. These functions are done merely by one or two clicks with the mouse.
Table 4 Example of output from WEBHUSH. Horizontal percentage.
Table 17.5 Horizontal: no. of resident kin, vertical: age of household head

Selected parishes: Askevold ,Yttre Holmedahl ,Indre Holmedahl ,Førde ,Julster ,Gloppen ,Indvig ,Ejd

Horizontal percentage

no. of resident kin
01 234 56 SUM
age of household headNumber of households
16-2332.0%32.0% 16.0%4.0%4.0% 12.0%-25
24-3138.6%29.4% 21.7%6.3%2.9% 0.7%0.4%272
32-3949.5%25.6% 18.6%2.8%2.6% 0.5%0.3%574
40-4764.2%25.7% 8.4%0.9%0.8% 0.1%-798
48-5576.5%18.9% 3.6%0.6%0.3% 0.1%-774
56-6382.8%16.2% 1.0%-- --505
64-7178.2%20.4% 1.4%-- --284
72-7973.2%23.7% 2.1%-- 1.0%-97
80-8783.3%10.0% 6.7%-- --30
88-95100.0%- --- --1
SUM 66.6% 22.4%8.3%1.4% 1.0%0.3%0.1% 3360
Source: WEBHUSH, Norwegian Census 1801, Department of History, University of Bergen

The last function is the mean of the vertical variable if the variable allows this kind of calculation. Using the same data the user will get an output as in Table 5. Table 5 shows that the mean number of resident kin is in the range from 1.5 to 0.9 when the household head is under 40 years old.

The output from Table 3, Table 4 or Table 5 should be a nice start to make a closer look at this or other regions, and of course to try to falsify or modify the hypothesis stated above. This shows that WEBHUSH is a suitable tool for what could be called "source-browsing", getting a quick view of the data and finding regions that could be interesting for a detailed and critical study of kin-relations in pre-industrial Norway. It can also be used conversely, if one had already done a detailed study in one parish, the WWW tools would make it possible to compare finding in this study with other regions.
Table 5 Example of output from WEBHUSH. Finding mean number of resident kin.
Table 17.5 Horizontal: no. of resident kin,

vertical: age of household head

Selected parishes: Askevold ,Yttre Holmedahl ,Indre Holmedahl ,Førde ,Julster ,Gloppen ,Indvig ,Ejd
Horizontal mean

age of household headNumber of households Mean no. of resident kin
16-23251.52
24-312721.09
32-395740.86
40-477980.49
48-557740.30
56-635050.18
64-712840.23
72-79970.33
80-87300.23
88-9510.00
SUM 33600.49
Source: WEBHUSH, Norwegian Census 1801, Department of History, University of Bergen

Our Technical Solution

The software model for publiccation and statistical analysis on the WWW is quite simple. The requirements for making systems like WEBSYS, WEB1801 or WEBHUSH work on a PC are (See Figure 2):

This hardware and software allows for installation of a WWW-server on a PC, but in order to make programs that access local databases the following is also needed:

This program and the WWW-server communicate through something called a Common Gateway Interface (CGI). CGI sends input from the user and the local program return data as output (i.e. HTML-documents). An alternative to Pascal, C or Basic, is to write the scripts in programming languages like Perl and in combinations with Java-scripts. The communication between the Internet, the WWW-server, a program and historical sources are outlined in Figure 2.
Figure 2 Software model for publicizing and interactive statistical analyzes on the World Wide Web



Conclusion

Firstly back to my thesis - did this effort give answers to any of the historical questions? Was the household system tightly knit to the ecotypes? Was the proportion of nuclear households in a region dependent on the possibilities for making extra income from fishing, shipping, mining, or lumbering? The model was tested in a typical fishing region (fishfarmers) and a region with strong bias toward livestock farming. Partly I got a positive answer. In this particular fishing region I found earlier marriage and age at household formation (age 26-27) , simple households, and the neo-local rule for household formation dominated and. In the agricultural region I found later marriage, higher age at household formation (at age 30-31), patri-local household formation dominated with larger and more complex household types.

This result came forth by borrow models from both computer science and computational linguistics. I have argued that it is possible to let the computer do some of the classification work on historical sources. There are of course limitations. Census-lists are pretty restricted when it comes to semantics. The meaning are often clear-cut, and the problems emerge from strong use of abbreviates. Comparison between human and "computerized" categorization shows that it is possible to leave some of the hard work to the computer.

Computerized historical sources can be put on the Internet, not just for browsing and/or downloading, but also for simple statistical analysis. So far this also has limitations, especially when it comes to making good user interfaces.