Black swans, black squirrels and charter databases

Posted: Aug. 6, 2015, 2:48 p.m. by Rachel Stone

What do black people who appear in Elizabethan archives have in common with early medieval charters which mention saltpans and the interrogation of a medieval transvestite prostitute? My answer would be that they’re all rare phenomena that historians might be interested in. This post considers (at a fairly abstract level) how historians can find such rare events in documentary records and the role of digital humanities in assisting this. It’s worth starting by estimating just how rare such records are. A little while ago I read Nassim Nicholas Taleb, who talked a lot about the concept of black swan events: ones that cannot be expected because they come so far outside one’s previous experience. (The name comes from the assumptions of Westerners before Australia was discovered: if all anyone has ever seen for centuries are white swans, how can you imagine that a species of black swan exists until you actually see it?)

In that sense, I think John Rykener, the transvestite prostitute probably counts as a black swan. It’s not impossible to imagine a variety of unusual sexual practices in medieval London, but the chances that one of those participating had been caught and that the records survive are very low. (In fact Jeremy Goldberg has recently argued that this isn’t actually a genuine court record at all, but that’s a separate matter).

In contrast, the other phenomena that I mentioned at the start aren’t as rare. It’s long been known that there were some black people in sixteenth-century England and given the extensive coverage of parish registers, it’s likely that a few would be mentioned. We also have thousands of Carolingian charters from across a wide range of Europe: since the use of saltpans to extract salt was a standard technique, it’s likely that they turn up in some charter. To adapt Taleb’s metaphor, these are less black swans than black squirrels, a known, if unusual variant of the grey squirrel.

As an ex-mathematician, one of my responses to unusual phenomena is to start thinking about probabilities and scale. At a very abstract level, a lot of historical research involves going through a series of source documents looking for examples of a particular uncommon event, which the historian will then collate, analyse and discuss. But how many do you need to go through and how feasible is the research project? Let’s start with some very crude numbers, to get a feeling of scale. How many hours working on source texts are you likely to have?

200 hours studying in the archives or reading through printed volumes, for example, equates to around 5-6 weeks of full-time study: the sort of effort you’d probably need finding new material for a journal article. But for a research project like that you’d also need to read up on the secondary literature and then spend time actually drafting the article. So 200 hours of document study probably equates to a 3-6 month research project, depending on how much time you have available for research. For a typical three-year British history PhD, you’re probably looking at around 2000-2500 hours of document study at the maximum: that equates to a year full-time on that (50 weeks of 40-50 hours), plus a year of background reading/planning and a year of writing up. For a single three-year postdoctoral research post, you might be able to get up to 3000 hours of document discovery and analysis in total (around half of a maximum of 6000 hours on the project), but you’re unlikely to get it much more. Call that number of hours, whatever it is, T, the time available.

There’s then the question of how long it takes to analyse the average document/unit. In many ways, that’s like asking how long a piece of string is, given it depends on the language of the document, whether it’s printed or handwritten, how formulaic it is and what kind of things you’re trying to find is. The point is, however, after a while you’ll gradually get to know this. And that gives you your second variable: your rate of analysis (R). If you know that on average it takes you an hour to analyse a charter and you’ve got 200 hours documentary time for your project, that’s 200 documents you can study. If, however, you can look through 10 documents an hour and you’ve got 2000 hours for your project, you’re looking at a sample of 20,000 documents. These figures are always very rough, of course, but they do give an idea of scale, which is key. It may be possible to speed up and cut corners a bit, but if you have 20,000 documents to analyse and on average it’s taking you 30 minutes to analyse each one, that’s not a feasible PhD project.

The final variable here is how frequently the event you’re looking for occurs (its probability, p). You’ll obviously only know for certain after you’ve done the detailed analysis, but again it’s possible to make a very rough estimate of the likely frequency early on. Suppose you’re trying to get a sense of whether it’s more likely that your event turns up 1 time in 10 (p=0.1), 1 time in 100 (p= 0.01) or 1 time in 1000 (p=0.001). These very widely spaced intervals can often be spotted with a bit of intuition, sampling and some reading round. Firstly, how often do you think these events would happen in the society you’re studying? You’ll probably already have some sense of whether they’re really strange (1 in 1000) as opposed to rare (1 in 100). Secondly, try looking at a random selection of about 20 of your sources and see if they include any examples of the event. If so, it’s probably nearer a 1 in 10 phenomenon. Has someone done a statistical analysis of event frequency for a subset of your data? What kind of figures for frequency are they coming up with? If no-one’s done a statistical analysis and you haven’t found any examples in your sample that suggests a relatively uncommon phenomenon (less than 1 in 10).

What about references to the event in footnotes or articles? If several other researchers include different examples of the event occurring, it’s likely either to be in the order of a 1 in 100 phenomenon or a type of source where researchers can routinely go through thousands of individual source documents (such as with marriage and death registers). If, however, when you look for references to the event, they’re all the same couple of examples, or the examples are being found by someone who’s spent a lifetime studying a particular type of source/archive, then you probably are in the realm of 1 in 1000 events or even rarer ones. Once you’ve got an estimated probability (p) at the level of 1 in 10, 1 in 100 or 1 in 1000, you can set up an extremely basic equation for the expected number of documents containing the event (E) you’ll find:

E = T x R x p

Here you’re measuring T (time) in hours, R (rate of analysis) in documents per hour, and p as a decimal fraction. Or to put some figures into that, if you’ve got 200 hours, you can look at 2 documents an hour and you’re looking for an event you expect 1 in 100 times, you’d expect to find 200 x 2 x 0.01 = 4 documents containing the event in that time. So far, so good. But expected numbers are just averages. So now to ask a slightly more mathematically complex question: is there a possibility that I won’t find any example of the event if I look through all these documents? Assume for a minute that the events are evenly distributed in the documents. (That’s a big assumption, but I’ll get back to it in a minute). If you pick one document, then the chance of finding the event is p and the chance of not finding the event is 1-p. So, for example, the chance of not finding an event that occurs 1 in 100 times is 99 in 100 = 0.99. The chances of the event not appearing in a second document is again 0.99, and the chances of it appearing in neither document is 0.99 x 0.99 = (1-p)2Similarly, the chance of the events not occurring when you look at N independent documents is:

(1-p)N that is (1-p) x (1- p) x (1-p) N times.

But you already know the number of documents (N) you’re going to have time to look at: that’s T x R. So the probability that you get no examples of the event is:

(1-p)(T x R)

You’ll probably need a scientific calculator to work out that, but if you plug in the figures you get some interesting results. Suppose, as before, you’ve got 200 hours, you can look at 2 documents an hour and you’re looking for an event you expect 1 in 100 times. You chances of not finding anything are 0.99400 = 0.018 (= 1.8%). In other words, you’re almost sure to find at least some references to the event. On the other hand, suppose you can only look at 1 document an hour, so that T x R = 200. The probability you’ll find nothing is 0.99200 = 0.134. Or to put it another way, you’ve got a 13% chance of carrying out a research project for several months and having nothing to show at the end of it.

But suppose the event you’re looking for doesn’t actually turn up 1 in 100 times, but only 1 in 1000 (p = 0.001). Then looking at 400 documents, your chances of finding nothing are 0.999400 = 67%. Even if you look at 1500 documents, your chances of finding nothing are 0.9991500 = 22%. And unless your rate of analysis in documents/hour is very high, 1500 documents are probably more than you’d be easily able to analyse just for a journal article, let alone for a conference paper.

For larger projects, with more hours available, it’s unlikely that you’ll find no examples at all of the event you’re looking for. The problem here is that you may not find enough examples to justify all your work. For a conference paper or even a journal article, you may well need only 2 or 3 examples of an event if you then discuss them in detail. For a PhD, however, you’re probably looking for more like 40-50 documents, so you can do at least some comparisons and very basic tabulation. And for a research project, you ideally need a few hundred documents as a minimum, in order to provide a corpus. A project which spends 3000 hours to discover 3 documents is not likely to please the funders. Even a find rate of 1 in 100 still means a relatively small output for a lot of effort. For example, the Medici in early medieval Italy, AD 800-1100 project had to look through 17,000 documents to create a database of 178 entries. As the principal investigator on the project, Luca Larpi has discussed how information from this database can then usefully be combined with other online material to aid further research, but this isn’t a straightforward process.

What all this means is that searching for rare phenomena (more than about 1 in 100 level) is not a good research strategy, unless there’s a way that you can look through your source documents very quickly. Phenomena this rare are normally only going to be found by chance, when you’re looking for something else. But what if these rare phenomena are the ones you’re desperately interested in? Are there any possibilities if you want to research such topics?

There are some strategies that may cut down the odds enough to make active searching worthwhile. One is to use the fact that events aren’t equally scattered in the records. Sometimes this can be deduced logically: if you’re looking for saltpans in charters, it makes sense to concentrate on charters produced in areas where salt is naturally found. If you’re looking for black people in Elizabethan England, they’re most likely to be where immigration and trade are most commonplace, not in more remote areas. If you can cut down the sources to be checked while (hopefully) not excluding relevant events, you’ve improved your odds considerably. It’s also useful looking for clusters, because historical events themselves aren’t random, nor is the recording of them. If you know that an event happened in a particular place/time, it’s therefore worth checking “nearby” dates and places.

There’s also the possibility of “crowd-sourcing” for rare events. With many sources, there are already people analysing them for their own (different) research. It’s always worth asking them whether they’ve come across event X in their sources. And here you come to the advantage of rare events: they’re more memorable. If you’re a local historian looking at early modern parish registers in Sussex, you may well remember whether or not you came across any black people, even if you can’t remember the details. Using people’s memories in this way may get you specific examples or at least give you hints of where it’s best to look or where not.

Studying early medieval charters is less of a mass-participation sport, but there are often a number of other people who will know a particular collection very well. One of the things I reckon that early medieval charter research could really do with is a discussion board or mailing list where you could ask a question like: “I’ve got this phenomenon here, have other people seen it and where?” and tap into the collective wisdom of diplomatists who specialise in particular archives.

The final option is using online resources to look through sources more quickly. The classic example for charter specialists are where you have access to full-text sources and you have a phenomenon that’s normally described by only a few Latin/Old English etc terms. There are the normal problems of full text searching when you’ve got weird spellings, but they can potentially be dealt with through enough ingenuity. Full text searching, however, isn’t much use when you’ve got a phenomenon that isn’t consistently described in the documents; rare events, just because they’re rare, are particularly likely to be described inconsistently. If you were a cleric or churchwarden in Elizabethan England who wanted to record that the man you’d just buried wasn’t white, what terms would you use? Black, Negro, Moor, Blackamoor, African, native of Guinea? I’ve probably missed about a dozen other possible terms and even an expert in the field isn’t likely to be able to guess all the descriptions that may have occurred. Full text searching only works effectively when you have things that people in the past found it easy to describe, and that sometimes seems a frustratingly small subset of historical reality.

This is when detailed but general-purpose analysis of sources, like that by the Making of Charlemagne’s Europe project come into their own. At our peak rate, we were adding about 60-70 charters a month, which equates to roughly around 4 hours a charter on average. (This average conceals a lot of variation: I had charters which I can input in an hour or less, but some could take several days to input, if they combined complex provisions with large numbers of personal and place names).

This relatively slow rate of progress, however, was combined with a great variety of information extracted and input. And once such data has been recorded, it makes searching for a number of rare phenomena very rapid for users. Alongside all the common things we have in our database, we have 3 charters which mention saltpans, 2 that mention stepsons and 1 that mentions people explicitly described as Irish. We have 6 female witnesses, 3 people described as “waldator” (forester) and 3 places called “alps”. We also have Maginfrid, an unfree man belonging to Charlemagne who was donating property without his master’s consent.

One of the things that the Charlemagne database therefore offers is a way to locate all kinds of uncommon phenomena rapidly. In that sense, it’s a kind of mega-index to a large number of charters. If we were able to input more data in the future, it would increasingly provide an efficient way for researchers of the early Middle Ages to find, if not the black swans of Carolingian charters, at least the black squirrels.