17. October 2023
Statistics Zero: Is Data Plural?
If you talk to a linguist for any amount of time they will probably inform you that linguists are not grammarians. Linguists are not enforcers of grammatical correctness, but rather study how language is used, and sometimes what that might imply about how we think.
At some point you might have written something like “the data suggests….” If so, you may also have run into a scientist of a certain age, or perhaps from a certain field, who has corrected you. Perhaps they put a nice little comment on your manuscript explaining that data is plural, so it should be “these data suggest.” Just now I read a passage in David Quammen’s Breathless: the Scientific Race to Defeat a Deadly Virus (recommended) where Quammen describes the first posting of the genetic sequence for the SARS-CoV-2 virus. Eddie Holmes and Andrew Rambaut attached a note to their post: “Please feel free to download, share, use and analyze this data.” Quammen follows this quote with the explanation “both men know that ‘data’ is plural but they were in a hurry.”
I would not argue with Quammen, nor that definitely hypothetical manuscript editor. Data is plural.
But.
“These data” has always sounded wrong to me, but data is plural, at least if you speak Latin: datum1 is the singular, and when you have more than one of those you have data. Therefore, we should say “these data” like we say “these cats,” right? It still sounds weird.
Some hair.
photo: Robert A. Brown
I present exhibit A: “hair.” A detective might find a hair or even two or three hairs in a suspicious location. But what about this hair that you carry around on your head? There are clearly more than one of them. Likewise, Anyone who has ever been in the presence of a husky in the spring will attest that there is definitely more than one hair. The hair emerges in tufts, wafts through the air, and spontaneously regenerates itself whenever the vacuum cleaner is put away. There is so much hair that you might say it is uncountable.
The English language loves to set little traps, and in that spirit “hair” is actually two words. A countable noun, with “hair” as the singular and “hairs” the plural, and an uncountable noun. Uncountable nouns do not have singular forms because one is a number that many of us can indeed count to. Instead, rather than having one hair or many hairs, you simply have some hair.
Hair has emergent properties that hairs do not: you can style it, donate it, make wigs of it, and lose it; it flows, becomes unruly and is a crowning glory. Perhaps a better illustration of this particular point is water. A litre of water is composed of a great number of water molecules, yet the behaviour and properties of water in any amount we are familiar with is so different than that of a meaningfully countable number of water molecules, much less a single one, that we almost never think or speak of it as a composite.
Similarly, data has emergent properties that a mere datum, or even a bunch of them, do not. Apparently the aphorism “the plural of anecdote is not data” started life as the opposite: “the plural of anecdote is data,” as uttered by the political scientist Raymond Wolfinger during a class discussion. Jokes about what this says about political “science” aside, the adage quickly morphed into its negative. It is quite easy to collect an impressive set of anecdotes about how the world is flat simply by asking around a flat Earther convention. Data on the other hand, is composed of observations plus information about how those observations were obtained: whether they are opinions or measurements, where and how they were acquired, etc.
Another interesting way to think about it is that data is itself an emergent property of a collection of related observations and information about them.
In statistical terms, data is a sample of a population, and integral to that sample is knowledge of exactly what population it is drawn from. Our collection of opinions from the Flat Earth convention might be valuable data for a sociologist studying anti-science movements, but is not useful for an educator assessing the knowledge of primary school students, much less a geographer interested in improving the geoid. Further knowledge may also be critical. If we gathered our data by setting up a booth with a large banner asking convention goers to “tell us about the TRUE shape of Earth!” we might attract true believers at a greater rate than non-believers attending for a laugh. Similarly, if we take careful gravimetric measurements from a satellite in a flawed orbit we might obtain biased data.
So why does “these data” sound strange? Why did Holmes and Rambaut, eminent virologists who had no doubt written “these data” many, many times, revert to “this data” when in a hurry? I propose that the intuitive use of data as an uncountable noun reflects our modern recognition that data is not a mere collection of some number of atomic units. Rather, data necessarily comprises organized observations as well as information about how those observations were made; it is the properties that emerge from this agglomeration that are a critical foundation for both statistics and science.
-
The word “datum” is not terribly common, and is even less commonly used to refer to a single element of data (because such a thing is meaningless?). Usually datum is used in the spirit of its Latin root “something given,” to refer to a reference. Nautical charts have a “chart datum,” which is the level of water (e.g. the water level at the lowest tide on record) from which other levels are measured. A carpenter, machinist or needleworker might use a particular feature as a datum from which other dimensions are measured (e.g. a factory edge on a piece of plywood). GPS uses a system called WGS 84, in which the datum is a particular oblate spheroid that approximates the planet’s actual shape, has its centre within 2 cm of the actual centre of the Earth, and has the line of zero longitude 102 metres east of the Greenwich meridian at the latitude of the Royal Observatory. ↩︎