# Thoughts on statistical methods and distance reading

Hi everyone –

Last week’s discussion and readings on stylometrics had me thinking about what it means to use counting methods such as log-likelihood ratios to make comparative claims about texts.

The main point I’d like to make is that I don’t believe that looking at log-likelihood ratios (or other such purely frequency-based measures) alone are sufficient, because they tell us nothing about the distribution of a particular word or lemma throughout a text. For one, I think the distribution of words of importance matters quite a bit as well – not only for plot-related reasons but also because we can, for example, then associate words with particular characters or particular moments. This might then better help inform the kind of close reading that we can do by drawing our attention to particular areas of the text.

The reason I think distribution is important is that a play is not a random sequence of words drawn from a bucket of Shakespeare’s corpus – and so really all that a log-likelihood table tells us is that a given play is not like Shakespeare’s corpus – but we knew that without doing any counting, precisely because the arrangement of words is what makes the play. For example, an alphabetically sorted list of all the words in a given play would produce exactly the same log-likelihood ratio results as the play itself, but would hardly give rise to the same kind of literary analysis. Yet essentially all that counting methods do by themselves is just that – sorting and comparing. I’m reminded here of perhaps Jakobson’s axes of combination and selection – in my mind, a sorted text is very like the original text in the axis of selection since at some level, all the same selections have been made; obviously this does not carry to the axis of combination. For this reason tools that care about syntactical structures might come into play – but even then, the notion of stock phrases or syntactical structures raised in the Shore article troubles me, because those seem like distinct semantic units that somehow operate at the same level of abstraction as words, much as noun phrases for example might.

Thinking about these issues slightly more mathematically, I’m also concerned with the proper interpretation any sort of statistics-based reasoning. Consider Hope and Witmore’s explanation of the significance stars: “Stars are used to indicate degrees of statistical significance: four indicate a result very unlikely to be due to chance, with the degree of confidence decreasing as the number of stars decreases.” First, and I think this perhaps invites some discussion on issues such as authorial intent, in what way can any of Shakespeare’s sentences can be said to have arisen “due to chance?” I would argue that “not at all” is pretty close (but I have certain assumptions of how the creative faculty works that may differ than those of others – a discussion for another time).

Thus, a word on p-values (the number that get converted into significance stars, where lower p-values are considered “lower probability” and thus “higher significance”). Consider the points raised here in this great summary of where p-values can go wrong from Wikipedia, especially points 1 and 4. The null hypothesis using WordHoard’s methodology is (very generally) that the relative frequency of given word or lemma analysis text and reference text is the same. Is this an especially valuable null hypothesis to be testing given that Shakespeare did not compose at random? I’m not sure.

Yan