Can a Machine Determine Plot?
At last year’s American Political Science Association annual convention, Nick Beauchamp (Assistant Professor at Northeastern University) gave a fascinating talk about new computational methods for visualizing narrative arcs in texts. His online tool Plot Mapper maps the progression of a text — i.e. its “plot” — in a two dimensional space, along with key words along the way. You can try it out for yourself here, but here’s an example from The Great Gatsby:
How do we evaluate a digital method such as Plot Mapper? What is this visualizing telling us? When and where should we use a tool like this?
To answer these questions, we first have to understand how the Plot Mapper works.
Automated text analysis methods such as Plot Mapper usually treat documents under the “bag of words” assumption in order to reduce the complexity of natural language text. Specifically, the text (e.g. a novel or speech) is split up into an arbitrary number of chunks, and each chunk is treated as its own document. The documents are then preprocessed to remove capitalization, punctuation, numbers, and “stop words” (such as “the”, “and”, “or”, etc). Sometimes a “stemming” algorithm is used on the corpus, which reduces words to their “stem” or root. For example, words such as "vote", "voting", "voter", "votes" are collapsed into one token instead of treated as unrelated.
Once pre-processing is completed, each document is transformed into a vector containing the count of every unique word within the document, disregarding information such as the order in which the words appear. Hence the “bag of words” assumption — each document is treated as if it was simply a bag of particular words in particular quantities. As many scholars have noted, this model of language is necessarily wrong, but can be useful in the statistical analysis of textual data because it reduces complexity and allows for more efficient modeling. The vectors are then combined into a special kind of table called a Document-Term Matrix (DTM), which is the primary input used for many automated text analysis applications.
Plot Mapper uses a particular kind of automated text analysis technique called Principle Component Analysis (PCA). Without getting into the statistical weeds, we can think of principle components as the most important dimensions of variation in a text, according to words contained in it. PCA reduces texts to their principle components, thus allowing the analyst to “embed” chunks in a two dimensional space.
Generally speaking, PCA is used to visualize and compare chunks of text. It allows the researcher to plot documents on a graph and see which documents are closest to each other, which infers some similarity between them. Here’s an example of a PCA embedding I did with chunks (in this case, small paragraphs) of Machiavelli’s Prince.** The numbers represent the order of the chunks, allowing us to see which paragraphs are more or less similar to one another.
So we see, for instance, that paragraphs 10 and 31 are similar to each other according to these two dimensions. Here's paragraph 10:
But considerable problems arise if territories are annexed in a country that differs in language, customs and institutions, and great good luck and great ability are needed to hold them. One of the best and most effective solutions is for the conqueror to go and live there. This makes the possession more secure and more permanent. This is what the Turks did in Greece: all the other measures taken by them to hold that country would not have sufficed, if they had not instituted direct rule. For if one does do that, troubles can be detected when they are just beginning and effective measures can be taken quickly. But if one does not, the troubles are encountered when they have grown, and nothing can be done about them. Moreover, under direct rule, the country will not be exploited by your officials; the subjects will be content if they have direct access to the ruler. Consequently, they will
And paragraph 31:
authority he acquired within it. And because their old hereditary ruling families no longer existed, these provinces recognised only the authority of various Roman leaders. Bearing all these things in mind, then, nobody should be surprised how easy it was for Alexander to maintain his position in Asia, and how difficult it was for others to hold conquered territories, as Pyrrhus and many others discovered. This contrast does not depend upon how much ability the conquerors displayed but upon the different characteristics of the conquered states. V: How one should govern cities or principalities that, before being conquered, used to live under their own laws When states that are annexed have been accustomed to living under their own laws and in freedom, as has been said, there are three ways of holding them: the first, to destroy their political institutions; the second, to go to live there yourself; the third, to let them continue to
And both of them are far from paragraph 187, inferring that 187 is different. Here's that paragraph:
to conquer Italy with a piece of chalk; and he who said that our sins were responsible spoke the truth. However, they were not the sins that he meant, but those that I have specified; and because they were the sins of rulers, they too have been punished for them. I want to show more effectively the defects of these troops. Mercenary generals are either very capable men, or they are not. If they are, you cannot trust them, because they will always be aspiring to achieve a great position for themselves, either by attacking you, their employer, or by attacking others contrary to your wishes. If they are mediocre, you will be ruined as a matter of course. And if it is objected that anyone who has forces at his disposal (whether mercenaries or not) will act in this way, I would reply by first drawing a distinction: arms are used either by a ruler or by a republic. If the former, the ruler should personally lead his armies, acting as the general. If the latter, the republic must send its own citizens as generals; and if someone is sent who turns out not to be very capable, he must be replaced; and if the general sent is capable, there should be legal controls that ensure that he does not exceed his authority. Experience has shown that only rulers and republics that possess their own armies are very successful, whereas mercenary armies never achieve anything, and cause only harm. And it is more difficult for a citizen to seize power in a republic that possesses its own troops than in one that relies upon foreign troops.
In what way are paragraphs 10 and 31 similar and both different from paragraphs 187? In other words, what do those dimensions mean? What do the x and y axis represent? This will depend greatly on the text. Sometimes the dimensions are semantically meaningful; sometimes they are not.
Here is the same PCA with words embedded according to the degree to which they represent of the principle components:
Labeling these dimensions (i.e. the axes) requires human beings to read representative documents and words and infer a theme. After reading the documents on the outer edges of the graph and reading them, I would argue PC1 captures an abstract v. specific dimension. That is, paragraphs with low values on PC1 discuss general and abstract qualities of a good leader, without reference to any ruler in particular. Words associated with these values include “ruler”, “men”, “qualities”. Those with high values in PC1 discuss specific historical examples, such as Kings Louis and Charles and their respective occupation of Italian territories. Words here include “France”,“Italy”,“King” (as in specific Kings), “Italy”, etc.
PC2 captures the dimension of taking power v. maintaining power (or statecraft v. warcraft). Those paragraphs with high values discuss mainly how to rule a territory once already captured. It gives advice to rulers for how to govern a new land and prevent rebellion, etc. The words associated with this dimension include “hold”, “govern”, and rule”. Those with low values, on the other hand, mainly discuss a ruler’s rise to power — how he (and its mostly “he” here) builds armies, wins battles, conquers territory, etc. The words associated with this dimension include “armies”. “fight”, “generals”, and “troops”
This suggests that this PCA representation of The Prince captures two primary dimensions: 1) advice on how to conquer land v. how to govern it, 2) abstract advice (a good ruler is mean) along with specific historical examples (Prince Louis did this).
Plot Mapper, essentially, does the same thing: It plots chunks of text in a two dimensional space according to its principle components, and then draws a line between them according to order. So we see that chunks 2 and 5 are similar according to this analysis.
What those axes of dimension represent is entirely dependent on the statistical correlation between words — which may or may not be semantically meaningful to humans.
Why get into the methodological weeds? Because it can help us understand what Plot Mapper (and PCA) can and cannot do as a digital humanities method. For instance, Plot Mapper may help us compare sections of a text. Specifically, PCA can invite us to think about the ways in which various chunks are similar or dissimilar according to dimensions we may not have inferred by reading alone. It may also potentially help us visualize and compare broad narrative arcs by tracing the movement of texts along the two dimensional space.
But PCA tells us very little about tone, perspective, character development, and many other aspects of a text that we might care about. The “bag of words” assumption prohibits an analysis of style or rhythm. The two-dimensional embedding is, by design, a way to obscure a text’s complexity, reducing it to its bare components. Furthermore, those components are inferred statistically in the case of Plot Mapper, thus preventing a manual specification of thematic dimensions, although there are other automated methods that are designed for this.
In sum, principle component analyses like Plot Mapper are best used as an exploration tool, an invitation to think about texts in new ways, and potentially as a comparative device. But if you want to know how Nick’s interiority developed over the course of The Great Gatsby, you’re better off reading the old-fashioned way.
** This analysis is based on a homework assignment I did for Justin Grimmer's class on Text as Data at Stanford University, Fall 2014.