Changing Registers and the Demise of a Pronoun

Lars Hinrichs; Axel Bohmann

Introduction

Previous work (Hinrichs, Szmrecsanyi, and Bohmann 2015; Hinrichs and Szmrecsanyi 2007) has shown that hierarchical modeling can be successfully employed in an otherwise traditional corpus-linguistic study design pursuing one dependent variable. The current project builds this work by combining it with the multidimensional analysis framework introduced to corpus linguistics by Biber (1991), a highly multivariate register classification procedure which Bohmann (2019) applied to large corpora.

The Bell Tolls for WHOM: differing accounts

The ongoing progressive loss of WHOM has been observed popularly and by linguists for decades. It is clearly detectable in COHA as well (Brozovsky et al. 2018). There are several accounts that we need to consider in trying to explain the overall change in written English that ultimately produces the loss of WHOM. Ultimately, we will see if these accounts can be reconciled, or must be refined or contradicted.

Loss of the dative case

Predicts replacement by who + preposition.

Loss of complexity

Predicts loss without replacement and concurrent shifts toward shorter sentences, fewer subordinators, fewer other wh-pronouns, generally less relativization.

Stylistic informalization & colloquialization

A general trend that goes against wh-pronouns, but offers replacements with alternatives, most prominently that.

The corpus

Corpus of Historical American English (COHA) (Davies 2012)

Covers the time from 1810 to 2007
Contains circa 116,000 text samples
Strives for even representation of the genres fiction, magazines, newspapers, nonfiction.

Analysis

Factor analysis

We pass our matrix of 200+ features for 116,000+ texts into the psych::fa() function, specifying parameters as in Bohmann (2019). The one thing we change, relative to the design of that study, is the number of factors. We only have written English here; speech or Twitter are not represented; and only AmE is included.

A first attempt to get the optimal number of parameters could be Horn’s parallel analysis. However, this takes 10+ hours to run and ends up suggesting an implausibly large N of 54.

Instead, we can do a PCA and inspect how much variance is explained by the first X dimensions, looking for a characteristic “bend” or “elbow” where the tail flattens, suggesting further components do not add much further information:

Figure 1: Scree plot: proportion of variance explained by each of the first 100 principal components in descending order.

The scree plot method suggests between 4 and 6 factors, and we choose 5. We write all the coefficients to file for further inspection and qualitative interpretation of the factors. (This is done outside R.)

Structure coefficients for all measured features along the first and second factor. PA1: Involved vs. informational production. PA2: Argument and persuasion.

Figure 2: Structure coefficients for all measured features along the first and second factor. PA1: Involved vs. informational production. PA2: Argument and persuasion.

Based on our interpretation of the factors, we propose the following labels:

Dimension	Positive features	Negative features
Involved versus informational	meanWL (0.79), Suf_ation (0.60)), Suf_ion (0.56), Pre_coMN (0.56), Pre_re (0.53), Pre_aCC (0.51), Suf_ment (0.50), Pre_pro (0.45), Suf_al (0.45), Pre_pre (0.42), Pre_de (0.42), prepositions (0.40), Suf_ial (0.37), Suf_ive (0.35), Suf_ity (0.34), Suf_dent (0.34), agentless_be_passive (0.32), Pre_uni (0.32), Suf_ary (0.30)	gh (-0.32), sec_pers_pn (-0.34), double_prep (-0.35), first_pers_pn (-0.35), contractions (-0.36), third_pers_pn (-0.40)
Argument and persuasion	standardness (0.55), not (0.49), demonst_PN (0.45), pn_it (0.43), prepositions (0.40), agentless_be_passive (0.40), if_unless (0.37), standard_def_article (0.35), conjuncts (0.34), should (0.33), epistemic_certainty (0.32), amplifiers (0.31)	hyphenation (-0.44)
Colloquial writing	contractions (0.70), sec_pers_pn (0.56), first_pers_pn (0.55), private_verbs (0.45), indef_pn (0.43), reporting_verbs (0.35), if_unless (0.35), want_to (0.33), maybe (0.32), pn_it (0.32)	agentless_be_passive (-0.35), prepositions (-0.45), standard_def_article (-0.45), meanWL (-0.459)
Narrative	third_pers_pn (0.65), past_perfect (0.46)	Suf_ation (-0.30), Pre_coMN (-0.31), will_uncontracted (-0.31), meanWL (-0.33), prepositions (-0.35), Suf_ion (-0.36), Pre_pro (-0.38)
Emphasis and evaluation	emphatics (0.31)	agentless_be_passive (-0.41)

Note that emphasis and evaluation is a very flimsy dimension. We kept it because there were other features bubbling just below the 0.3 threshold to allow a meaningful interpretation here.

Visualizing dimension scores

For all dimensions, we will look into how they relate to the four genres, how they change over time (per genre), and how they relate to the frequency of WHOM.

Involved versus informational production

Figure 3: Distribution of dimension scores for each genre: Involved vs. informational production.

The expected genre differentiation, where fiction is most involved and newspapers/nonfiction most informational.

Figure 4: Diachronic development of dimension scores for each genre in COHA: Involved vs. informational production.

This is interesting, and worth a dedicated analysis. Particularly notable are the rise-fall patterns in informational focus for nonfiction and newspaper writing. There is a story here about the competing pressures of textual economy and colloquial/interpersonal writing style throughout the 20^th century.

Figure 5: Frequency of WHOM by dimension score: Informational vs. involved production.

It looks like texts that are both very informational and very involved shy away from WHOM compared to the middle ground. We would have expected a more linear pattern where more informational focus means more WHOM. Perhaps information density prevents this?

Argument and persuasion

Figure 6: Distribution of dimension scores for each genre: Argument and persuasion.

Genres are much closer together than for IvI. This indicates that the dimension scores capture variation beyond what the genre labels can account for.

Figure 7: Diachronic development of dimension scores for each genre in COHA: Argument and persuasion.

That is a rapid descent! The overall downward trend is interesting, also the apparently rapid shift in cultural expectations for magazine writing in the first half of the 20^th century.

Figure 8: Frequency of WHOM by dimension score: Argument and persuasion.

More argument ≙ more WHOM, it seems. But this also correlates very strongly with the year a text was produced in. Teasing apart a register effect from an individual-feature change may be a bit of a challenge.

Colloquial writing

Figure 9: Distribution of dimension scores for each genre: Colloquial writing.

These are generally low scores, with fiction slightly higher overall, and a skewed distribution.

Figure 10: Diachronic development of dimension scores for each genre in COHA: Colloquial writing.

The development is as expected. Colloquialization seems to be picking up around the turn of the century.

Figure 11: Frequency of WHOM by dimension score: Colloquial writing.

It appears: more colloquial ≙ less WHOM. Another relationship where collinearity with year may be problematic.

Narrative

Figure 12: Distribution of dimension scores for each genre: Narrative.

Expected pattern, with fiction leading the other genres. What’s perhaps surprising is that the difference is not even clearer.

Figure 13: Diachronic development of dimension scores for each genre in COHA: Narrative.

Around the same time that magazine writing is losing its argumentative character (see above), its narrative scores increase. Nonfiction shows a u-shaped development throughout the 20^th century that will be worth exploring. It is also interesting that fiction is not stable along this dimension.

Figure 14: Frequency of WHOM by dimension score: Narrative.

For ~ the score interval [-1:1] (this is almost 80% of the data) there is a linear relationship such that more narrative ≙ more WHOM. Not something we would necessarily have expected, but perhaps the fact that talk about people in the third person is prominent for the narrative dimension accounts for this.

Emphasis and Evaluation

Figure 15: Distribution of dimension scores for each genre: Emphasis and evaluation.

We did not have any clear expectations, but seeing newspaper writing at the bottom does surprise us a little.

Figure 16: Diachronic development of dimension scores for each genre in COHA: Emphasis and evaluation.

With this plot, newspaper writing makes a bit more sense. There is some diachronic dynamism here. Magazine writing once again shows an identity crisis throughout the first half of the 20^th century.

Figure 17: Frequency of WHOM by dimension score: Emphasis and evaluation.

As above, for the score interval most clearly supported by the data (about [-1:1]), a roughly linear relation holds, such that more evaluation ≙ more WHOM.

Regression model

First, let’s see how bad multicollinearity is when year is added to our dimension scores.

collin.fnc(FactorScores[,3:9])$cnumber

[1] 141.4007

141 is definitely problematic!

Without the inclusion of year, the issue disappears entirely (\(\kappa\) of ~2). We are not entirely sure about this, but perhaps the large amount of data we have somewhat mitigates the problem. We’ll try fitting the full model and checking the variance inflation factor for each predictor, with the default assumption that values below 5 are unproblematic. Also, since we saw that the relationship between WHOM and Involved versus Informational is parabolically shaped, let’s add a quadratic term for this dimension.

                                   GVIF Df GVIF^(1/(2*Df))
year                           1.765951  1        1.328891
genre                          2.781590  3        1.185902
Involved_vs_informational      1.830961  1        1.353130
I(Involved_vs_informational^2) 1.232548  1        1.110202
Argument_and_persuasion        2.144350  1        1.464360
Colloquial_writing             1.746542  1        1.321568
Narrative                      1.607575  1        1.267902
Emphasis_and_evaluation        1.446427  1        1.202675

Okay, definitely nothing close to 5. We’ll just treat this as confirmation that we are okay to include all predictors in the model. Also, looking further into this we found the advice that variables should be standardized before calculating \(\kappa\). If this is done, \(\kappa\) is reduced to the entirely harmless value of 2.85.

Let’s look at the model summary to confirm all our predictors matter.

	whom
Predictors	Estimates	CI	p
(Intercept)	3.70	3.45 – 3.95	<0.001
year	-0.00	-0.00 – -0.00	<0.001
genre [mag]	0.04	0.03 – 0.06	<0.001
genre [news]	0.05	0.03 – 0.06	<0.001
genre [nf]	0.08	0.05 – 0.11	<0.001
Involved_vs_informational	0.01	0.01 – 0.02	<0.001
Involved_vs_informational^2	-0.02	-0.02 – -0.02	<0.001
Argument_and_persuasion	0.02	0.02 – 0.03	<0.001
Colloquial_writing	-0.02	-0.02 – -0.01	<0.001
Narrative	0.08	0.07 – 0.08	<0.001
Emphasis_and_evaluation	0.03	0.02 – 0.03	<0.001
Observations	116613
R² / R² adjusted	0.024 / 0.024

Indeed they all matter. The R² is not great though, but the data are tricky: more than 80% of all texts have exactly 0 cases of WHOM. With this skew, it will be very difficult to get great R² values.

Finally, we plot the model coefficients and draw some initial inferences.

Figure 18: Coefficient estimates for the minimal adequate model.

Most notable: year trumps all the other predictors, even when keeping register influence equal. This means that the change away from WHOM goes above and beyond anything that could be explained in terms of general register developments (as operationalized in the ramshackle way we try here).

The fact that fiction writing is the least conducive genre for WHOM, but that WHOM is favored in narrative texts when controlling for genre, will require further attention.

With more informational writing, it looks as though WHOM is becoming slightly more favored, but more important is the quadratic term indicating that values further from 0 (i.e. either very involved or very informational texts) both actually disfavor WHOM.

When controlling for register influences, the role of genre is reduced to fiction versus all other categories in COHA, whose confidence intervals overlap.

Next steps

Change corpus mining to work with the full range of word/lemma/POS information available in COHA.
Re-cast (a potentially more robust) factor solution.
Explore WHOM in n-gram analysis.
Explore analysis of the corpus when split into texts containing WHOM (20%) and texts not containing WHOM (80%).

Biber, Douglas. 1991. Variation Across Speech and Writing. Cambridge University Press.

Bohmann, Axel. 2019. Variation in English Worldwide: Registers and Global Varieties. Cambridge: Cambridge University Press.

Brozovsky, Erica, Lars Hinrichs, James Law, and Jenny Wolfgang. 2018. “How WHOM Retreated Against the Advice of Prescriptive Grammarians: A Multivariate Analysis of Written English Since 1810.” New York, NY. https://bit.ly/whom-lsa-poster.

Davies, Mark. 2012. “Expanding Horizons in Historical Linguistics with the 400-Million Word Corpus of Historical American English.” Corpora 7 (2): 121157.

Hinrichs, Lars, and Benedikt Szmrecsanyi. 2007. “Recent Changes in the Function and Frequency of Standard English Genitive Constructions: A Multivariate Analysis of Tagged Corpora.” English Language and Linguistics 11 (3): 437–74. http://www.journals.cambridge.org/abstract_S1360674307002341.

Hinrichs, Lars, Benedikt Szmrecsanyi, and Axel Bohmann. 2015. “Which-Hunting and the Standard English Relative Clause.” Language 91 (4): 806–36.