Changing Registers and the Demise of a Pronoun

Progress report for feedback and comments.

Lars Hinrichs https://www.larshinrichs.site (The University of Texas at Austin)https://liberalarts.utexas.edu/english , Axel Bohmann https://www.freinem.uni-freiburg.de/mitglieder/dr-axel-bohmann-alu-englisches-seminar (University of Freiburg)https://www.anglistik.uni-freiburg.de
October 30, 2020

Introduction

Previous work (Hinrichs, Szmrecsanyi, and Bohmann 2015; Hinrichs and Szmrecsanyi 2007) has shown that hierarchical modeling can be successfully employed in an otherwise traditional corpus-linguistic study design pursuing one dependent variable. The current project builds this work by combining it with the multidimensional analysis framework introduced to corpus linguistics by Biber (1991), a highly multivariate register classification procedure which Bohmann (2019) applied to large corpora.

The Bell Tolls for WHOM: differing accounts

The ongoing progressive loss of WHOM has been observed popularly and by linguists for decades. It is clearly detectable in COHA as well (Brozovsky et al. 2018). There are several accounts that we need to consider in trying to explain the overall change in written English that ultimately produces the loss of WHOM. Ultimately, we will see if these accounts can be reconciled, or must be refined or contradicted.

Loss of the dative case

Predicts replacement by who + preposition.

Loss of complexity

Predicts loss without replacement and concurrent shifts toward shorter sentences, fewer subordinators, fewer other wh-pronouns, generally less relativization.

Stylistic informalization & colloquialization

A general trend that goes against wh-pronouns, but offers replacements with alternatives, most prominently that.

The corpus

Corpus of Historical American English (COHA) (Davies 2012)

Analysis

Factor analysis

We pass our matrix of 200+ features for 116,000+ texts into the psych::fa() function, specifying parameters as in Bohmann (2019). The one thing we change, relative to the design of that study, is the number of factors. We only have written English here; speech or Twitter are not represented; and only AmE is included.

A first attempt to get the optimal number of parameters could be Horn’s parallel analysis. However, this takes 10+ hours to run and ends up suggesting an implausibly large N of 54.

Instead, we can do a PCA and inspect how much variance is explained by the first X dimensions, looking for a characteristic “bend” or “elbow” where the tail flattens, suggesting further components do not add much further information:

Scree plot: proportion of variance explained by each of the first 100 principal components in descending order.

Figure 1: Scree plot: proportion of variance explained by each of the first 100 principal components in descending order.

The scree plot method suggests between 4 and 6 factors, and we choose 5. We write all the coefficients to file for further inspection and qualitative interpretation of the factors. (This is done outside R.)

Structure coefficients for all measured features along the first and second factor. PA1: Involved vs. informational production. PA2: Argument and persuasion.

Figure 2: Structure coefficients for all measured features along the first and second factor. PA1: Involved vs. informational production. PA2: Argument and persuasion.

Based on our interpretation of the factors, we propose the following labels:

Dimension Positive features Negative features
Involved versus informational meanWL (0.79), Suf_ation (0.60)), Suf_ion (0.56), Pre_coMN (0.56), Pre_re (0.53), Pre_aCC (0.51), Suf_ment (0.50), Pre_pro (0.45), Suf_al (0.45), Pre_pre (0.42), Pre_de (0.42), prepositions (0.40), Suf_ial (0.37), Suf_ive (0.35), Suf_ity (0.34), Suf_dent (0.34), agentless_be_passive (0.32), Pre_uni (0.32), Suf_ary (0.30) gh (-0.32), sec_pers_pn (-0.34), double_prep (-0.35), first_pers_pn (-0.35), contractions (-0.36), third_pers_pn (-0.40)
Argument and persuasion standardness (0.55), not (0.49), demonst_PN (0.45), pn_it (0.43), prepositions (0.40), agentless_be_passive (0.40), if_unless (0.37), standard_def_article (0.35), conjuncts (0.34), should (0.33), epistemic_certainty (0.32), amplifiers (0.31) hyphenation (-0.44)
Colloquial writing contractions (0.70), sec_pers_pn (0.56), first_pers_pn (0.55), private_verbs (0.45), indef_pn (0.43), reporting_verbs (0.35), if_unless (0.35), want_to (0.33), maybe (0.32), pn_it (0.32) agentless_be_passive (-0.35), prepositions (-0.45), standard_def_article (-0.45), meanWL (-0.459)
Narrative third_pers_pn (0.65), past_perfect (0.46) Suf_ation (-0.30), Pre_coMN (-0.31), will_uncontracted (-0.31), meanWL (-0.33), prepositions (-0.35), Suf_ion (-0.36), Pre_pro (-0.38)
Emphasis and evaluation emphatics (0.31) agentless_be_passive (-0.41)

Note that emphasis and evaluation is a very flimsy dimension. We kept it because there were other features bubbling just below the 0.3 threshold to allow a meaningful interpretation here.

Visualizing dimension scores

For all dimensions, we will look into how they relate to the four genres, how they change over time (per genre), and how they relate to the frequency of WHOM.

Involved versus informational production

Distribution of dimension scores for each genre: Involved vs. informational production.

Figure 3: Distribution of dimension scores for each genre: Involved vs. informational production.

The expected genre differentiation, where fiction is most involved and newspapers/nonfiction most informational.

Diachronic development of dimension scores for each genre in COHA: Involved vs. informational production.

Figure 4: Diachronic development of dimension scores for each genre in COHA: Involved vs. informational production.

This is interesting, and worth a dedicated analysis. Particularly notable are the rise-fall patterns in informational focus for nonfiction and newspaper writing. There is a story here about the competing pressures of textual economy and colloquial/interpersonal writing style throughout the 20th century.