Progress report for feedback and comments.
Previous work (Hinrichs, Szmrecsanyi, and Bohmann 2015; Hinrichs and Szmrecsanyi 2007) has shown that hierarchical modeling can be successfully employed in an otherwise traditional corpus-linguistic study design pursuing one dependent variable. The current project builds this work by combining it with the multidimensional analysis framework introduced to corpus linguistics by Biber (1991), a highly multivariate register classification procedure which Bohmann (2019) applied to large corpora.
The ongoing progressive loss of WHOM has been observed popularly and by linguists for decades. It is clearly detectable in COHA as well (Brozovsky et al. 2018). There are several accounts that we need to consider in trying to explain the overall change in written English that ultimately produces the loss of WHOM. Ultimately, we will see if these accounts can be reconciled, or must be refined or contradicted.
Predicts replacement by who + preposition.
Predicts loss without replacement and concurrent shifts toward shorter sentences, fewer subordinators, fewer other wh-pronouns, generally less relativization.
A general trend that goes against wh-pronouns, but offers replacements with alternatives, most prominently that.
Corpus of Historical American English (COHA) (Davies 2012)
Covers the time from 1810 to 2007
Contains circa 116,000 text samples
Strives for even representation of the genres fiction, magazines, newspapers, nonfiction
.
We pass our matrix of 200+ features for 116,000+ texts into the psych::fa()
function, specifying parameters as in Bohmann (2019). The one thing we change, relative to the design of that study, is the number of factors. We only have written English here; speech or Twitter are not represented; and only AmE is included.
A first attempt to get the optimal number of parameters could be Horn’s parallel analysis. However, this takes 10+ hours to run and ends up suggesting an implausibly large N of 54.
Instead, we can do a PCA and inspect how much variance is explained by the first X dimensions, looking for a characteristic “bend” or “elbow” where the tail flattens, suggesting further components do not add much further information:
The scree plot method suggests between 4 and 6 factors, and we choose 5. We write all the coefficients to file for further inspection and qualitative interpretation of the factors. (This is done outside R.)
Based on our interpretation of the factors, we propose the following labels:
Dimension | Positive features | Negative features |
---|---|---|
Involved versus informational | meanWL (0.79), Suf_ation (0.60)), Suf_ion (0.56), Pre_coMN (0.56), Pre_re (0.53), Pre_aCC (0.51), Suf_ment (0.50), Pre_pro (0.45), Suf_al (0.45), Pre_pre (0.42), Pre_de (0.42), prepositions (0.40), Suf_ial (0.37), Suf_ive (0.35), Suf_ity (0.34), Suf_dent (0.34), agentless_be_passive (0.32), Pre_uni (0.32), Suf_ary (0.30) | gh (-0.32), sec_pers_pn (-0.34), double_prep (-0.35), first_pers_pn (-0.35), contractions (-0.36), third_pers_pn (-0.40) |
Argument and persuasion | standardness (0.55), not (0.49), demonst_PN (0.45), pn_it (0.43), prepositions (0.40), agentless_be_passive (0.40), if_unless (0.37), standard_def_article (0.35), conjuncts (0.34), should (0.33), epistemic_certainty (0.32), amplifiers (0.31) | hyphenation (-0.44) |
Colloquial writing | contractions (0.70), sec_pers_pn (0.56), first_pers_pn (0.55), private_verbs (0.45), indef_pn (0.43), reporting_verbs (0.35), if_unless (0.35), want_to (0.33), maybe (0.32), pn_it (0.32) | agentless_be_passive (-0.35), prepositions (-0.45), standard_def_article (-0.45), meanWL (-0.459) |
Narrative | third_pers_pn (0.65), past_perfect (0.46) | Suf_ation (-0.30), Pre_coMN (-0.31), will_uncontracted (-0.31), meanWL (-0.33), prepositions (-0.35), Suf_ion (-0.36), Pre_pro (-0.38) |
Emphasis and evaluation | emphatics (0.31) | agentless_be_passive (-0.41) |
Note that emphasis and evaluation is a very flimsy dimension. We kept it because there were other features bubbling just below the 0.3 threshold to allow a meaningful interpretation here.
For all dimensions, we will look into how they relate to the four genres, how they change over time (per genre), and how they relate to the frequency of WHOM.
The expected genre differentiation, where fiction
is most involved and newspapers/nonfiction
most informational.
This is interesting, and worth a dedicated analysis. Particularly notable are the rise-fall patterns in informational focus for nonfiction
and newspaper
writing. There is a story here about the competing pressures of textual economy and colloquial/interpersonal writing style throughout the 20th century.
It looks like texts that are both very informational and very involved shy away from WHOM compared to the middle ground. We would have expected a more linear pattern where more informational focus means more WHOM. Perhaps information density prevents this?
Genres are much closer together than for IvI. This indicates that the dimension scores capture variation beyond what the genre labels can account for.
That is a rapid descent! The overall downward trend is interesting, also the apparently rapid shift in cultural expectations for magazine writing in the first half of the 20th century.
More argument ≙ more WHOM, it seems. But this also correlates very strongly with the year
a text was produced in. Teasing apart a register effect from an individual-feature change may be a bit of a challenge.
These are generally low scores, with fiction
slightly higher overall, and a skewed distribution.
The development is as expected. Colloquialization seems to be picking up around the turn of the century.
It appears: more colloquial ≙ less WHOM. Another relationship where collinearity with year
may be problematic.
Expected pattern, with fiction
leading the other genres. What’s perhaps surprising is that the difference is not even clearer.
Around the same time that magazine
writing is losing its argumentative character (see above), its narrative scores increase. Nonfiction
shows a u-shaped development throughout the 20th century that will be worth exploring. It is also interesting that fiction
is not stable along this dimension.
For ~ the score interval [-1:1] (this is almost 80% of the data) there is a linear relationship such that more narrative ≙ more WHOM. Not something we would necessarily have expected, but perhaps the fact that talk about people in the third person is prominent for the narrative dimension accounts for this.
We did not have any clear expectations, but seeing newspaper
writing at the bottom does surprise us a little.
With this plot, newspaper
writing makes a bit more sense. There is some diachronic dynamism here. Magazine
writing once again shows an identity crisis throughout the first half of the 20th century.
As above, for the score interval most clearly supported by the data (about [-1:1]), a roughly linear relation holds, such that more evaluation ≙ more WHOM.
First, let’s see how bad multicollinearity is when year
is added to our dimension scores.
collin.fnc(FactorScores[,3:9])$cnumber
[1] 141.4007
141 is definitely problematic!
Without the inclusion of year
, the issue disappears entirely (\(\kappa\) of ~2). We are not entirely sure about this, but perhaps the large amount of data we have somewhat mitigates the problem. We’ll try fitting the full model and checking the variance inflation factor for each predictor, with the default assumption that values below 5 are unproblematic. Also, since we saw that the relationship between WHOM and Involved versus Informational is parabolically shaped, let’s add a quadratic term for this dimension.
GVIF Df GVIF^(1/(2*Df))
year 1.765951 1 1.328891
genre 2.781590 3 1.185902
Involved_vs_informational 1.830961 1 1.353130
I(Involved_vs_informational^2) 1.232548 1 1.110202
Argument_and_persuasion 2.144350 1 1.464360
Colloquial_writing 1.746542 1 1.321568
Narrative 1.607575 1 1.267902
Emphasis_and_evaluation 1.446427 1 1.202675
Okay, definitely nothing close to 5. We’ll just treat this as confirmation that we are okay to include all predictors in the model. Also, looking further into this we found the advice that variables should be standardized before calculating \(\kappa\). If this is done, \(\kappa\) is reduced to the entirely harmless value of 2.85.
Let’s look at the model summary to confirm all our predictors matter.
whom | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 3.70 | 3.45 – 3.95 | <0.001 |
year | -0.00 | -0.00 – -0.00 | <0.001 |
genre [mag] | 0.04 | 0.03 – 0.06 | <0.001 |
genre [news] | 0.05 | 0.03 – 0.06 | <0.001 |
genre [nf] | 0.08 | 0.05 – 0.11 | <0.001 |
Involved_vs_informational | 0.01 | 0.01 – 0.02 | <0.001 |
Involved_vs_informational^2 | -0.02 | -0.02 – -0.02 | <0.001 |
Argument_and_persuasion | 0.02 | 0.02 – 0.03 | <0.001 |
Colloquial_writing | -0.02 | -0.02 – -0.01 | <0.001 |
Narrative | 0.08 | 0.07 – 0.08 | <0.001 |
Emphasis_and_evaluation | 0.03 | 0.02 – 0.03 | <0.001 |
Observations | 116613 | ||
R2 / R2 adjusted | 0.024 / 0.024 |
Indeed they all matter. The R2 is not great though, but the data are tricky: more than 80% of all texts have exactly 0 cases of WHOM. With this skew, it will be very difficult to get great R2 values.
Finally, we plot the model coefficients and draw some initial inferences.
Most notable: year
trumps all the other predictors, even when keeping register influence equal. This means that the change away from WHOM goes above and beyond anything that could be explained in terms of general register developments (as operationalized in the ramshackle way we try here).
The fact that fiction
writing is the least conducive genre for WHOM, but that WHOM is favored in narrative texts when controlling for genre, will require further attention.
With more informational writing, it looks as though WHOM is becoming slightly more favored, but more important is the quadratic term indicating that values further from 0 (i.e. either very involved or very informational texts) both actually disfavor WHOM.
When controlling for register influences, the role of genre is reduced to fiction
versus all other categories in COHA, whose confidence intervals overlap.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hinrichs & Bohmann (2020, Oct. 30). Changing Registers and the Demise of a Pronoun . Retrieved from https://whom.netlify.app
BibTeX citation
@misc{hinrichsbohmann2020, author = {Hinrichs, Lars and Bohmann, Axel}, title = {Changing Registers and the Demise of a Pronoun }, url = {https://whom.netlify.app}, year = {2020} }