Progress report for feedback and comments.
Previous work (Hinrichs, Szmrecsanyi, and Bohmann 2015; Hinrichs and Szmrecsanyi 2007) has shown that hierarchical modeling can be successfully employed in an otherwise traditional corpus-linguistic study design pursuing one dependent variable. The current project builds this work by combining it with the multidimensional analysis framework introduced to corpus linguistics by Biber (1991), a highly multivariate register classification procedure which Bohmann (2019) applied to large corpora.
The ongoing progressive loss of WHOM has been observed popularly and by linguists for decades. It is clearly detectable in COHA as well (Brozovsky et al. 2018). There are several accounts that we need to consider in trying to explain the overall change in written English that ultimately produces the loss of WHOM. Ultimately, we will see if these accounts can be reconciled, or must be refined or contradicted.
Predicts replacement by who + preposition.
Predicts loss without replacement and concurrent shifts toward shorter sentences, fewer subordinators, fewer other wh-pronouns, generally less relativization.
A general trend that goes against wh-pronouns, but offers replacements with alternatives, most prominently that.
Corpus of Historical American English (COHA) (Davies 2012)
Covers the time from 1810 to 2007
Contains circa 116,000 text samples
Strives for even representation of the genres fiction, magazines, newspapers, nonfiction
.
We pass our matrix of 200+ features for 116,000+ texts into the psych::fa()
function, specifying parameters as in Bohmann (2019). The one thing we change, relative to the design of that study, is the number of factors. We only have written English here; speech or Twitter are not represented; and only AmE is included.
A first attempt to get the optimal number of parameters could be Horn’s parallel analysis. However, this takes 10+ hours to run and ends up suggesting an implausibly large N of 54.
Instead, we can do a PCA and inspect how much variance is explained by the first X dimensions, looking for a characteristic “bend” or “elbow” where the tail flattens, suggesting further components do not add much further information:
The scree plot method suggests between 4 and 6 factors, and we choose 5. We write all the coefficients to file for further inspection and qualitative interpretation of the factors. (This is done outside R.)
Based on our interpretation of the factors, we propose the following labels:
Dimension | Positive features | Negative features |
---|---|---|
Involved versus informational | meanWL (0.79), Suf_ation (0.60)), Suf_ion (0.56), Pre_coMN (0.56), Pre_re (0.53), Pre_aCC (0.51), Suf_ment (0.50), Pre_pro (0.45), Suf_al (0.45), Pre_pre (0.42), Pre_de (0.42), prepositions (0.40), Suf_ial (0.37), Suf_ive (0.35), Suf_ity (0.34), Suf_dent (0.34), agentless_be_passive (0.32), Pre_uni (0.32), Suf_ary (0.30) | gh (-0.32), sec_pers_pn (-0.34), double_prep (-0.35), first_pers_pn (-0.35), contractions (-0.36), third_pers_pn (-0.40) |
Argument and persuasion | standardness (0.55), not (0.49), demonst_PN (0.45), pn_it (0.43), prepositions (0.40), agentless_be_passive (0.40), if_unless (0.37), standard_def_article (0.35), conjuncts (0.34), should (0.33), epistemic_certainty (0.32), amplifiers (0.31) | hyphenation (-0.44) |
Colloquial writing | contractions (0.70), sec_pers_pn (0.56), first_pers_pn (0.55), private_verbs (0.45), indef_pn (0.43), reporting_verbs (0.35), if_unless (0.35), want_to (0.33), maybe (0.32), pn_it (0.32) | agentless_be_passive (-0.35), prepositions (-0.45), standard_def_article (-0.45), meanWL (-0.459) |
Narrative | third_pers_pn (0.65), past_perfect (0.46) | Suf_ation (-0.30), Pre_coMN (-0.31), will_uncontracted (-0.31), meanWL (-0.33), prepositions (-0.35), Suf_ion (-0.36), Pre_pro (-0.38) |
Emphasis and evaluation | emphatics (0.31) | agentless_be_passive (-0.41) |
Note that emphasis and evaluation is a very flimsy dimension. We kept it because there were other features bubbling just below the 0.3 threshold to allow a meaningful interpretation here.
For all dimensions, we will look into how they relate to the four genres, how they change over time (per genre), and how they relate to the frequency of WHOM.
The expected genre differentiation, where fiction
is most involved and newspapers/nonfiction
most informational.
This is interesting, and worth a dedicated analysis. Particularly notable are the rise-fall patterns in informational focus for nonfiction
and newspaper
writing. There is a story here about the competing pressures of textual economy and colloquial/interpersonal writing style throughout the 20th century.