Language formulas - for your convenience. — The Palace of Green Porcelain

Let's say you know about 30,000 words in your language, which, according to some studies, makes you slightly above average. Some of these entries are everyday words, such as "you", "door", or "go". Others are more rare, like "aghast" and "triptychon". Some of these words are easy to access, usually those that we use often and have a more concrete ("dog") than abstract ("democracy") meaning. There is also an effect of "age of acquisition" in that words that we learned earlier in life seem cognitively more anchored. It is difficult to tell this effect apart from frequency of use, since children tend to learn everyday words first.

If your language system works like the majority of language models suggest, it is based on "words and rules". All words, or at least their roots, are stored in a mental lexicon, and when you utter a sentence your system retrieves each needed word and applies combinatorial ("grammatical") principles to generate the utterance. This needs to happen within fractions of a second. Our system must work like this at least to some degree since we can use our word and combinatorial knowledge to generate a virtually infinite number of sentences.

But how often do we need this procedure? In some situations, it is woefully inelegant. Let's say you are going to say the sentence "I don't know", and you are, it likely being the most common sentence in English. In fact, it is more frequent than the vast majority of words you know. Why go through the effort of retrieving each word and then combining the utterance from scratch, every time? Your system would be stupid to not store this particular combination as a single "word", to be retrieved and fired away. This phenomenon lies at the core of what we call "formulaic language". Formulaic language does not only provide a processing advantage. Formulas are also a piece of identity. Communities are strengthened by sharing the same formulas. Also, just like any other word, formulas appear to be greater than the sum of their parts and serve a function that goes beyond the combination of the single words they contain. "I don't know" for example can signal a turn in a conversation or make a statement less forceful.

So formulas are combinations of words either stored as one form, or perhaps as a schema with one or several slots to be filled (e.g., "I don't know X"). The case of "I don't know" is easily made. We even have a special phonological form for it, which made it into (informal) orthography: I dunno. But this sentence is just the beginning. Once we allow for only one formula, we must ask how far the phenomenon goes. "Thank you", "good morning"? Likely formulas. But what about "Can you open X please?" (X being a noun phrase). How many formulas can we store? When does whole-form storage become clunky, and the size of our inventory of formulas start to slow us down while we fumble for the next sentence to retrieve? How schematic do formulas get?

Embarrassingly, these questions have barely been investigated, though our lab has joined the fray. Work by Tomasello, Lieven and others suggest that children start out with a very formulaic system, and only with time learn which parts of their production are even single words. There is an argument that as adults, we don't delete these formulas, and that we learn new ones with time. Researchers are certainly aware of the few dozen formulas that keep appearing in most papers in their field ("Stimuli were presented..."). In second language learning too formulaic language plays a large role.

After brain damage or in dementia, what appear to be formulas remain resilient. A man with severe aphasia may be able to say "I don't know", but not a sentence that is similarly complex ("I don't agree", "I can't reach") or even an easier form ("I know"). Strikingly, while an overuse of formulas indicates a lack of linguistic flexibility, and an inability to adjust to new communicational situations, some people with aphasia start repurposing formulas, giving them meanings they did not have before or combining them in ways that are grammatically wrong, but communicationally meaningful.

Our lab investigates formulas as a way to understand, but potentially also identify, neurological disorders. If there is a normative band of formula use (some say that 40% of what we say is formulaic, but to me these numbers are just educated guesses), deviation from the norm may indicate that there is something unusual about the individual's brain. It could be an undetected lesion, early dementia or some neurotransmitter imbalance (such as dopamine) which may eventually lead to mental health issues.

When I started thinking about formulas, I was surprised at how little research there was, especially in the literature on language disorders. You can cover a lot of ground within weeks and very quickly hit some obvious, but unaddressed questions. I am proud of my work on the frequency-properties of speech in Alzheimer's disease - it is truly novel. At the same time, it's crazy that no one has done it before. This is an exciting frontier.