Learning Abstractions Over Multiword Chunks: A Computational Model

This submission has open access

Abstract Summary

We report on a computational model of chunking in development which utilizes stored multiword sequences during comprehension and production-related processing. It also abstracts over stored units to form partially abstract, lexically-based frames. The model captures ~70% of all multiword child utterances encountered across a typologically diverse array of 29 languages.

Submission ID :

AILA286

Submission Type

Standard

Abstract :

Recent years have seen mounting evidence that both children and adults utilize stored multiword sequences in language comprehension and production. A recent computational model of early grammatical development, the Chunk-Based Learner (CBL), succeeds in utilizing such multiword chunks to capture key aspects of language learning. Here, we report on an extension of CBL that allows it to learn abstractions over multiword chunks. When the model discovers sufficient overlap between chunks in all but one position, it creates lexical frames (chunks with empty slots corresponding to the variant items). For example, if the model has come across “I like tea,” “I like coffee,” and “I like cake,” it can generalize this pattern to “I like milk” by creating the lexical frame “I like ___”.

The model aims to recreate utterances produced by the target child in a corpus, given only the words in the utterance. The words are scrambled in a “bag-of-words”, and the task of the model is to sequence them correctly using only previously learned chunks and statistics. Performance is scored according to whether CBL’s utterance perfectly matches the child’s. In contrast to the original CBL model, our extended model can utilize learned lexical frames during the bag-of-words task.

When exposed to the English section of the CHILDES database, the model acquires developmentally plausible frames and uses them to capture 70% of utterances produced by the target children (compared to just 58% for CBL and 46% for a standard trigram model). Our model and the trigram model differed significantly from the CBL model, in opposite directions. Across a typologically diverse range of 28 additional languages from CHILDES, the model captures over 68% of child utterances, compared to just 55% and 46% percent for the CBL and trigram models, respectively. Again, our model and the trigram differed significantly from CBL.