By systematically asking these questions about all the entities and events in a story, NLP researchers can score systems’ comprehension in a principled way, probing for the world models that systems actually need.
Much of today’s reading comprehension research entails carefully tweaking models to eke out a few more percentage points on the latest data sets. “State of the art” has practically become a proper noun: “We beat SOTA on SQuAD by 2. 4 points!
We’re proposing a more fundamental shift: to construct more meaningful evaluations, NLP researchers should start by thoroughly specifying what a system’s world model should contain to be useful for downstream applications.
This sort of modeling and reasoning is precisely what automated research assistants or game characters must do—and it’s conspicuously missing from today’s systems.
Research groups like the Allen Institute for AI have proposed other ways to harden the evaluations, such as targeting diverse linguistic structures, asking questions that rely on multiple reasoning steps, or even just aggregating many benchmarks.
But our argument is more basic: however systems are implemented, if they need to have faithful world models, then evaluations should systematically test whether they have faithful world models.
Jesse Dunietz is a researcher at Elemental Cognition, where he works on developing rigorous evaluations for reading comprehension systems.
But when people imagine computers that comprehend language, they envision far more sophisticated behaviors: legal tools that help people analyze their predicaments; research assistants that synthesize information from across the web; robots or game characters that carry out detailed instructions.
Other researchers, such as Yejin Choi’s group at the University of Washington, have focused on testing common sense, which pulls in aspects of a world model.
Opinion: The field of natural language processing is chasing the wrong goal. https://t.co/l1bqelvZRz— MIT Technology Review (@techreview) August 18, 2020