Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
- From a team at University of Washington / Allen institute for artificial intelligence/
- Courtesy of Yannic Kilcher's youtube channel.
-
- General idea: use GPT-3 as a completion source given a set of prompts, like:
- X starts running
- X and Y engage in an argument
- There are only 7 linkage atoms (edges, so to speak) in these queries, but of course many actions / direct objects.
- These prompts are generated from the Atomic 20-20 human-authored dataset.
- The prompts are fed into 175B parameter DaVinci model, resulting in 165k examples in the 7 linkages after cleaning.
- In turn the 165k are fed into a smaller version of GPT-3, Curie, that generates 6.5M text examples, aka Atomic 10x.
- Then filter the results via a second critic model, based on fine-tuned RoBERTa & human supervision to determine if a generated sentence is 'good' or not.
- By throwing away 62% of Atomic 10x, they get a student accuracy of 96.4%, much better than the human-designed knowledge graph.
- They suggest that one way thins works is by removing degenerate outputs from GPT-3.
Human-designed knowledge graphs are described here:
ConceptNet 5.5: An Open Multilingual Graph of General Knowledge
And employed for profit here: https://www.luminoso.com/ |