A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge
- A whole lot of verbiage here for an old, important, but relatively straightforward result:
- Take ~30k encyclopedia articles.
- From them, make a vocabulary of ~ 60k words.
- Form a sparse matrix with rows being the vocabulary word, and columns being the encyclopedia article.
- Perform large, sparse SVD on this matrix.
- Take the top 300 singular values & associated V vectors, and use these as an embedding space for vocabulary.
- The 300-dim embedding can then be used to perform analysis to solve TOEFL synonym problems
- Map the cue and the multiple choice query words to 300-dim space, and select the one with the highest cosine similarity.
The fact that sVD works at all, and pulls out some structure is interesting! Not nearly as good as word2vec. |