The unreasonable effectiveness of data.
- counterpoint to Eugene Wigner's "The Unreasonable effectiveness of mathematics in the natural sciences"
- that is, math is not effective with people.
- we should not look for elegant theories, rather embrace complexity and make use of extensive data. (google's mantra!!)
- in 2006 google released a trillion-word corpus with all words up to 5 words long.
- document translation and voice transcription are successful mostly because people need the services - there is demand.
- Traditional natural language processing does not have such demand as of yet. Furthermore, it has required human-annotated data, which is expensive to produce.
- simple models and a lot of data triumph more elaborate models based on less data.
- for translation and any other application of ML to web data, n-gram models or linear classifiers work better than elaborate models that try to discover general rules.
- much web data consists of individually rare but collectively frequent events.
- because of a huge shared cognitive and cultural context, linguistic expression can be highly ambiguous and still often be understood correctly.
- mention project halo - $10,000 per page of a chemistry textbook. (funded by DARPA)
- ultimately suggest that there is so so much to explore now - just use unlabeled data with an unsupervised learning algorithm.
|