PMID-29074582 A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs
- Vicarious supplementary materials on their RCN (recursive cortical network).
- Factor scene into shape and appearance, which CNN or DCNN do not do -- they conflate (ish? what about the style networks?)
- They call this the coloring book approach -- extract shape then attach appearance.
- Hierarchy of feature layers (binary) and pooling layer (multinomial), where f is feature, r is row, c is column (e.g. over image space).
- Each layer is exclusively conditional on the layer above it, and all features in a layer are conditionally independent given the layer above.
- Pool variables is multinomial, and each value associated with a feature, plus one off feature.
- These features form a ‘pool’, which can/does have translation invariance.
- If any of the pool variables are set to enable , then that feature is set (or-operation). Many pools can contain a given feature.
- One can think of members of a pool as different alternatives of similar features.
- Pools can be connected laterally, so each is dependent on the activity of its neighbors. This can be used to enforce edge continuity.
- Each bottom-level feature corresponds to an edge, which defines ‘in’ and ‘out’ to define shape, .
- These variables are also interconnected, and form a conditional random field, a ‘Potts model’. is generated by gibbs sampling given the F-H hierarchy above it.
- Below Y, the per-pixel model X specifies texture with some conditional radial dependence.
- The model amounts to a probabalistic model for which exact inference is impossible -- hence you must do approximate, where a bottom up pass estimates the category (with lateral connections turned off), and a top down estimates the object mask. Multiple passes can be done for multiple objects.
- Model has a hard time moving from rgb pixels to edge ‘in’ and ‘out’; they use edge detection pre-processing stage, e.g. Gabor filter.
- Training follows a very intuitive, hierarchical feature building heuristic, where if some object or collection of lower level features is not present, it’s added to the feature-pool tree.
- This includes some winner-take-all heuristic for sparsification.
- Also greedily learn some sort of feature ‘’dictionary’’ from individual unlabeled images.
- Lateral connections are learned similarly, with a quasi-hebbian heuristic.
- Neuroscience inspiration: see refs 9, 98 for message-passing based Bayesian inference.
- Overall, a very heuristic, detail-centric, iteratively generated model and set of algorithms. You get the sense that this was really the work of Dileep George or only a few people; that it was generated by successively patching and improving the model/algo to make up for observed failures and problems.
- As such, it offers little long-term vision for what is possible, or how perception and cognition occurs.
- Instead, proof is shown that, well, engineering works, and the space of possible solutions -- including relatively simple elements like dictionaries and WTA -- is large and fecund.
- Unclear how this will scale to even more complex real-world problems, where one would desire a solution that does not have to have each level carefully engineered.
- Modern DCNN, at least, do not seem to have this property -- the structure is learned from the (alas, labeled) data.
- This extends to the fact that yes, their purpose-built system achieves state of the art performance on the designated CAPATCHA tasks.
- Check: B. M. Lake, R. Salakhutdinov, J. B. Tenenbaum, Human-level concept learning through probabilistic program induction. Science 350, 1332–1338 (2015). doi:10.1126/science.aab3050 Medline
|