{1516} revision 2 modified: 07-16-2020 15:49 gmt |

Inductive representation learning on large graphs -
**William L. Hamilton, Rex Ying, Jure Leskovec** - Problem: given a graph where each node has a set of (possibly varied) attributes, create a 'embedding' vector at each node that describes both the node and the network that surrounds it.
- To this point (2017) there were two ways of doing this -- through matrix factorization methods, and through graph convolutional networks.
- The matrix factorization methods or spectral methods (similar to multi-dimensional scaling, where points are projected onto a plane to preserve a distance metric) are
*transductive*: they work entirely within-data, and don't directly generalize to new data. - This is parsimonious in some sense, but doesn't work well in the real world, where datasets are constantly changing and frequently growing.
- Their approach is similar to graph convolutional networks, where (I think) the convolution is indexed by node distances.
- General idea: each node starts out with an embedding vector = its attribute or feature vector.
- Then, all neighboring nodes are
*aggregated*by sampling a fixed number of the nearest neighbors (fixed for computational reasons). - Aggregation can be mean aggregation, LSTM aggregation (on random permuations of the neighbor nodes), or MLP -> nonlinearity -> max-pooling. Pooling has the most wins, though all seem to work...
- The aggregated vector is concatenated with the current node feature vector, and this is fed through a learned weighting matrix and nonlinearity to output the feature vector for the current pass.
- Passes proceed from out-in... i think.
- Algorithm is inspired by the Weisfeiler-Lehman Isomorphism Test, which updates neighbor counts per node to estimate if graphs are isomorphic. They do a similar thing here, only with vectors not scalars, and similarly take into account the local graph structure.
- All the aggregator functions, and for course the nonlinearities and weighting matricies, are differentiable -- so the structure is trained in a supervised way with SGD.
This is a well-put together paper, with some proofs of convergence etc -- but it still feels only lightly tested. As with many of these papers, could benefit from a positive control, where the generating function is known & you can see how well the algorithm discovers it. Otherwise, the structure / algorithm feels rather intuitive; surprising to me that it was not developed before the matrix factorization methods. Worth comparing this to word2vec embeddings, where local words are used to predict the current word & the resulting vector in the neck-down of the NN is the representation. |