Inductive representation learning on large graphs
 William L. Hamilton, Rex Ying, Jure Leskovec
 Problem: given a graph where each node has a set of (possibly varied) attributes, create a 'embedding' vector at each node that describes both the node and the network that surrounds it.
 To this point (2017) there were two ways of doing this  through matrix factorization methods, and through graph convolutional networks.
 The matrix factorization methods or spectral methods (similar to multidimensional scaling, where points are projected onto a plane to preserve a distance metric) are transductive : they work entirely withindata, and don't directly generalize to new data.
 This is parsimonious in some sense, but doesn't work well in the real world, where datasets are constantly changing and frequently growing.
 Their approach is similar to graph convolutional networks, where (I think) the convolution is indexed by node distances.


 General idea: each node starts out with an embedding vector = its attribute or feature vector.
 Then, all neighboring nodes are aggregated by sampling a fixed number of the nearest neighbors (fixed for computational reasons).
 Aggregation can be mean aggregation, LSTM aggregation (on random permuations of the neighbor nodes), or MLP > nonlinearity > maxpooling. Pooling has the most wins, though all seem to work...
 The aggregated vector is concatenated with the current node feature vector, and this is fed through a learned weighting matrix and nonlinearity to output the feature vector for the current pass.
 Passes proceed from outin... i think.
 Algorithm is inspired by the WeisfeilerLehman Isomorphism Test, which updates neighbor counts per node to estimate if graphs are isomorphic. They do a similar thing here, only with vectors not scalars, and similarly take into account the local graph structure.
 All the aggregator functions, and for course the nonlinearities and weighting matricies, are differentiable  so the structure is trained in a supervised way with SGD.
This is a wellput together paper, with some proofs of convergence etc  but it still feels only lightly tested. As with many of these papers, could benefit from a positive control, where the generating function is known & you can see how well the algorithm discovers it.
Otherwise, the structure / algorithm feels rather intuitive; surprising to me that it was not developed before the matrix factorization methods.
Worth comparing this to word2vec embeddings, where local words are used to predict the current word & the resulting vector in the neckdown of the NN is the representation. 