[GNN] Graph Convolutional Networks (GCNs) - Part 2 : Towards Spatial Graph Convolution (ChebNet, GCN)

Outlines

Reference
1. Spectral Convolution
2. ChebNet : Re-Formulation of Spectral Filtering with K-Hops Localized Filters
3. GCN : Layer-Wise Linear Formulation of Spatial Graph Convolution

Reference

1. Spectral Convolution

Standard CNN captures local features and composes them into a series of hierarchical patterns with greater semantic context by using shift invariant compactly supported filters.
Spectral covolution is able to generalize the ability of CNN to graphs by transforming the spatial structure of graph into spectral domain where one can filter the desired frequency component of the graph signal.
- Graph Fourier Transform and Spectral Convolution
However, spectral convolution has several limitations
1. High computational cost
  - Needs to operate eigen-decomposition of graph laplacian to generate graph fourier basis, which is computationally expensive.
  - Costs $\large O(n^{2})$ due to the two-fold multiplication of the fourier basis matrix with input features.
2. Lack of spatial locality
  - Lacks an explicit notion of spatial locality, which is a fundamental aspect of traditional CNNs.
  - Filters the entire graph simultaneously, making it challenging to capture local neighborhood information directly.
3. Over-smoothing problem
  - Involves low-pass filtering, and as layers are stacked in deep graph networks, it can lead to over-smoothing of node representations.
  - Loss of discriminative power to distinguish between differnet nodes.

2. ChebNet : Re-Formulation of Spectral Filtering with K-Hops Localized Filters

ChebNet addresses the major bottlenecks of generalizing CNN to graphs, defining localized graph filters and reducing computational complexity.
Further, it introduces graph-coarsening and pooling steps to simplify and downsize the graph structure while preserving the significant information about the graph.
- This enables the multi-scale approach where single architecture, chebnet, can be applied to multiple graph resolutions.

2.1. K-Localized Spectral Filters

Graph fourirer transform with normalized laplacian

By utilizing $Kth$ order polynomials of the laplacian as the graph filters, ChebNet limits the receptive field of the filters to the nodes that are at maximum K steps away from the central node.
Then, how can Kth order polynomials of laplacian can capture the nodes that are exactly K-localized from the target node ?

2.2. Fast and Stable Filtering : Recursive Formulation of Chebyshev Polynomials

Monomial (Higher-order polynomial) basis is not orthogonal and becomes more susceptible to large oscillations as the degree of polynomial increases.
To overcome this, one need to replace monomials with the orthogonal polynomials computed based on the stable recurrence relation.

2.2.1. Chebyshev Polynomials

Chebyshev polynomials of the first kind ($\large T_{n}$)
- $\large T_{n+1}(x)\, + T_{n-1}(x)\, = 2x \times T_{n}(x)\,$
- $\large T_{n}(cos\theta)\, = cos(n\theta)$
Chebyshev polynomials of the second kind ($\large U_{n}$)
- $\large U_{n+1}(x)\, + U_{n-1}(x)\, = 2x \times U_{n}(x)\,$
- $\large U_{n}(cos\theta)\, = \frac{sin((n+1)\theta)}{sin\theta}$
Derivation
Chebyshev expansion has been traditionally used in graph signal processing to approximate kernels due to its stability.
Key properties of chebyshev polynomials :
- Orthogonality Chebyshev polynomials
  - $\large T_{n}(x)$ and $\large T_{m}(x)$ are orthogonal to each other with respect to $\large \frac{dx}{\sqrt{1\,-\,x^{2}}}$
- MinMax Property : Any point lies within the interval $\large [-1, 1]$
- Recurrence relation : Allows efficient computation of higher-order polynomials based on the lower-order ones.

2.2.2. Parametrization of Filters with Chebyshev Expansion

→

While satisfying the recurrence relation, with $\large T_{0} = 1$ and $\large T_{1} = x$
Note that these polynomials form an orthogonal basis w.r.t $\large \frac{dx}{\sqrt{1\,-\,x^{2}}}$
Here, $\large \tilde{\Lambda} \, = \, 2\Lambda / \lambda_{max} \, - \, I_{n}$, a diagonal matrix scaled to lie within $\large [-1, 1]$
Re-formulate the spectral convolution with chebyshev polynomials

→
- where with
No need to operate eigen-decomposition of graph laplacian and $\large O(n^{2})$ computations for mulitiplication with eigenvector matrix.
The entire filtering operation costs $\large O(K|\xi|)$ as it requires K times computation of highly sparse matrix L with the number of edges $\large |\xi|$, which is far more efficient than $\large O(n^{2})$.

2.3. Learning Filters

Forwrad Pass
- $j^{th}$ output feature map of the sample $s$ of batch size $S$
  
  $\large y_{s, j} \, = \, \sum_{i=1}^{F_{in}} \, g_{\theta_{i, j}}(L)\,x_{s, i}$ where $\large y_{s, j}, x_{s, i} \in \mathbb{R}^{n}$
- $\large \theta$ : $\large F_{in} \times F_{out}$ vectors with each $\large \theta_{i, j} \, \in \, \mathbb{R}^{K}$
- $\large y_{s} \,\in \, \mathbb{R}^{n \times F_{out}}$
Backward Pass
- Through the backpropagation, need to compute two gradients of the mini-batch loss $\large E$ with respect to $\large \theta$ and $\large x$.

2.4. Group Coarsening and Fast Pooling

Graph coarsening
- Uses the Graclus multilevel clustering algorithm.
- Done for multiple layers to perform multi-scale clustering where similar vertices are clustered together.
- Reduces the size of the graph while preserving the meaningful information of the graph.
Efficient Pooling
- Rearrange the vertices to make the pooling operation as efficient as 1D-pooling
  1. Create a balanced binary tree
  2. Rearragne the vertices
To sum up, ChebNet enables the application of K-localized fiters on graphs to capture spatial information and improve computational efficiency of convolution operation by removing the eigen-decomposition step and parameterizing the localized graph filters using chebyshev expansion.

3. GCN : Layer-Wise Linear Formulation of Spatial Graph Convolution

GCN facilitates the transformation of spectral graph convolution into spatial graph convolution, boosting the application of CNN on graph-structured data.

Key changes in GCN from ChebNet
1. Limit K = 1
2. Constrain the number of parameters to address overfitting and reduce computations.
  - $\large \theta \,=\, \theta_{0} \,=\, - \theta_{1}$
3. Renormalization trick
  - eigenvalues lie within [-1, 1], which can possibly cause gradient vanishing / exploding problems.
  - add self-connections to the adjacency matrix to maintain node identity.

3.1. Layer-Wise First-Order Linear Model

GCN approximates the spectral graph convolution process as a layer-wise linear operation, limiting the convolution operation to first-order neighborhoods (K = 1).
Adds non-linearity after the linear filtering step and stack the multiple linear graph convolutional layers to build deeper models.
ChebNet
- Learn and compute $K$-localized filters to mimick the CNN filters with kernal size K that captures localized receptive field.
- $\large g_{\theta^{\prime}} * x \, = \, \sum_{k=0}^{K}\, \theta_{k}^{\prime} T_{k}(\tilde{L})x$
GCN
- Normalized $\large L = I_{N} - D^{\frac{-1}{2}}AD^{\frac{-1}{2}}$
- $\large g_{\theta^{\prime}} * x \, = \, \theta_{0}^{\prime}x \, + \, \theta_{1}^{\prime}(L - I_{N})\,x \, = \, \theta_{0}^{\prime}x \, - \, \theta_{1}^{\prime} D^{\frac{-1}{2}}AD^{\frac{-1}{2}}x$
- Instead of explicit parameterization of $K^{th}$-order localized filters using chebyshev expansion, GCN adopts a linear convolutional layer that applies first-order localized filter to the graph.
- GCN can recover the ability of spectral filters to capture local representation by stacking up multiple such layers, while improving the problem of overfitting on local neighborhoods possibly caused from the explicit use of K-fold filters.

3.2. Single Parameter

Constrain the number of parameters from two different parameters $\large \theta_{0}^{\prime}$, $\large \theta_{1}^{\prime}$ to single parameter $\large \theta = \theta_{0}^{\prime} = -\,\theta_{1}^{\prime}$
$\large g_{\theta} * x \, = \, \theta(I_{N} \, + \, D^{\frac{-1}{2}}AD^{\frac{-1}{2}})x$
Can address overfitting by limiting the number of learnable parameters and reduce the computations required.
Note that the $\large I_{N} \, + \, D^{\frac{-1}{2}}AD^{\frac{-1}{2}}$ has eigenvalues within the range $[0, 2]$

3.3. Renormalization Trick

The problem of conventional adjacency matrix is that it only reflects the information about the neighboring nodes, not the source nodes.
Aggregated representation of nodes using this adjacency matrix may lead to the loss of node identity and fail to construct feature representation comprehensive to the graph structure.
This problem can be easily fixed by adding self-connection to the adjacency matrix.
- $\large \tilde{A} = A + I_{N}$
Renormalization
- In order to let the filters be the representation of relative significance between the neighboring nodes and the source node, not simply affected by the absolute size of degree, re-normalize $\large \tilde{A}$.
- $\large I_{N} \, + \, D^{\frac{-1}{2}}AD^{\frac{-1}{2}}$ = $\large \tilde{D}^{\frac{-1}{2}} \tilde{A} \tilde{D}^{\frac{-1}{2}}$

Comparison of Propagation Models and Final GCN Layer

Final formulation of GCN layer : $\large Z = \hat{A} \, X \, \Theta$
- $\large \hat{A} = \tilde{D}^{\frac{-1}{2}} \tilde{A} \tilde{D}^{\frac{-1}{2}}$
- $\large \hat{A} \, \in \, \mathbb{R}^{N \times N}$
- $\large X \, \in \, \mathbb{R}^{N \times C}$, C : input channels
- $\large \Theta \, \in \, \mathbb{R}^{C \times F}$
- $\large Z \, \in \, \mathbb{R}^{N \times F}$, F : output feature maps
Add non-linearity activation
- $\large H^{(l+1)} = \sigma(\hat{A}\,H^{(l)}\,W^{(l+1)})$

Interpretation of spatial convolution using K=1 localized filter

Re-constructing the feature of target node as the weighted sum of its first-order neighboring node features.
Analogous to standard convolution in CNN in that it computes the weighted sum of adjacent data points to generate more semantically strong higher-order feature maps.

3.4. Multi-Layer GCN for Semi-Supervised Learning

Forward model of GCN
- Two-layer GCN
- $\large f(X, A) = Z = \text{Softmax}(\hat{A} \,\, \text{ReLU} \, (\hat{A} X W^{(0)}) \,\, W^{(1)})$
- Softmax : $\large Z_{i} = \frac{exp(x_{i})}{\sum_{i}^{N} exp(x_{i})}$ where $\large x_{i} \in \mathbb{R}^{F}$
- Objective funtion : Cross-Entropy Loss
Semi-Supervised Learning
- “Semi Supervised” means that model infers the labels of unlabeled nodes from the labels of ground-truth nodes based on the graph structure.
- As the model $\large f(X, A)$ is conditioned on both $\large A$ and $\large X$, it contains the information about the global connectivity uderlying the graph structure.
- Hence, the model is expected to be able to extract the novel information not present in $\large X$, inferring from the links and relations between nodes.
Figure 1.
- (b) : t-SNE visualization of hidden layer activations of a two-layer GCN trained on the Cora dataset.
  - Model seems to successfully learn spatially significant localized patterns present in the graph.

Implementation Cost of Two-Layer GCN
- $\large O(|\xi|CHF)$
- $\large \xi$ : The number of edges of sparse laplacian.
- $\large C$ : input channel dimensionality.
- $\large H$ : dimension of 1st hidden layer.
- $\large F$ : dimension of 2nd hidden layer.

This post gives an overview of how the graph convolution evolves from spectral convolution to spatial convolution with a focus on two great models, ChebNet and GCN.
With a layer-wise linear formulation of spatial graph convolution process, the generalization of CNN on graphs becomes more practically feasible.