[Paper Review] Streaming Graph Neural Networks (DGNN, 2018)

Outlines

Reference
1. Dynamic Graph Neural Networks (DGNN)
2. Frameworks of DGNN
- 2.1. Update Component
- 2.2. Propagation Component
3. Parameter Learning
- 3.1. Link Prediction
- 3.2. Node Classification
4. Evaluation of DGNN and Performance Comparision with Other Baselines

Reference

Streaming Graph Neural Networks, Ma et al, 2018

1. Dynamic Graph Neural Networks (DGNN)

As graphs in real-world applications are inherently dynamic, this paper aims to design a graph neural networks that models dynamic graphs along with temporal information.
DGNN utilizes a modified LSTM to update the graph when new interactions occur between nodes.
Further, it propagates the influence of the new interaction information to the neighboring nodes.
Both update and propagation steps incorporate consideration on the time interval between the interactions to determine the extent to which current interaction is reflected for the update or propagation.

2. Frameworks of DGNN

Figure 1. An overview of DGNN when a new interaction happened at time t7 from v2 to v5.

Interactiong nodes : two nodes that are directly involved in the interaction, here $\large v_{2}$ and $\large v_{5}$.
Influenced nodes :
- nodes that are influenced by the interaction, limited to the nodes close to interacting nodes.
- $\large v_{1}, v_{3}, v_{6}, v_{7}$
$\large v_{s}$ and $\large v_{g}$ : source node and target node.
Interation from $\large v_{s}$ to $\large v_{g}$ at time $t$ : $\large (v_{s}, v_{g}, t)$
- at time t7 from v2 to v5 : $\large (v_{2}, v_{5}, t_{7})$

2.1. Update Component

Figure 2. An overview of the operations of update component with the focus on node $\large v_{2}$ and its connections

Overview of the operations of the update component with the focus on $\large v_{2}$ in the dynamic graph illustrated in Figure 1.
There are three interactions involving node $\large v_{2}$, {$\large v_{2},v_{1},t_{0}$}, {$\large v_{7}, v_{2}, t_{3}$}and {$\large v_{2}, v_{5}, t_{7}$}.
Sequence of interactions are recurrently applied to update the information of affected nodes.
- Next component of the sequence takes the output of the previous component as an input.
- Only stores the latest information about each node.
As shown in the figure 2., single update component consists of three units, interact unit, S or G-Ipdate unit, and Merge unit.

Figure 3. Overview of the operations performed within each unit in the update component when an interaction {$\large v_{2}, v_{5}, t_{7}$} happened.

Before getting into the details of each unit, let’s specify the information stored per each node.

Figure 3.(a)
- As a node can act as both source and target, there are two sets of cell memories and hidden states, one for the role as a source and the other for target.
- $\large C_{v}^{s}(t-),\,\, h_{v}^{s}(t-),\,\, C_{v}^{g}(t-),\,\, h_{v}^{g}(t-)$ : cell memories and hidden states of source and target nodes at time $t-$.
- $\large t_{7}-$ denotes the most recent time before the interaction at time $\large t_{7}$.
  - In case of $\large v_{2}$, $\large t_{7}-$ is $\large t_{3}$ and for $\large v_{5}$, $\large t_{6}$.
- $\large u_{v_{x}}$ contains the general features computed from interact unit, which holds the interaction information between source and taraget nodes.
Carrying these information updated from previous interaction, interacting nodes enter into following subsequent units.

2.1.1. Interact Unit

Figure 3.(b)

This unit is a learned feed forward network for computing the interaction information betwen source and target nodes.
Formulation :
Each of $\large u_{v_{2}}(t_{7}-)$ (source) and $\large u_{v_{5}}(t_{7}-)$ (target) enters into the model to generate interaction information $\large e(t)$.
$\large W_{1}, W_{2}, b_{e}$ are the learned paramters and $\large act$ is the activation function (e.g, sigmoid or tanh).
Also compute $\Delta_{t_{s}}$ and $\Delta_{t_{g}}$, time intervals between the latest previous interaction time and the current interaction.

2.1.2. Update Unit

Figure 3.(b)

Modified LSTM is used in this unit.
There are two types of update units, S-Update and G-Update, to which corresponding information of the nodes (source or target) are passed.
- Input of the unit consists of four components, $\large C_{v_{x}}^{x}(t_{7}-),\,\, h_{v}^{x}(t_{7}-),\,\, e(t_{7}),\,\, \Delta_{t_{x}}$. Here x is s and g for source and target, respectively.
- S and G update units share same structure with separately learned weigths.
In each unit, new $\large C_{v}(t_{7}),\,\, h_{v}(t_{7})$ are computed through the modified LSTM with corresponding formulations.

Figure 4. Illustration of the update unit
- part (2), (4) are short memory and long term memory, respectively.
- While long term momory remains unchanged, short term memory is discounted by the discount function $\large g$ that considers the time interval between interactions.
  - As $\large g$ is a decreasing function, larger $\Delta_{t}$ results in smaller $g(\Delta_{t})$, which is to be multiplied to short term memory.
  - Intuitively, this operation reflects the nutural tendency for older memories to be forgotten more, while relatively recent memory are retained to a greater extent.
- Part (5) in the formulations is the final adjusted cell memory $\large C_{v}^{*}(t-)$, which propagates out of the blue dashed box to the standard LSTM unit.
- The formulations for the rest part of the update unit are as follows
- To summarize all the procedures present in update unit (from eq.(2) to eq.(11))

2.1.3. Merge Unit

Depending on whether the given node is a source or a target, it only passes through one of the two units (S-Update or G-Update) in update unit.
As each unit only updates the information of the corresponding node (source or target), information of the other node (for S-update, target information) remains in its previous state.
Hence, node $\large v_{s}$ has $\large h_{v_{s}}^{s}(t)$ and $\large h_{v_{g}}^{g}(t-)$ as the output of S-Update and node $\large v_{g}$ has $\large h_{v_{s}}^{s}(t-)$ and $\large h_{v_{g}}^{g}(t)$ as the output of G-Update unit.
Combining these two hidden state features of source and target, merge unit generates general features $\large u_{v_{s}}(t)$ or $\large u_{v_{g}}(t)$ as follows
Finally, the output of the update component is the udpated information of the interacting nodes after the interaction {$\large v_{2},v_{5},t_{7}$}
- For the source node, updated information includes $\large C_{v_{s}}^{s}(t),\,\, h_{v_{s}}^{s}(t),\,\, C_{v_{g}}^{g}(t-),\,\, h_{v_{g}}^{g}(t-), \,\,u_{v_{s}}(t)$
- For the target node, $\large C_{v_{g}}^{g}(t), \,\,h_{v_{g}}^{g}(t),\,\, C_{v_{s}}^{s}(t-),\,\, h_{v_{s}}^{s}(t-), \,\,u_{v_{g}}(t)$

2.2. Propagation Component

After updating the information of interacting nodes, DGNN also consider the sequential influence of the interaction to adjacent nodes, referred to as influenced nodes.
Authors limit the influenced nodes as current neighbors of the two interacting nodes with following three reasons.
1. Impact of a new edge on the whole graph is often local.
2. The propagated information can further propagates once new interaction happens for the influenced nodes.
3. Empirically, extending the propagation range does not significantly increase the performance of the model and, in some cases, even decreases it.
To update the influenced nodes, interaction information $\large e(t)$ computed from update component should be propagated to their cell memories.
Based on the assumption that propagation carries indirect secondary influence from interactions, the authors choose to simply add new interaction information instead of directly decaying the cell memories of influenced nodes, as they do for the interacting nodes.
Similar to update component, propagation component also considers the time interval to determine the extent to which the interaction information is reflected.
Additionally, connection strength between nodes is also taken into consideration as it is natural that strongly tied neighbors are more likely to be influenced and vice versa.
There are 3 units in propagation component, interact unit (b), prop unit (c), and merge unit (d).

Figure 5. Propagation from the source node ($\large v_{2}$) to source neighbors ($\large v_{7}$)
- Interact unit and merge unit are identical with the ones used in udpate component.
- For prop unit, paper divide the current neighbors of interacting nodes into 4 groups.

2.2.1. Defining Neighbors to be Influenced

Neighbors of source node and target node : $\large N(v_{s})$, $\large N(v_{g})$
- $\large N(v_{s})\,=\, N^{s}(v_{s}) \, \cup N^{g}(v_{s})$, source node to its source neighbors and source node to its target neighbors.
- $\large N(v_{g})\,=\, N^{s}(v_{g}) \, \cup N^{g}(v_{g})$, target node to its source neighbors and target node to its target neighbors.
4 neighbor groups pass through different prop units with the same structure but different parameters.

2.2.2. Operations in Prop Units

Focusing on source to source propagation, for all $\large v_{x} \in N^{s}(v_{s})$, interaction information propagates as follows
Interaction information $\large e(t)$ scaled by 4 components is added to cell memory of $\large v_{x}$.
- $\large f_{a}(u_{v_{x}}(t-), \, u_{v_{s}}(t-))$ : attention scores capturing the relative strength between $\large v_{x}$ and $\large v_{s}$ among other source neighbors.
- $\large g(\Delta_{t}^{s})$
  - Same as the discount function in update component, decaying the interaction information by the magnitude of time interval.
  - Older connections are less likely to be affected.
- $\large h(\Delta_{t}^{s})$
  - Not a gradual decaying function, a filter to block the propagation for the neighbors with the time interval larger than the pre-defined threshold ($\large \tau$).
- $\large W_{s}^{s}$
  - Learned linear transformation to project the interaction information to the neighbors.
  - Each neighbor uses different transformations.
After obtaining newly updated $\large C_{v_{x}}^{s}(t)$ and $\large h_{v_{x}}^{s}(t)$ with the prop operations above, pass them to the merge unit where updated general features $\large u_{v_{x}}(t)$ is computed using the outputs of prop unit.

3. Parameter Learning

This section describes how DGNN learns parameters to perform various tasks such as link prediction and node classification.

3.1. Link Prediction

First, project the source and target general features ($\large u_{v_{s}}(t-)$ and $\large u_{v_{g}}(t-)$) at time $\large t-$ with corresponding projection matrix $\large P^{s}$ and $\large P^{g}$ to get $\large u_{v_{s}}^{s}(t-)$ and $\large u_{v_{g}}^{g}(t-)$.
Then, compute the probability of an interaction between source and target at time $\large t$
- Take dot product between two of them and apply sigmoid.
- $\sigma (\large u_{v_{s}}^{s}(t-)^{T}\,u_{v_{g}}^{g}(t-))$
Eventually, the loss is represented as
- $\large Q$ is the number of negative examples
- $\large P_{n}(v)$ is a negative sampling distribution.
Adds up the losses for all interactions untill time $\large T$ to get total loss.
Then, adopt the mini-batch gradient descent to optmize the loss function.
Note mini-batch is not randomly selected, but selected by the temporal order of the interaction sequences.

3.2. Node Classification

Adopts cross entropy loss (CE)
First, project the general features $\large u_{v}(t)$ to $\large u_{v}^{c}(t)\,\in\,\mathbb{R}^{N_{c} \times 1}$
Then compute the CE across all classes

4. Evaluation of DGNN and Performance Comparision with Other Baselines

4.1. Evaluation Metrics

Link Prediction
- Mean Reciprocal Rank (MPR) :
  - $\large rank_{i}$ : rank of the computed probability of ground-truth nodes among all nodes.
  - $|H|$ : the number of testing pairs (single edge is assigned to two pairs, for source node and target node.)
- Recall@k :
  - indicator function that returns 1 only if the $\large rank_{i}$ is smaller than k, measuring the averaged number of the examples where ground-truth node is in top k out of all nodes.

Node Classification
- F1-macro
  - sum(F1 scores) / number of classes
- F1-micro
  - TP / (TP + 0.5*(FP + FN))

4.2. Comparison with Other Models

Baseline Models
- GCN, GraphSage, node2vec, DynGEM, CPTM, DANE, DynamicTriad

Link Prediction Task : For all metrics and used datasets except the Recall@20 with DNC dataset, DGNN outperforms the baseline models.
Node Classification : DGNN shows better performance compared to all other baselines.