[Paper Review & Partial Implementation] Random Walks Based Graph Embeddings : DeepWalk, Node2Vec

Outlines

Reference
Graph Embedding
1. DeepWalk
- 1.1. Algorithms for DeepWalk
- 1.2. PyTorch Implementation for DeepWalk
  - 1.2.1. RandomWalk
  - 1.2.2. Hierarchical Skip-Gram
2. Node2Vec
- 2.1. Biased Random Walks
- 2.2. Comparison to Other Node Embedding Strategies

Reference

Graph Embedding

The goal of graph embedding is to transform the nodes and edges of a graph into a lower-dimensional vector space while preserving important structural properties and relationships between nodes.
In other words, it is a technique used to represent graph data in a continuous, numerical form suitable for machine learning algorithms.
There are various techniques for graph embedding, including random walks-based methods, spectral-based methods, and deep learning-based methods like Graph Convolutional Networks (GCNs) and GraphSAGE.
For this post, I will review the random walks-based low-dimensional embedding strategies

1. DeepWalk

DeepWalk, proposed in 2014, tries to generalize word embedding strategies used in natural languages for learning latent low-dimensional representation of graph networks.
- This analogy is supported by the similar power-law distribution between vertices in graph and word occurrence in natural language.
- If the degree distribution of a connected graph follows a power law (i.e. scale-free), the frequency which vertices appear in the short random walks will also follow a power-law distribution.
- Word frequency in natural language follows a similar distribution, and hence, techniques from language modeling account for this type of distributional behavior.
The DeepWalk method is divided into two parts: random walk to obtain node sequences and to generate node embedding using hierarchical skip-gram analogy.
It treats the paths created from random walks as “sentences” and applies the Word2Vec algorithm (Skip-Gram) to learn embedded node representations in a continuous vector space.
Additionally, it adopts hierarchical softmax for training word embeddings as an alternative to standard softmax based approaches, whici significanlty improves the computational efficiency.

1.1. Algorithms for DeepWalk

Overview of the deepwalk approach
- Consists of largely two parts : 1. RandomWalk Generator, 2. Hierarchical Skip-Gram Update
- Hyper-parameters are denoted as $d$ for embedding dimension, $\gamma$ for the number of random walks per vertice, $t$ for walk length, and for skip-gram, window size as $w$.
- Empirically through parameter sensitiviey study, works great on $d = 64, \gamma = 30, w = 10$

1.2. PyTorch Implementation for DeepWalk

Graph and hyperparameters

adj_list = [[1,2,3], [0,2,3], [0, 1, 3], [0, 1, 2], [5, 6], [4,6], [4, 5], [1, 3]]
size_vertex = len(adj_list)  # number of vertices

w  = 10            # window size
d  = 64            # embedding size
r  = 30          # walks per vertex
t  = 6            # walk length 
lr = 0.025       # learning rate

v=[0,1,2,3,4,5,6,7] #labels of available vertices

Two part : 1. RandomWalk → 2. Learn node embeddings through hierarchical skip-gram.

for i in range(y):
    random.shuffle(v)
    for vi in v:
        wvi = RandomWalk(vi,t)
        HierarchicalSkipGram(wvi, w)

1.2.1. RandomWalk

def RandomWalk(node,t):
    walk = [node]        # Walk starts from the source node
    
    for i in range(t-1):
        node = adj_list[node][random.randint(0,len(adj_list[node])-1)]
        walk.append(node)

    return walk

Randomly selects the next-step node among the adjacent neighboirs connected from the source node and adds to random walk path.

1.2.2. Hierarchical Skip-Gram

1.2.2.1 Standard Softmax Based Embedding Learning

CBOW (Continuous Bag of Words)
- Predict embedding information of the target word from the context words within the window size (-w to +w from the center).
- Need to compute dot product with V x M lookup matrix (W) for 2w times for all adjacent words.
- Average out all the resultant vectors to generate embeddings (lookup, $v$) for the central word.
- Project the embedded vector with M x V projection matrix (W’) and take the softmax.
- objective function uses cross entropy.
Skip-Gram
- In contrast to CBOW, the goal is to predict context words given a target word.
- Use an embedded vector of the target word to project its context words within the window size.
- No need to separately compute embedded vectors for all neighboring words and average them to get embedding vector for the target word.
- Generate only one embedded vector for the central node using lookup matrix and train projection matrix W’ for 2w times, each for its neighboring node.
Both CBOW and skip-gram uses softmax function to calculate the probability distribution over the entire vocabulary for each context word, which can be computationally expensive for large vocabularies.
Complexity of softmax grows linearly with the vocabulary size, making it slow and memory-intensive to compute probabilities for every word.

1.2.2.2 Hierarchical Softmax

Hierarchical softmax offers a more efficient alternative. Instead of computing the softmax over the entire vocabulary, it creates a binary tree (Huffman Tree) to hierarchically represent the words in the vocabulary.
Each word is associated with a unique path from the root of the tree (context word) to its corresponding leaf node (target word).
The probability of a word in the hierarchical softmax is calculated as the product of the conditional probabilities along the path to the target word’s leaf node.
Conditional Probablity : $\large p(w_{T}|w_{C})$
- Take sigmoid to the dot product of embedding vector of target word (leaf node) and context word.
- Multiplier : -1 if right child, +1 if left child so that both sides add up to 1.
- Example

1.2.2.3 Hierarchical Skip-Gram for DeepWalk

# count The length of path from the root node to the given vertex.
def func_L(w):
    count=1
    while(w!=1):
        count+=1
        w//=2

    return count


# returns the nth node in the path from the root node to the given vertex
def func_n(w, j):
    li=[w]
    while(w!=1):
        w = w//2
        li.append(w)

    li.reverse()
    
    return li[j]

class HierarchicalModel(torch.nn.Module):
    
    def __init__(self):
        super(HierarchicalModel, self).__init__()
        self.phi         = nn.Parameter(torch.rand((size_vertex, d), requires_grad=True))   
        self.prob_tensor = nn.Parameter(torch.rand((2*size_vertex, d), requires_grad=True))
    
    def forward(self, wi, wk):
        one_hot     = torch.zeros(size_vertex)
        one_hot[wi] = 1
        wk = size_vertex + wk
        h = torch.matmul(one_hot, self.phi)
        p = torch.tensor([1.0])
        for i in range(1, func_L(wk)-1):
            mult = -1
            if(func_n(wk, i+1)==2*func_n(wk, i)): # Left child
                mult = 1
        
            p = p*sigmoid(mult*torch.matmul(self.prob_tensor[func_n(wk,i)], h))
        
        return p


hierarchicalModel = HierarchicalModel()

Assigning the vertices to the leaves of a binary tree, calculate the probability of a specific path in the tree that connects the root node ($\large \Phi(v_{j})$) to the target leaf node ($\large v_{3}, \, v_{5}$).
$\large P(u_{k}\, |\, \Phi(v_{j}))$ : Given the representation of $\large v_{j}$ (source node in the randomwalk), the probability of a path from the root node to the target leaf node (one of its neighbors in the randomwalk).
- $\large b_{l} = b_{0}, b_{1}, … , b_{log|V|}$ where $\large b_{0}$ : root node, $\large b_{log|V|}$ : $\large u_{k}$
- Hierarchical softmax reduces the computational complexity for calculating $\large P(u_{k}\, |\, \Phi(v_{j}))$ from $\large O(V)$ to $\large O(logV)$
$\large w$ : position of $\large u_{k}$.
$\large h$ : embedded vector for the root of the tree $\large v_{j}$.
func_L(wk) : path length from the root node to the target leaf node.
func_n(wk, i) : ith node in the path.

def HierarchicalSkipGram(wvi,  w):
   
    for j in range(len(wvi)):
        for k in range(max(0,j-w) , min(j+w, len(wvi))):
            prob = hierarchicalModel(wvi[j], wvi[k])
            loss = - torch.log(prob)
            loss.backward()
            for param in hierarchicalModel.parameters():
                param.data.sub_(lr*param.grad)
                param.grad.data.zero_()

Visiting all nodes in a given randomwalk path from the starting node to the end node, it tries to learn embeddings that can maximizes the probability of its neighbors (within the window size) in each walk step.

2. Node2Vec

node2vec is an extension of DeepWalk, also fundamentally based on the random walk algorithm.
The key idea is to add the search bias term $\alpha$ to edge weights that numerically balances between breadth-first search (BFS) and a depth-first search (DFS) during random walks.
- BFS
  - Constrain the neighborhoods to nodes that are immediate neighbors of the source.
  - Captures micro-view of local connectivity centering around the source node.
- DFS
  - Neighborhood consists of nodes sequentially sampled at increasing distances from the source node.
  - Explores broader parts of the network as it moves further away from the source node, and hence, reflects the macro-view of the neighborhoods.
Flexibly interpolate between BFS and DFS, node2vec can capture both local and global graph structures, making it highly effective for a wide range of graph-related tasks.

2.1. Biased Random Walks

2.1.1. Random Walk Probability

Starting from a source node $\large u$, a random walk path of length $\large l$
$\large c_{i}$ : ith node in the walk. ($\large c_{0} \, = \, u$)
$\large \pi_{vx}$ : unoramlized transition probability from $\large (i-1)^{th}$ node ($\large v$) to $\large i^{th}$ node ($\large x$)
$\large Z$ : normalizing constant

2.1.2. Determining $\large \pi_{vx}$ with Bias $\alpha$

Considering a random walk that just traversed the edge $\large (t, v)$, and currently reside on node $\large v$.
Transition probability of edge $\large (v, x)$ is represented as follows
- $\large \pi_{vx} = \alpha_{pq}(t, x) \, \times \, w_{vx}$
$\large d_{tx}$ : shortest path distance from node $\large t$ to $\large x$, one of {$0, 1, 2$}
- 0 : returns back to $\large t$
- 1 : $\large x$, a common neighbor of both $\large t$ and $\large v$
- 2 : one step further away from node $\large t$
- Based on the distance $\large d_{tx}$, node2vec determines the bias to be applied to the target edge $\large (v, x)$.

2.1.3. Parameters $\large p$ and $\large q$

Intuitively, parameters p and q control how fast the walk explores and leaves the neighborhood of source node u.
In other words, these are the factors for balancing between the two extreme search paradigms, BFS and DFS.
$\large p$ : return parameter
- likelihood of immediately revisiting previous node in the walk.
- $\large d_{tx} \, = \, 0$
- smaller $\large p$ more likely to lead the walk to backtracking step.
- BFS-like exploration.
$\large q$ : In-Out parameter
- allows the search to distinguish inward and outward propagation.
- $\large d_{tx} \, = \, 2$
- higher q biases the walk to the nodes close to node $\large t$, reluctant to leaving outwards.
- lower q, on the other hand, encourages outward exploration, more likely to obtain extended view of the neighborhoods in the walk.
- DFS-like exploration.
Setting p and q to be 1 (uniform for all $\large d_{tx}$) equals to previously introduced DeepWalk.
By biasing the random walk probability with these parameters p and q, node2vec flexibly explores various types of network structures, incorporating both BFS and DFS methods.
Hence, it can learn node embeddings that captures more comprehensive representation of the graph’s neighborhood structure.

2.2. Comparison to Other Node Embedding Strategies

Efficacy of node2vec over existing state-ofthe-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains.

Table 2 : Macro-F1 scores for multilabel classification tasks with 50% of the nodes labeled for training.

BlogCatalog: a network of social relationships of the bloggers listed on the BlogCatalog website.
Protein-Protein Interactions (PPI) : a subgraph of the PPI network for Homo Sapiens.
Wikipedia : cooccurrence network of words appearing in the first million bytes of the Wikipedia dump.

Up until now, I’ve examined two graph embedding methods based on random walks: DeepWalk and node2vec, which are considered relatively early in the field.
Next posts will adress different types of graph embedding approaches, SDNE, GraphSAGE, and GAT, which do not rely on random walk algorithms.