[Paper Review] Visualizing Data using t-Stochastic Nearest Neighbor (t-SNE) (t-SNE, 2008)

Outlines

Reference
1. t-SNE : Dimensionality Reduction for Visualization
2. Stochastic Neighborhood Embedding (SNE)
- 2.1. Cost Function and Gradient
3. Algorithms of t-SNE
4. σ and Perplexity
- 4.1. Binary Search to Find the σ Matched with the Perplexity

Reference

1. t-SNE : Dimensionality Reduction for Visualization

t-SNE is an improvement of SNE (Hinton and Roweis, 2002) introduced by Maaten and Hinton in 2008 and is widely used for embedding high dimensional data into lower-dimensional representation preserving both local and global structures of original data.
Representing non-linear manifolds in lower dimensions requires capturing the data’s structures at various scales, which can be approximated as two types of structures, local and global.
While global structure can be preserved by keeping disimilar data points far apart, creating distinctly separated clusters, maintaining local structure is more about keeping similar datapoints close together without causing crowding.
Unlike other linear dimensionality reduction methods, such as PCA and MDS, that mainly focus on mapping global structures, t-SNE is able to capture both local and global structures of the data with a few modifications added to original SNE (more optimized for capturing local structure).

2. Stochastic Neighborhood Embedding (SNE)

Basic algorithm of SNE is to model pairwise similarities between all data points both in lower (q) and higher dimension (p).
- Similarity between two data points ($\large i, j$) is given by the conditional probability $\large p_{j|i}$ and $\large q_{j|i}$.
- It is assumed that the similarity distribution of every data point follows a Gaussian distribution.
- Set the variance of similarities of all lower-dimensional embedded points to share identical value, $\sigma_{i} \, = \, \cfrac{1}{\sqrt{2}}$
As the goal of manifold embedding is to map the data into lower dimension while maintaining original data structure, the cost function can be given by the KL divergence between the similarity distributions of higher dimensional and lower dimensional data.

2.1. Cost Function and Gradient

Then, SNE tries to minimize the sum of kL divergence over all datapoints using a stochastic gradient descent update.
- As the conditional probabilities are not symmetric ($\large p_{j|i} \, \ne \, p_{i|j}$), errors in pairwise similarity are not equally weighted.
- From the given cost function, one can notice that cost of widely separated mappings to represent nearby datapoints (small $\large q_{j|i}$ and large $\large p_{j|i}$) is greater than the opposite case, which means embedding becomes more sensitive to capturing local structures (similar higher dimensional datapoints) compared to global structure.
In order to update the embeddings through gradient descent, compute the derivative to the cost function with resepct to embedded datapoints ($\large y_{i}$).
- Sum of the distances between a target mapping ($\large y_{i}$) and all other mappings ($\large y_{j}$), each weighted with the mismatch of similairty between map point and data point determines the gradient with respect to the target map point.
- Intuitively, the distance between mappings ($\large y_{i} \, - \, y_{j}$) represents a spring force (either repelling or attracting) and the mismatch adds the proportional stiffness to control the final impact of the spring force to be exerted.
In order to speed up the optimization process and preventing from beding stucked in the poor local optima, SNE adds the momentum term $\large \alpha(t)$ to the current gradient $\large \cfrac{\partial{C}}{\partial{\mathcal{Y}}}$.
- Final Update Rule :

3. Algorithms of t-SNE

3.1. Symmetric SNE

As an alternative to the asymmetric pairwise similarity of SNE given by conditional probability, t-SNE uses a symmetric pairwise distance with a newly defined joint probability $\large p_{ij}$ and $\large q_{ij}$.

Higher Dimension $\large p_{ij}$

$\large p_{ij} \, = \, \cfrac{p_{j|i} + p_{i|j}}{2n}$
This ensures that for all $\large i$, $\sum_{j} p_{ij} \, > \, \cfrac{1}{2n}$, guaranteeing that all datapoints have significant enough contribution to the cost.
Similarily, t-SNE defines a distinct variance for all datapoints ($\large \sigma_{i}$)
- $\large p_{ij} \, = \, \cfrac{exp(-|x_{i} - x_{j}|^{2} \, / 2\sigma_{i}^{2})}{\sum_{k \ne l} \, exp(-|x_{k} - x_{l}|^{2} \, / 2\sigma_{k}^{2})}$
- Then, $\large \sum_{i, j}\,p_{ij}$ adds up to 1 in t-SNE.
- $\large \sigma_{i}$ is determined by binary search.

Higher Dimension $\large q_{ij}$

3.2. t-Distributed Distance : Thicker Tail for Mismatch to Alleviate Crowding Problem

t-SNE uses Student t-distribution with 1 degree of freedom to represent the distribution of the distance between map points ($\large y_{i} - y_{j}$).
- Student t-Distribution of with 1 degree of freedom
- Following Student t-distribution with a single degree of freedom, probaility distribution of distance in lower-dimension is $\large (1 \, + \, |y_{i} - y_{j}|^{2})^{-1}$.
- Then, the joint probabilities for pairwise similarities is adjusted to

Crowding Problem and Student t-Distribution

The crowding problem refers to the phenomenon where points that are distant in high-dimensional space become crowded or clustered together in the lower-dimensional map, which can lead to the distortion and overlap of clusters.
This occurs mainly because map representation doesn’t have enough room (dimensional space) to accomdate datapoints that are moderately distant in higher dimension due to its limited dimensionality.
In order to represent the moderate or small distances in the map accurately, t-SNE uses t-distribution of 1 degree of freedom instead of gaussian distribution to model the distribution of pair distances between map points.
Student t-distribution has gaussian-like bell shape but with thicker tail.
- The curve is sharpest in standard normal (gaussian) distribution and becomes smoother and thicker (tail) as the degree of freedom decreases.
- The tail of the gaussian distribution is so thin that it cannot properly model moderate and large enough distances, thus using gaussian as a distribution of the map distances results in the collapse of distant scale pattern and widely separated points in higher dimension will be brought together and densly packed into smaller region.
- Heavy-tailed t-distribution, on the other hand, provides enough space to place these distant pairs and thus, more accurately distinguishes the distances.
Hence, t-SNE can alleviate the crowding problem by spreading out the map points that are moderately or widely separated in higher dimensional space.

3.3. Cost Function and Gradient of t-SNE

Using symmetric joint probailities, cost function of t-SNE can be written as
The gradient with respect to map points is as follows
- Derivation

Advantages of gradient of t-SNE over SNE

First two terms are identical with SNE, while the third term, $\large (1 + |y_{i} - y_{j}|^{2})^{-1}$, is new in t-SNE.
This term rescales and adjusts the magnitude of string force exerted from the distance between the mapping points.

Figure 1: Gradients of three types of SNE

Both repulse close dissiimilar points and attract far similar points, but the t-SNE shows more balanced and stable gradient in both cases.

Stronger repulsion to the dissimilar datapoints that are closely mapped in lower dimension.
- As closely mapped points has smaller $\large |y_{i} - y_{j}|^{2}$ value, third term $\large (1 + |y_{i} - y_{j}|^{2})^{-1}$ relatively upscales the gradient of closely mapped dissimilar points compared to the one computed from SNE. (relative scale, not an absolute scale.)
- Figure 1-(a) shows that SNE has relatively weak repulsion gradient for the closely mapped dissimilar points compared to stronger attraction force exerted on the similar datapoints that are mapped far apart.
- t-SNE presented in figure 1-(c), however, introduces strong repulsion, which is comparable to strong attraction.
Moderates the attraction force, preventing crowding
- As presented in figure 1-(a), SNE tends to represent extreme attraction gradient for the widely separated similar datapoints, which can cause the crowding problem.
- However, figure 1-(c) shows that the magnitude of attraction gradient in t-SNE becomes more moderate and stable.

4. $\large \sigma$ and Perplexity

Perplexity : Determining the Size of Referred Neighborhood

Perplexity is defined as $\large 2^{H(P_{i})}$
- $\large H(P_{i})$ is the information entropy of $\large P_{i}$ measured in bits.
- $\large H(P_{i}) \, = \, \sum_{j}\, p_{j|i}\log_{2}(p_{j|i})$
Intuitively, perplexity is a hyperparameter that determines the size of a neighborhood that accounts for the number of neighboring points considered in stochastic update of embeddings.
With further detail, perplexity controls the effective number of nearby points that each data point takes into account when constructing the probability distribution of pair distances in both the high-dimensional and lower-dimensional spaces.
As the performance of t-SNE is quite robust to the change of perplexity, typical choice of perplexity is between 5 and 50.
After choosing the perplexity, t-SNE performs a binary search to figure the $\large \sigma$ that is consitent with the specified perplexity.
- Note that the $\large \sigma$ affects the probaility distribution and thus, perplexity.

4.1. Binary Search to Find the $\large \sigma$ Matched with the Perplexity

Basic rule of the binary search
- Set the initial $\large \sigma$ value and calculate the perplexity from the initialized $\large \sigma$.
- If the computed perplexity is greater than the desired one, downscale the $\large \sigma$ with a factor of 2 and vice versa.
- Repeat the above process untill $\large \sigma$ converges.

Implementation

Computing Entropy (Perplexity)

  def get_entropy(dist_, var):
      prob = to_prob(dist_, var)
      entropy = - (prob * np.log(prob)).sum()
      return entropy           

Compare the computed entropy with given $\large \sigma$ and desired one (get difference sign), and update the $\large \sigma$ based on the comparison. Repeat untill convergence (no difference sign change).

  def binary_search_variance(dist, perplexity=30.0, verbose=False):

      desired_entropy = np.log2(perplexity)

      var = 1
      decay = 0.9
      factor = 2

      previous_diff_sign = True

      for n_try in range(30):

          entropy = get_entropy(dist, var)
          entropy_diff = entropy - desired_entropy
          diff_sign = entropy_diff > 0

          if previous_diff_sign != diff_sign:
              factor = max(1, factor * decay)

          if entropy_diff > 0:
              var /= factor
          else:
              var *= factor

          if verbose:
              print('var = {:f}, perplexity = {:f}'.format(var, 2 ** entropy))

          previous_diff_sign = diff_sign
          if factor == 1:
              break

      perplexity = 2 ** entropy
      return var, perplexity

Using the binary search algorithm, t-SNE finds out the $\large \sigma$ that is distinct for each datapoint and determines the probability distribution that satisfies the perplexity specified by the user.

  def to_prob(dist_, var):
      prob = np.exp(-(dist_.copy() ** 2) / var)
      prob = prob / prob.sum()
      return prob

  for i, dist in enumerate(dist_samples):

      var, perplexity = binary_search_variance(dist, perplexity=40)
      prob = to_prob(dist, var)

To sum up, t-SNE is a dimensionality reduction technique that embeds the higher dimensional data into lower map for visualization.
It controls the position of the map points in a way that minimizes the KL divergence of the pairwise similarity distribution between lower and higher dimension by using stochastic gradient descent.
t-SNE successfully capture both local and global structure of the data by adopting symmetric SNE (local) and resolving crowding (global) with Student t-distribution.