14 May 2023 34 min 14_5_2023_Data_Free_Knowledge_Distillation

## Introduction

In this post, I'm going to review the current state-of-the-art in data-free knowledge distillation. I'll start with the basics of knowledge distillation, go through the motivation for data-free knowledge distillation, and then introduce several works including state-of-the-art that I consider important to understand the subject.

## Knowledge Distillation

Knowledge distillation is the technique of transferring knowledge from a model or ensemble of models (teacher) to a another model (student). As the field walks towards larger models, it is becoming increasingly important to find ways to transfer that knowledge to smaller models that can be deployed in hardware with limited resources or applications where energy consumption is a concern. Some of the most common applications of knowledge distillation are:

• Transfer to a smaller version of the teacher model with fewer layers and fewer neurons per layer
• Transfer to a smaller architecture
• Transfer to a quantized version of the teacher model

### Response-based KD

The most common form of knowledge distillation is the response-based knowledge distillation. In this form of knowledge distillation, the student model is trained to mimic the output of the teacher model. The equation shows a possible loss function in a multi-class classification problem using the cross-entropy loss:

$\mathcal{L}_{KD} = D_{KL}(\text{Softmax}(T(\mathbf{x})) || \text{Softmax}(\mathbf{S(x)}))$

### Feature-based KD

In feature-based knowledge distillation, the student model is trained to mimic the intermediate-layers features of the teacher model. The equation shows a possible loss function that minimizes the mean squared error between the features of the teacher and the student, where $f_T$ and $f_S$ are the features of the teacher and student respectively, and $\Phi_T$ and $\Phi_S$ are projections that map the features from the teacher and student to the same space:

$\mathcal{L}_{KD} = MSE(\Phi_T(f_T(x)), \Phi_S(f_S(x)))$

### Relation-based KD

Relation-based knowledge distillation explores relations between different layers and data samples and optimizes the student model to mimic those relations. For example, Yim et al. (2017) uses the FST matrix between the features of two layers to measure the relation between them. The student model is then trained to minimize the difference between its FST matrix and the FST matrix of the teacher model.

## Data-Free Knowledge Distillation

In many cases, the teacher model is trained on datasets that are very large or are not publicly available for copyright or privacy reasons. Which makes it difficult to distil knowledge from those models using the same data where they were trained on. Introduced in Lopes et al. (2017), data-free knowledge distillation is the process of distilling knowledge from a model without using any data. That original work used metadata from the original dataset to reconstruct a dataset that was similar to the original dataset. However, in many cases, the metadata is not available or is difficult to obtain. For that reason, the following works have been proposing new methods that try to create a dataset without any metadata.

In this section, I'll introduce several works that I consider important to understand the current state-of-the-art in data-free knowledge distillation.

### Zero-Shot Knowledge Distillation in Deep Networks

ZSKD introduces a method to synthesize data impressions by optimising the output of the teacher model to match a sample from a Dirichlet distribution. Since the model is frozen the only parameter that can optimize the loss function is the data impression.

Give an output $s$, after the softmax layer of the teacher model, which is a vector of probabilities of a certain input belonging to a certain class, the authors propose to model the probability of the output $y$ belonging to a class $c$ as a Dirichlet distribution, $p(y^c) = Dir(C, \alpha^c)$. Where C is the number of classes and $\alpha^c$ is the concentration parameter of the Dirichlet distribution.

To find the concentration parameter $\alpha^c$ they compute the cosine similarity matrix between all the C classes and use the c-th row of the matrix as the concentration parameter $\alpha^c$.

Then for each class, $c$, are sampled $N$ outputs $y$ from the Dirichlet distribution and optimized one data impression for each $y$ using a cross-entropy loss.

### Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion

DeepInversion synthesizes data impressions that maximize the confidence of the model for a class $y_i$ while regularizing using the statistics in the batch norm layers, and the l2 norm and variance of the outputs. By using the statistics of the batch norm layers, DeepInversion is able to generate data impressions that recover some of the statistics of the original dataset.

The loss function is the following:

$\mathcal{L} = \mathcal{L}_{CE}(T(x), y_i) + \lambda_{L2} \mathcal{L}_{L2} + \lambda_{Var} \mathcal{L}_{Var} + \lambda_{feature} \mathcal{L}_{feature}$
$\mathcal{L}_{L2} = -\left\| T(x) \right \|_2^2$
$\mathcal{L}_{Var} = \sum_c \sigma(T(x)_c)^2$
$\mathcal{L}_{feature} = \sum_l \left\| \mu_l(\hat{x})-\mathbb{E}[\mu_l(x)] \right \|_2 + \sum_l \left\| \sigma_l(\hat{x})-\mathbb{E}[\sigma_l(x)] \right\|_2$

To further improve the quality of the generated data impressions, proposed Adaptive DeepInversion. Given that the student knowledge evolves during the training process it makes sense to generate data impressions where the teacher and student disagree. The Adaptive DeepInversion loss extends the DeepInversion loss by adding a term that maximizes the Jensen-Shannon divergence between the teacher and student models.

$\mathcal{L}_{AdaptativeDeepInversion} = \mathcal{L}_{DeepInversion} + \lambda_{Adversarial} (1 - JS(T(x), S(x)))$

### Data-Free Learning of Student Networks

DAFL is another method that uses a generative model to generate data impressions. They employ a cross-entropy loss to minimise the entropy of the teacher logits for the different classes, maximise L1 norm of the teacher logits, and maximise the information entropy of the average of the teacher logits to encourage a balanced representation of the different classes.

### Data-Free Knowledge Distillation with Soft Targeted Transfer Set Synthesis

Wang (2021) introduces a method that models the output of the l-th layer of the teacher model as Multivariate normal distribution to generate data impressions. Considering a Teacher model $T$, with $L$ layers, the output of the l-th layer is $s^l$ a vector of size $K$. $s^l$ is modeled as a Multivariate normal distribution with mean $\mu^l$ and covariance matrix $\Sigma^l$. The mean considered is $\mu^l=0$. To obtain the covariance matrix $\Sigma^l$ they argue that the correlation between the outputs of the l-th layer is implictly encoded in the weights of the l-th layer. Therefore, they use the weights of the l-th layer to compute the correlation matrix, $R^l=\frac{w_i^\top w_j}{\left\| w_i \right\| \cdot \left\| w_j \right\|}$. Then, the covariance matrix is given by $\Sigma^l = D \times R^l \times D$ where $D$ is a learnable parameter.

To generate data impressions, they sample in total $N$ outputs $y^l_{soft}$ from the different layers and $N$ noise inputs and optimize the data impression to minimise the KL divergence between the sample of the l-th layer $y_{soft}^l$ and the current output of the l-th layer $\hat{y}^l$, as shown in the following equation:

$\mathcal{L}_{mv} = \text{KL}(y_{soft}^l || \hat{y}^l)$

Additionally, as commonly used in synthetic data generation, they add a term to the loss function that encourages higher activations to the last convolutional layer. This term is given by:

$\mathcal{L}_{act} = - \frac{1}{n} \sum_{i=1}^n \left\| s^{lastConv}_i \right\|_1$

The final loss function to optimize the data impressions is given by:

$\mathcal{L} = \mathcal{L}_{mv} + \lambda_{act} \mathcal{L}_{act}$

### Zero-shot Knowledge Transfer via Adversarial Belief Matching

ZSKT trains a generator to generate the data impressions in an adversarial way, i.e. maximize the disagreement between the teacher and the student models.

$\mathcal{L}_{adv} = -D_\text{KL}(T(x) || S(x))$

Additionally, when transferring knowledge from a teacher model to a student model, they use an additional loss term that exploits the fact that in some cases have a similar block structure to minimize the l2 distance between the normalized spatial attention maps of the hidden layers of the teacher and student models. A spatial attention map of a layer $l$ is denoted as $\mathcal{A}_l$ and is computed as: $\mathcal{A}_l = \frac{1}{C_l} \sum_{c=1}^{C_l} \mathcal{a}_{lc}^2$, where $C_l$ is the number of channels of the layer $l$ and $\mathcal{a}_{lc}$ is the activation of the channel $c$ of the layer $l$. The final distillation loss is given by:

$\mathcal{L}_{dist} = \mathcal{L}_{CE}(S(x), T(x)) + \beta \sum_{l=1}^L \left\| \frac{\mathcal{A}^S_l}{\left\| \mathcal{A}^S_l \right\|_2} - \frac{\mathcal{A}^T_l}{\left\| \mathcal{A}^T_l \right\|_2} \right\|$

### Data-free network quantization with adversarial knowledge distillation

DFQ is a method to distil knowledge from a teacher model to a quantized student model. To avoid the mode collapse that tends to affect generator networks, which leads to a low diversity of the generated data impressions, they use ensembles of generators and students which act as discriminators. To optimize the generators they use the batch normalization loss, minimise the entropy of each of the teacher outputs, and maximize the entropy of the average of the teacher outputs in the batch. Moreover, to ensure that the generated data impressions are diverse, they employ KL divergence between the teacher and student outputs as a discriminator loss. The final loss function is given by:

$\min_{1\le i \le S}\max_{1 \le j \le G} \sum_{j=1}^G\left(\frac{1}{S}\sum_{i=1}^{S}\mathcal{L}_{discr}-\alpha \mathcal{L}_{dfq} \right)$
$\mathcal{L}_{discr}=\mathbb{E}_{\rho(z)}[D_{KL}(T \circ g(z), S\circ g(z) )]$
$\mathcal{L}_{dfq} = \sum_{l,c}(D_{KL}^\mathcal{N}(\mu_{l,c}(x), \sigma_{l,c}^2(x)), (\mu_{l,c},\sigma^2_{l,c})) + \\ \mathbb{E}_{\rho(z)}[H(T \circ g(z))] + \\ H(\mathbb{E}_{\rho(z)}[T \circ) g(z)]$

### Contrastive Model Inversion for Data-Free Knowledge Distillation

CMI [@fag_2021_artificial] starts from the observation that previous methods for data-free knowledge distillation are producing sets of data impressions with a low level of diversity. To address this issue, they propose using a contrastive loss as a diversity indicator and therefore create a method that generates a more diverse set of data impressions. To measure the similarity between data impressions they add a new model $h=f \circ T$ which projects the output of the teacher given an input $x$ to a new embedding space and then computes the cosine similarity between the embeddings of the data impressions.

$\text{sim}(x_i, x_j) = \frac{h(x_i)^\top h(x_j)}{\left\| h(x_i) \right\| \cdot \left\| h(x_j) \right\|}$

Similarly to Adaptative DeepInversion methods, they employ a unified inversion framework to generate data impressions, which consists of a class conditional loss (cover all possible classes), batch normalization loss and an adversarial generation loss (cover space where the teacher and student disagree). However, they optimize a generative model to generate data impressions instead of optimizing the data impressions directly. For the adversarial generation loss, they only consider it when the class predicted by the teacher and the student are the same.

The final loss is given by:

$\mathcal{L}_{CMI} = \mathcal{L}_{CE}(T(x), y_i) + \lambda_{bn} \mathcal{L}_{bn} + \lambda_{adv} \mathcal{L}_{adv} + \lambda_{ctr} \mathcal{L}_{ctr}$
$\mathcal{L}_{bn} = \sum_{l}(D_{KL}^\mathcal{N}(\mu_{l}(x), \sigma_{l}^2(x)), (\mu_{l},\sigma^2_{l}))$
$\mathcal{L}_{adv} = -D_\text{KL}(T(x) || S(x)) \cdot \mathbb{1} \{\text{arg} \text{max} T(x) = \text{arg} \text{max} S(x)\}$
$\mathcal{L}_{ctr} = \frac{1}{N} \sum_{i=1}^N \frac{\text{exp}(\text{sim}(x_i, \tilde{x}_i))}{\sum_{j=1}^N \text{sim}(x_i, x_j))}$

### Up to 100× Faster Data-Free Knowledge Distillation

FastDFKD is a method that reuses the features of the generator using meta updates, which allows them to generate data impressions faster, unlike previous methods that require learning to generate data impressions every batch. This results in a large speedup in the generation of data impressions which in some cases, according to the authors, can be up to 100x faster.

## Conclusion

In this post, I've reviewed the current state-of-the-art in data-free knowledge distillation. While many methods have been proposed, I'm still not convinced that data-free knowledge distillation is a viable solution to distil knowledge from large models in real-world applications. Mainly because the datasets used to benchmark these models are very small and do not represent the complexity of real-world datasets, moreover the resolution of the images is very low. The comparison between the different methods is also not clear and it's difficult to understand which method is better than the others and the reproducibility of the results and availability of the code are also lacking in some cases.

Another problem with this field is the lack of motivation for the methods proposed. While all argue that these methods might be useful in scenarios where the data is not available, I'm not convinced that this is a problem most times. I would argue that if in some cases the data is not available it is still possible to find a similar dataset that can be used as a starting point for methods like these to enrich those datasets. Finally, we will see in the coming months the publication of new methods that will leverage the recent advances in diffusion models to generate data impressions.

Chen, H., Wang, Y., Xu, C., Yang, Z., Liu, C., Shi, B., Xu, C., Xu, C., & Tian, Q. (2019). Data-Free Learning of Student Networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 3513–3521.
Choi, Y., Choi, J. P., El-Khamy, M., & Lee, J. (2020). Data-Free Network Quantization With Adversarial Knowledge Distillation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3047–3057.
Fang, G., Mo, K., Wang, X., Song, J., Bei, S., Zhang, H., & Song, M. (2022). Up to 100x Faster Data-free Knowledge Distillation. AAAI Conference on Artificial Intelligence.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network.
Lopes, R. G., Fenu, S., & Starner, T. (2017). Data-Free Knowledge Distillation for Deep Neural Networks.
Micaelli, P., & Storkey, A. (2019). Zero-shot Knowledge Transfer via Adversarial Belief Matching. arXiv. https://doi.org/10.48550/ARXIV.1905.09768
Nayak, G. K., Mopuri, K. R., Shaj, V., Babu, R. V., & Chakraborty, A. (2019). Zero-Shot Knowledge Distillation in Deep Networks.
Wang, Z. (2021). Data-Free Knowledge Distillation with Soft Targeted Transfer Set Synthesis.
Yim, J., Joo, D., Bae, J., & Kim, J. (2017). A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7130–7138. https://doi.org/10.1109/CVPR.2017.754
Yin, H., Molchanov, P., Li, Z., Alvarez, J. M., Mallya, A., Hoiem, D., Jha, N. K., & Kautz, J. (2019). Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion. arXiv. https://doi.org/10.48550/ARXIV.1912.08795