Table of Contents

DataFree Knowledge Distillation

Dreaming to Distill: Datafree Knowledge Transfer via DeepInversion

DataFree Knowledge Distillation with Soft Targeted Transfer Set Synthesis

Zeroshot Knowledge Transfer via Adversarial Belief Matching

Datafree network quantization with adversarial knowledge distillation

Contrastive Model Inversion for DataFree Knowledge Distillation
Introduction
In this post, I'm going to review the current stateoftheart in datafree knowledge distillation. I'll start with the basics of knowledge distillation, go through the motivation for datafree knowledge distillation, and then introduce several works including stateoftheart that I consider important to understand the subject.
Knowledge Distillation
Knowledge distillation (Hinton et al., 2015) is the technique of transferring knowledge from a model or ensemble of models (teacher) to a another model (student). As the field walks towards larger models, it is becoming increasingly important to find ways to transfer that knowledge to smaller models that can be deployed in hardware with limited resources or applications where energy consumption is a concern. Some of the most common applications of knowledge distillation are:
 Transfer to a smaller version of the teacher model with fewer layers and fewer neurons per layer
 Transfer to a smaller architecture
 Transfer to a quantized version of the teacher model
Responsebased KD
The most common form of knowledge distillation is the responsebased knowledge distillation. In this form of knowledge distillation, the student model is trained to mimic the output of the teacher model. The equation shows a possible loss function in a multiclass classification problem using the crossentropy loss:
Featurebased KD
In featurebased knowledge distillation, the student model is trained to mimic the intermediatelayers features of the teacher model. The equation shows a possible loss function that minimizes the mean squared error between the features of the teacher and the student, where $f_T$ and $f_S$ are the features of the teacher and student respectively, and $\Phi_T$ and $\Phi_S$ are projections that map the features from the teacher and student to the same space:
Relationbased KD
Relationbased knowledge distillation explores relations between different layers and data samples and optimizes the student model to mimic those relations. For example, Yim et al. (2017) uses the FST matrix between the features of two layers to measure the relation between them. The student model is then trained to minimize the difference between its FST matrix and the FST matrix of the teacher model.
DataFree Knowledge Distillation
In many cases, the teacher model is trained on datasets that are very large or are not publicly available for copyright or privacy reasons. Which makes it difficult to distil knowledge from those models using the same data where they were trained on. Introduced in Lopes et al. (2017), datafree knowledge distillation is the process of distilling knowledge from a model without using any data. That original work used metadata from the original dataset to reconstruct a dataset that was similar to the original dataset. However, in many cases, the metadata is not available or is difficult to obtain. For that reason, the following works have been proposing new methods that try to create a dataset without any metadata.
In this section, I'll introduce several works that I consider important to understand the current stateoftheart in datafree knowledge distillation.
ZeroShot Knowledge Distillation in Deep Networks
ZSKD (Nayak et al., 2019) introduces a method to synthesize data impressions by optimising the output of the teacher model to match a sample from a Dirichlet distribution. Since the model is frozen the only parameter that can optimize the loss function is the data impression.
Give an output $s$, after the softmax layer of the teacher model, which is a vector of probabilities of a certain input belonging to a certain class, the authors propose to model the probability of the output $y$ belonging to a class $c$ as a Dirichlet distribution, $p(y^c) = Dir(C, \alpha^c)$. Where C is the number of classes and $\alpha^c$ is the concentration parameter of the Dirichlet distribution.
To find the concentration parameter $\alpha^c$ they compute the cosine similarity matrix between all the C classes and use the cth row of the matrix as the concentration parameter $\alpha^c$.
Then for each class, $c$, are sampled $N$ outputs $y$ from the Dirichlet distribution and optimized one data impression for each $y$ using a crossentropy loss.
Dreaming to Distill: Datafree Knowledge Transfer via DeepInversion
DeepInversion (Yin et al., 2019) synthesizes data impressions that maximize the confidence of the model for a class $y_i$ while regularizing using the statistics in the batch norm layers, and the l2 norm and variance of the outputs. By using the statistics of the batch norm layers, DeepInversion is able to generate data impressions that recover some of the statistics of the original dataset.
The loss function is the following:
Adaptive DeepInversion
To further improve the quality of the generated data impressions, (Yin et al., 2019) proposed Adaptive DeepInversion. Given that the student knowledge evolves during the training process it makes sense to generate data impressions where the teacher and student disagree. The Adaptive DeepInversion loss extends the DeepInversion loss by adding a term that maximizes the JensenShannon divergence between the teacher and student models.
DataFree Learning of Student Networks
DAFL (Chen et al., 2019) is another method that uses a generative model to generate data impressions. They employ a crossentropy loss to minimise the entropy of the teacher logits for the different classes, maximise L1 norm of the teacher logits, and maximise the information entropy of the average of the teacher logits to encourage a balanced representation of the different classes.
DataFree Knowledge Distillation with Soft Targeted Transfer Set Synthesis
Wang (2021) introduces a method that models the output of the lth layer of the teacher model as Multivariate normal distribution to generate data impressions. Considering a Teacher model $T$, with $L$ layers, the output of the lth layer is $s^l$ a vector of size $K$. $s^l$ is modeled as a Multivariate normal distribution with mean $\mu^l$ and covariance matrix $\Sigma^l$. The mean considered is $\mu^l=0$. To obtain the covariance matrix $\Sigma^l$ they argue that the correlation between the outputs of the lth layer is implictly encoded in the weights of the lth layer. Therefore, they use the weights of the lth layer to compute the correlation matrix, $R^l=\frac{w_i^\top w_j}{\left\ w_i \right\ \cdot \left\ w_j \right\}$. Then, the covariance matrix is given by $\Sigma^l = D \times R^l \times D$ where $D$ is a learnable parameter.
To generate data impressions, they sample in total $N$ outputs $y^l_{soft}$ from the different layers and $N$ noise inputs and optimize the data impression to minimise the KL divergence between the sample of the lth layer $y_{soft}^l$ and the current output of the lth layer $\hat{y}^l$, as shown in the following equation:
Additionally, as commonly used in synthetic data generation, they add a term to the loss function that encourages higher activations to the last convolutional layer. This term is given by:
The final loss function to optimize the data impressions is given by:
Zeroshot Knowledge Transfer via Adversarial Belief Matching
ZSKT (Micaelli & Storkey, 2019) trains a generator to generate the data impressions in an adversarial way, i.e. maximize the disagreement between the teacher and the student models.
Additionally, when transferring knowledge from a teacher model to a student model, they use an additional loss term that exploits the fact that in some cases have a similar block structure to minimize the l2 distance between the normalized spatial attention maps of the hidden layers of the teacher and student models. A spatial attention map of a layer $l$ is denoted as $\mathcal{A}_l$ and is computed as: $\mathcal{A}_l = \frac{1}{C_l} \sum_{c=1}^{C_l} \mathcal{a}_{lc}^2$, where $C_l$ is the number of channels of the layer $l$ and $\mathcal{a}_{lc}$ is the activation of the channel $c$ of the layer $l$. The final distillation loss is given by:
Datafree network quantization with adversarial knowledge distillation
DFQ (Choi et al., 2020) is a method to distil knowledge from a teacher model to a quantized student model. To avoid the mode collapse that tends to affect generator networks, which leads to a low diversity of the generated data impressions, they use ensembles of generators and students which act as discriminators. To optimize the generators they use the batch normalization loss, minimise the entropy of each of the teacher outputs, and maximize the entropy of the average of the teacher outputs in the batch. Moreover, to ensure that the generated data impressions are diverse, they employ KL divergence between the teacher and student outputs as a discriminator loss. The final loss function is given by:
Contrastive Model Inversion for DataFree Knowledge Distillation
CMI [@fag_2021_artificial] starts from the observation that previous methods for datafree knowledge distillation are producing sets of data impressions with a low level of diversity. To address this issue, they propose using a contrastive loss as a diversity indicator and therefore create a method that generates a more diverse set of data impressions. To measure the similarity between data impressions they add a new model $h=f \circ T$ which projects the output of the teacher given an input $x$ to a new embedding space and then computes the cosine similarity between the embeddings of the data impressions.
Similarly to Adaptative DeepInversion methods, they employ a unified inversion framework to generate data impressions, which consists of a class conditional loss (cover all possible classes), batch normalization loss and an adversarial generation loss (cover space where the teacher and student disagree). However, they optimize a generative model to generate data impressions instead of optimizing the data impressions directly. For the adversarial generation loss, they only consider it when the class predicted by the teacher and the student are the same.
The final loss is given by:
Up to 100× Faster DataFree Knowledge Distillation
FastDFKD (Fang et al., 2022) is a method that reuses the features of the generator using meta updates, which allows them to generate data impressions faster, unlike previous methods that require learning to generate data impressions every batch. This results in a large speedup in the generation of data impressions which in some cases, according to the authors, can be up to 100x faster.
Conclusion
In this post, I've reviewed the current stateoftheart in datafree knowledge distillation. While many methods have been proposed, I'm still not convinced that datafree knowledge distillation is a viable solution to distil knowledge from large models in realworld applications. Mainly because the datasets used to benchmark these models are very small and do not represent the complexity of realworld datasets, moreover the resolution of the images is very low. The comparison between the different methods is also not clear and it's difficult to understand which method is better than the others and the reproducibility of the results and availability of the code are also lacking in some cases.
Another problem with this field is the lack of motivation for the methods proposed. While all argue that these methods might be useful in scenarios where the data is not available, I'm not convinced that this is a problem most times. I would argue that if in some cases the data is not available it is still possible to find a similar dataset that can be used as a starting point for methods like these to enrich those datasets. Finally, we will see in the coming months the publication of new methods that will leverage the recent advances in diffusion models to generate data impressions.