Table of Contents
-
Data-Free Knowledge Distillation
-
Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion
-
Data-Free Knowledge Distillation with Soft Targeted Transfer Set Synthesis
-
Zero-shot Knowledge Transfer via Adversarial Belief Matching
-
Data-free network quantization with adversarial knowledge distillation
-
Contrastive Model Inversion for Data-Free Knowledge Distillation
Introduction
In this post, I'm going to review the current state-of-the-art in data-free knowledge distillation. I'll start with the basics of knowledge distillation, go through the motivation for data-free knowledge distillation, and then introduce several works including state-of-the-art that I consider important to understand the subject.
Knowledge Distillation
Knowledge distillation (Hinton et al., 2015) is the technique of transferring knowledge from a model or ensemble of models (teacher) to a another model (student). As the field walks towards larger models, it is becoming increasingly important to find ways to transfer that knowledge to smaller models that can be deployed in hardware with limited resources or applications where energy consumption is a concern. Some of the most common applications of knowledge distillation are:
- Transfer to a smaller version of the teacher model with fewer layers and fewer neurons per layer
- Transfer to a smaller architecture
- Transfer to a quantized version of the teacher model
Response-based KD
The most common form of knowledge distillation is the response-based knowledge distillation. In this form of knowledge distillation, the student model is trained to mimic the output of the teacher model. The equation shows a possible loss function in a multi-class classification problem using the cross-entropy loss:
Feature-based KD
In feature-based knowledge distillation, the student model is trained to mimic the intermediate-layers features of the teacher model. The equation shows a possible loss function that minimizes the mean squared error between the features of the teacher and the student, where and are the features of the teacher and student respectively, and and are projections that map the features from the teacher and student to the same space:
Relation-based KD
Relation-based knowledge distillation explores relations between different layers and data samples and optimizes the student model to mimic those relations. For example, Yim et al. (2017) uses the FST matrix between the features of two layers to measure the relation between them. The student model is then trained to minimize the difference between its FST matrix and the FST matrix of the teacher model.
Data-Free Knowledge Distillation
In many cases, the teacher model is trained on datasets that are very large or are not publicly available for copyright or privacy reasons. Which makes it difficult to distil knowledge from those models using the same data where they were trained on. Introduced in Lopes et al. (2017), data-free knowledge distillation is the process of distilling knowledge from a model without using any data. That original work used metadata from the original dataset to reconstruct a dataset that was similar to the original dataset. However, in many cases, the metadata is not available or is difficult to obtain. For that reason, the following works have been proposing new methods that try to create a dataset without any metadata.
In this section, I'll introduce several works that I consider important to understand the current state-of-the-art in data-free knowledge distillation.
Zero-Shot Knowledge Distillation in Deep Networks
ZSKD (Nayak et al., 2019) introduces a method to synthesize data impressions by optimising the output of the teacher model to match a sample from a Dirichlet distribution. Since the model is frozen the only parameter that can optimize the loss function is the data impression.
Give an output , after the softmax layer of the teacher model, which is a vector of probabilities of a certain input belonging to a certain class, the authors propose to model the probability of the output belonging to a class as a Dirichlet distribution, . Where C is the number of classes and is the concentration parameter of the Dirichlet distribution.
To find the concentration parameter they compute the cosine similarity matrix between all the C classes and use the c-th row of the matrix as the concentration parameter .
Then for each class, , are sampled outputs from the Dirichlet distribution and optimized one data impression for each using a cross-entropy loss.
Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion
DeepInversion (Yin et al., 2019) synthesizes data impressions that maximize the confidence of the model for a class while regularizing using the statistics in the batch norm layers, and the l2 norm and variance of the outputs. By using the statistics of the batch norm layers, DeepInversion is able to generate data impressions that recover some of the statistics of the original dataset.
The loss function is the following:
Adaptive DeepInversion
To further improve the quality of the generated data impressions, (Yin et al., 2019) proposed Adaptive DeepInversion. Given that the student knowledge evolves during the training process it makes sense to generate data impressions where the teacher and student disagree. The Adaptive DeepInversion loss extends the DeepInversion loss by adding a term that maximizes the Jensen-Shannon divergence between the teacher and student models.
Data-Free Learning of Student Networks
DAFL (Chen et al., 2019) is another method that uses a generative model to generate data impressions. They employ a cross-entropy loss to minimise the entropy of the teacher logits for the different classes, maximise L1 norm of the teacher logits, and maximise the information entropy of the average of the teacher logits to encourage a balanced representation of the different classes.
Data-Free Knowledge Distillation with Soft Targeted Transfer Set Synthesis
Wang (2021) introduces a method that models the output of the l-th layer of the teacher model as Multivariate normal distribution to generate data impressions. Considering a Teacher model , with layers, the output of the l-th layer is a vector of size . is modeled as a Multivariate normal distribution with mean and covariance matrix . The mean considered is . To obtain the covariance matrix they argue that the correlation between the outputs of the l-th layer is implictly encoded in the weights of the l-th layer. Therefore, they use the weights of the l-th layer to compute the correlation matrix, . Then, the covariance matrix is given by where is a learnable parameter.
To generate data impressions, they sample in total outputs from the different layers and noise inputs and optimize the data impression to minimise the KL divergence between the sample of the l-th layer and the current output of the l-th layer , as shown in the following equation:
Additionally, as commonly used in synthetic data generation, they add a term to the loss function that encourages higher activations to the last convolutional layer. This term is given by:
The final loss function to optimize the data impressions is given by:
Zero-shot Knowledge Transfer via Adversarial Belief Matching
ZSKT (Micaelli & Storkey, 2019) trains a generator to generate the data impressions in an adversarial way, i.e. maximize the disagreement between the teacher and the student models.
Additionally, when transferring knowledge from a teacher model to a student model, they use an additional loss term that exploits the fact that in some cases have a similar block structure to minimize the l2 distance between the normalized spatial attention maps of the hidden layers of the teacher and student models. A spatial attention map of a layer is denoted as and is computed as: , where is the number of channels of the layer and is the activation of the channel of the layer . The final distillation loss is given by:
Data-free network quantization with adversarial knowledge distillation
DFQ (Choi et al., 2020) is a method to distil knowledge from a teacher model to a quantized student model. To avoid the mode collapse that tends to affect generator networks, which leads to a low diversity of the generated data impressions, they use ensembles of generators and students which act as discriminators. To optimize the generators they use the batch normalization loss, minimise the entropy of each of the teacher outputs, and maximize the entropy of the average of the teacher outputs in the batch. Moreover, to ensure that the generated data impressions are diverse, they employ KL divergence between the teacher and student outputs as a discriminator loss. The final loss function is given by:
Contrastive Model Inversion for Data-Free Knowledge Distillation
CMI [@fag_2021_artificial] starts from the observation that previous methods for data-free knowledge distillation are producing sets of data impressions with a low level of diversity. To address this issue, they propose using a contrastive loss as a diversity indicator and therefore create a method that generates a more diverse set of data impressions. To measure the similarity between data impressions they add a new model which projects the output of the teacher given an input to a new embedding space and then computes the cosine similarity between the embeddings of the data impressions.
Similarly to Adaptative DeepInversion methods, they employ a unified inversion framework to generate data impressions, which consists of a class conditional loss (cover all possible classes), batch normalization loss and an adversarial generation loss (cover space where the teacher and student disagree). However, they optimize a generative model to generate data impressions instead of optimizing the data impressions directly. For the adversarial generation loss, they only consider it when the class predicted by the teacher and the student are the same.
The final loss is given by:
Up to 100× Faster Data-Free Knowledge Distillation
FastDFKD (Fang et al., 2022) is a method that reuses the features of the generator using meta updates, which allows them to generate data impressions faster, unlike previous methods that require learning to generate data impressions every batch. This results in a large speedup in the generation of data impressions which in some cases, according to the authors, can be up to 100x faster.
Conclusion
In this post, I've reviewed the current state-of-the-art in data-free knowledge distillation. While many methods have been proposed, I'm still not convinced that data-free knowledge distillation is a viable solution to distil knowledge from large models in real-world applications. Mainly because the datasets used to benchmark these models are very small and do not represent the complexity of real-world datasets, moreover the resolution of the images is very low. The comparison between the different methods is also not clear and it's difficult to understand which method is better than the others and the reproducibility of the results and availability of the code are also lacking in some cases.
Another problem with this field is the lack of motivation for the methods proposed. While all argue that these methods might be useful in scenarios where the data is not available, I'm not convinced that this is a problem most times. I would argue that if in some cases the data is not available it is still possible to find a similar dataset that can be used as a starting point for methods like these to enrich those datasets. Finally, we will see in the coming months the publication of new methods that will leverage the recent advances in diffusion models to generate data impressions.