Self Sparse Generative Adversarial Networks

Generative Adversarial Networks (GANs) are an unsupervised generative model that learns data distribution through adversarial training. However, recent experiments indicated that GANs are difficult to train due to the requirement of optimization in the high dimensional parameter space and the zero gradient problem. In this work, we propose a Self Sparse Generative Adversarial Network (Self-Sparse GAN) that reduces the parameter space and alleviates the zero gradient problem. In the Self-Sparse GAN, we design a Self-Adaptive Sparse Transform Module (SASTM) comprising the sparsity decomposition and feature-map recombination, which can be applied on multi-channel feature maps to obtain sparse feature maps. The key idea of Self-Sparse GAN is to add the SASTM following every deconvolution layer in the generator, which can adaptively reduce the parameter space by utilizing the sparsity in multi-channel feature maps. We theoretically prove that the SASTM can not only reduce the search space of the convolution kernel weight of the generator but also alleviate the zero gradient problem by maintaining meaningful features in the Batch Normalization layer and driving the weight of deconvolution layers away from being negative. The experimental results show that our method achieves the best FID scores for image generation compared with WGAN-GP on MNIST, Fashion-MNIST, CIFAR-10, STL-10, mini-ImageNet, CELEBA-HQ, and LSUN bedrooms, and the relative decrease of FID is 4.76% ~ 21.84%.


INTRODUCTION
Generative adversarial networks (GANs) [1] are a kind of unsupervised generation model based on game theory, and widely used to learn complex real-world distributions based on deep convolutional layers [2] (e.g.image generation).However, despite its success, training GANs is very unstable, and there may be problems such as gradient disappearance, divergence, and mode collapse [3,4].The main reason is that training GANs needs to find a Nash equilibrium for a non-convex problem in a high dimensional continuous space [5].In addition, it is pointed out that the loss function used in the original GANs [1] causes the zero gradient problem when there is no overlap between the generated data distribution and the real data distribution [6].
The stabilization of GAN training has been investigated by either modifying the network architecture [2,[6][7][8] or adopting an alternative objective function [9][10][11][12].However, these methods do not reduce the high-dimensional parameter space of the generator.When the task is complex (including more texture details and with high resolution), we often increase the number of convolution kernels to enhance the capability of the generator.Nevertheless, we do not exactly know how many convolution kernels are appropriate, which further increases the parameter space of the generator.Therefore, it is reasonable to speculate that parameter redundancy exists in the generator.If the parameter space of the generator can be reduced, both the performance and training stability of GANs will be further improved.
Motivated by the aforementioned challenges and the sparsity in deep convolution networks [13,14], we propose a Self-Sparse Generative Adversarial Network (Self-Sparse GAN), with a Self-Adaptive Sparse Transform Module (SASTM) after each deconvolution layer.The SASTM consisting of the sparsity decomposition and feature-map recombination is applied on multichannel feature maps of the deconvolution layer to obtain sparse feature maps.The channel sparsity coefficients and position sparsity coefficients are obtained by using a two-headed neural network to transform the latent vector in the sparsity decomposition.Then, the sparse multi-channel feature maps are acquired by a superposition of the channel sparsity and position sparsity, which can be obtained by the feature maps multiplying the corresponding sparsity coefficients.The corresponding sparsity coefficients will alleviate the zero gradient problem by maintaining meaningful features in the Batch Normalization (BN) layer and driving the weights of deconvolution layers away from being negative.Meanwhile, the sparse feature maps will free some of the convolution kernels, that is, the weights do not affect the model, thus reducing the parameter space.
Our contributions.We propose a novel Self-Sparse GAN, in which the training of generator considers the adaptive sparsity in multi-channel feature maps.We use the SASTM to implement feature maps sparsity adaptively, and theoretically prove that our method not only reduces the search space of the convolution kernel weight but also alleviates the zero gradient problem.We evaluate the performance of proposed Self-Sparse GAN using the MNIST [15], Fashion-MNIST [16], , , CELEBA-HQ [7], LSUN bedrooms [20] datasets.The experimental results show that our method achieves the best FID scores for image generation compared with WGAN-GP, and the relative decrease of FID is 4.76% ~ 21.84%.

Related Work
Generative adversarial network.GANs [1] can learn the data distribution through the game between the generator and discriminator, and have been widely used in image generation [21], video generation [22], image translation [23], and image inpainting [24].
Optimization and training frameworks.With the development of GANs, more and more researchers are committed to settling the training barriers of gradient disappearance, divergence and mode collapse.In the work [5], noise is added to the generated data and real data to increase the support of two distributions and alleviate the problem of gradient disappearance.In the work [10], the least squares loss function is adopted to stabilize the training of the discriminator.WGAN [9] uses the Earth Mover's Distance (EMD) instead of the Jensen-Shannon divergence in the original GAN, which requires the discriminator to satisfy the Lipschitz constraint and can be achieved by weight clipping.Because weight clipping will push weights towards the extremes of the clipping range,  uses the gradient penalty to make the discriminator satisfy the Lipschitz constraint.Another way to enforce the Lipschitz constraint is proposed in [12] by spectral normalization.A series of adaptive methods with the transformation of the latent vector to get additional information are also widely used in GANs.Work [25] uses an adaptive affine transformation to utilize spatial feedback from the discriminator to improve the performance of GANs.Work [26] uses a nonlinear network to transform the latent space to obtain the intermediate latent space, which controls the generator through adaptive instance normalization (AdaIN) in each convolutional layer.In the work [27], a nonlinear network is used to transform the latent vector to obtain the affine transformation parameters of the BN layer to stabilize the GAN training.SPGAN [28] creates a sparse representation vector for each image patch then synthesizes the entire image by multiplying generated sparse representations to a pre-trained dictionary and assembling the resulting patches.
Sparsity in Convolutional Neural Networks.Deep convolution networks have made great progress in a wide range of fields, especially for image classification [29].However, there is a strong correlation between the performance of the network and the network size [30], which also leads to parameter redundancy in deep convolutional networks.The Sparse Convolutional Neural Networks [13,14,31] uses sparse decomposition of the convolution kernels to reduce more than 90% parameters, while the drop of accuracy is less than 1% on the ILSVRC2012 dataset.Work [13,14,31] proposes  0 norm regularization for neural networks to encourage weights to become exactly zero to speed up training and improve generalization.
3 Self-Sparse GAN Motivated by the aforementioned challenges, we aim to design the generator with a mechanism, which can use fewer feature maps to learn useful representations.Inspired by DANet [32], we first design a two-headed neural network to transform the latent vector to obtain the channel sparsity coefficient and position sparsity coefficient of the multi-channel feature maps.Second, we multiply the multi-channel feature maps by the channel sparse coefficient and position sparse coefficient, respectively.Then, we add the results to get the output of SASTM.
The proposed Self-Sparse GAN adds a SASTM behind each deconvolution layer of the generator.Self-Sparse GAN only modifies the architecture of the generator, and its conceptual diagram is shown in Figure 1.We define the process transforming the planar size of feature map from  ×  to 2 × 2 as a generative stage.For example, when the resolution of the generated image is 128 × 128 , the hierarchical processes of feature map generation  → 4 × 4 → 8 × 8 → 16 × 16 → 32 × 32 → 64 × 64 → 128 × 128 are defined as different stages in the generator.Stage  = 3 refers to 8 × 8 → 16 × 16, where  ∈ {1,2,3,4,5,6} and  = 6 denotes the total number of stages.

SASTM: Self-Adaptive Sparse Transform Module
SASTM includes the sparsity decomposition and feature-map recombination.The sparsity decomposition consists of Channel Sparsity Module (CSM) and Position Sparsity Module (PSM) to obtain the channel sparsity coefficient and position sparsity coefficient.As illustrated in Figure 2, a two-headed neural network will be employed to obtain the corresponding sparsity coefficients.In the two-headed neural network, the underlying shared layers MLP are defined as   , and the exclusive networks are  1  and  2  , respectively.   ≥ 0 and  ,  ≥ 0 can be obtained as follows: where Eq.( 1) and Eq.( 2) represent CSM and PSM, respectively.   ∈    and  ,  ∈    ×  are coefficients of the channel sparsity and position sparsity, respectively.When    = 0 and  ,  = 0, the corresponding channel and spatial location will become useless, respectively.
Suppose that the output after the deconvolution layer is ℎ ,,  ∈    ×  ×  , where   ,   and t represent the number of channels, height and width of the feature maps, respectively.The feature-map recombination will be calculated as follow: where  ,,  ∈    ×  ×  is sparse feature maps.Therefore, SASTM is the superposition of channel sparsity and position sparsity.
The sparse rate of position sparsity in the i-th channel is defined as follows: where "crad" is used to signify number of elements in the set.If    ≥ 2/3 , the i-th channel is regarded to be sparse.
The sparse rate of the channel sparsity is defined as follows: In the BP (Back Propagation) process, the derivatives of loss function with respect to ℎ

Mechanism Analysis of the SASTM
To analyze the role of the SASTM, we select the t-th generation stage as shown in Figure 3.For the convenience of discussion, we assume that dimensions of the input and output feature maps of the deconvolution layer remain same (  ×   ×   ).Meanwhile, the size of the deconvolution kernel is 1 × 1.The feedforward process is expressed as follows: Therefore, we make the following assumption: Hypothesis 1: When    and  ,  have represented the significance of channel and position sparsity, their signs will remain unchanged.
In this section, we prove that the proposed SASTM plays the following three roles: 1) reducing the search space of convolution parameters in the generator; In Eq. ( 7 From Eq. ( 12), when position sparsity exits in  ,,  , the probability of  ,,  < 0 will decrease, which increases the probability    > 0 from Eq. ( 7).In addition, a larger    will lead to a lower probability of  ,,  < 0 .Therefore, the gradient in the backpropagation will not disappear.In other words, when    and    have already determined the sparse channels and spatial locations, SASTM will reduce the likelihood that useful feature information is dropped after passing through the BN layer, thus maintaining meaningful features to alleviate the zero gradient problem.
3) driving the convolutional weights away from being negative; From the above derivation, the probability that  ̃,,  is less than zero after passing through the BN layer can be reduced when the proposed SASTM is implemented into the network, and a larger position sparsity rate    leads to a smaller probability.Therefore, for convenience, we will not consider the BN layer in the discussion below.
However, from Hypothesis 1,    and    will not be less than zero.According to Eq.( 8 < 0 from Eq. ( 8), and then  ,  will increase.Similarly, according to Eq. ( 9) and ( 10),     < 0 and     < 0can be inferred, and thus   and   will increase.Therefore, the increase of  ,  is promoted.Therefore, the proposed SASTM enables to drive convolutional weights away from being negative.This phenomenon has been similarly reported as the Channel Scaling layer [33].has good theoretical and stability properties in practice, and a zero-centered gradient penalty further enhances the convergency [3].Therefore, WGAN with a zero-centered gradient penalty is adopted as the baseline for comparison with our method, and the objective function is as follows:

Baseline Model: WGAN-GP
where x, , λ and x ̂ represent the real data, latent vector, gradient penalty coefficient and random samples with sampling uniformly along straight lines between pairs of real data and fake data [11].

Datasets
Resolution of generated images Table 1: The investigated resolutions of generated images.
We use the Adam [34] optimizer and set the learning rates of generator and discriminator as 0.0001 and 0.0003 on all datasets as suggested in [35].Because multiple discriminator steps per generator step can help the GAN training in WGAN-GP, we set two discriminator steps per generator step for 100K generator steps.We set  1 = 0.5,  2 = 0.999 for MNIST, Fashion-MNIST, CIFAR-10, STL-10 and mini-ImageNet, and  1 = 0.0,  2 = 0.9for CELEBA-HQ and LSUN bedrooms.
For the evaluation of model sampling quality, we use Frechet Inception Distance (FID) [35] as the evaluation metric, which can measure the distance between the real and generated data distributions.A smaller FID indicates better qualities of the generated images.The FID is calculated as: where  and ∑ denote mean and covariance, respectively, and  and  denote the real and generated data, respectively.To obtain training curves quickly, the FID is evaluated every 500 generator steps using the 5K samples.

Results
Comparison with WGAN-GP: Figure 4 shows the FID curves on MNIST and CIFAR-10, and those of other datasets are plotted in the Appendix 2 of the supplementary materials.Figure 4 indicates that our method converges faster in the same FID level.Table 2 shows the mean and standard deviation of the best FIDs on all datasets.Experimental results show that our method reduces FIDs on all datasets and the relative decrease of FID is 4.76% ~ 21.84%.Although the Self-Sparse GAN does not significantly exceed the baseline on CELEBA-HQ with the resolutions of 64×64×3, the relative decrease of FID is still close to 5%.Meanwhile, these results demonstrate that our method can both improve the generation quality of grayscale and RGB images.
In addition, the relative improvement of model performance increases with the resolution of generated images from 64 × 64 × 3 to 128 × 128 × 3 on CELEBA-HQ and LSUN bedrooms.On multiple datasets, our reported FIDs for WGAN-GP may be smaller than other literature.For example, On the STL-10 dataset, the reported FID for the WGAN-GP is 63.88, which is smaller than 55.1 reported by [36].The main reason is that we used 5K samples in a resolution of 64 × 64 × 3 to calculate FID rather than using 50K samples in a resolution of 48 × 48 × 3 in [36].

Ablations
To investigate the effects of CSM and PSM in the proposed SASTM, we perform ablation studies on Fashion-MNIST and STL-10."Without PSM" represents using CSM only and  ,  = 0. Similarly, "without CSM" represents using PSM only and    = 0.
As shown in Table 4, the model performance has a significant improvement on Fashion-MNIST and STL-10 when both CHM and PSM are applied.Since the position sparsity coefficient   is shared by all channels, it is difficult to represent the pixel-wise sparsity among different channels without   .Therefore, using only PSM may not function well.On the other hand, when only CSM is used, the model may lack generation power.Figure 5 also shows that using only CSM on the Fashion-MNIST dataset causes the multi-channel feature maps too sparse, which will suppress the model performance.Robustness to Hyperparameters of Adam Training.GANs are very sensitive to hyperparameters of the optimizer.Therefore, we evaluate different hyperparameter settings to validate the robustness of our method.We test two popular settings of ( 1 ,  2 ) in Adam: (0, 0.9) and (0.5, 0.999).Table 5 compares the mean and standard deviation of FID scores on STL-10.It suggests that the proposed Self-Sparse GAN consistently improves model performance.

Datasets
Robustness to Network Architectures.To further test the robustness of the proposed Self-Sparse GAN to different network architectures, we use two common network architectures from DCGAN and ResNet on STL-10.Details are referred to Appendix 3 in the supplementary materials.Table 6 compares the FID scores using different network architectures on STL-10, which shows that our method is robust to both DCGAN and ResNet network architectures.
Figure 6: Visualization of the feature map of the output of SASTM.It can be observed that Self-Sparse GAN learns to pick useful convolutional kernels instead of using all convolutional kernels for image generation.In a), we can observe that some sparse feature maps have regular feature points, which means that the PSM is working.

Visualization of SASTM Features
To illustrate the function of SASTM, we visualize the multi-channel feature maps of each deconvolution layer in the generator on MNIST with a resolution of 128×128.Figure 6 shows multi-channel feature maps in two representative levels of 64 × 64 and 32 × 32, and results in other spatial sizes are shown in Appendix 4 of the supplementary materials.Results show that the proposed Self-Sparse GAN learns to pick useful sparse convolutional kernels instead of using all kernels greedily.It also proves that our method can obtain sparse multi-channel feature maps, thus reducing network parameters.
Validation of Hypothesis 1: We can verify this hypothesis by visualizing feature maps under different training steps on MNIST, as shown in Figure 7.The results illustrate that the sign of    and  ,  will remain unchanged after 5000 generator steps.
Figure 7 Validation of Hypothesis 1 on MNIST with the resolution of 128×128

Investigation of Relationship between Sparsity and FID
In Section 3.2, we have proved that the proposed SASTM will alleviate the zero gradient problem and thus improve the model performance.To analyze the relationship between model sparsity and FID quantitatively, we define the average position sparsity rate  ̅ of the generators as We select the same network architecture to calculate the corresponding average position sparsity rate according to Eq.( 15), as shown in Table 7. From the data, it indicates that a larger average position sparsity rate may lead to a greater improvement in FID except on the mini-ImageNet with 64 × 64 resolution and CIFAR-10 with 128 × 128 resolution, which will be further investigated in the future study.
The Pearson's coefficient is used to measure the correlation between the average position sparsity rate and FID as where  ̅ and  denote the average position sparsity rate and the relative decrease of FID, respectively.A positive correlation between the average position sparsity rate and FID is found.When the resolution of the generated image is 64 × 64, Pearson's correlation coefficient is 0.62.When the resolution of the generated image is 128 × 128, Pearson's correlation coefficient is 0.79.Meanwhile, with the increase of resolution, the Pearson's correlation coefficient will increase.

Conclusions
In this study, a Self Sparse Generative Adversarial Network (Self-Sparse GAN) is proposed for the unsupervised image generation task.By exploiting channel sparsity and position sparsity in multi-channel feature maps, Self-Sparse GAN stabilizes the training process and improves the model performance by: (1) reducing the search space of convolution parameters in the generator; (2) maintaining meaningful features in the BN layer to alleviate the zero gradient problem; (3) driving the convolutional weights away from being negative.We demonstrate the proposed method on seven image datasets.Experimental results show that our approach can obtain better FIDs on all the seven datasets compared with WGAN-GP, and is robust to both training hyperparameters and network architectures.Besides, a positive correlation between sparsity and FID further validates that the proposed sparsity module enhances the image generation power of the model.

Figure 1 :
Figure 1: The concept diagram of the modified generator in the Self-Sparse GAN.Deconvolution, BN, ReLU, and Tanh represent the deconvolutional layer, the Batch Normalization layer, rectified linear activation function, and tanh activation function, respectively.

Figure 2 :
Figure 2: The concept diagram of the Self-Adaptive Sparse Transform Module.
), the dimension of the deconvolution kernel weight  ,  is   ×   .Find ∀  ∈  ⊂ {1, … ,   }, where    = 0, and ∀  ∈ {1, … ,   }, ∀  ∈ {1, …   }, where    = 0, then   1 ,,  = 0,   , 1  = 0 for all times from hypothesis 1, which indicates that the i-th channel will no longer work in both feedforward and backward processes in training.Consequently, the dimensions of the valid convolutional kernels are (  − ||) ×   , thus reducing the search space of the convolutional parameters in the generator.2) maintaining meaningful features in the BN layer to alleviate the zero gradient problem; For the convenience of discussion, we do not consider the affine transformation in the BN layer.At the same time, because dividing by the standard deviation in the BN layer will not change the sign, we also ignore the standard deviation in the BN layer for the remainder of the discussion. ,,  can be divided into two parts { ,,  | ,,  < 0} and { ,,  | ,,  ≥ 0}.When { ,,  | ,,  < 0} passes through the BN layer, a part of its value will become greater than zero in feedforward and thus mitigate the zero gradient problem in backward.Here, we ignore the part still less than zero.Therefore, in the following discussion, we assume  , ∈ {1,2, … , } ,  ∈ {1,2, … }.According to the definition of sparse rate of position sparsity,  = (1 −    )    .Therefore, the computation of the BN layer can be expressed as  ̃  = ∑ ̃,,  / > 0 and  ,,  is the value of  ,,  after passing through BN.The conditional probability of { ,,  < 0| ,,  > 0} is ( ̃,,  −  ̃  < 0).In addition, considering the position sparsity in  ,,  , the aforementioned probability is approximated as ( ̃,, Figure 4: FID training curves on MNIST and CIFAR10, depicting the mean performance of three random trainings with a 95% confidence interval.

Figure 5 :
Figure 5: Ablation study.Refer to Section 4.4 for details.Using only CHM causes too sparse multi-channel feature maps.

Table 2 :
Comparison of FIDs between our proposed Self-Sparse GAN and the baseline WGAN-GP.The mean and standard deviation of the FID are calculated through three individual training with different random seeds.

Table 3 :
Table3shows the new FIDs by 50K samples, which shows the FID is better than 55.1 even with 64 × 64 × 3 resolution.Therefore, this paper doesn't lower the baseline.The FIDs in STL-10 with resolution of 64 × 64 × 3 by 50K samples

Table 4 :
Comparisons of FIDs in ablation studies on Fashion-MNIST and STL-10.

Table 5 :
Comparisons of FID in the robustness experiments on STL-10 with different Adam hyperparameter settings.

31 ± 4.29Table 6 :
Comparisons of FID in the robustness experiments on STL-10 with different network architectures.

Table 7 :
Relationship between sparseness and FID with the same network architecture.