## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Fixing a Broken ELBO.

ICML, pp.159-168, (2018)

EI

Keywords

Abstract

Recent work in unsupervised representation learning has focused on learning deep directed latent-variable models. Fitting these models by maximizing the marginal likelihood or evidence is typically intractable, thus a common approximation is to maximize the evidence lower bound (ELBO) instead. However, maximum likelihood training (whether...More

Code:

Data:

Introduction

- Learning a “useful” representation of data in an unsupervised way is one of the “holy grails” of current machine learning research.
- A common approach to this problem is to fit a latent variable model of the form p(x, z|θ) = p(z|θ)p(x|z, θ) to the data, where x are the observed variables, z are the hidden variables, and θ are the parameters.
- The authors usually fit such models by minimizing L(θ) = KL[p(x) || p(x|θ)], which is equivalent to maximum likelihood training.
- Obtaining a good ELBO is not enough for good representation learning

Highlights

- Learning a “useful” representation of data in an unsupervised way is one of the “holy grails” of current machine learning research
- We may instead maximize a lower bound on this quantity, such as the evidence lower bound (ELBO), as is done when fitting variational autoencoder (VAE) models (Kingma & Welling
- We show that VAEs with powerful autoregressive decoders can be trained to not ignore their latent code by targeting certain points on this curve
- In section 4, we show how to use this framework to study the properties of various recently-proposed VAE model variants
- We examine several VAE model architectures that have been proposed in the literature
- We have presented a theoretical framework for understanding representation learning using latent variable models in terms of the rate-distortion tradeoff

Methods

- Toy Model the authors empirically show a case where the usual ELBO objective can learn a model which perfectly captures the true data distribution, p∗(x), but which fails to learn a useful latent representation.
- By training the same model such that the authors minimize the distortion, subject to achieving a desired target rate R∗, the authors can recover a latent representation that closely matches the true generative process, while perfectly capturing the true data distribution.
- See Appendix E for more detail on the data generation and model

Results

- It nearly perfectly reproduces the true generative process, as can be seen by comparing the yellow and purple regions in the z-space plots (2aii, 2cii) – both the optimal model and the Target Rate model have two clusters, one with about 70% of the probability mass, corresponding to class 0, and the other with about 30% of the mass corresponding to class 1

Conclusion

- The authors have presented a theoretical framework for understanding representation learning using latent variable models in terms of the rate-distortion tradeoff
- This constrained optimization problem allows them to fit models by targeting a specific point on the RD curve, which the authors cannot do using the β-VAE framework.
- Perhaps the most surprising finding is that all the current approaches seem to have a hard time achieving high rates at low distortion.

Related work

- Improving VAE representations. Many recent papers have introduced mechanisms for alleviating the problem of unused latent variables in VAEs. Bowman et al (2016) proposed annealing the weight of the KL term of the ELBO from 0 to 1 over the course of training but did not consider ending weights that differed from 1. Higgins et al (2017) proposed the β-VAE for unsupervised learning, which is a generalization of the original VAE in which the KL term is scaled by β, similar to this paper. However, their focus was on disentangling and did not discuss rate-distortion tradeoffs across model families. Recent work has used the β-VAE objective to tradeoff reconstruction quality for sampling accuracy (Ha & Eck, 2018). Chen et al (2017) present a bits-back interpretation (Hinton & Van Camp, 1993). Modifying the variational families (Kingma et al, 2016), priors (Papamakarios et al, 2017; Tomczak & Welling, 2017), and decoder structure (Chen et al, 2017) have also been proposed as a mechanism for learning better representations.

Funding

- However, the VAE fails to learn a useful representation, only yielding a rate of R = 0.0002 nats,3 while the Target Rate model achieves R = 0.4999 nats. It nearly perfectly reproduces the true generative process, as can be seen by comparing the yellow and purple regions in the z-space plots (2aii, 2cii) – both the optimal model and the Target Rate model have two clusters, one with about 70% of the probability mass, corresponding to class 0 (purple shaded region), and the other with about 30% of the mass (yellow shaded region) corresponding to class 1

Reference

- Achille, A. and Soatto, S. Information Dropout: Learning Optimal Representations Through Noisy Computation. In Information Control and Learning, September 2016. URL http://arxiv.org/abs/1611.01353.
- Achille, A. and Soatto, S. Emergence of Invariance and Disentangling in Deep Representations. Proceedings of the ICML Workshop on Principled Approaches to Deep Learning, 2017.
- Agakov, F. V. Variational Information Maximization in Stochastic Environments. PhD thesis, University of Edinburgh, 2006.
- Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep Variational Information Bottleneck. In ICLR, 2017.
- Balle, J., Laparra, V., and Simoncelli, E. P. End-to-end Optimized Image Compression. In ICLR, 2017.
- Barber, D. and Agakov, F. V. Information maximization in noisy channels: A variational approach. In NIPS. 2003.
- Bell, A. J. and Sejnowski, T. J. An informationmaximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
- Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. CoNLL, 2016.
- Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. arXiv preprint 1606.03657, 2016.
- Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational Lossy Autoencoder. In ICLR, 2017.
- Germain, M., Gregor, K., Murray, I., and Larochelle, H. Made: Masked autoencoder for distribution estimation. In ICML, 2015.
- Gregor, K., Besse, F., Rezende, D. J., Danihelka, I., and Wierstra, D. Towards conceptual compression. In Advances In Neural Information Processing Systems, pp. 3549–3557, 2016.
- Ha, D. and Eck, D. A neural representation of sketch drawings. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hy6GHpkCW.
- Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In ICLR, 2017.
- Hinton, G. E. and Van Camp, D. Keeping the neural networks simple by minimizing the description length of the weights. In Proc. of the Workshop on Computational Learning Theory, 1993.
- Hoffman, M. D. and Johnson, M. J. Elbo surgery: yet another way to carve up the variational evidence lower bound. In NIPS Workshop in Advances in Approximate Bayesian Inference, 2016.
- Huszar, F. Is maximum likelihood useful for representation learning?, 20URL http://www.inference.vc/maximum-likelihood-forrepresentation-learning-2/.
- Johnston, N., Vincent, D., Minnen, D., Covell, M., Singh, S., Chinen, T., Hwang, S. J., Shor, J., and Toderici, G. Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks. ArXiv e-prints, 2017.
- Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In ICLR, 2014.
- Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In NIPS. 2016.
- Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In AI/Statistics, 2011.
- Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I. Adversarial autoencoders. In ICLR, 2016.
- Papamakarios, G., Murray, I., and Pavlakou, T. Masked autoregressive flow for density estimation. In NIPS. 2017.
- Phuong, M., Welling, M., Kushman, N., Tomioka, R., and Nowozin, S. The mutual autoencoder: Controlling information in latent code representations, 2018. URL https://openreview.net/forum?id=HkbmWqxCZ.
- Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
- Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In ICLR, 2017.
- Shamir, O., Sabato, S., and Tishby, N. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29):2696 – 2711, 2010.
- Slonim, N., Atwal, G. S., Tkacik, G., and Bialek, W. Information-based clustering. PNAS, 102(51):18297– 18302, 2005.
- Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), 2015.
- Tishby, N., Pereira, F., and Biale, W. The information bottleneck method. In The 37th annual Allerton Conf. on Communication, Control, and Computing, pp. 368–377, 1999. URL https://arxiv.org/abs/physics/0004057.
- Tomczak, J. M. and Welling, M. VAE with a VampPrior. ArXiv e-prints, 2017.
- van den Oord, A., Vinyals, O., and kavukcuoglu, k. Neural discrete representation learning. In NIPS. 2017.
- Zhao, S., Song, J., and Ermon, S. Infovae: Information maximizing variational autoencoders. arXiv preprint 1706.02262, 2017.
- Zhao, S., Song, J., and Ermon, S. The informationautoencoding family: A lagrangian perspective on latent variable generative modeling, 2018. URL https://openreview.net/forum?id=ryZERzWCZ.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn