Awesome Papers: 2016-12-1

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba,Ilya Sutskever,Joan Bruna, Dumitru Erhan,Ian Goodfellow, Rob Fergus

Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties. First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains the semantic information in the high layers of neural networks. Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extent. We can cause the network to misclassify an image by applying a certain hardly perceptible perturbation, which is found by maximizing the network’s prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.



Matching Networks for One Shot Learning

Oriol Vinyals,Charles Blundell,Timothy Lillicrap,Koray Kavukcuoglu,Daan Wierstra

Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.


从几个例子学习仍然是机器学习的一个关键挑战。尽管近来在诸如视觉和语言的重要领域取得进展,但是标准的监督深度学习范例不能为从小数据中快速学习新概念提供令人满意的解决方案。在这项研究中,我们采用从基于深度神经特征的公制学习的想法和从外部记忆增加神经网络的最新进展。我们的框架学习了一个网络,其将一个小的标记支持集和一个未标记的示例映射到标签,避免了微调以适应新类类型的需要。然后我们定义视觉(使用Omniglot,ImageNet)和语言任务的一次性学习问题。与竞争方法相比,我们的算法使用Omniglot提高ImageNet的一次性准确率从87.6%提高到93.2%,从88.0%提高到93.8%。我们还通过在Penn Treebank上引入一次性任务来演示同一模型对语言建模的有用性。

One-shot Learning with Memory-Augmented Neural Networks

Adam Santoro,Sergey Bartunov,Matthew Botvinick,Daan Wierstra,Timothy Lillicrap

Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of “one-shot learning.” Traditional gradient-based networks require a lot of data to learn, often through extensive iterative training. When new data is encountered, the models must inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference. Architectures with augmented memory capacities, such as Neural Turing Machines (NTMs), offer the ability to quickly encode and retrieve new information, and hence can potentially obviate the downsides of conventional models. Here, we demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external memory that focuses on memory content, unlike previous methods that additionally use memory location based focusing mechanisms.



EIE: Efficient Inference Engine on Compressed Deep Neural Network

Song Han∗ Xingyu Liu∗ Huizi Mao∗ Jing Pu∗ Ardavan Pedram∗ Mark A. Horowitz∗ William J. Dally

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed ‘Deep Compression’ makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS/s working directly on a compressed network, corresponding to 3 TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88×104 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.


最先进的深度神经网络(DNN)具有数亿的连接,并且在计算和存储非常密集,使得它们难以部署在具有有限的硬件资源和功率预算的嵌入式系统上。虽然定制硬件有助于计算,但从DRAM提取权重要比ALU操作贵两个数量级,并占据所需的功率。 以前提出的“深度压缩”可以将大型DNN(AlexNet和VGGNet)完全集成在片上SRAM中。该压缩通过修剪冗余连接并且使多个连接共享相同的权重来实现。我们提出了一种能量效率推理机(EIE),其对该压缩网络模型执行推理,并且通过权重共享来加速所得到的稀疏矩阵矢量乘法。从DRAM到SRAM,EIE 120X节能;利用稀疏性节省10×;重量分享给予8×;跳过零激活从ReLU节省另一个3×。根据九个DNN基准评估,与没有压缩的相同DNN的CPU和GPU实现相比,EIE是189×和13×更快。 EIE在压缩网络上的处理能力为102 GOPS / s,对应于未压缩网络上的3 TOPS / s,并以1.88×104帧/秒处理AlexNet的FC层,功耗仅为600mW。它比CPU和GPU分别有24,000×和3,400×更高的能量效率。与DaDianNao相比,EIE具有2.9倍、19倍和3倍的吞吐量、能效和面积效率。


Song Han,Huizi Mao,William J. Dally

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce “deep compression”, a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35× to 49× without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9× to 13×; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35×, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49× from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3× to 4× layerwise speedup and 3× to 7× better energy efficiency.



Deep learning with COTS HPC systems

Adam Coates,Brody Huval, Tao Wang,David J. Wu,Andrew Y. Ng,Bryan Catanzaro

Scaling up deep learning algorithms has been shown to lead to increased performance in benchmark tasks and to enable discovery of complex high-level features. Recent efforts to train extremely large networks (with over 1 billion parameters) have relied on cloudlike computing infrastructure and thousands of CPU cores. In this paper, we present technical details and results from our own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Our system is able to train 1 billion parameter networks on just 3 machines in a couple of days, and we show that it can scale to networks with over 11 billion parameters using just 16 machines. As this infrastructure is much more easily marshaled by others, the approach enables much wider-spread research with extremely large neural networks.

使用COTS HPC系统进行深度学习

扩展深度学习算法已经表明可以提高基准任务中的性能,并实现复杂的高级特征的发现。 最近为培训超大型网络(具有超过10亿个参数)所做的努力依赖于类云计算基础设施和数千个CPU内核。 在本文中,我们提出基于商品现有高性能计算(COTS HPC)技术的我们自己的系统的技术细节和结果:具有Infiniband互连和MPI的GPU服务器集群。 我们的系统能够在几天内在3台机器上训练10亿个参数网络,证明它可以使用16台机器扩展到超过110亿个参数的网络。 由于这种基础设施更容易由其他人编组,因此该方法能够利用超大的神经网络进行更广泛的研究。

The Development of Embodied Cognition:Six Lessons from Babies

Linda SmithMichael Gasser*

The embodiment hypothesis is the idea that intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity. In this paper we offer six lessons for developing embodied intelligent agents suggested by research in developmental psychology. We argue that starting as a baby grounded in a physical, social and linguistic world is crucial to the development of the flexible and inventive intelligence that characterizes humankind.


实施例假设是这样的想法,智力出现在代理与环境的相互作用中以及作为感觉运动活动的结果。 在本文中,我们提供了六个课程,用于开发由发展心理学研究建议的智能代理。 我们认为,作为一个基于物理、社会和语言世界的婴儿,对于发展人类灵活和创造性的智力至关重要。

Neural Autoregressive Distribution Estimation

Benigno Uria,Marc-Alexandre Cˆot´e,Karol Gregor,Iain Murray, Hugo Larochelle

We present Neural Autoregressive Distribution Estimation (NADE) models, which are neural network architectures applied to the problem of unsupervised distribution and density estimation. They leverage the probability product rule and a weight sharing scheme inspired from restricted Boltzmann machines, to yield an estimator that is both tractable and has good generalization performance. We discuss how they achieve competitive performance in modeling both binary and real-valued observations. We also present how deep NADE models can be trained to be agnostic to the ordering of input dimensions used by the autoregressive product rule decomposition. Finally, we also show how to exploit the topological structure of pixels in images using a deep convolutional architecture for NADE.


我们目前的神经自回归分布估计(NADE)模型,它是神经网络架构应用于无监督分布和密度估计的问题。 他们利用概率产品规则和来自受限制的波尔兹曼机器的权重共享方案,以产生既易于处理又具有良好的泛化性能的估计器。 我们讨论了他们如何在建模二进制和实值观测中实现竞争性能。 我们还展示了深度的NADE模型如何被训练为与自回归产品规则分解使用的输入维度的排序无关。 最后,我们还展现了如何使用NADE的深卷积架构来利用图像中像素的拓扑结构。

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Andrew M. Saxe,James L. McClelland ,Surya Ganguli

Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.



Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear

Zachary C. Lipton, Jianfeng Gao, Lihong Li, Jianshu Chen, Li Deng

To use deep reinforcement learning in the wild, we might hope for an agent that can avoid catastrophic mistakes. Unfortunately, even in simple environments, the popular deep Q-network (DQN) algorithm is doomed by a Sisyphean curse. Owing to the use of function approximation, these agents eventually forget experiences as they become exceedingly unlikely under a new policy. Consequently, for as long as they continue to train, DQNs may periodically relive catastrophic mistakes. In this paper, we demonstrate unacceptable performance of DQNs on two toy problems. We then introduce intrinsic fear, a new method that mitigates these problems by avoiding states deemed dangerous. Our approach incorporates a second model trained via supervised learning to predict the probability of catastrophe within a short number of steps. This score then acts to penalize the Q-learning objective, shaping the reward function away from catastrophic states.







《Our most detailed view of Earth across space and time》by Chris Herwig

链接:Our most detailed view of Earth across space and time


《Amazon launches new artificial intelligence services for developers: Image recognition, text-to-speech, Alexa NLP GeekWire》by Taylor Soper

链接:Amazon launches new artificial intelligence services for developers


《Learning without Backpropagation: Intuition and Ideas (Part 1)》by Tom Breloff

链接:Learning without Backpropagation


《Feature Selection methods with example (Variable selection methods)》by Saurav Kaushik

链接:Feature Selection methods with example








National University of Defense Tecnology
Changsha, Hunan 410073