Awesome Papers: 2016-12-1

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba,Ilya Sutskever,Joan Bruna, Dumitru Erhan,Ian Goodfellow, Rob Fergus

Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties. First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains the semantic information in the high layers of neural networks. Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extent. We can cause the network to misclassify an image by applying a certain hardly perceptible perturbation, which is found by maximizing the network’s prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

神经网络的吸引人的属性

深度神经网络是最近实现了语音和视觉识别任务的最佳性能的高度表达性模型。虽然他们的表现力是他们成功的原因,它也使他们学习可能有反直觉属性的不可解释的解决方案。在本文中,我们报告了两个这样的属性。首先,根据单元分析的各种方法,我们发现单个高级单元和高级单元的随机线性组合之间没有区别。它表明,它是空间,而不是单个单位,它包含在神经网络高层的语义信息。第二,我们发现深度神经网络学习输入——输出映射时,其在相当大程度上是不连续的。我们可以使网络通过应用某种难以察觉的扰动来错误分类图像,这是通过最大化网络的预测误差而发现的。此外,这些扰动的特定性质不是学习的随机假象:相同的扰动可以导致在数据集的不同子集上训练的不同网络,以错误分类相同的输入。


Matching Networks for One Shot Learning

Oriol Vinyals,Charles Blundell,Timothy Lillicrap,Koray Kavukcuoglu,Daan Wierstra

Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.

匹配网络的一次性学习

从几个例子学习仍然是机器学习的一个关键挑战。尽管近来在诸如视觉和语言的重要领域取得进展,但是标准的监督深度学习范例不能为从小数据中快速学习新概念提供令人满意的解决方案。在这项研究中,我们采用从基于深度神经特征的公制学习的想法和从外部记忆增加神经网络的最新进展。我们的框架学习了一个网络,其将一个小的标记支持集和一个未标记的示例映射到标签,避免了微调以适应新类类型的需要。然后我们定义视觉(使用Omniglot,ImageNet)和语言任务的一次性学习问题。与竞争方法相比,我们的算法使用Omniglot提高ImageNet的一次性准确率从87.6%提高到93.2%,从88.0%提高到93.8%。我们还通过在Penn Treebank上引入一次性任务来演示同一模型对语言建模的有用性。


One-shot Learning with Memory-Augmented Neural Networks

Adam Santoro,Sergey Bartunov,Matthew Botvinick,Daan Wierstra,Timothy Lillicrap

Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of “one-shot learning.” Traditional gradient-based networks require a lot of data to learn, often through extensive iterative training. When new data is encountered, the models must inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference. Architectures with augmented memory capacities, such as Neural Turing Machines (NTMs), offer the ability to quickly encode and retrieve new information, and hence can potentially obviate the downsides of conventional models. Here, we demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external memory that focuses on memory content, unlike previous methods that additionally use memory location based focusing mechanisms.

记忆增强神经网络的一次性学习

尽管最近在深层神经网络的应用方面取得了突破,但是一个持久的挑战是“一次性学习”。传统的基于梯度的网络需要大量的数据来学习,通常是通过大量的迭代训练。当遇到新数据时,模型必须无效地重新学习它们的参数以充分地并入新信息而没有灾难性干扰。具有增强的存储器容量的架构,诸如神经图灵机(NTM),提供快速编码和检索新信息的能力,因此可潜在地消除常规模型的缺点。在这里,我们演示了记忆增强神经网络快速吸收新数据的能力,并利用这些数据在仅有几个样本后做出准确的预测。与之前使用基于存储器定位的聚焦机制的方法不同,我们还引入了一种用于访问专注于存储器内容的外部存储器的新方法。


EIE: Efficient Inference Engine on Compressed Deep Neural Network

Song Han∗ Xingyu Liu∗ Huizi Mao∗ Jing Pu∗ Ardavan Pedram∗ Mark A. Horowitz∗ William J. Dally

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed ‘Deep Compression’ makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS/s working directly on a compressed network, corresponding to 3 TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88×104 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.

EIE:压缩深度神经网络的高效推理机

最先进的深度神经网络(DNN)具有数亿的连接,并且在计算和存储非常密集,使得它们难以部署在具有有限的硬件资源和功率预算的嵌入式系统上。虽然定制硬件有助于计算,但从DRAM提取权重要比ALU操作贵两个数量级,并占据所需的功率。 以前提出的“深度压缩”可以将大型DNN(AlexNet和VGGNet)完全集成在片上SRAM中。该压缩通过修剪冗余连接并且使多个连接共享相同的权重来实现。我们提出了一种能量效率推理机(EIE),其对该压缩网络模型执行推理,并且通过权重共享来加速所得到的稀疏矩阵矢量乘法。从DRAM到SRAM,EIE 120X节能;利用稀疏性节省10×;重量分享给予8×;跳过零激活从ReLU节省另一个3×。根据九个DNN基准评估,与没有压缩的相同DNN的CPU和GPU实现相比,EIE是189×和13×更快。 EIE在压缩网络上的处理能力为102 GOPS / s,对应于未压缩网络上的3 TOPS / s,并以1.88×104帧/秒处理AlexNet的FC层,功耗仅为600mW。它比CPU和GPU分别有24,000×和3,400×更高的能量效率。与DaDianNao相比,EIE具有2.9倍、19倍和3倍的吞吐量、能效和面积效率。


DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING

Song Han,Huizi Mao,William J. Dally

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce “deep compression”, a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35× to 49× without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9× to 13×; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35×, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49× from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3× to 4× layerwise speedup and 3× to 7× better energy efficiency.

DEEP压缩:用精简、训练量化和HUFFMAN编码压缩深度神经网络

神经网络是计算密集型和存储器密集型的,使得它们难以部署在具有有限的硬件资源的嵌入式系统上。为了解决这个限制,我们引入了“深度压缩”,一个三级流水线:精简,训练量化和霍夫曼编码,它们一起工作,将神经网络的存储需求降低35×到49×,而不影响它们的精度。我们的方法首先通过学习仅有的重要的连接来精简网络。接下来,我们量化权重以实施权重分配,最后,我们应用霍夫曼编码。在前两个步骤之后,我们重新训练网络以微调剩余的连接和量化的质心。精简,将连接数减少9×到13×;量化然后将表示每个连接的位数从32个减少到5。在ImageNet数据集中,我们的方法将AlexNet所需的存储空间从240MB减少到35MB,而不会降低精度。我们的方法将VGG-16的尺寸从552MB减小到了11.3MB,同样没有精度损失。这允许将模型拟合到片上SRAM缓存而不是片外DRAM存储器。我们的压缩方法还便于在移动应用中使用复杂的神经网络,其中应用大小和下载带宽受到限制。基于CPU,GPU和移动GPU,压缩网络具有3×到4×分层加速和3×到7×更好的能量效率。


Deep learning with COTS HPC systems

Adam Coates,Brody Huval, Tao Wang,David J. Wu,Andrew Y. Ng,Bryan Catanzaro

Scaling up deep learning algorithms has been shown to lead to increased performance in benchmark tasks and to enable discovery of complex high-level features. Recent efforts to train extremely large networks (with over 1 billion parameters) have relied on cloudlike computing infrastructure and thousands of CPU cores. In this paper, we present technical details and results from our own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Our system is able to train 1 billion parameter networks on just 3 machines in a couple of days, and we show that it can scale to networks with over 11 billion parameters using just 16 machines. As this infrastructure is much more easily marshaled by others, the approach enables much wider-spread research with extremely large neural networks.

使用COTS HPC系统进行深度学习

扩展深度学习算法已经表明可以提高基准任务中的性能,并实现复杂的高级特征的发现。 最近为培训超大型网络(具有超过10亿个参数)所做的努力依赖于类云计算基础设施和数千个CPU内核。 在本文中,我们提出基于商品现有高性能计算(COTS HPC)技术的我们自己的系统的技术细节和结果:具有Infiniband互连和MPI的GPU服务器集群。 我们的系统能够在几天内在3台机器上训练10亿个参数网络,证明它可以使用16台机器扩展到超过110亿个参数的网络。 由于这种基础设施更容易由其他人编组,因此该方法能够利用超大的神经网络进行更广泛的研究。


Neural Autoregressive Distribution Estimation

Benigno Uria,Marc-Alexandre Cˆot´e,Karol Gregor,Iain Murray, Hugo Larochelle

We present Neural Autoregressive Distribution Estimation (NADE) models, which are neural network architectures applied to the problem of unsupervised distribution and density estimation. They leverage the probability product rule and a weight sharing scheme inspired from restricted Boltzmann machines, to yield an estimator that is both tractable and has good generalization performance. We discuss how they achieve competitive performance in modeling both binary and real-valued observations. We also present how deep NADE models can be trained to be agnostic to the ordering of input dimensions used by the autoregressive product rule decomposition. Finally, we also show how to exploit the topological structure of pixels in images using a deep convolutional architecture for NADE.

神经自回归分布估计

我们目前的神经自回归分布估计(NADE)模型,它是神经网络架构应用于无监督分布和密度估计的问题。 他们利用概率产品规则和来自受限制的波尔兹曼机器的权重共享方案,以产生既易于处理又具有良好的泛化性能的估计器。 我们讨论了他们如何在建模二进制和实值观测中实现竞争性能。 我们还展示了深度的NADE模型如何被训练为与自回归产品规则分解使用的输入维度的排序无关。 最后,我们还展现了如何使用NADE的深卷积架构来利用图像中像素的拓扑结构。


Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Andrew M. Saxe,James L. McClelland ,Surya Ganguli

Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.

在深度线性神经网络中学习的非线性动力学的确切解决方案

尽管深度学习方法已经广泛的实践成功,我们对深层神经网络学习动力学的理论理解仍然相当稀疏。我们试图通过系统地分析深度线性神经网络受限情况下的学习动力学来弥合深度学习的理论和实践之间的差距。不管它们的输入—输出映射的线性,这样的网络具有随着每个新的隐藏层的添加而改变权重上的非线性梯度下降动力学。我们展现了深度线性网络呈现非线性学习现象类似于非线性网络的模拟中所见,包括长期平稳状态,随后快速转变到更低的误差解决方案,以及从贪婪无监督预训练初始条件比从随机初始条件更快的收敛。我们通过找到深度学习非线性动力学的新的精确解,提供了这些现象的分析描述。我们的理论分析还揭示了一个令人惊讶的发现,即当网络的深度接近无穷大,学习速度仍然可以保持有限:对于一个特殊类的初始条件的权重,非常深度的网络只产生有限的、深度独立的、相对浅层网络学习速度的延迟。我们证明,在某些条件下训练数据,无监督预训练可以找到这个特殊类的初始条件,而缩放的随机高斯初始化却不能。我们进一步展示了一类新的随机正交初始条件的权重,像无监督预训练,享受深度独立的学习时间。我们进一步表明,这些初始条件也导致梯度的忠实传播,甚至在深度非线性网络中,只要它们在被称为混沌边缘的特殊方式中操作。


Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear

Zachary C. Lipton, Jianfeng Gao, Lihong Li, Jianshu Chen, Li Deng

To use deep reinforcement learning in the wild, we might hope for an agent that can avoid catastrophic mistakes. Unfortunately, even in simple environments, the popular deep Q-network (DQN) algorithm is doomed by a Sisyphean curse. Owing to the use of function approximation, these agents eventually forget experiences as they become exceedingly unlikely under a new policy. Consequently, for as long as they continue to train, DQNs may periodically relive catastrophic mistakes. In this paper, we demonstrate unacceptable performance of DQNs on two toy problems. We then introduce intrinsic fear, a new method that mitigates these problems by avoiding states deemed dangerous. Our approach incorporates a second model trained via supervised learning to predict the probability of catastrophe within a short number of steps. This score then acts to penalize the Q-learning objective, shaping the reward function away from catastrophic states.

用内在恐惧打击强化学习的Sisyphean诅咒

要在未开化地方使用深层强化学习,我们希望一个能够避免灾难性错误的代理。不幸的是,即使在简单的环境中,流行的深度Q网络(DQN)算法也是由于Sisyphean诅咒注定失败。由于使用近似函数,这些代理最终忘记了经验,因为它们在新策略下变得极不可能。因此,只要它们继续训练,DQNs可以周期性地重现灾难性的错误。在本文中,我们证明DQNs对两个玩具问题的不可接受的性能。然后,我们引入内在恐惧,一种通过避免被认为危险的状态来缓解这些问题的新方法。我们的方法结合了通过监督学习训练的第二个模型,来预测在短步数内的灾难的概率。然后这个分数来惩罚Q学习目标,形成远离灾难状态的奖励功能。


工具

深度学习的五个能力级别

链接:深度学习的五个能力级别

可视化:一图尽览沧海桑田

《Our most detailed view of Earth across space and time》by Chris Herwig

链接:Our most detailed view of Earth across space and time

Amazon发布面向开发者的AI服务:图像识别/语音合成/NLP

《Amazon launches new artificial intelligence services for developers: Image recognition, text-to-speech, Alexa NLP GeekWire》by Taylor Soper

链接:Amazon launches new artificial intelligence services for developers

无反向传播的网络学习(优化):直觉与思路

《Learning without Backpropagation: Intuition and Ideas (Part 1)》by Tom Breloff

链接:Learning without Backpropagation

特征选择方法实例解析

《Feature Selection methods with example (Variable selection methods)》by Saurav Kaushik

链接:Feature Selection methods with example


其他

学术:一种基于多频分级架构的空中自组网研究

本文在分析飞机编队特征的基础上,深入研究了飞机编队组网涉及的相关问题,提出了基于IP的多频分级组网架构,在此基础上,研究并设计了分级式路由协议。

链接:一种基于多频分级架构的空中自组网研究

Phone

07318457661

Address

National University of Defense Tecnology
Changsha, Hunan 410073
China