# Awesome Papers: 2017-01-3

### Online Reinforcement Learning for Real-Time Exploration in Continuous State and Action Markov Decision Processes

Ludovic Hofer, Hugo Gimbert

This paper presents a new method to learn online policies in continuous state, continuous action, model-free Markov decision processes, with two properties that are crucial for practical applications.First, the policies are implementable with a very low computational cost: once the policy is computed,the action corresponding to a given state is obtained in logarithmic time with respect to the number of samples used. Second, our method is versatile: it does not rely on any a priori knowledge of the structure of optimal policies. We build upon the Fitted Q-iteration algorithm which represents the $Q$-value as the average of several regression trees. Our algorithm, the Fitted Policy Forest algorithm (FPF), computes a regression forest representing the Q-value and transforms it into a single tree representing the policy, while keeping control on the size of the policy using resampling and leaf merging. We introduce an adaptation of Multi-Resolution Exploration (MRE) which is particularly suited to FPF. We assess the performance of FPF on three classical benchmarks for reinforcement learning: the “Inverted Pendulum”, the “Double Integrator” and “Car on the Hill” and show that FPF equals or outperforms other algorithms, although these algorithms rely on the use of particular representations of the policies, especially chosen in order to fit each of the three problems. Finally, we exhibit that the combination of FPF and MRE allows to find nearly optimal solutions in problems where $\epsilon$-greedy approaches would fail.

### DeepMind Lab

Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich K"uttler, Andrew Lefrancq, Simon Green, V'ictor Vald'es, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis,Shane Legg and Stig Petersen

DeepMind Lab is a first-person 3D game platform designed for research and development of general artificial intelligence and machine learning systems.DeepMind Lab can be used to study how autonomous artificial agents may learn complex tasks in large, partially observed, and visually diverse worlds. DeepMind Lab has a simple and flexible API enabling creative task-designs and novel AI-designs to be explored and quickly iterated upon. It is powered by a fast and widely recognised game engine, and tailored for effective use by the research community.

### Multiple Instance Learning: A Survey of Problem Characteristics and Applications

Marc-Andr'e Carbonneau, Veronika Cheplygina, Eric Granger and Ghyslain Gagnon

Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories:the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally,experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research.

### Design and Development of Bayes’ Minimax Linear Classification Systems

Denise M. Reeves

This paper considers the design and development of Bayes’ minimax, linear classification systems using linear discriminant functions that are Bayes’ equalizer rules. Bayes’ equalizer rules divide two-class feature spaces into decision regions that have equal classification errors. I will formulate the problem of learning unknown linear discriminant functions from data as a locus problem, thereby formulating geometric locus methods within a statistical framework. Solving locus problems involves finding the equation of a curve or surface defined by a given property, and finding the graph or locus of a given equation. I will devise a system of locus equations that determines Bayes’equalizer rules which is based on a variant of the inequality constrained optimization problem for linear kernel support vector machines. Thereby, I will define a class of learning machines which are fundamental building blocks for Bayes’ minimax pattern recognition systems.

### The Predictron: End-To-End Learning and Planning

David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, Thomas Degris

One of the key challenges of artificial intelligence is to learn models that are effective in the context of planning. In this document we introduce the predictron architecture. The predictron consists of a fully abstract model,represented by a Markov reward process, that can be rolled forward multiple “imagined” planning steps. Each forward pass of the predictron accumulates internal rewards and values over multiple planning depths. The predictron is trained end-to-end so as to make these accumulated values accurately approximate the true value function. We applied the predictron to procedurally generated random mazes and a simulator for the game of pool. The predictron yielded significantly more accurate predictions than conventional deep neural network architectures.

Janarthanan Rajendran, Aravind S. Lakshminarayanan, Mitesh M. Khapra, P Prasanna, Balaraman Ravindran

### Non-Deterministic Policy Improvement Stabilizes Approximated Reinforcement Learning

Wendelin B"ohmer and Rong Guo and Klaus Obermayer

This paper investigates a type of instability that is linked to the greedy policy improvement in approximated reinforcement learning. We show empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the improvements’ stochasticity. Additionally we show that a suitable representation of the value function also stabilizes the solution to some degree. The presented approach is simple and should also be easily transferable to more sophisticated algorithms like deep reinforcement learning.

### How to Train Your Deep Neural Network with Dictionary Learning

Vanika Singhal, Shikha Singh and Angshul Majumdar

Currently there are two predominant ways to train deep neural networks. The first one uses restricted Boltzmann machine (RBM) and the second one autoencoders. RBMs are stacked in layers to form deep belief network (DBN); the final representation layer is attached to the target to complete the deep neural network. Autoencoders are nested one inside the other to form stacked autoencoders; once the stcaked autoencoder is learnt the decoder portion is detached and the target attached to the deepest layer of the encoder to form the deep neural network. This work proposes a new approach to train deep neural networks using dictionary learning as the basic building block; the idea is to use the features from the shallower layer as inputs for training the next deeper layer. One can use any type of dictionary learning (unsupervised,supervised, discriminative etc.) as basic units till the pre-final layer. In the final layer one needs to use the label consistent dictionary learning formulation for classification. We compare our proposed framework with existing state-of-the-art deep learning techniques on benchmark problems; we are always within the top 10 results. In actual problems of age and gender classification,we are better than the best known techniques.

### A note on the function approximation error bound for risk-sensitive reinforcement learning

Prasenjit Karmakar, Shalabh Bhatnagar

In this paper we improve the existing function approximation error bound for the policy evaluation algorithm when the aim is to find the risk-sensitive cost represented using exponential utility.

### Deep Learning and Its Applications to Machine Health Monitoring: A Survey

Rui Zhao, Ruqiang Yan, Zhenghua Chen, Kezhi Mao, Peng Wang and Robert X. Gao Categories: cs.LG stat.ML

Since 2006, deep learning (DL) has become a rapidly growing research direction, redefining state-of-the-art performances in a wide range of areas such as object recognition, image segmentation, speech recognition and machine translation. In modern manufacturing systems, data-driven machine health monitoring is gaining in popularity due to the widespread deployment of low-cost sensors and their connection to the Internet. Meanwhile, deep learning provides useful tools for processing and analyzing these big machinery data.The main purpose of this paper is to review and summarize the emerging research work of deep learning on machine health monitoring. After the brief introduction of deep learning techniques, the applications of deep learning in machine health monitoring systems are reviewed mainly from the following aspects: Auto-encoder (AE) and its variants, Restricted Boltzmann Machines and its variants including Deep Belief Network (DBN) and Deep Boltzmann Machines (DBM), Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).Finally, some new trends of DL-based machine health monitoring methods are discussed.

### Adaptive Lambda Least-Squares Temporal Difference Learning

Timothy A. Mann and Hugo Penedones and Shie Mannor and Todd Hester

Temporal Difference learning or TD(λ) is a fundamental algorithm in the field of reinforcement learning. However, setting TD’s λ parameter, which controls the timescale of TD updates, is generally left up to the practitioner. We formalize the λ selection problem as a bias-variance trade-off where the solution is the value of λ that leads to the smallest Mean Squared Value Error (MSVE). To solve this trade-off we suggest applying Leave-One-Trajectory-Out Cross-Validation (LOTO-CV) to search the space of λ values. Unfortunately, this approach is too computationally expensive for most practical applications. For Least Squares TD (LSTD) we show that LOTO-CV can be implemented efficiently to automatically tune λ and apply function optimization methods to efficiently search the space of λ values. The resulting algorithm, ALLSTD, is parameter free and our experiments demonstrate that ALLSTD is significantly computationally faster than the na"{i}ve LOTO-CV implementation while achieving similar performance.

### Email

fengyanghe@gmail.com

07318457661