Loss Landscapes

Visualizing the Loss Landscape

The loss landscape graphically represents the model’s loss function, a measure of how well the model can make predictions on a given dataset. Previous work has shown that structure of the loss landscape foretells the generalizability and robustness on a model solution (Keskar et al.). Furthermore, recent optimization methods leverage local loss information to traverse the loss landscape and lead to drastic training improvements (Foret et al). Most papers use loss visualization to validate model performance and provide comparison between solutions. In contrast, here we describe methods and heuristics for analyzing the loss landscape with the intention to improve model architecture, adjust training hyperparameters, and gain insight into the training process of large models.


Deep Neural Nets: 33 years ago and 33 years from now (Invited Post)

The Yann LeCun et al. (1989) paper Backpropagation Applied to Handwritten Zip Code Recognition is I believe of some historical significance because it is, to my knowledge, the earliest real-world application of a neural net trained end-to-end with backpropagation. Except for the tiny dataset (7291 16x16 grayscale images of digits) and the tiny neural network used (only 1,000 neurons), this paper reads remarkably modern today, 33 years later - it lays out a dataset, describes the neural net architecture, loss function, optimization, and reports the experimental classification error rates over training and test sets. It’s all very recognizable and type checks as a modern deep learning paper, except it is from 33 years ago. So I set out to reproduce the paper 1) for fun, but 2) to use the exercise as a case study on the nature of progress in deep learning.


A Deeper Look at Zero-Cost Proxies for Lightweight NAS

Imagine you have a brand new dataset, and you are trying to find a neural network that achieves high validation accuracy on this dataset. You choose a neural network, but after 3 hours of training, you find that the validation accuracy is only 85%. After more choices of neural networks — and many GPU-hours — you finally find one that has an accuracy of 93%. Is there an even better neural network? And can this whole process become faster?


Normalization is dead, long live normalization!

Since the advent of Batch Normalization (BN), almost every state-of-the-art (SOTA) method uses some form of normalization. After all, normalization generally speeds up learning and leads to models that generalize better than their unnormalized counterparts. This turns out to be especially useful when using some form of skip connections, which are prominent in Residual Networks (ResNets), for example. However, Brock et al. (2021a) suggest that SOTA performance can also be achieved using ResNets without normalization!


Understanding Few-Shot Multi-Task Representation Learning Theory

Multi-Task Representation Learning (MTR) is a popular paradigm to learn shared representations from multiple related tasks. It has demonstrated its efficiency for solving different problems, ranging from machine translation for natural language processing to object detection in computer vision. On the other hand, Few-Shot Learning is a recent problem that seeks to mimic the human capability to quickly learn how to solve a target task with little supervision. For this topic, researchers have turned to meta-learning that learns to learn a new task by training a model on a lot of small tasks. As meta-learning still suffers from a lack of theoretical understanding for its success in few-shot tasks, an intuitively appealing approach would be to bridge the gap between it and multi-task learning to better understand the former using the results established for the latter. In this post, we dive into a recent ICLR 2021 paper by S. Du, W. Hu, S. Kakade, J. Lee and Q. Lei, that demonstrated novel learning bounds for multi-task learning in the few-shot setting and go beyond it by establishing the connections that allow to better understand the inner workings of meta-learning algorithms as well.

"Auction Learning as a Two Player Game": GANs (?) for Mechanism Design

We discuss a new contribution to the nascent area of deep learning for revenue-maximizing auction design, which uses a GAN-style approach in which two neural networks (one which models strategic behavior by bidders, and one which models an auctioneer) compete with each other.


An Understanding of Learning from Demonstrations for Neural Text Generation

In this blog post, we will go over the ICLR 2021 paper titled Text Generation by Learning from Demonstration. This paper introduces a learning method based on offline, off-policy reinforcement learning (RL) which addresses two key limitations of a training objective used in neural text generation models: Maximum Likelihood Estimate (MLE).


Rethinking ValueDice - Does It Really Improve Performance?

This post rethinks the ValueDice algorithm introduced in the following ICLR publication. We promote several new conclusions and perhaps some of them can provide new insights.


Representation Change in Model-Agnostic Meta-Learning

Last year, an exciting adaptation of one of the most popular optimization-based meta-learning approaches, model-agnostic meta-learning (MAML) [Finn et al., 2017], was proposed in

   ▶  Jaehoon Oh, Hyungjun Yoo, ChangHwan Kim, Se-Young Yun (ICLR, 2021) BOIL: Towards Representation Change for Few-shot Learning

The authors adapt MAML by freezing the last layer to force body only inner learning (BOIL). Interestingly, this is complementary to ANIL (almost no inner loop) proposed in

   ▶  Aniruddh Raghu, Maithra Raghu, Samy Bengio, Oriol Vinyals (ICLR, 2020) Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML

Both papers attempt to understand the success of MAML and improve it. Oh et al. [2021] compare BOIL, ANIL, and MAML and show that both improve the performance of MAML. Albeit, BOIL outperforms ANIL, especially when the task distribution varies between training and testing.


The 37 Implementation Details of Proximal Policy Optimization

Jon is a first-year master’s student who is interested in reinforcement learning (RL). In his eyes, RL seemed fascinating because he could use RL libraries such as Stable-Baselines3 (SB3) to train agents to play all kinds of games. He quickly recognized Proximal Policy Optimization (PPO) as a fast and versatile algorithm and wanted to implement PPO himself as a learning experience. Upon reading the paper, Jon thought to himself, “huh, this is pretty straightforward.” He then opened a code editor and started writing PPO. CartPole-v1 from Gym was his chosen simulation environment, and before long, Jon made PPO work with CartPole-v1. He had a great time and felt motivated to make his PPO work with more interesting environments, such as the Atari games and MuJoCo robotics tasks. “How cool would that be?” he thought.


Discovering Non-Monotonic Autoregressive Ordering for Text Generation Models using Sinkhorn Distributions

Natural-Language-Generation (NLG) is a process for producing a sequence of natural language tokens. While the input to the NLG pipeline includes, but is not limited to, audio, video, image, structured documents, and natural language itself, the output is restricted to human readable texts.


Knowledge Graph Papers @ ICLR 2021

Hi! 👋 Today we are going to have a look at ICLR 2021 papers focusing on knowledge graphs (KGs), particularly in areas of graph representation learning and NLP. Among 860 accepted papers we highlight 10 particularly interesting and promising works that might influence the field in near future. This post is be structured as follows:


Euclidean geometry meets graph, a geometric deep learning perspective

Graph neural networks (GNN) have been an active area of machine learning research to tackle various problems in graph data. A graph is a powerful way of representing relationships among entities as nodes connected by edges. Sometimes nodes and edges can have spatial features, such as 3D coordinates of nodes and directions along edges. How do we reason over the topology of graphs while considering those geometric features? In this post, we discuss Learning from Protein Structure with Geometric Vector Perceptrons (Jing et al. 2021), published in ICLR 2021.


Symbolic Binding in Neural Networks through Factorized Memory Systems

In this blog post, we describe the paper 'Emergent Symbols through Binding in External Memory', which introduces the Emergent Symbol Binding Network (ESBN), a recurrent neural network with an augmented factorized memory system, to solve several logical reasoning tasks. We describe the design choices of the ESBN in great detail to compare with other sequence models, and perform a qualitative validation of the generalization claims from the original paper.

On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections

This blog post discusses the ICLR 2021 paper “On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections” by Li et al., highlighting the importance of its theoretical results while critically examining the notions and applications of dyadic fairness presented. This blog post assumes basic familiarity with graph representation learning using message-passing GNNs and fairness based on observed characteristics. The images in this blog post are equipped with alernative text.


Does Adam Converge and When?

In this blog post, we revisit the (non-)convergence behavior of Adam. Especially, we briefly review the non-convergence results by Reddi et al. [14] and the convergence results by Shi et al. [17]. Do this two results contradict to each other? If not, does the convergence analysis in Shi et al. [17] match the practical setting of Adam? How large is the gap between theory and practice? In this blog, we will discuss these questions from multiple different perspectives. We will show that the gap is actually non-negligible, and the discussion on the convergence of Adam is far from being concluded.


Recent Advances in Deep Learning for Routing Problems

TL;DR Developing neural network-driven solvers for combinatorial optimization problems such as the Travelling Salesperson Problem have seen a surge of academic interest recently. This blogpost presents a Neural Combinatorial Optimization pipeline that unifies several recently proposed model architectures and learning paradigms into one single framework. Through the lens of the pipeline, we analyze recent advances in deep learning for routing problems and provide new directions to stimulate future research towards practical impact.


Generating Molecular Conformations via Normalizing Flows and Neural ODEs

In this post, we provide an in-depth overview of methods outlined in the paper “Learning Neural Generative Dynamics for Molecular Conformation Generation,” discuss the impact of the work in the context of other conformation generation approaches, and additionally discuss future potential applications to improve the diversity and stability of generated conformations.


Learning to Coarsen Graphs with Graph Neural Networks

With the rise of large-scale graphs for relational learning, graph coarsening emerges as a computationally viable alternative. We revisit the principles that aim to improve data-driven graph coarsening with adjustable coarsened structures.


The Annotated S4

A guide to the S4 model from 'Efficiently Modeling Long Sequences with Structured State Spaces' annotated in a literate coding style as a working system.

PPLM Revisited: Steering and Beaming a Lumbering Mammoth to Control Text Generation

In this blogpost, we examine to which extent PPLM can control Language Models by investigating reproducibility, the impact of the prompt vs. BoW, effect of using weighted BoW and style control. Want a summary only? Check our TL;DR.


Looking at the Performer from a Hopfield Point of View

The recent paper Rethinking Attention with Performers constructs a new efficient attention mechanism in an elegant way. It strongly reduces the computational cost for long sequences, while keeping the intriguing properties of the original attention mechanism. In doing so, Performers have a complexity only linear in the input length, in contrast to the quadratic complexity of standard Transformers. This is a major breakthrough in the strive of improving Transformer models.