| CARVIEW |
Proposed
Discrete latent space
The model is based on VAE [1], where image \(x\) is generated from random latent variable \(z\) by a decoder \(p(x\ \vert\ z)\). The posterior (encoder) captures the latent variable distribution \(q_{\phi}(z\ \vert\ x)\) and is generally trained to match a certain distribution \(p(z)\) from which \(z\) is sampled from at inference time.
Contrary to the standard framework, in this work the latent space is discrete, i.e., \(z \in \mathbb{R}^{K \times D}\) where \(K\) is the number of codes in the latent space and \(D\) their dimensionality. More precisely, the input image is first fed to \(z_e\), that outputs a continuous vector, which is then mapped to one of the latent codes in the discrete space via nearest-neighbor search.
Adapting the \(\mathcal{L}_{\text{ELBO}}\) to this formalism, the KL divergence term greatly simplifies and we obtain:
\[\begin{align} \mathcal{L}_{\text{ELBO}}(x) &= \text{KL}(q(z | x) \| p(z)) - \mathbb{E}_{z \sim q(\cdot | x)}(\log p(x | z))\\ &= - \log(p(z_k)) - \log p(x | z_k)\\ \mbox{where }& z_k = z_q(x) = \arg\min_z \| z_e(x) - z \|^2 \tag{1} \end{align}\]In practice, the authors use a categorical uniform prior for the latent codes, meaning the KL divergence is constant and the objective reduces to the reconstruction loss.
Figure: A figure describing the VQ-VAE (left). Visualization of the embedding space (right)). The output of the encoder z(x) is mapped to the nearest point. The gradient (in red) will push the
encoder to change its output, which could alter the configuration, hence the code assignment, in the next forward pass.
Training Objective
As we mentioned previously, the \(\mathcal{L}_{\text{ELBO}}\) objective reduces to the reconstruction loss and is used to learn the encoder and decoder parameters. However the mapping from \(z_e\) to \(z_q\) is not straight-forward differentiable (Equation (1)). To palliate this, the authors use a straight-through estimator, meaning the gradients from the decoder input \(z_q(x)\) (quantized) are directly copied to the encoder output \(z_e(x)\) (continuous). However, this means that the latent codes that intervene in the mapping from \(z_e\) to \(z_q\) do not receive gradient updates that way.
Hence in order to train the discrete embedding space, the authors propose to use Vector Quantization (VQ), a dictionary learning technique, which uses mean squared error to make the latent code closer to the continuous vector it was matched to:
where \(x \mapsto \overline{x}\) denotes the stop gradient operator. The first term is the reconstruction loss stemming from the ELBO, the second term is the vector quantization contribution. Finally, the last term is a commitment loss to control the volume of the latent space by forcing the encoder to “commit” to the latent code it matched with, and not grow its output space unbounded.
Learned Prior
A second contribution of this work consists in learning the prior distribution. As mentioned, during the training phase, the prior \(p(z)\) is a uniform categorical distribution. After the training is done, we fit an autoregressive distribution over the space of latent codes. This is in particular enabled by the fact that the latent space is discrete.
Note: It is not clear to me if the autoregressive model is trained on latent codes sampled from the prior \(z \sim p(z)\) or from the encoder distribution \(x \sim \mathcal{D};\ z \sim q(z\ \vert\ x)\)
Experiments
The proposed model is mostly compared to the standard continuous VAE framework. It seems to achieve similar log-likelihood and sample quality, while taking advantage of the discrete latent space. In particular
For ImageNet for instance, they consider \(K = 512\) latent codes with dimensions \(1\). The output of the fully-convolutional encoder \(z_e\) is a feature map of size \(32 \times 32 \times 1\) which is then quantized pixel-wise. Interestingly, the model still performs well when using a powerful decoder (here, PixelCNN [2]) which seems to indicate it does not suffer from posterior collapse as strongly as the standard continuous VAE.
A second set of experiments tackles the problem of audio modeling. The performance of the model are once again satisfying. Furthermore, it does seem like the discrete latent space actually captures relevant characteristics of the input data structure, although this is a purely qualitative observation.
References
- [1] Autoencoding Variational Bayes, Kingma and Welling, ICLR 2014
- [2] Pixel Recurrent Neural Networks, van den Oord et al, arXiv 2016
- Pros (+): Theoretical justification, simple model, easy to implement.
- Cons (-): Some training instability in practice.
Generalized Bound on the Expected Risk
Several theoretical studies of the domain adaptation problem have proposed upper bounds of the risk on the target domain, involving the risk on the source domain and a notion of distance between the source and target distribution, \(\mathcal D_S\) and \(\mathcal D_T\). Here, the authors specifically consider the work of [1]. First, they define the \(\mathcal H\)-divergence:
\[\begin{align} d_{\mathcal H}(\mathcal D_S, \mathcal D_T) = 2 \sup_{h \in \mathcal H} \left| \mathbb{E}_{x\sim\mathcal{D}_s} (h(x) = 1) - \mathbb{E}_{x\sim\mathcal{D}_T} (h(x) = 1) \right| \tag{1} \end{align}\]where \(\mathcal H\) is a space of (here, binary) hypothesis functions. In the case where \(\mathcal H\) is a symmetric hypothesis class (i.e., \(h \in \mathcal H \implies -h \in \mathcal H\)), one can reduce (1) to the empirical form:
\[\begin{align} d_{\mathcal H}(\mathcal D_S, \mathcal D_T) &\simeq 2 \sup_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} [\!|h(x) = 1 |\!] - \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right|\\ &= 2 \sup_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} 1 - [\!|h(x) = 0 |\!] - \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right|\\ &= 2 - 2 \min_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} [\!|h(x) = 0 |\!] + \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right| \tag{2} \end{align}\]It is difficult to estimate the minimum over the hypothesis class \(\mathcal H\). Instead, [1] propose to approximate Equation (2) by training a classifier \(\hat{h}\) on samples \(\mathbf{x_S} \in \mathcal{D}_S\) with label 0 and \(\mathbf{x_T} \in \mathcal D_T\) with label 1, and replacing the minimum term by the empirical risk of \(\hat h\).
Given this definition of the \(\mathcal H\)-divergence, [1] further derives an upper bound on the empirical risk on the target domain, which in particular involves a trade-off between the empirical risk on the source domain, \(\mathcal{R}_{D_S}(h)\), and the divergence between the source and target distributions, \(d_{\mathcal H}(D_S, D_T)\).
where \(\mbox{VC}\) designates the Vapnik–Chervonenkis dimensions and \(n\) the number of samples.
The rest of the paper directly stems from this intuition: in order to minimize the target risk the proposed Domain Adversarial Neural Network (DANN) aims to build an “internal representation that contains no discriminative information about the origin of the input (source or target), while preserving a low risk on the source (labeled) examples”.
Proposed
The goal of the model is to learn a classifier \(\phi\), which can be decomposed as \(\phi = G_y \circ G_f\), where \(G_f\) is a feature extractor and \(G_y\) a small classifier on top that outputs the target label. This architecture is trained with a standard classification objective to minimize:
\[\begin{align} \mathcal{L}_y(\theta_f, \theta_y) = \frac{1}{N_s} \sum_{(x, y) \in D_s} \ell(G_y(G_f(x)), y) \end{align}\]Additionally DANN introduces a domain prediction branch, which is another classifier \(G_d\) on top of the feature representation \(G_f\) and whose goal is to approximate the domain discrepancy as (2), which leads to the following training objective to maximize:
The final objective can thus be written as:
\[\begin{align} E(\theta_f, \theta_y, \theta_d) &= \mathcal{L}_y(\theta_f, \theta_y) - \lambda \mathcal{L}_d(\theta_f, \theta_d) \tag{1}\\ \theta_f^\ast, \theta_y^\ast &= \arg\min E(\theta_f, \theta_y, \theta_d) \tag{2}\\ \theta_d^\ast &= \arg\max E(\theta_f, \theta_y, \theta_d) \tag{3} \end{align}\]Gradient Reversal Layer
Applying standard gradient descent, the DANN objective leads to the following gradient update rules:
In the case of neural networks, the gradients of the loss with respect to parameters are obtained with the backpropagation algorithm. The current system equations are very similar to the standard backpropagation scheme, except for the opposite sign in the derivative of \(\mathcal{L}_d\) with respect to \(\theta_d\) and \(\theta_f\). The authors introduce the gradient reversal layer (GRL) to evaluate both gradients in one standard backpropagation step.
The idea is that the output of \(\theta_f\) is normally propagated to \(\theta_d\), however during backpropagation, its gradient is multiplied by a negative constant:
\[\begin{align} \frac{\partial \mathcal L_d}{\partial \theta_f} = \frac{\bf{\color{red}{-}} \partial \mathcal L_d}{\partial G_f(x)} \frac{\partial G_f(x)}{\partial \theta_f} \end{align}\]In other words, for the update of \(\theta_d\), the gradients of \(\mathcal L_d\) with the respect to activations are computed normally (minimization), but they are then propagated with a minus sign in the feature extraction part of the network (maximization). Augmented with the gradient reversal layer, the final model is trained by minimizing the sum of losses \(\mathcal L_d + \mathcal L_y\) , which corresponds to the optimization problem in (1-3).
Figure: The proposed architecture includes a deep feature extractor and a deep label predictor. Unsupervised domain adaptation is achieved by adding a domain classifier connected to the feature extractor via a gradient reversal layer that multiplies the gradient by a certain negative constant during backpropagation.
Experiments
Datasets
The paper presents extensive results on the following settings:
- Toy dataset: A toy example based on the two half-moons dataset, where the source domains consists in the standard binary classification tasks with the two half-moons, and the target is the same, but with a 30 degrees rotation. They compare the
DANNto aNNmodel which has the same architecture but without theGRL: in other words, the baseline directly minimizes both the task and domain classification losses. - Sentiment Analysis: These experiments are performed on the Amazon reviews dataset which contains product reviews from four different domains (hence 12 different source to target scenarios) which have to be classified as either positive or negative reviews.
- Image Classification: Here the model is evaluated on various image classification task including MNIST \(\rightarrow\) SVHN, or different domain pairs from the OFFICE dataset [2] .
- Person Re-identification: The task of person identification across various visual domains.
Validation
Setting hyperparameters is a difficult problem, as we cannot directly evaluate the model on the target domain (no labeled data available). Instead of standard cross-validation, the authors use reverse validation based on a technique introduced in [3]: First, the (labeled) source set \(S\) and (unlabeled) target set \(T\) are each split into a training and validation set, \(S'\) and \(S_V\) (resp. \(T'\) and \(T_V\)). Using these splits, a model \(\eta\) is trained on \(S'\rightarrow T'\). Then a second model \(\eta_r\) is trained for the reverse direction on the set \(\{ (x, \eta(x)),\ x \in T'\} \rightarrow S'\). This reverse classifier \(\eta_r\) is then finally evaluated on the labeled validation set \(S_V\), and this accuracy is used as a validation score.
Conclusions
In general, the proposed method seems to perform very well for aligning the source and target domains in an unsupervised domain adaptation framework. Its main advantage is its simplicity, both in terms of theoretical motivation and implementation. In fact, the GRL is easily implemented in standard Deep Learning frameworks and can be added to any architectures.
The main shortcomings of the method are that (i) all experiments deal with only two sources and extensions to multiple domains might require some tweaks (e.g., considering the sum of pairwise discrepancies as an upper-bound) and (ii) in practice, training can become unstable due to the adversary training scheme; In particular, the experiment sections show that some stability tricks have to be used during training, such as using momentum or slowly increasing the contribution of the domain classification branch.
Figure: t-SNE projections of the embeddings for the source (MNIST) and target (SVHN) datasets without (left) and with (right) DANN adaptation.
Closely related
Conditional Adversarial Domain Adaptation.
Long et al, NeurIPS 2018[link]
In this work, the authors propose to for Domain Adversarial Networks. More specifically, the domain classifier is conditioned on the input’s class: However, since part of the samples are unlabeled, the conditioning uses the output of the target classifier branch as a proxy for the class information. Instead of simply concatenating the feature input with the condition, the authors consider a multilinear conditioning technique which relies on the cross-covariance operator. Another related paper is [4]. It also uses the multi-class information of the input domain, although in a simpler way.
References
- [1] Analysis of representations for Domain Adaptation, Ben-David et al, NeurIPS 2006
- [2] Adapting visual category models to new domains, Saenko et al, ECCV 2010
- [3] Person re-identification via structured prediction, Zhang and Saligrama, arXiv 2014
- [4] Multi-Adversarial Domain Adaptation, Pei et al, AAAI 2018
- Pros (+): Interesting results, with connections to Style Transfer and Network inverson.
- Cons (-): Seems like the results might depend a lot on parameter initialization, learning rate etc.
Background
Given a random noise vector \(z\) and conditioned on an image \(x_0\), the goal of conditional image generation is to generate image \(x = f_{\theta}(z; x_0)\) (where the random nature of \(z\) provides a sampling strategy for \(x\)); for instance, the task of generating a high quality image \(x\) from its lower resolution counterpart \(x_0\).
In particular, this encompasses inverse tasks such as denoising, super-resolution and inpainting that acts at the local pixel level. Such tasks can often be phrased with an objective of the following form:
\[\begin{align} \theta^{\ast} = \arg\min E(x, x_0) + R(x) \end{align}\]where \(E\) is a cost function and \(R\) is a prior on the output space acting as a regularizer. \(R\) is often a hand-crafted prior, for instance a smoothness constraint like Total Variation [1], or, for more recent techniques, it can be implemented with adversarial training (e.g., GANs).
Deep Image Prior
In this paper, the goal is to replace \(R\) by an implicit prior captured by the neural network, relatively to input noise \(z\). In other words
\[\begin{align} R(x) &= 0\ \mbox{if}\ \exists \theta\ \mbox{s.t.}\ x = f_{\theta}(z)\\ R(x) &= + \infty,\ \mbox{otherwise} \end{align}\]Which results in the following workflow:
\[\begin{align} \theta^{\ast} = \arg\min E(f(z; x_0), x_0) \mbox{ and } x^{\ast} = f_{\theta^{\ast}}(z; x_0) \end{align}\]One could wonder if this is a good choice for a prior at all. In fact, \(f\), being instantiated as a neural network, should be powerful enough that any image \(x\) can be generated from \(z\) for a certain choice of parameters \(\theta\), which means the prior should not be constraining.
However, the structure of the network* itself effectively affects how optimization algorithms such as gradient descent will browse the output space:
To quantify this effect, the authors perform a reconstruction experiment (i.e., \(E(x) = \| x - x_0 \|\)) for different choices of the input image \(x_0\) ((i)** natural image, (ii) same image with small perturbations, (iii) with large perturbations, and (iv) white noise) using a U-Net [2] inspired architecture. Experimental results show that the network descends faster to natural-looking images (case (i) and (ii)), than to random noise (case (iii) and (iv)).
Figure: Learning curves for the reconstruction task using: a natural image, the same plus i.i.d. noise, the same but randomly scrambled, and white noise.
Experiments
The experiments focus on three image analysis tasks:
- Image denoising (\(E(x, x_0) = \|x - x_0\|\)), based on the previous observation that the model converges more easily to natural-looking images than noisy ones.
- Super Resolution (\(E(x, x_0) = \| \mbox{downscale}(x) - x_0 \|\)), to upscale the resolution of input image \(x_0\)
- Image inpainting (\(E(x, x_0) = \|(x - x_0) \odot m\|\)) where the input image \(x_0\) is masked by a mask \(m\) and the goal is to recover the missing pixels.
The method seems to outperform most non-trained methods, when available, (e.g. Bicubic upsampling for Super-Resolution) but is still often outperformed y learning-based ones. The inpainting results are particularly interesting, and I do not know of any other non-trained baselines for this task. Obviously performs poorly when the obscured region requires highly semantic knowledge, but it seems to perform well on more reasonable benchmarks.
Additionally, the authors test the proposed prior for diagnosing neural networks by generating natural pre-images for neural activations of deep layers. Qualitative images look better than other handcrafted priors (total variation) and are not biased to specific datasets as are trained methods.
Figure: Example comparison between the proposed Deep Image Prior and various baselines for the task of Super-Resolution.
Closely related (follow-up work)
Deep Decoder: Concise Image Representations from Untrained Non-Convolutional Networks
Heckel and Hand, [link]
This paper builds on Deep Image Prior but proposes a much simpler architecture which is under-parametrized and non-convolutional. In particular, there are fewer weight parameters than the dimensionality of the output image (in comparison, DIP was using a
U-Netbased architecture). In particular, this property implies that the weights of the network can additionally be used as a compressed representation of the image. In order to test for compression, the authors use their architecture to reconstruct image \(x\) for different compression ratios \(k\) (i.e., number of network parameters \(N\), is \(k\)-times smaller as the output dimension of the images).
The deep decoder architecture combines standard blocks include linear combination of channels (convolutions ), ReLU, batch-normalization and upscaling. Note that since here we have a special case of batch size 1, the Batch Norm operator essentially normalizes the activation channel-wise. In particular, the paper contains a nice theoretical justification for the denoising case, in which they show that the model can only fit a certain amount of noise, which explains why it would converge to more natural-looking images, although it only applies to small networks (1 layer ? possibly generalizable to multi-layer and no batch-norm)
References
- [1] An introduction to Total Variation for Image Analysis, Chambolle et al., Technical Report, 2009
- [2] U-Net: Convolutional Networks for Biomedical Image Segmentation, Ronneberger et al., MICCAI 2015
CNN architectures with notion of relational reasoning, particularly useful for tasks such as visual question answering, dynamics understanding etc.
- Pros (+): Simple architecture, relies on small and flexible modules.
- Cons (-): Still a black-box module, hard to quantify how much "reasoning" happens.
Proposed Model
The main idea of Relation Networks (RN) is to constrain the functional form of convolutional neural networks as to explicitly learn relations between entities, rather than hoping for this property to emerge in the representation during training. Formally, let \(O\) be a set of objects of interest \(O = \{o_1 \dots o_n\}\); The Relation Network is trained to learn a representation that considers all pairwise relations across the objects:
\(f_{\phi}\) and \(g_{\theta}\) are defined as Multi Layer Perceptrons. By definition, the Relation Network (i) has to consider all pairs of objects, (ii) operates directly on the set of objects hence is not constrained to a specific organization of the data, and (iii) is data-efficient in the sense that only one function, \(g_{\theta}\) is learned to capture all the possible relations: \(g\) and \(f\) are typically light modules and most of the overhead comes from the sum of pairwise components (\(n^2\)).
The objects are the basic elements of the relational process we want to model. They are defined with regard to the task at hand, for instance:
-
Attending relations between objects in an image: The image is first processed through a fully-convolutional network. Each of the resulting cell is taken as an object, which is a feature of dimensions \(k\), additionally tagged with its position in the feature map.
-
Sequence of images. In that case, each image is first fed through a feature extractor and the resulting embedding is used as an object. The goal is to model relations between images across the sequence.
Figure: Example of applying the Relation Network for Visual Question Answeting. Questions are processed with an LSTM to produce a question embedding, and images are processed with a CNN to produce a set of objects for the RN.
Experiments
The main evaluation is done on the CLEVR dataset [2]. The main message seems to be that the proposed module is very simple and yet often improves the model accuracy when added to various architectures (CNN, CNN + LSTM etc.) introduced in [1]. The main baseline they compare to (and outperform) is Spatial Attention (SA) which is another simple method to integrate some form of relational reasoning in a neural architecture.
Closely related
Recurrent Relational Neural Networks [3]
Palm et al, [link]
\[\begin{align} h_i^0 &= v_i\\ h_i^{t + 1} &= f_{\phi} \left( h_i^t, v_i, \sum_{j} e_{i, j} g_{\theta}(h^t_i, h^t_j) \right)\\ o_i &= r(h_i^T) \mbox{ or } o = r(\sum_i h_i^T) \end{align}\]This paper builds on the Relation Network architecture and propose to explore more complex relational structures, defined as a graph, using a message passing approach: Formally, we are given a graph with vertices \(\mathcal V = \{v_i\}\) and edges \(\mathcal E = \{e_{i, j}\}\). By abuse of notation, \(v_i\) also denotes the embedding for vertex \(i\) (e.g. obtained via a CNN) and \(e_{i, j}\) is 1 where \(i\) and \(j\) are linked, 0 otherwise. To each node we associate a hidden state \(h_i^t\) at iteration \(t\), which will be updated via message passing. After a few iterations, the resulting state is passed through a
MLP\(r\) to output the result (either for each node or for the whole graph):
Comparing to the original Relation Network:
- Each update rule is a Relation Network that only looks at pairwise relations between linked vertices. The message passing scheme additionally introduces the notion of recurrence, and the dependency on the previous hidden state.
- The dependence on \(h_i^t\) could in theory be avoided by adding self-edges from \(v_i\) to \(v_i\), to make it closer to the Relation Network formulation.
- Adding \(v_i\) as input of \(f_\phi\) looks like a simple trick to avoid long-term memory problems.
The experiments essentially compare the proposed
RRNNmodel to the Relation Network and classical recurrent architectures such asLSTM. They consider three datasets:
- Babi. NLP question answering task with some reasoning involved. Solves 19.7 (out of 20) tasks on average, while simple RN solved around 18 of them reliably.
- Pretty CLEVR. A CLEVR like dataset (only with simple 2D shapes) with questions involving various steps of reasoning, e.g. “which is the shape \(n\) steps of the red circle ?”
- Sudoku. the graph contains 81 nodes (one for each cell in the sudoku), with edges between cells belonging to the same row, column or block.
Multi-Layer Relation Neural Networks [4]
Jahrens and Martinetz, [link]
\[\begin{align} h_{i, j}^0 &= g^0_{\theta}(x_i, x_j) \\ h_{i, j}^t &= g^{t + 1}_{\theta}\left(\sum_k h_{i, k}^{t - 1}, \sum_k h_{j, k}^{t - 1}\right) \\ MLRN(O) &= f_{\phi}(\sum_{i, j} h^T_{i, j}) \end{align}\]This paper presents a very simple trick to make Relation Network consider higher order relations than pairwise, while retaining some efficiency. Essentially the model can be written as follow:
It is not clear why this model would be equivalent to explicitly considering higher-level relations (as it is rather combining pairwise terms for a finite number of steps). According to the experiments it seems that indeed this architecture could be better fitted for the studied tasks (e.g. over the Relation Network or Recurrent Relation Network) but it also makes the model even harder to interpret.
References
- [1] Inferring and executing programs for visual reasoning, Johnson et al, ICCV 2017
- [2] CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, Johnson et al, CVPR 1017
- [3] Recurrent Relational Neural Networks, Palm et al, NeurIPS 2018
- [4] Multi-Layer Relation Neural Networks, Jahrens et Martinetz, arXiv 2018
CRL) to learn at the same time both the structure of the task and its components.
- Pros (+): This problem is well motivated, and seems a very promising direction, for learning domain-agnostic components.
- Cons (-): The actual implementation description lacks crucial details and I am not sure how easy it would be to reimplement.
Proposed model
Problem definition
A problem \(P_i\) is defined as a transformation \(x_i : t_x \mapsto y_i : t_y\), where \(t_x\) and \(t_y\) are the respective types of \(x\) and \(y\). However since we only consider recursive problem here, then \(t_x = t_y\).
We define a family of problems \(\mathcal P\) as a set of composite recursive problems that share regularities. The goal of CRL is to extrapolate to solve new compositions of these tasks, using knowledge from the limited subset of tasks it has seen during training.
Implementation
In essence, the problem can be formulated as a sequence-decision making task via a meta-level MDP (\(\mathcal X\), \(\mathcal F\), \(\mathcal P_{\mbox{meta}}\), r, \(\gamma\)), where \(\mathcal X\) is the set of states, i.e., representations; \(\mathcal F\) is a set of computations, i.e., istances of the transformations we consider, for instance as neural networks, and an additional special function HALT that stops the execution; \(\mathcal P_{\mbox{meta}}: (x_t, f_t, x_{t + 1}) \mapsto c \in [0, 1]\) is the policy which assigns a probability to each possible transition. Finally \(r\) is the reward function and \(\gamma\) a decay factor.
More specifically, the CRL is implemented as a set of neural networks, \(f_k \in \mathcal F\), and a controller \(\pi(f\ |\ \mathbf{x}, t_y)\) which selects the best course of action given the current history of representations \(\mathbf{x}\) and target type \(t_y\).
The loss is back-propagated through the functions \(f\), and the controller is trained as a Reinforcement Learning (RL) agent with a sparse reward (it only knows the final target result).
An additional important training scheme is the use of curriculum learning i.e., start by learning small transformations and then consider more complex compositions, increasing the state space little by little.
Figure: (top-left) CRL is a symbiotic relationship between a
controller and evaluator: the controller selects a module `m` given an intermediate representation `x` and the
evaluator applies `m` on `x` to create a new representation. (bottom-left) CRL dynamically learns the
structure of a program customized for its problem, and this program can be viewed as a finite state machine.
(right) A series of computations in the program is equivalent to a traversal through a Meta-MDP, where module
can be reused across different stages of computation, allowing for recursive computation.
Experiments
Multilingual Arithmetic
The learner will aim to solve recursive arithmetic expressions across 6 languages: English, Numerals, PigLatin, Reversed-English, Spanish. The input is a tuple \((x^s, t_y)\), where \(x\) is the arithmetic expression expressed in source language \(s\), and \(t_y\) is the output language.
-
Training: The learner trains on a curriculum of a limited set of 2, 3, 4, 5-length expressions. During training, each source language is seen with four target languages (and one held out for testing) and each target language is seen with four source languages (and one held out for testing).
-
Testing: The learner is asked to generalize to 5-length expressions (test set) and to extrapolate to 10-length expressions (extrapolation set) with unseen language pairs.
The authors consider two main types of functional units for this task: A reducer, which takes as input a window of three terms in the input expression and outputs a softmax distribution over the vocabulary. While a translator applies a function to every element of the input sequence and outputs a sequence of the same size.
The CRL is compared to a baseline RNN architecture that directly tries to map a variable length input sequence to the target output. On the test set, RNN and CRL yield similar accuracies although CRL usually requires less training samples and/or less training iterations. On the extrapolation set however, CRL more clearly outperforms RNN.
Interestingly the CRL results usually have a much bigger variance which would be interesting to qualitatively analyze. Moreover, the use of curriculum learning significantly improves the model performance. Finally, qualitative results show that the reducers and translators are interpretable to some degree: e.g., it is possible to map some of the reducers to specific operations, however due to the unsupervised nature of the task, the mapping is not always straight-forward.
Image Transformations
This time the functional units are composed of three specialized Spatial Transformer Networks [1] to learn rotation, scale and translation, and an identity function. Overall this setting does not yield very good quantitative results. More precisely, one of the main challenges, since we are acting on a visual domain, is to deduce the structure of the task from information which lacks clear structure (pixel matrices). Additionally the fact that all inputs and outputs have the same domain (images) and that only a sparse reward is available make it more difficult for the controller to distinguish between functionalities, i.e., it could collapse to using only one transformer.
References
- [1] Spatial Transformer Networks, Jaderberg et al., NeurIPS 2016
SAT problems with weak supervision: In that case, a model is trained only to predict the satisfiability of a formula in conjunctive normal form. As a byproduct, if the formula is satisfiable, an actual satisfying assignment can be worked out from the network's activations in most cases.
- Pros (+): No need for extensive annotation, seems to extrapolate nicely to harder problems by increasing the number message passing iterations.
- Cons (-): Limited practical applicability since it is outperformed by classical
SATsolvers.
Model: NeuroSAT
Input
We consider boolean logic formulas in their conjunctive normal form (CNF), i.e. each input formula is represented as a conjunction (\(\land\)) of clauses, which are themselves disjunctions (\(\lor\)) of literals (positive or negative instances of variables). The goal is to learn a classifier to predict whether such a formula is satisfiable.
A first problem is how to encode the input formula in such a way that it preserves the CNF invariances (invariance to negating a literal in all clauses, invariance to permutations in \(\lor\) and \(\land\) etc.). The authors use a standard undirected graph representation where:
- \(\mathcal V\): vertices are the literals (positive and negative form of variables, denoted as \(x\) and \(\bar x\)) and the clauses occurring in the input formula
- \(\mathcal E\): Edges are added to connect (i) the literals with clauses they appear in and (ii) each literal to its negative counterpart.
The graph relations are encoded as an adjacency matrix, \(A\), with as many rows as there are literals and as many columns as there are clauses. Note that this structure does not constrain the vertices ordering, and does not make any preferential treatment between positive or negative literals. However it still has some caveats, which can be avoided by pre-processing the formula. For instance when there are disconnected components in the graph, the averaging decision rule (see next paragraph) can lead to false positives.
Message-passing model
In a high-level view, the model keeps track of an embedding for each litteral and each clause (\(L^t\) and \(C^t\)), updated via message-passing on the graph, and combined via a Multi Layer Perceptron (MLP) to output the model prediction of the formula’s satisfiability. Then the model updates are as follow:
where \(h\) designates a hidden context vector for the LSTMs. The operator \(L \mapsto \bar{L}\) returns \(\overline{L}\), the embedding matrix \(L\) where the row of each litteral is swapped with the one corresponding to the literal’s negation. In other words, in (1) each clause embedding is updated based on the litteral that composes it, while in (2) each litteral embedding is updated based on the clauses it appears in and its negated counterpart.
After \(T\) iterations of this message-passing scheme, the model computes a logit for the satisfiability classification problem, which is trained via sigmoid cross-entropy:
\[\begin{align} L^t_{\mbox{vote}} &= \texttt{MLP}_{\texttt{vote}}(L^t)\\ y^t &= \mbox{mean}(L^t_{\mbox{vote}}) \end{align}\]Building the training set
The training set is built such that for any satisfiable training formula \(S\), it also includes an unsatisfiable counterpart \(S'\) which differs from \(S\) only by negating one litteral in one clause. These carefully curated samples should constrain the model to pick up substantial characteristics of the formula. In practice, the model is trained on formulas containing up to 40 variables, and on average 200 clauses. At this size, the SAT problem can still be solved by state-of-the-art solvers (yielding the supervision required to solve the model) but are large enough they prove challenging for Machine Learning models.
Inferring the SAT assignment
When a formula is satisfiable, one often also wants to know a valuation (variable assignment) that satisfies it. Recall that \(L^t_{\mbox{vote}}\) encodes a “vote” for every literal and its negative counterpart. Qualitative experiments show that those scores cannot be directly used for inferring the variable assignment, however they do induce a nice clustering of the variables (once the message passing has converged). Hence an assignment can be found as follows:
- (1) Reshape \(L^T_{\mbox{vote}}\) to size \((n, 2)\) where \(n\) is the number of literals.
- (2) Cluster the litterals into two clusters with centers \(\Delta_1\) and \(\Delta_2\) using the following criterion: \begin{align} |x_i - \Delta_1|^2 + |\overline{x_i} - \Delta_2|^2 \leq |x_i - \Delta_2|^2 + |\overline{x_i} - \Delta_1|^2 \end{align}
- (3) Try the two resulting assignments (set \(\Delta_1\) to true and \(\Delta_2\) to false, or vice-versa) and choose the one that yields satisfiability if any.
In practice, this method retrieves a satistifiability assignment for over 70% of the satisfiable test formulas.
Experiments
In practice, the NeuroSAT model is trained with embeddings of dimension 128 and 26 message passing iterations. The MLP architectures are very standard: 3 layers followed by ReLU activations. The final model obtains 85% accuracy in predicting a formula’s satisfiability on the test set.
It also can generalize to larger problems, although it requires to increase the number of message passing iterations. However the classification performance significantly decreases (e.g. 25% for 200 variables) and the number of iterations linearly scales with the number of variables (at least in the paper experiments).
Figure: (left) Success rate of a NeuroSAT model trained on 40 variables for test set involving formulas with up to 200 variables, as a function of the number of message-passing iterations. (right) The sequence of literal votes across message-passing iterations on a satisfiable formula. The vote matrix is reshaped such that each row contains the votes for a literal and its negated counterpart. For several iterations, most literals vote unsat with low confidence (light blue). After a few iterations, there is a phase transition and all literals vote sat with very high confidence (dark red), until convergence.
Interestingly, the model generalizes well to other classes of problems that were reduced to SAT (using SAT’s NP-completitude), although they have different structure than the random formulas generated for training, which seems to show that the model does learn some general characteristics of boolean formulas.
To summarize, the model takes advantage of the structure of Boolean formulas, and is able to predict whether an input formula is satisfiable or not with high accuracy. Moreover, even though trained only with this weak supervisory signal, it can work out a valid assignment most of the time. However it is still subpar compared to standard SAT solvers, which makes its applicability limited.
]]>VAEs or GANs) and easily parallelizable training and inference (unlike the sequential generative process in auto-regressive models). This paper proposes a new, more flexible, form of invertible flow for generative models, which builds on [3].
- Pros (+): Very clear presentation, promising results both quantitative and qualitative.
- Cons (-): One of the disadvantages of the models seem to be a large number of parameters, it would be interesting to have a more detailed report on training time. Also a comparison to [5] (a variant of
PixelCNNthat allows for faster parallelized sample generation) would be nice.