Carview!

CARVIEW

MOTORHOMES

Select Language

Proposed

Discrete latent space

The model is based on VAE [1], where image \(x\) is generated from random latent variable \(z\) by a decoder \(p(x\ \vert\ z)\). The posterior (encoder) captures the latent variable distribution \(q_{\phi}(z\ \vert\ x)\) and is generally trained to match a certain distribution \(p(z)\) from which \(z\) is sampled from at inference time. Contrary to the standard framework, in this work the latent space is discrete, i.e., \(z \in \mathbb{R}^{K \times D}\) where \(K\) is the number of codes in the latent space and \(D\) their dimensionality. More precisely, the input image is first fed to \(z_e\), that outputs a continuous vector, which is then mapped to one of the latent codes in the discrete space via nearest-neighbor search.

\[\begin{align} q(z = z_k\ |\ x) = [\!| k = \arg\min_j \| z_e(x) - z_j \|^2 |\!] \end{align}\]

Adapting the \(\mathcal{L}_{\text{ELBO}}\) to this formalism, the KL divergence term greatly simplifies and we obtain:

\[\begin{align} \mathcal{L}_{\text{ELBO}}(x) &= \text{KL}(q(z | x) \| p(z)) - \mathbb{E}_{z \sim q(\cdot | x)}(\log p(x | z))\\ &= - \log(p(z_k)) - \log p(x | z_k)\\ \mbox{where }& z_k = z_q(x) = \arg\min_z \| z_e(x) - z \|^2 \tag{1} \end{align}\]

In practice, the authors use a categorical uniform prior for the latent codes, meaning the KL divergence is constant and the objective reduces to the reconstruction loss.

Figure: A figure describing the VQ-VAE (left). Visualization of the embedding space (right)). The output of the encoder z(x) is mapped to the nearest point. The gradient (in red) will push the encoder to change its output, which could alter the configuration, hence the code assignment, in the next forward pass.

Training Objective

As we mentioned previously, the \(\mathcal{L}_{\text{ELBO}}\) objective reduces to the reconstruction loss and is used to learn the encoder and decoder parameters. However the mapping from \(z_e\) to \(z_q\) is not straight-forward differentiable (Equation (1)). To palliate this, the authors use a straight-through estimator, meaning the gradients from the decoder input \(z_q(x)\) (quantized) are directly copied to the encoder output \(z_e(x)\) (continuous). However, this means that the latent codes that intervene in the mapping from \(z_e\) to \(z_q\) do not receive gradient updates that way.

Hence in order to train the discrete embedding space, the authors propose to use Vector Quantization (VQ), a dictionary learning technique, which uses mean squared error to make the latent code closer to the continuous vector it was matched to:

\[\begin{align} \mathcal{L}_{\text{VQ-VAE}}(x) = - \log p(x | z_q(x)) + \| \overline{z_e(x)} - e \|^2 + \beta \| z_e(x) - \bar{e} \|^2 \end{align}\]

where \(x \mapsto \overline{x}\) denotes the stop gradient operator. The first term is the reconstruction loss stemming from the ELBO, the second term is the vector quantization contribution. Finally, the last term is a commitment loss to control the volume of the latent space by forcing the encoder to “commit” to the latent code it matched with, and not grow its output space unbounded.

Learned Prior

A second contribution of this work consists in learning the prior distribution. As mentioned, during the training phase, the prior \(p(z)\) is a uniform categorical distribution. After the training is done, we fit an autoregressive distribution over the space of latent codes. This is in particular enabled by the fact that the latent space is discrete.

Note: It is not clear to me if the autoregressive model is trained on latent codes sampled from the prior \(z \sim p(z)\) or from the encoder distribution \(x \sim \mathcal{D};\ z \sim q(z\ \vert\ x)\)

Experiments

The proposed model is mostly compared to the standard continuous VAE framework. It seems to achieve similar log-likelihood and sample quality, while taking advantage of the discrete latent space. In particular For ImageNet for instance, they consider \(K = 512\) latent codes with dimensions \(1\). The output of the fully-convolutional encoder \(z_e\) is a feature map of size \(32 \times 32 \times 1\) which is then quantized pixel-wise. Interestingly, the model still performs well when using a powerful decoder (here, PixelCNN [2]) which seems to indicate it does not suffer from posterior collapse as strongly as the standard continuous VAE.

A second set of experiments tackles the problem of audio modeling. The performance of the model are once again satisfying. Furthermore, it does seem like the discrete latent space actually captures relevant characteristics of the input data structure, although this is a purely qualitative observation.

References

[1] Autoencoding Variational Bayes, Kingma and Welling, ICLR 2014
[2] Pixel Recurrent Neural Networks, van den Oord et al, arXiv 2016

]]>Karras et al.

Domain Adversarial Training of Neural Networks2019-05-23T10:59:24+02:002019-05-23T10:59:24+02:00https://ameroyer.github.io/domain%20adaptation/domain_adversarial_training_of_neural_networks In this article, the authors tackle the problem of unsupervised domain adaptation: Given labeled samples from a source distribution `\mathcal D_S` and unlabeled samples from target distribution `\mathcal D_T`, the goal is to learn a function that solves the task for both the source and target domains. In particular, the proposed model is trained on both source and target data jointly, and aims to directly learn an aligned representation of the domains, while retaining meaningful information with respect to the source labels.

Pros (+): Theoretical justification, simple model, easy to implement.
Cons (-): Some training instability in practice.

Generalized Bound on the Expected Risk

Several theoretical studies of the domain adaptation problem have proposed upper bounds of the risk on the target domain, involving the risk on the source domain and a notion of distance between the source and target distribution, \(\mathcal D_S\) and \(\mathcal D_T\). Here, the authors specifically consider the work of [1]. First, they define the \(\mathcal H\)-divergence:

\[\begin{align} d_{\mathcal H}(\mathcal D_S, \mathcal D_T) = 2 \sup_{h \in \mathcal H} \left| \mathbb{E}_{x\sim\mathcal{D}_s} (h(x) = 1) - \mathbb{E}_{x\sim\mathcal{D}_T} (h(x) = 1) \right| \tag{1} \end{align}\]

where \(\mathcal H\) is a space of (here, binary) hypothesis functions. In the case where \(\mathcal H\) is a symmetric hypothesis class (i.e., \(h \in \mathcal H \implies -h \in \mathcal H\)), one can reduce (1) to the empirical form:

\[\begin{align} d_{\mathcal H}(\mathcal D_S, \mathcal D_T) &\simeq 2 \sup_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} [\!|h(x) = 1 |\!] - \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right|\\ &= 2 \sup_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} 1 - [\!|h(x) = 0 |\!] - \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right|\\ &= 2 - 2 \min_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} [\!|h(x) = 0 |\!] + \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right| \tag{2} \end{align}\]

It is difficult to estimate the minimum over the hypothesis class \(\mathcal H\). Instead, [1] propose to approximate Equation (2) by training a classifier \(\hat{h}\) on samples \(\mathbf{x_S} \in \mathcal{D}_S\) with label 0 and \(\mathbf{x_T} \in \mathcal D_T\) with label 1, and replacing the minimum term by the empirical risk of \(\hat h\). Given this definition of the \(\mathcal H\)-divergence, [1] further derives an upper bound on the empirical risk on the target domain, which in particular involves a trade-off between the empirical risk on the source domain, \(\mathcal{R}_{D_S}(h)\), and the divergence between the source and target distributions, \(d_{\mathcal H}(D_S, D_T)\).

\[\begin{align} \mathcal{R}_{D_T}(h) \leq \mathcal{R}_{D_S}(h) + d_{\mathcal H}(D_S, D_T) + f\left(\mbox{VC}(\mathcal H), \frac{1}{n}\right) \tag{upper-bound} \end{align}\]

where \(\mbox{VC}\) designates the Vapnik–Chervonenkis dimensions and \(n\) the number of samples. The rest of the paper directly stems from this intuition: in order to minimize the target risk the proposed Domain Adversarial Neural Network (DANN) aims to build an “internal representation that contains no discriminative information about the origin of the input (source or target), while preserving a low risk on the source (labeled) examples”.

Proposed

The goal of the model is to learn a classifier \(\phi\), which can be decomposed as \(\phi = G_y \circ G_f\), where \(G_f\) is a feature extractor and \(G_y\) a small classifier on top that outputs the target label. This architecture is trained with a standard classification objective to minimize:

\[\begin{align} \mathcal{L}_y(\theta_f, \theta_y) = \frac{1}{N_s} \sum_{(x, y) \in D_s} \ell(G_y(G_f(x)), y) \end{align}\]

Additionally DANN introduces a domain prediction branch, which is another classifier \(G_d\) on top of the feature representation \(G_f\) and whose goal is to approximate the domain discrepancy as (2), which leads to the following training objective to maximize:

\[\begin{align} \mathcal{L}_d(\theta_f, \theta_d) = \frac{1}{N_s} \sum_{x \in D_s} \ell(G_d(G_f(x)), s) + \frac{1}{N_t} \sum_{x \in D_t} \ell(G_d(G_f(x)), t) \end{align}\]

The final objective can thus be written as:

\[\begin{align} E(\theta_f, \theta_y, \theta_d) &= \mathcal{L}_y(\theta_f, \theta_y) - \lambda \mathcal{L}_d(\theta_f, \theta_d) \tag{1}\\ \theta_f^\ast, \theta_y^\ast &= \arg\min E(\theta_f, \theta_y, \theta_d) \tag{2}\\ \theta_d^\ast &= \arg\max E(\theta_f, \theta_y, \theta_d) \tag{3} \end{align}\]

Gradient Reversal Layer

Applying standard gradient descent, the DANN objective leads to the following gradient update rules:

\[\begin{align} \theta_f &= \theta_f - \alpha \left( \frac{\partial \mathcal{L}_y}{\partial \theta_f} - \lambda \frac{\partial \mathcal{L}_d}{\partial \theta_f} \right)\\ \theta_y &= \theta_y - \alpha \frac{\partial \mathcal{L}_y}{\partial \theta_y} \\ \theta_d &= \theta_d + \alpha \frac{- \lambda \partial \mathcal{L}_d}{\partial \theta_d} \\ \end{align}\]

In the case of neural networks, the gradients of the loss with respect to parameters are obtained with the backpropagation algorithm. The current system equations are very similar to the standard backpropagation scheme, except for the opposite sign in the derivative of \(\mathcal{L}_d\) with respect to \(\theta_d\) and \(\theta_f\). The authors introduce the gradient reversal layer (GRL) to evaluate both gradients in one standard backpropagation step.

The idea is that the output of \(\theta_f\) is normally propagated to \(\theta_d\), however during backpropagation, its gradient is multiplied by a negative constant:

\[\begin{align} \frac{\partial \mathcal L_d}{\partial \theta_f} = \frac{\bf{\color{red}{-}} \partial \mathcal L_d}{\partial G_f(x)} \frac{\partial G_f(x)}{\partial \theta_f} \end{align}\]

In other words, for the update of \(\theta_d\), the gradients of \(\mathcal L_d\) with the respect to activations are computed normally (minimization), but they are then propagated with a minus sign in the feature extraction part of the network (maximization). Augmented with the gradient reversal layer, the final model is trained by minimizing the sum of losses \(\mathcal L_d + \mathcal L_y\) , which corresponds to the optimization problem in (1-3).

Figure: The proposed architecture includes a deep feature extractor and a deep label predictor. Unsupervised domain adaptation is achieved by adding a domain classifier connected to the feature extractor via a gradient reversal layer that multiplies the gradient by a certain negative constant during backpropagation.

Experiments

Datasets

The paper presents extensive results on the following settings:

Toy dataset: A toy example based on the two half-moons dataset, where the source domains consists in the standard binary classification tasks with the two half-moons, and the target is the same, but with a 30 degrees rotation. They compare the DANN to a NN model which has the same architecture but without the GRL: in other words, the baseline directly minimizes both the task and domain classification losses.
Sentiment Analysis: These experiments are performed on the Amazon reviews dataset which contains product reviews from four different domains (hence 12 different source to target scenarios) which have to be classified as either positive or negative reviews.
Image Classification: Here the model is evaluated on various image classification task including MNIST \(\rightarrow\) SVHN, or different domain pairs from the OFFICE dataset [2] .
Person Re-identification: The task of person identification across various visual domains.

Validation

Setting hyperparameters is a difficult problem, as we cannot directly evaluate the model on the target domain (no labeled data available). Instead of standard cross-validation, the authors use reverse validation based on a technique introduced in [3]: First, the (labeled) source set \(S\) and (unlabeled) target set \(T\) are each split into a training and validation set, \(S'\) and \(S_V\) (resp. \(T'\) and \(T_V\)). Using these splits, a model \(\eta\) is trained on \(S'\rightarrow T'\). Then a second model \(\eta_r\) is trained for the reverse direction on the set \(\{ (x, \eta(x)),\ x \in T'\} \rightarrow S'\). This reverse classifier \(\eta_r\) is then finally evaluated on the labeled validation set \(S_V\), and this accuracy is used as a validation score.

Conclusions

In general, the proposed method seems to perform very well for aligning the source and target domains in an unsupervised domain adaptation framework. Its main advantage is its simplicity, both in terms of theoretical motivation and implementation. In fact, the GRL is easily implemented in standard Deep Learning frameworks and can be added to any architectures.

The main shortcomings of the method are that (i) all experiments deal with only two sources and extensions to multiple domains might require some tweaks (e.g., considering the sum of pairwise discrepancies as an upper-bound) and (ii) in practice, training can become unstable due to the adversary training scheme; In particular, the experiment sections show that some stability tricks have to be used during training, such as using momentum or slowly increasing the contribution of the domain classification branch.

Figure: t-SNE projections of the embeddings for the source (MNIST) and target (SVHN) datasets without (left) and with (right) DANN adaptation.

Closely related

Conditional Adversarial Domain Adaptation.

Long et al, NeurIPS 2018[link]

In this work, the authors propose to for Domain Adversarial Networks. More specifically, the domain classifier is conditioned on the input’s class: However, since part of the samples are unlabeled, the conditioning uses the output of the target classifier branch as a proxy for the class information. Instead of simply concatenating the feature input with the condition, the authors consider a multilinear conditioning technique which relies on the cross-covariance operator. Another related paper is [4]. It also uses the multi-class information of the input domain, although in a simpler way.

References

[1] Analysis of representations for Domain Adaptation, Ben-David et al, NeurIPS 2006
[2] Adapting visual category models to new domains, Saenko et al, ECCV 2010
[3] Person re-identification via structured prediction, Zhang and Saligrama, arXiv 2014
[4] Multi-Adversarial Domain Adaptation, Pei et al, AAAI 2018

]]>Ganin et al.

Deep Image Prior2019-05-14T14:59:24+02:002019-05-14T14:59:24+02:00https://ameroyer.github.io/image%20analsys/deep_image_prior Deep Neural Networks are widely used in image generation tasks for capturing a general prior on natural images from a large set of observations. However, this paper shows that the structure of the network itself is able to capture a good prior, at least for local cues of image statistics. More precisely, a randomly initialized convolutional neural network can be a good handcrafted prior for low-level tasks such as denoising, inpainting.

Pros (+): Interesting results, with connections to Style Transfer and Network inverson.
Cons (-): Seems like the results might depend a lot on parameter initialization, learning rate etc.

Background

Given a random noise vector \(z\) and conditioned on an image \(x_0\), the goal of conditional image generation is to generate image \(x = f_{\theta}(z; x_0)\) (where the random nature of \(z\) provides a sampling strategy for \(x\)); for instance, the task of generating a high quality image \(x\) from its lower resolution counterpart \(x_0\).

In particular, this encompasses inverse tasks such as denoising, super-resolution and inpainting that acts at the local pixel level. Such tasks can often be phrased with an objective of the following form:

\[\begin{align} \theta^{\ast} = \arg\min E(x, x_0) + R(x) \end{align}\]

where \(E\) is a cost function and \(R\) is a prior on the output space acting as a regularizer. \(R\) is often a hand-crafted prior, for instance a smoothness constraint like Total Variation [1], or, for more recent techniques, it can be implemented with adversarial training (e.g., GANs).

Deep Image Prior

In this paper, the goal is to replace \(R\) by an implicit prior captured by the neural network, relatively to input noise \(z\). In other words

\[\begin{align} R(x) &= 0\ \mbox{if}\ \exists \theta\ \mbox{s.t.}\ x = f_{\theta}(z)\\ R(x) &= + \infty,\ \mbox{otherwise} \end{align}\]

Which results in the following workflow:

\[\begin{align} \theta^{\ast} = \arg\min E(f(z; x_0), x_0) \mbox{ and } x^{\ast} = f_{\theta^{\ast}}(z; x_0) \end{align}\]

One could wonder if this is a good choice for a prior at all. In fact, \(f\), being instantiated as a neural network, should be powerful enough that any image \(x\) can be generated from \(z\) for a certain choice of parameters \(\theta\), which means the prior should not be constraining.

However, the structure of the network* itself effectively affects how optimization algorithms such as gradient descent will browse the output space: To quantify this effect, the authors perform a reconstruction experiment (i.e., \(E(x) = \| x - x_0 \|\)) for different choices of the input image \(x_0\) ((i)** natural image, (ii) same image with small perturbations, (iii) with large perturbations, and (iv) white noise) using a U-Net [2] inspired architecture. Experimental results show that the network descends faster to natural-looking images (case (i) and (ii)), than to random noise (case (iii) and (iv)).

Figure: Learning curves for the reconstruction task using: a natural image, the same plus i.i.d. noise, the same but randomly scrambled, and white noise.

Experiments

The experiments focus on three image analysis tasks:

Image denoising (\(E(x, x_0) = \|x - x_0\|\)), based on the previous observation that the model converges more easily to natural-looking images than noisy ones.
Super Resolution (\(E(x, x_0) = \| \mbox{downscale}(x) - x_0 \|\)), to upscale the resolution of input image \(x_0\)
Image inpainting (\(E(x, x_0) = \|(x - x_0) \odot m\|\)) where the input image \(x_0\) is masked by a mask \(m\) and the goal is to recover the missing pixels.

The method seems to outperform most non-trained methods, when available, (e.g. Bicubic upsampling for Super-Resolution) but is still often outperformed y learning-based ones. The inpainting results are particularly interesting, and I do not know of any other non-trained baselines for this task. Obviously performs poorly when the obscured region requires highly semantic knowledge, but it seems to perform well on more reasonable benchmarks.

Additionally, the authors test the proposed prior for diagnosing neural networks by generating natural pre-images for neural activations of deep layers. Qualitative images look better than other handcrafted priors (total variation) and are not biased to specific datasets as are trained methods.

Figure: Example comparison between the proposed Deep Image Prior and various baselines for the task of Super-Resolution.

Closely related (follow-up work)

Deep Decoder: Concise Image Representations from Untrained Non-Convolutional Networks

Heckel and Hand, [link]

This paper builds on Deep Image Prior but proposes a much simpler architecture which is under-parametrized and non-convolutional. In particular, there are fewer weight parameters than the dimensionality of the output image (in comparison, DIP was using a U-Net based architecture). In particular, this property implies that the weights of the network can additionally be used as a compressed representation of the image. In order to test for compression, the authors use their architecture to reconstruct image \(x\) for different compression ratios \(k\) (i.e., number of network parameters \(N\), is \(k\)-times smaller as the output dimension of the images).

The deep decoder architecture combines standard blocks include linear combination of channels (convolutions ), ReLU, batch-normalization and upscaling. Note that since here we have a special case of batch size 1, the Batch Norm operator essentially normalizes the activation channel-wise. In particular, the paper contains a nice theoretical justification for the denoising case, in which they show that the model can only fit a certain amount of noise, which explains why it would converge to more natural-looking images, although it only applies to small networks (1 layer ? possibly generalizable to multi-layer and no batch-norm)

References

[1] An introduction to Total Variation for Image Analysis, Chambolle et al., Technical Report, 2009
[2] U-Net: Convolutional Networks for Biomedical Image Segmentation, Ronneberger et al., MICCAI 2015

]]>Ulyanov et al.

A simple Neural Network Module for Relational Reasoning2019-05-14T08:59:24+02:002019-05-14T08:59:24+02:00https://ameroyer.github.io/architectures/a_simple_neural_network_module_for_relational_reasoning The authors propose a relation module to equip CNN architectures with notion of relational reasoning, particularly useful for tasks such as visual question answering, dynamics understanding etc.

Pros (+): Simple architecture, relies on small and flexible modules.
Cons (-): Still a black-box module, hard to quantify how much "reasoning" happens.

Proposed Model

The main idea of Relation Networks (RN) is to constrain the functional form of convolutional neural networks as to explicitly learn relations between entities, rather than hoping for this property to emerge in the representation during training. Formally, let \(O\) be a set of objects of interest \(O = \{o_1 \dots o_n\}\); The Relation Network is trained to learn a representation that considers all pairwise relations across the objects:

\[\begin{align} \mbox{RN}(O) = f_{\phi}& \left(\sum_{i, j} g_{\theta}(o_i, o_j) \right) \end{align}\]

\(f_{\phi}\) and \(g_{\theta}\) are defined as Multi Layer Perceptrons. By definition, the Relation Network (i) has to consider all pairs of objects, (ii) operates directly on the set of objects hence is not constrained to a specific organization of the data, and (iii) is data-efficient in the sense that only one function, \(g_{\theta}\) is learned to capture all the possible relations: \(g\) and \(f\) are typically light modules and most of the overhead comes from the sum of pairwise components (\(n^2\)).

The objects are the basic elements of the relational process we want to model. They are defined with regard to the task at hand, for instance:

Attending relations between objects in an image: The image is first processed through a fully-convolutional network. Each of the resulting cell is taken as an object, which is a feature of dimensions \(k\), additionally tagged with its position in the feature map.
Sequence of images. In that case, each image is first fed through a feature extractor and the resulting embedding is used as an object. The goal is to model relations between images across the sequence.

Figure: Example of applying the Relation Network for Visual Question Answeting. Questions are processed with an LSTM to produce a question embedding, and images are processed with a CNN to produce a set of objects for the RN.

Experiments

The main evaluation is done on the CLEVR dataset [2]. The main message seems to be that the proposed module is very simple and yet often improves the model accuracy when added to various architectures (CNN, CNN + LSTM etc.) introduced in [1]. The main baseline they compare to (and outperform) is Spatial Attention (SA) which is another simple method to integrate some form of relational reasoning in a neural architecture.

Closely related

Recurrent Relational Neural Networks [3]

Palm et al, [link]

This paper builds on the Relation Network architecture and propose to explore more complex relational structures, defined as a graph, using a message passing approach: Formally, we are given a graph with vertices \(\mathcal V = \{v_i\}\) and edges \(\mathcal E = \{e_{i, j}\}\). By abuse of notation, \(v_i\) also denotes the embedding for vertex \(i\) (e.g. obtained via a CNN) and \(e_{i, j}\) is 1 where \(i\) and \(j\) are linked, 0 otherwise. To each node we associate a hidden state \(h_i^t\) at iteration \(t\), which will be updated via message passing. After a few iterations, the resulting state is passed through a MLP \(r\) to output the result (either for each node or for the whole graph):

\[\begin{align} h_i^0 &= v_i\\ h_i^{t + 1} &= f_{\phi} \left( h_i^t, v_i, \sum_{j} e_{i, j} g_{\theta}(h^t_i, h^t_j) \right)\\ o_i &= r(h_i^T) \mbox{ or } o = r(\sum_i h_i^T) \end{align}\]

Comparing to the original Relation Network:

Each update rule is a Relation Network that only looks at pairwise relations between linked vertices. The message passing scheme additionally introduces the notion of recurrence, and the dependency on the previous hidden state.

The dependence on \(h_i^t\) could in theory be avoided by adding self-edges from \(v_i\) to \(v_i\), to make it closer to the Relation Network formulation.

Adding \(v_i\) as input of \(f_\phi\) looks like a simple trick to avoid long-term memory problems.

The experiments essentially compare the proposed RRNN model to the Relation Network and classical recurrent architectures such as LSTM. They consider three datasets:

Babi. NLP question answering task with some reasoning involved. Solves 19.7 (out of 20) tasks on average, while simple RN solved around 18 of them reliably.

Pretty CLEVR. A CLEVR like dataset (only with simple 2D shapes) with questions involving various steps of reasoning, e.g. “which is the shape \(n\) steps of the red circle ?”

Sudoku. the graph contains 81 nodes (one for each cell in the sudoku), with edges between cells belonging to the same row, column or block.

Multi-Layer Relation Neural Networks [4]

Jahrens and Martinetz, [link]

This paper presents a very simple trick to make Relation Network consider higher order relations than pairwise, while retaining some efficiency. Essentially the model can be written as follow:

\[\begin{align} h_{i, j}^0 &= g^0_{\theta}(x_i, x_j) \\ h_{i, j}^t &= g^{t + 1}_{\theta}\left(\sum_k h_{i, k}^{t - 1}, \sum_k h_{j, k}^{t - 1}\right) \\ MLRN(O) &= f_{\phi}(\sum_{i, j} h^T_{i, j}) \end{align}\]

It is not clear why this model would be equivalent to explicitly considering higher-level relations (as it is rather combining pairwise terms for a finite number of steps). According to the experiments it seems that indeed this architecture could be better fitted for the studied tasks (e.g. over the Relation Network or Recurrent Relation Network) but it also makes the model even harder to interpret.

References

[1] Inferring and executing programs for visual reasoning, Johnson et al, ICCV 2017
[2] CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, Johnson et al, CVPR 1017
[3] Recurrent Relational Neural Networks, Palm et al, NeurIPS 2018
[4] Multi-Layer Relation Neural Networks, Jahrens et Martinetz, arXiv 2018

]]>Santoro et al.

Automatically Composing Representation Transformations as a Mean for Generalization2019-05-14T08:59:24+02:002019-05-14T08:59:24+02:00https://ameroyer.github.io/domain%20adaptation/automatically_composing_representation_transformations_as_a_mean_for_generalization The authors focus on solving recursive tasks which can be decomposed into a sequence of simpler algorithmic procedures (e.g., arithmetic problems, geometric transformations). The main difficulties of this approach are (i) how to actually decompose the task into simpler blocks and (ii) how to extrapolate to more complex problems from learning on simpler individual tasks. The authors propose the compositional recursive learner (CRL) to learn at the same time both the structure of the task and its components.

Pros (+): This problem is well motivated, and seems a very promising direction, for learning domain-agnostic components.
Cons (-): The actual implementation description lacks crucial details and I am not sure how easy it would be to reimplement.

Proposed model

Problem definition

A problem \(P_i\) is defined as a transformation \(x_i : t_x \mapsto y_i : t_y\), where \(t_x\) and \(t_y\) are the respective types of \(x\) and \(y\). However since we only consider recursive problem here, then \(t_x = t_y\). We define a family of problems \(\mathcal P\) as a set of composite recursive problems that share regularities. The goal of CRL is to extrapolate to solve new compositions of these tasks, using knowledge from the limited subset of tasks it has seen during training.

Implementation

In essence, the problem can be formulated as a sequence-decision making task via a meta-level MDP (\(\mathcal X\), \(\mathcal F\), \(\mathcal P_{\mbox{meta}}\), r, \(\gamma\)), where \(\mathcal X\) is the set of states, i.e., representations; \(\mathcal F\) is a set of computations, i.e., istances of the transformations we consider, for instance as neural networks, and an additional special function HALT that stops the execution; \(\mathcal P_{\mbox{meta}}: (x_t, f_t, x_{t + 1}) \mapsto c \in [0, 1]\) is the policy which assigns a probability to each possible transition. Finally \(r\) is the reward function and \(\gamma\) a decay factor.

More specifically, the CRL is implemented as a set of neural networks, \(f_k \in \mathcal F\), and a controller \(\pi(f\ |\ \mathbf{x}, t_y)\) which selects the best course of action given the current history of representations \(\mathbf{x}\) and target type \(t_y\). The loss is back-propagated through the functions \(f\), and the controller is trained as a Reinforcement Learning (RL) agent with a sparse reward (it only knows the final target result). An additional important training scheme is the use of curriculum learning i.e., start by learning small transformations and then consider more complex compositions, increasing the state space little by little.

Figure: (top-left) CRL is a symbiotic relationship between a controller and evaluator: the controller selects a module `m` given an intermediate representation `x` and the evaluator applies `m` on `x` to create a new representation. (bottom-left) CRL dynamically learns the structure of a program customized for its problem, and this program can be viewed as a finite state machine. (right) A series of computations in the program is equivalent to a traversal through a Meta-MDP, where module can be reused across different stages of computation, allowing for recursive computation.

Experiments

Multilingual Arithmetic

The learner will aim to solve recursive arithmetic expressions across 6 languages: English, Numerals, PigLatin, Reversed-English, Spanish. The input is a tuple \((x^s, t_y)\), where \(x\) is the arithmetic expression expressed in source language \(s\), and \(t_y\) is the output language.

Training: The learner trains on a curriculum of a limited set of 2, 3, 4, 5-length expressions. During training, each source language is seen with four target languages (and one held out for testing) and each target language is seen with four source languages (and one held out for testing).
Testing: The learner is asked to generalize to 5-length expressions (test set) and to extrapolate to 10-length expressions (extrapolation set) with unseen language pairs.

The authors consider two main types of functional units for this task: A reducer, which takes as input a window of three terms in the input expression and outputs a softmax distribution over the vocabulary. While a translator applies a function to every element of the input sequence and outputs a sequence of the same size.

The CRL is compared to a baseline RNN architecture that directly tries to map a variable length input sequence to the target output. On the test set, RNN and CRL yield similar accuracies although CRL usually requires less training samples and/or less training iterations. On the extrapolation set however, CRL more clearly outperforms RNN. Interestingly the CRL results usually have a much bigger variance which would be interesting to qualitatively analyze. Moreover, the use of curriculum learning significantly improves the model performance. Finally, qualitative results show that the reducers and translators are interpretable to some degree: e.g., it is possible to map some of the reducers to specific operations, however due to the unsupervised nature of the task, the mapping is not always straight-forward.

Image Transformations

This time the functional units are composed of three specialized Spatial Transformer Networks [1] to learn rotation, scale and translation, and an identity function. Overall this setting does not yield very good quantitative results. More precisely, one of the main challenges, since we are acting on a visual domain, is to deduce the structure of the task from information which lacks clear structure (pixel matrices). Additionally the fact that all inputs and outputs have the same domain (images) and that only a sparse reward is available make it more difficult for the controller to distinguish between functionalities, i.e., it could collapse to using only one transformer.

References

[1] Spatial Transformer Networks, Jaderberg et al., NeurIPS 2016

]]>Chang et al.

Learning a SAT Solver from Single-Bit Supervision2019-05-14T08:59:24+02:002019-05-14T08:59:24+02:00https://ameroyer.github.io/structured%20learning/Learning_a_sat_solver_from_single_bit_supervision The goal is to solve SAT problems with weak supervision: In that case, a model is trained only to predict the satisfiability of a formula in conjunctive normal form. As a byproduct, if the formula is satisfiable, an actual satisfying assignment can be worked out from the network's activations in most cases.

Pros (+): No need for extensive annotation, seems to extrapolate nicely to harder problems by increasing the number message passing iterations.
Cons (-): Limited practical applicability since it is outperformed by classical SAT solvers.

Model: NeuroSAT

Input

We consider boolean logic formulas in their conjunctive normal form (CNF), i.e. each input formula is represented as a conjunction (\(\land\)) of clauses, which are themselves disjunctions (\(\lor\)) of literals (positive or negative instances of variables). The goal is to learn a classifier to predict whether such a formula is satisfiable.

A first problem is how to encode the input formula in such a way that it preserves the CNF invariances (invariance to negating a literal in all clauses, invariance to permutations in \(\lor\) and \(\land\) etc.). The authors use a standard undirected graph representation where:

\(\mathcal V\): vertices are the literals (positive and negative form of variables, denoted as \(x\) and \(\bar x\)) and the clauses occurring in the input formula
\(\mathcal E\): Edges are added to connect (i) the literals with clauses they appear in and (ii) each literal to its negative counterpart.

The graph relations are encoded as an adjacency matrix, \(A\), with as many rows as there are literals and as many columns as there are clauses. Note that this structure does not constrain the vertices ordering, and does not make any preferential treatment between positive or negative literals. However it still has some caveats, which can be avoided by pre-processing the formula. For instance when there are disconnected components in the graph, the averaging decision rule (see next paragraph) can lead to false positives.

Message-passing model

In a high-level view, the model keeps track of an embedding for each litteral and each clause (\(L^t\) and \(C^t\)), updated via message-passing on the graph, and combined via a Multi Layer Perceptron (MLP) to output the model prediction of the formula’s satisfiability. Then the model updates are as follow:

\[\begin{align} C^t, h_C^t &= \texttt{LSTM}_\texttt{C}(h_C^{t - 1}, A^T \texttt{MLP}_{\texttt{L}}(L^{t - 1}) )\ \ \ \ \ \ \ \ \ \ \ (1)\\ L^t, h_L^t &= \texttt{LSTM}_\texttt{L}(h_L^{t - 1}, \overline{L^{t - 1}}, A\ \texttt{MLP}_{\texttt{C}}(C^{t }) )\ \ \ \ \ \ (2)\\ \end{align}\]

where \(h\) designates a hidden context vector for the LSTMs. The operator \(L \mapsto \bar{L}\) returns \(\overline{L}\), the embedding matrix \(L\) where the row of each litteral is swapped with the one corresponding to the literal’s negation. In other words, in (1) each clause embedding is updated based on the litteral that composes it, while in (2) each litteral embedding is updated based on the clauses it appears in and its negated counterpart.

After \(T\) iterations of this message-passing scheme, the model computes a logit for the satisfiability classification problem, which is trained via sigmoid cross-entropy:

\[\begin{align} L^t_{\mbox{vote}} &= \texttt{MLP}_{\texttt{vote}}(L^t)\\ y^t &= \mbox{mean}(L^t_{\mbox{vote}}) \end{align}\]

Building the training set

The training set is built such that for any satisfiable training formula \(S\), it also includes an unsatisfiable counterpart \(S'\) which differs from \(S\) only by negating one litteral in one clause. These carefully curated samples should constrain the model to pick up substantial characteristics of the formula. In practice, the model is trained on formulas containing up to 40 variables, and on average 200 clauses. At this size, the SAT problem can still be solved by state-of-the-art solvers (yielding the supervision required to solve the model) but are large enough they prove challenging for Machine Learning models.

Inferring the SAT assignment

When a formula is satisfiable, one often also wants to know a valuation (variable assignment) that satisfies it. Recall that \(L^t_{\mbox{vote}}\) encodes a “vote” for every literal and its negative counterpart. Qualitative experiments show that those scores cannot be directly used for inferring the variable assignment, however they do induce a nice clustering of the variables (once the message passing has converged). Hence an assignment can be found as follows:

(1) Reshape \(L^T_{\mbox{vote}}\) to size \((n, 2)\) where \(n\) is the number of literals.
(2) Cluster the litterals into two clusters with centers \(\Delta_1\) and \(\Delta_2\) using the following criterion: \begin{align} |x_i - \Delta_1|^2 + |\overline{x_i} - \Delta_2|^2 \leq |x_i - \Delta_2|^2 + |\overline{x_i} - \Delta_1|^2 \end{align}
(3) Try the two resulting assignments (set \(\Delta_1\) to true and \(\Delta_2\) to false, or vice-versa) and choose the one that yields satisfiability if any.

In practice, this method retrieves a satistifiability assignment for over 70% of the satisfiable test formulas.

Experiments

In practice, the NeuroSAT model is trained with embeddings of dimension 128 and 26 message passing iterations. The MLP architectures are very standard: 3 layers followed by ReLU activations. The final model obtains 85% accuracy in predicting a formula’s satisfiability on the test set.

It also can generalize to larger problems, although it requires to increase the number of message passing iterations. However the classification performance significantly decreases (e.g. 25% for 200 variables) and the number of iterations linearly scales with the number of variables (at least in the paper experiments).

Figure: (left) Success rate of a NeuroSAT model trained on 40 variables for test set involving formulas with up to 200 variables, as a function of the number of message-passing iterations. (right) The sequence of literal votes across message-passing iterations on a satisfiable formula. The vote matrix is reshaped such that each row contains the votes for a literal and its negated counterpart. For several iterations, most literals vote unsat with low confidence (light blue). After a few iterations, there is a phase transition and all literals vote sat with very high confidence (dark red), until convergence.

Interestingly, the model generalizes well to other classes of problems that were reduced to SAT (using SAT’s NP-completitude), although they have different structure than the random formulas generated for training, which seems to show that the model does learn some general characteristics of boolean formulas.

To summarize, the model takes advantage of the structure of Boolean formulas, and is able to predict whether an input formula is satisfiable or not with high accuracy. Moreover, even though trained only with this weak supervisory signal, it can work out a valid assignment most of the time. However it is still subpar compared to standard SAT solvers, which makes its applicability limited.

]]>Selsam et al.

Glow: Generative Flow with Invertible 1×1 Convolutions2019-05-07T14:59:24+02:002019-05-07T14:59:24+02:00https://ameroyer.github.io/generative%20models/glow_generative_flow_with_invertible_1x1_convolution Invertible flow based generative models such as [2, 3] have several advantages including exact likelihood inference process (unlike VAEs or GANs) and easily parallelizable training and inference (unlike the sequential generative process in auto-regressive models). This paper proposes a new, more flexible, form of invertible flow for generative models, which builds on [3].

Pros (+): Very clear presentation, promising results both quantitative and qualitative.
Cons (-): One of the disadvantages of the models seem to be a large number of parameters, it would be interesting to have a more detailed report on training time. Also a comparison to [5] (a variant of PixelCNN that allows for faster parallelized sample generation) would be nice.

Invertible flow-Based Generative Models

Given input data \(x\), invertible flow-based generative models are built as two steps processes that generate data from an intermediate latent representation \(z\):

where \(g_\theta\) is an invertible function, i.e. a bijection, \(g_\theta: \mathcal X \rightarrow \mathcal Z\). It acts as an encoder from the input data to the latent space. \(g\) is usually built as a sequence of smaller invertible functions \(g = g_1 \circ \dots \circ g_n\). Such a sequence is also called a normalizing flow [1]. Under this construction, the change of variables formula applied to \(x = g(z)\) gives the following equivalence between the input and latent densities:

where \(\forall i \in [1; n],\ g_{\leq i} = g_i \circ \dots g_1\) In particular, this means \(g_{\leq n}(x) = z\) and \(g_0(x) = x\). \(p_\theta(z)\) is usually chosen as a simple density such as a unit Gaussian distribution, \(p_\theta(z) = \mathcal N(z; 0, \mathbf{I})\). In order to efficiently estimate the likelihood, the functions \(g_1, \dots g_n\) are usually chosen such that the log-determinant of the Jacobian, \(\log\ \left\vert \det \left( \frac{g_{\leq i}}{g_{\leq i - 1}} \right) \right\vert\), is easily computed, for instance by choosing transformation such that the Jacobian is a triangular matrix.

Proposed Flow Construction: GLOW

Flow step

Each flow step function \(g_i\) is a sequence of three operations as follows. Given an input tensor of dimensions \(h \times w \times c\):

General Pipeline

Step Description	Functional Form of flow \(g_i\)	Inverse Function of the flow, \(g_i^{-1}\)	Log-determinant Expression
ActNorm \(s: [c,]\) \(b: [c,]\)	\(y = \sigma\odot x + \mu\)	\(x = (y - \mu) / \sigma\)	\(hw\ \mbox{sum} \log(\vert\sigma\vert)\)
1x1 conv \(W: [c,c]\)	\(y = Wx\)	\(x = W^{-1}y\)	\(h w \log \vert \det (W) \vert\)
Affine Coupling (ACL) [2]	\(x_a,\ x_b = \mbox{split}(x)\) \((\log \sigma, \mu) = \mbox{NN}(x_b)\) \(y_a = \sigma \odot x_a + \mu\) \(y = \mbox{concat}(y_a, x_b)\)	\(y_a,\ y_b = \mbox{split}(y)\) \((\log \sigma, \mu) = \mbox{NN}(y_b)\) \(x_a = (y_a - \mu) / \sigma\) \(x = \mbox{concat}(x_a, y_b)\)	\(\mbox{sum} (\log \vert\sigma\vert)\)

These operations are then combined in a multi-scale architecture as described in [3], which in particular relies on a squeezing operation to trade of spatial resolution for number of output channels. Given an input tensor of size \(s \times s \times c\), the squeezing operator takes blocks of size \(2 \times 2 \times c\) and flatten them to size \(1 \times 1 \times 4c\), which can easily be inverted by reshaping. The final pipeline consists in \(L\) levels that operate on different scales: each level is composed of \(K\) flow steps and a final squeezing operation.

Experiments

Implementation

In practice, the authors implement NN as a convolutional neural network of depth 3 in the ACL; which means that each flow step contains 4 convolutions in total. They also use \(K = 32\) flow steps in each level. Finally the number of levels \(L\) is 3 for small-scale experiments (32x32 images) and 6 for large scale (256x256 ImageNet images). In particular this means that the model contains a lot of parameters (\(L \times K \times 4\) convolutions) which might be a practical disadvantage compared to other method that produce samples of similar quality, e.g. GANs. However, contrary to these models, GLOW provides exact likelihood inference.

Results

GLOW outperforms RealNVP [3] in terms of data likelihood, as evaluated on standard benchmarks (ImageNet, CIFAR-10, LSUN). In particular, the 1x1 convolutions performs better than other more specific permutations operations, and only introduces a small computational overhead.

Qualitatively, the samples are of great quality and the model seems to scale well with higher resolution. However this greatly increases the memory requirements. Leveraging the model’s invertibility to avoid storing activations during the feed-forward pass such as in [4] could be used to (partially) palliate the problem.

	ResNet Architecture	RevNet Architecture
Format of a Block	$$ y_{n} = y_{n - 1} + \mathcal F(y_{n - 1}) $$	$$ \begin{align} y_{n - 1, 1}, y_{n - 1, 2} &= \mbox{split}(y_{n - 1})\\ y_{n, 1} &= y_{n - 1, 1} + \mathcal{F}(y_{n - 1, 2})\\ y_{n, 2} &= y_{n - 1, 2} + \mathcal{G}(y_{n, 1})\\ y_{n} &= (y_{n, 1}, y_{n, 2}) \end{align} $$
Parameters	$$ \begin{align} \theta = \theta_{\mathcal F} \end{align} $$	$$\begin{align} \theta = (\theta_{\mathcal F}, \theta_{\mathcal G}) \end{align} $$
Backpropagation	$$\begin{align} &\mathbf{in:}\ y_{n - 1}, y_{n}, \overline{ y_{n}}\\ \overline{\theta_n} &=\overline{y_n} \frac{\partial y_n}{\partial \theta_n}\\ \overline{y_{n - 1}} &= \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n-1}} \\ &\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n -1}} \end{align}$$	$$\begin{align} &\mathbf{in:}\ y_{n}, \overline{y_{n }}\\ \texttt{# recover}& \texttt{ input activations} \\ y_{n, 1}, y_{n, 2} &= \mbox{split}(y_{n})\\ y_{n - 1, 2} &= y_{n, 2} - \mathcal{G}(y_{n, 1})\\ y_{n - 1, 1} &= y_{n, 1} - \mathcal{F}(y_{n - 1, 2})\\ \texttt{# compute}& \texttt{ gradients wrt. inputs} \\ \overline{y_{n -1, 1}} &= \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\ \overline{y_{n -1, 2}} &= \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\ \texttt{# compute}& \texttt{ gradients wrt. parameters} \\ \overline{\theta_{n, \mathcal G}} &= \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\ \overline{\theta_{n, \mathcal F}} &= \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\ &\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n -1}}, y_{n - 1} \end{align}$$

Proposed

Discrete latent space

Training Objective

Learned Prior

Experiments

References

Generalized Bound on the Expected Risk

Proposed

Gradient Reversal Layer

Experiments

Datasets

Validation

Conclusions

Closely related

Conditional Adversarial Domain Adaptation.

References

Background

Deep Image Prior

Experiments

Closely related (follow-up work)

Deep Decoder: Concise Image Representations from Untrained Non-Convolutional Networks

References

Proposed Model

Experiments

Closely related

Recurrent Relational Neural Networks [3]

Multi-Layer Relation Neural Networks [4]

References

Proposed model

Problem definition

Implementation

Experiments

Multilingual Arithmetic

Image Transformations

References

Model: NeuroSAT

Input

Message-passing model

Building the training set

Inferring the SAT assignment

Experiments

Invertible flow-Based Generative Models

Proposed Flow Construction: GLOW

Flow step

General Pipeline

Experiments

Implementation

Results

References

Proposed Architecture

RevNet

Backpropagation

Computational Efficiency

Experiments

References

Proposed Model

Statistical Background

Conditional Neural Processes

Training

Experiments and Applications

Proposed Model

Learning to generate analogies via manipulation of the embedding space

Regularizing the latent space

Disentangling the feature space

Experiments

Closely Related

Visalogy: Answering Visual Analogy Questions [3]

References