Currently, I am working on high-resolution controllable text to image generation at Ideogram AI. My work spans the full stack of model research and development, including pre-training, post-training, large-scale data processing pipelines, and serving models to millions of users.
Previously at Google, I helped develop one of the first large scale text to image and video diffusion models [Imagen, Imagen Video, Imagen Editor],
one of the first image to image diffusion models [SR3, Palette], and state of the art non-autoregressive machine translation methods [Imputer MT].
During my undergrad, I interned at MILA and worked on analysing sample complexity of learning algorithms for grounded language learning [BabyAI].
We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models.
We introduce Imagen Editor, a state-of-the-art solution for the task of masked inpainting — i.e., when a user provides text instructions alongside an overlay or “mask” (usually generated within a drawing-type interface) indicating the area of the image they would like to modify.
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation.
We present a framework for blind deblurring based on conditional diffusion models. Unlike existing techniques, we train a stochastic sampler that refines the output of a deterministic predictor and is capable of producing a diverse set of plausible reconstructions for a given input.
We introduce Palette, an image-to-image translation model that leverages diffusion generative models. We show strong performance of Palette on colorization, inpainting, uncropping and JPEG artifact removal. We apply Palette to these tasks without any task-specific tuning of the loss function or architecture and show that Palette beats several strong task-specific GAN and autoregressive model baselines.
We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation challenge, without any assistance from auxiliary
image classifiers to boost sample quality. We outperform BigGAN-deep and VQVAE-2 on FID and classification accuracy scores (CAS).
We adapt score matching based diffusion models for the image super-resolution. We achieve a fool rate of 50% on face super-resolution, and 40% on ImageNet super-resolution. We cascaded multiple super-resolution
models to efficiently generate 1024x1024 unconditional faces, and 256x256 class conditional natural images.
We apply latent alignment based models for non-autoregressive machine translation. We achieve SOTA for WMT14 EnDe for single step generation using CTC, and SOTA for iterative generation using Imputer.
We introduce a semi-autoregressive model for speech recognition that uses a tractable dynamic programming algorithm to approximately marginalize over all latent alignments and generation orders.
We study the impact of False Negatives in GAIL algorithm, and present a method diagnose it. We further present a solution - Fake Conditioning and improve upon sample complexity of human demonstrations by an order of magnitude compared to Behavioral Cloning.
We present a platform to study sample efficiency of grounded language learning. We include a numer of tasks with varying complexity and present a rigid sample complexity benchmark on each task.