| CARVIEW |
"I Know It When I See It":
Mood Spaces for Connecting and Expressing Visual Concepts
GPT-4o model fails to blend the violin and guitar.
GPT-4o is heavily by natural language, GPT-4o cannot capture that a lute is a mix of a violin and a guitar, nor how the lute should be held.
GPT-4o fails to complete the visual analogy of riding a bike to a horse.
GPT-4o focuses on object details but misses the overall concept. Riding a bike or horse requires reasoning about multiple attributes. GPT-4o gets the idea of riding the horse backwards, along with the intended horse pose change.
Visual concepts go beyond natural language descriptions.
Can you prompt ChatGPT generate creative images?
What's the missing creativity in ChatGPT?
ChatGPT can copy-paste existing image parts, e.g. dog head on fish body.
But it's hard to blend the existing parts to create a new part, that looks like a hybrid of the two parts.
What makes creativity difficult?
Visual concepts (e.g. duck, toilet paper) are disconnected in the embedding space.
Creative blending requires finding a path to connect the disconnected concepts.
How to improve creativity? Mood Space
- Mood Space connects disconnected visual concepts, e.g. duck -> toilet paper
- Mood Space only picks up most relevant concepts from Mood Board
How to control creativity? Mood Board
Mood Board is a set of 2~20 images that are used to train the Mood Space.
Mood Space only picks up most relevant concepts from Mood Board.
- e.g. 20 faces images Mood Board -> face is the most relevant concept
- e.g. 20 first person images Mood Board -> hand is the most relevant concept
How to generate the created visual concept?
Image Conditioned Diffusion Model: Using IP-Adapter and CLIP embeddings for interpolation
Results
Interpolation
Given Image A1 and Image B1, interpolate between them.
Interpolation
Given Image A1 and Image B1, interpolate between them.
Visual Analogy by Path Lifting
Given Image A1 → Image B1, What is Image A2 → Image B2?
A1 → B1 defines a path in the Mood Space.
A2 → B2 use the same path defined as A1 → B1.
Analysis
When Baseline CLIP Space Fails
Baseline CLIP space interpolation fails when dealing with semantically disconnected concepts:
- e.g. Duck → Toilet paper
- e.g. Duck → Pixel art duck
When Baseline CLIP Space Works
CLIP space interpolation works well for semantically connected concepts:
- e.g. Duck → Dinosaur (same pixel art style)
Takeaways
- Mood Space finds a dense low-dimensional space to compress CLIP space
- Mood Space connects disconnected concepts (e.g. duck -> toilet paper)
- Mood Board is an interpolation interface to control the Mood Space
- Mood Space only picks up most relevant concepts from Mood Board
Abstract
Expressing complex concepts is easy when they can be labeled or quantified, but many ideas are hard to define yet instantly recognizable. We propose a Mood Board, where users convey abstract concepts with examples that hint at the intended direction of attribute changes.
We compute an underlying Mood Space that 1) factors out irrelevant features and 2) finds the connections between images, thus bringing relevant concepts closer. We invent a fibration computation to compress/decompress pre-trained features into/from a compact space, 50-100x smaller. The main innovation is learning to mimic the pairwise affinity relationship of the image tokens across exemplars. To focus on the coarse-to-fine hierarchical structures in the Mood Space, we compute the top eigenvector structure from the affinity matrix and define a loss in the eigenvector space.
The resulting Mood Space is locally linear and compact, allowing image-level operations, such as object averaging, visual analogy, and pose transfer, to be performed as a simple vector operation in Mood Space. Our learning is efficient in computation without any fine-tuning, needs only a few (2-20) exemplars, and takes less than a minute to learn.
BibTeX
@misc{yang2025iknowiit,
title={"I Know It When I See It": Mood Spaces for Connecting and Expressing Visual Concepts},
author={Huzheng Yang and Katherine Xu and Michael D. Grossberg and Yutong Bai and Jianbo Shi},
year={2025},
eprint={2504.15145},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.15145},
}