Thread
You've probably seen that "Apoploe vesrreaitais" means "birds" to #dalle 2. This is not a hoax, but the connection to the cryptic textual outputs of DALL·E is spurious. Let me explain briefly.

DALL·E is very bad at spelling: "DALL·E wins the spelling contest for its name"
DALL·E's resolution for conceptual composition is basically not good enough to product consistent spelling. It generates letters and letter features and tries to piece them together. This strategy works better for generating coherent images than for generating coherent text.
The strings generated by DALL·E are somewhat similar to what the input asks for, but they are very inconsistent. You get different "secret words" every time.
These "secret words" are simply random strings. These random strings can point to consistent regions in the embedding space.
To understand what that means, consider that language models like GPT-3 and image models like DALL·E project all samples into a high dimensional space, where neighboring concepts are adjacent along direcetions of difference between them.
DALL·E 2 is correlating the spaces of language prompts and visual scenes with each other, but imperfecly: it can associate two or more concepts, but cannot map binary or higher order predicates reliably. On the other hand, it can compute a visual equivalent for each string.
This works by starting with noise, and following a gradient through the high dimensional space of scenes until the image reaches a local maximum of measured similarity between image and text. This will often result in consistent mappings between random strings and scene elements.
Btw, if you want to experience what it’s like to explore a semantic embedding space for yourself, you can play a couple of rounds of semantle.com — it lets you guess a word by telling you a similarity score, and you discover the gradient (search direction) by yourself.
Mentions
See All
Collections
See All