We can generate AI images, we can generate AI text, but text in an image is a no go?
Generating meaningful text in an image is very complex. Most of these models like Dall-E and simple diffusion are essentially guided denoising algorithms. They get images of pure noise, and are being told that it’s actually just a very noisy image of whatever the description is. So all they do is remove some noise for many steps in a row until a clear image emerges. You can kinda imagine it as the “AI” staring into the noise to see the image that you described.
Most real-world objects are of course quite complex. If it sees a tree branch in the noise, it also need to make sure that the rest of the tree fits. And a car headlight only makes sense if the rest of the car is also there. But for text these kind of correlations are even way way harder. In order to generate meaningful text it not only needs to understand how text is usually spaced, and that letters usually are written in a consistent font, it also needs to learn the entire English language. All that just to generate something that is probably overall of less influence to it’s “score” on images form the dataaset than learning how to draw a realistic car.
So in order to generate meaningful text, the model requires a lot of capacity. Otherwise, since it’s not specifically motivated to learn to write meaningful text, it’ll do whatever it’s doing now. Honestly I’m sometimes quite impressed with how well these models do generate text, given all these considerations.
EDIT: Another few things came to mind:
-
Relating images and text (and thus guiding the image generator) was in the past done using a different (AI) model. Not sure if that’s still the case. So 2 models need to understand the English language to generate meaningful text: generator and the image to text translation model.
-
So why can AI like ChatGPT generate meaningful text? Well in short, they are fully dedicated to outputting language. They output the text as text and thus can be easily scored on it. The neural network architecture is also way more suited to it and they see way more text
-
The best answer will require a very technical understanding, but I’ll give it a try and stay abstract.
The AI is trained using images. If you type in things like “a tree” it has a vague idea of what it looks like.
The thing is writing letters is a hard concept. How should the AI know text is made up of letters? Connected lines make a letter and unconnected ones don’t. Sentences are connected using dots.
Easy enough for us, you have to imagine an AI is best with what it can directly observe. But knowing when to literally write out letters is hard. So it has a stroke. It has a vague notion of “this is where text is supposed to go” but making the letters look right in an adjusted font, remembering where letters end and how words are spaced; all of this is far too complex.
Now I haven’t looked into it for AIs who CAN generate text more well, but I assume the only they do this is by deciding “there’s gonna be text” and then using another process to insert the text basically after the fact. Or maybe there’s some special process change in the training or inference of the image going on? Idk, for this one I need an expert.