‘A blue J. Rainbow stands in a large basket of macaroons.’ Credit: Google |
About a month later OpenAI announced DALL-E 2Google has continued the AI ”space race” with its own text-to-image diffusion model, its latest AI system for creating images from text. Pictures. Google’s results are extremely, perhaps even scary, impressive.
Using a standard measure, FIDExceeds Open AI’s DALL-E 2 with a score of 7.27 using Google Imagen COCO dataset. Despite not being trained using COCO, Imagine has performed well here as well. Imagine DALL-E 2 and other competing text-to-image methods are among the best among human raters. You can read about the full test results Google research paper.
![]() |
‘Toronto Skyline with Google Brain logo written on fireworks.’ |
Imagine works with a natural language text input, such as ‘A Golden Retriever dog wearing a blue checkered beret and red doted turtleneck,’ and then using a frozen T5-XXL encoder to convert that input text into embedding. A ‘conditional diffusion model’ then maps the text embedded in a small 64×64 image. Using the image text-conditioned super-resolution diffusion model to sample the 64×64 image at 256×256 and 1024×1024.
Compared to NVIDIA’s GauGAN2 approach since last fall, the image has improved significantly in terms of flexibility and results. AI is moving fast. Consider the figure below ‘A clever corgi lives in a house made of sushi.’ It seems believable, someone really made a dog house out of sushi that Corgi, probably surprisingly, likes.
![]() |
‘A clever corgi lives in a house made of sushi.’ |
This is a clever creation. Apparently All What we’ve seen so far from Imagine is beautiful. Funny costumes for hairy animals, cactus with sunglasses, swimming teddy bear, royal raccoon etc. Where are the people?
Innocent or malicious intent, we know that some users will start typing all sorts of phrases about people as soon as they access the image. I’m sure there will be plenty of text input about adorable animals in ridiculous situations, but there will also be input text about chefs, athletes, doctors, men, women, children and more. What would these people look like? Will the doctors be mostly men, will the flight attendants be mostly women and will most people have lighter skin?
![]() |
‘A robot couple is having a nice dining with the Eiffel Tower in the background.’ What if the couple did not include the word ‘robot’ in the text? |
We don’t know how Images handles these text strings because Google has chosen not to show them to anyone. There are ethical challenges with text-to-image research. If a model can draw almost any image from the text, how good is a model in presenting neutral results? AI models like Imagine are basically trained using web-scraped datasets. Content on the Internet is so skewed and biased that we are still trying to fully understand it. These biases have negative social implications that need to be considered and ideally corrected. Not only that, Google has used LAION-400M dataset for Imagen, which is known to contain a wide range of inappropriate content including ‘pornographic images, racist slander and harmful social stereotypes’. A subset of the training group was filtered to remove noise and ‘unwanted’ content, but there remained a risk that ‘Imagine encoded harmful stereotypes and presentations, indicating a decision not to publish Imagen for public use without further protection.’
![]() |
Text strings can be quite complex. A marble statue of a Koala DJ in front of a marble statue on a turntable. Koala is wearing big marble headphones. ‘ |
No, you can’t access Imagine for yourself. On him Website, Google lets you click on specific words in a selected group to see results, such as ‘a picture of a fuzzy panda wearing a cowboy hat and a black leather jacket playing the guitar on a hill’, but you can’t search with people or potentially problematic actions or items. Something has to be done. If you can, you’ll find that the model creates images of people with lighter skin tones and reinforces the traditional gender role. Preliminary research also indicates that Images reflect cultural bias through the depiction of certain items and events.
![]() |
‘A Pomeranian king is sitting on the throne wearing a crown. Two tiger soldiers are standing next to the throne. ‘ |
We know that Google is aware of presentation issues across its wide range of products and is working to improve realistic skin tone presentation and reduce the underlying bias. However, AI is still a kind of ‘Wild West’. Although there are many talented, thoughtful people behind the scenes who create AI models, a model is basically a one-time release. The model depends on the dataset used for training, so it is difficult to predict what will happen if users can type. Something They want.
![]() |
‘Dragon fruit wearing a karate belt in the snow.’ |
This is not the fault of Imagine, or the fault of any other AI model that has struggled with the same problem. Models are being trained using huge datasets that contain visible and hidden biases and scale these issues with the model. Even beyond the alignment of specific groups of people, AI models can create very harmful content. If you ask a painter to draw or draw something scary, many will be annoyed and turn you away. The text-to-image AI model has no ethical value and will create something. This is a problem, and it is unclear how this can be solved.
![]() |
‘The teddy bear is swimming in the Olympic 400mm butterfly event.’ |
In the meantime, as AI research teams struggle with the social and moral implications of their highly impressive work, you may see terrific realistic photos of skateboarding pandas, but you can’t input your own text. The image is not available to the public, nor is its code. However, you can learn a lot about a new project Research paper.
All images courtesy of Google