Recently, Jeff Lai, a friend of mine let me play with his DALL-E account as part of his university assignment. I’m a huge skeptic of AI information generators, while he is an ardent proponent of the benefits of AI, so each of us thought this was a great opportunity to understand each other. The following are my first thoughts on using this tool.
Background: DALL-E is a tool where you can type in a text prompt to generate novel images. You can prompt it with styles, time periods and adjectives.
The experience of using it is “uncanny”, because I’m using information retrieval techniques which are very limiting to information generation, which has far more dimensions. Historically, we are used to search engines and search bars, where we type in search terms, keywords that have adjacency to an actual piece of information that exists. The user is expecting to explore an existing repository or file system by interfacing with a powerful yet ultimately static retrieval engine. It’s only when a user is close to what they are searching for, they can explore that neighborhood by using the current search results which have dozens to hundreds of words that are effectively related search terms. The user can also contact the creator of those related works to probe the thought process and trajectory in the lifetime of that field of information or body of work. Hence, it’s the human creators themselves, who have tagged their works, evolved their thought over time or explained their work in depth that gives the retrieved information it’s richness. The retrieval technique itself while accessible, is quite frankly an extremely rate-limited and primitive way of gaining information.
However with DALL-E, the goals are fundamentally different, and information retrieval techniques fall laughably short. If DALL-E facilitates information generation, then we should have cut out the information retrieval step, and leap frogged directly to the rich conversation discussing themes, subtexts and appreciation of existing bodies of work. Instead, DALL-E presents the user with the familiar search bar, which is really a prompt bar. Of course, I tried the information retrieval techniques above, but they didn’t work for a couple of reasons.
DALL-E -> User
Firstly, DALL-E returns results in an opaque manner. It’s impossible to understand how DALL-E reached its results, aside from the words I prompted it with. There are no textual tags, no thought process that can be probed no insight into adjacent bodies of work. It’s difficult to put myself in DALL-E’s shoes and comprehend its state of mind without any context. This is a lesser problem than the next one, because text tags are probably already part of the generation process.
User -> DALL-E
Secondly, my own prompts are opaque to DALL-E. There is no way to explain my thought process, or address its confusion with terms that are specific to my culture. For example, DALL-E might not understand the “vibes” of “low-poly PS2 style graphics”, perhaps because “PS2” doesn’t exist in it’s vocabulary. Unlike a human, I cannot just show DALL-E a handful of examples for it to understand the gist of the idea. I cannot even know where it could be lacking. There is no way to augment its understanding by extending its domain of knowledge, or guide it toward something similar that exists in its current knowledge domain.
Prompt >-< Art
Thirdly, translating text prompts to visual art and back is already a difficult process among humans. Extra ideas will be incorporated into the image based on a human artist’s lived experience, attitudes and individual understanding of the same prompt. Many ideas are compressed into a single image, and when the image is seen, different valid interpretations of the message and subtext could be decompressed/extracted. Between humans, this could be negotiated by back and forth discussion, but between the user and DALL-E, there is some information loss that cannot be recovered.
Furthermore, in text, its difficult to explain what’s supposed to be subtext and translated to the “vibes” of the art piece, and what is intended to be the object. It’s difficult to describe this without the prompt becoming extremely long and stilted with tons of adjectives. For example, its difficult to explain what is a surreal meme in text, but when you see it you understand its surreal. In fact in some cases it might be easier for a user to directly translate their ideas into art themselves, rather than generating their art through DALL-E, because the English language simply isn’t equipped to express the ideas in the same way visual art can.
These problems are less relevant in information retrieval because the expectations on the engine are much lower, where it was always a guide to a human destination. But in information generation, if DALL-E cannot communicate with me and vice versa, then there is no way for us to inspect each other, to bounce off ideas and be creative in a collaborative manner. It will be like talking to a wall.
As it currently stands, the way DALL-E is used in this implementation is only useful for idea generation. It cannot be intentional or specific. To classify its results as successful, DALL-E is forced to guess the widest range of results that even remotely fall under the definition imposed by the prompt.
In the near future, a reverse DALL-E used to interpret the contents of images and their subtexts would be extremely useful for improving accessibility of images, by automatically generating alt texts. I really hope this will exist soon.
In the long term, I hope DALL-E will be less of a black box at the higher levels for more meaningful collaboration with it.