De-Diffusion Makes Text a Strong Cross-Modal Interface

1Google DeepMind, 2Johns Hopkins University

De-Diffusion is an autoencoder whose decoder is a text-to-image diffusion model.

It encodes an input image into information-rich text, which acts as a flexible interface between modalities.

Abstract

We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text-to-image diffusion decoder to reconstruct the original input, a process we term De-Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools, and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples.

De-Diffusion Text

We showcase De-Diffusion text of 5, 15 and 75-token length.

De-Diffusion text can describe real-world photographs.

cat wallpapers dog cat kissing

cat close wallpapers dogs eachother kissing eachother shepherd with dog beside cat cat outdoors wet

a col special awesome generic sphoto dog an shepherd dog sitting smelling smelling an cat showing a whites eachother it looking eachother there ently relating a a cat polcat a cat there blurry towards a cat cat cat among subtle subtle sunny shadows snow road sunny background parenparencat cat redness browns brown and ently feline parenvintage regard regard navarshepherd optimized macro sensual canine dog close vintage closeup closeup beak kiss

kitten closeup kitten sink bathroom

kitten cute analogue cute closeup sitting sink cat grey kitten orange brushes white bathroom sink

a colvindic vscocam analogue photograph of an kittens cat sitting in in in sink in a white basin it in basin sink there with a a orangepolpom a combs beside blur there a utentaps shelf above there white wall blur and wall and taps refer relating blur face bright lime green green white blur prettiest sportinggray gray brunette feline photograph macro cute cute kittens photograph vintage blur blur eyebrow sink

closeup yellow bus mirrors reflection

blurred windshield closeup mirror closeup mirror mirror bus yellow windshield beside behind roads billboards pole

a colcordvindicanalogue closeup of yellow bus mirror through a mirromirromirror depicting a yellow vehicle it on black bicycle foreground behind a a black polhandle a boards behind blur towards a white boards boards above background silhousky sky white siding and poles pole besides wooden pole lborange green green handle reflection contentreflection black black round mirror rotterdam wisconsin congestion vehicle mirror mirror vintage hdr reflection glass mirror

morecambe roadside kites pedestrians stormy

morecambe whitby foreground people walking flying kite kite wearing people stormy clouds roads retaining bridge

a colvc vanishing lantic street photography peop person right women walking towards scattered flying kites and a red shirt a on roadway road along alongside a a retaining polretaining a fencing right slope numerous a people sitting flying behind gloom dark cloud cloud earthy mound grassy grass etc with dark sky muted muted grey black person cloud foreground people muted beige slope mound whitby whitby burrows hilltop sky cloud skyline people walk grassy road

De-Diffusion text can describe creative synthetic images.

futuristic colourful rhinogeometric render

neon futuristic render statue closeup frontal geometric rhinocolorful origami colorful glowing colorful background triangle

an arts pic digitally pixels closeup glowing rhino wearing colorful colorful glaspentagon angular cone closeup forehead through an through colorful geomeangular futuristic futuristic osfuturistic futuristic animal resembsimilar called an chrome rhino shown front frontal incorporating wearing an colorful geometriangle angular with wearing face and horn yellow horn orange orange teal turquoise consist consist towards maroon angular on on angular maroon darkness on gray grey background closeup eyebrow futuristic futuristic wallpaper

beyonce colourful watercolor portrait illustration

beyonce beyonce drawing painting woman head abstract hair blackandwhite sleeveless colourful particles sketch sketch splash

a arts drawing beyonce beyonce portrait woman portrait with swept swept curls in a silver dress it with colourful watercolor there with a a colorful pollouda cloud above atop with a colorful watercolor graffiti atop beige beige background shadows grey background with spots wth with woman ear bold magenta purple black sleek off sleeveless blackandwhite blackandwhite yellow swoocurls rihanna jimi supermodel abstract inktober drawing illustration stration face eyebrow portrait

beatles beagle beagle walking illustration

beagle cartoon illustration cartoon triplets walking pedestrian beagle teal sweaters surrounded text intersection text crossing

an cartoon albu books vscocam drawing cartoon beagle four wearing blue blusweaters each rear pink pants walking an on zebra crossing crossing crossing cartoon os midcentury beagle beagle called these four an dog dog shown walking lineup unison wearing an white beagle beagle bears between spelled font white shoes beagle beagle mco browns brown monochrome placed front with yellow sweater on on font gray background on gray green background background text minimalist cartoon poster

steampunk robot cat robot illustration

cat steampunk wallpapers caricature eyes standing robot cat blackandwhite robot eyes fork scratched sketch mechanism

an hdr painting robotic cat aus blackandwhite steampunk cyborg cat with eye with crook standing an on black leash robot steampunk steampunk ossteampunk punk robot resemb exhibiting called an filly cat shown standing neck showcasing relating an derelsteampunk mechanism robot with optomeeye optomeeyed optomelenses silver monochrome brown monochrome consist overlooking with derelmechanism overlooking scratched cities faded codes on beige yellow background portrait eyebrow steampunk steampunk artwork

De-Diffusion text contains factual knowledge.

grand canyon canyon mountains sundown

canyon rim wallpapers canyon arial rim canyon cliffs reddish mountains sundown horizrim rim rim

an typical grand canyon many brown vast peaks mountains and canyon and overview surrounded an overview mostly overview canyon grand grand osgrand canyon canyon shown foreground note an overview canyon shown overview lineup whilst towards an morning morning dusk overview also morning morning blue sky and sky redness orange brown purple wth numerous towards vast cliffs numerous grand canyon grand cliffs mountains mostly blue background dusk overview grand grand canyon

yosemite mountains waterfalls mountains mountains

yosemite canyon unsplash mountains surrounded valley yosemite cliffs vegetation mountains summer clouds forests mountains gorge

an photographer typical known valleys two peimonumental cliffs mountains and dome and valleys surrounded an through grassy vegetation valleys yosemite yosemite osyosemite yosemite canyon together foreground looking an and valleys shown valleys open surrounded among an greenery flanforests forests besides white sky blue sky and cloud grey grey grey grey wth foreground towards white cliffs among valleys valleys monumental cliffs mountains mostly pine forrest haze overview monumental yosemite gorge

cinderella blue cinderella gown cartoon

cinderella cartoon illustration cartoon woman gestures net princess blue gown with gloves snowing nighttime wires

a illustration cartoon cinderella cinderella female fing examining offering sparkling webs wearing a pale dress it distributed sparkly strings overwhelmed distributed a a silver pol strings a string surrounding seen against a gray surface background right darkness purple background shadows snow ground darkness tree featuring wearing blue gloves pale pale blue white white dressed mostly silhouette yellow yellow hair ponytail disney rodgers imation animation cinderella gown illustration white face hime gown

mario grunge mario portrait illustration

nintendo cartoon drawing cartoon boy portrait cartoon cartoon red cartoon with cap splash graffiti splash

a drawing cartoon mario mario cartoon head portra animated mario mario in a red cap it looking watercolor paint staring there a a black pol spots a paint there staring wth a black spots dots on monochrome grey background shadows white backdrop splatsplatrefer wearing mario moustache red red teal blue blue drawing drawing vintage red deterbello cap capcom capcom alliance games inktober drawing vintage hdr cheek eyebrow smiling

hogwarts movie glasses group poster

potter movie poster people collage characters potter beside characters glasses candles text poster candle light

an poster boy closeup wth black and hogwarts dressed brown tie brown tie aboard an amongst lit pillars lamps hogwarts movies osrupert stark aph containing foreground collage an teenage persons shown portrait portrait beside with an white and lamps pillars beside smiling text and font regard eyewear darkness orange and poster accompanied beside with icy birds among inhabitants faces darkness font cityscape black blue background lights poster hermione movie poster

zelda hilltop clouds cliff poster

zelda movie poster boy standing sword shield atop text sunlight clouds sunlight valleys rock

a poster cartoon adventure protag boy standing holding standing and sword atop a atop atop it on atop rock a landsc pol land sca foreground amid foreground amid a some cities city amid sunlight sunlight clouds among valleys among clouds foreground showcases sunlight sunlight also navy blue written font also font vintage action adventure shield disney reboot successor games game poster vintage and skyline slogan cloud

De-Diffusion can recognize abstract symbols.

starbucks green starbucks emblem illustration

starbucks starbucks illustration symbol closeup round starbucks cup green face with star symbol background symbol

an illustration coffee logo situated black and white green stripped logo huge logo white lines aboard an on the white logo circle pontiindian oslandmark coffee logo resemb relating called an coffee female shown face frontal situated relating an white religious star crown with smiling star long hair long curls deepgreen emerald green consist front between white logo between geomestar emerald strips background white white background minimalist minimalist minimalist logo logo

illustration calligraphy lion emblem illustration

lions minimalist illustration symbol typical walking symbol lions silhouette symbol black symbol grey background tied

a lions animal walking side winding winding lion on a beige pared it minimalist profile silhouette featuring featuring a a black pol curled a silhouette atop lineup in a atmospheric silhouette silhouette on pale beige pared backgrounds grey background minimalist minimalist gladly featuring minimalist silhouette black and blk navy black white symbol modern i modernist minimalist white lions lions atx wsj fintech minimalist line symbol render o metric silhouette profile symbol

Transferable Prompts

De-Diffusion text transfers to different text-to-image tools.

Images are obtained from Stable Diffusion XL v1.0, Midjourney, and Imagen.

Stable Diffusion XL

Midjourney

Imagen

[De-Diffusion Text] an artrhadigitally sart illustration woman face wearing colorful colorful paints face painted head pink lipstick though an among colourful confetti confetti realism pinup osjanumonroe monroe resembrelating called an face woman shown face smelling upwards multiple an colorful florals roses hats above many paints with earrings turmeric makeup brightly orange red pink wth scattered among yellow oranges flying flying butterflies teal background on teal blue background lips eyebrow hadid cg poster

Stable Diffusion XL

Midjourney

Imagen

[De-Diffusion Text] an davilishlishblog closeup berries jar through largerefrerefrejar glass jar eachother glass on an on peach hardwood closeup glass homemade osmixed glass jar called relating called an oranges fruit shown slices eachother containing relating an orange orange slices slices between black grapes open chunks orange oranges and berry blackblueberry consist though towards pink closeup facing that background pink wall background pink pink wall wall closeup chia grapes recipe

Stable Diffusion XL

Midjourney

Imagen

[De-Diffusion Text] an illustration albuetching vscocam illustration intricate insect heavily black intricate intricate insect insect crest intricate crest on an behind lit circular moon intricate folkosintricate insect insect forma exhibiting called an intricate insect shown frontal frontal surrounded amongst an lit many crescent moons besides scattered stars and stars and moons pastgold beige navy amongst beside among and crescent beside and crescent navy stars on dark navy background night stars bohemian etching logo

Stable Diffusion XL

Midjourney

Imagen

[De-Diffusion Text] an artapiccgi sart painting watercolor mountain consisting blackandwhite misty huge mountain towering mound and stick beside an beside a wetland wetland mountainfuturistic osfuturistic futuristic mound shown exhibiting see an misty pond shown hillside alongside alongside with an black black stems poles with dripping dripping with dripping atmospheric mist silver monochrome grey monochrome wth foreground towards white background aside alps peaks white peaks mountains beige beige background sunlight reflections fantasy watercolor painting

original image

Stable Diffusion XL

Midjourney

Imagen

[De-Diffusion Text] an anomicomkppixels tfsimple circle consisting white blk wire hoop circular hoop black wire between an into black wire hoop minimal midcentury osminimal minimalist hoop creativesimilar called an white circle shown portrait frontal closeup resemban white white circular circle with simple simple simple hoop white circle ilitwhi black monochrome transportently between simple circle simple simple frame white isobackground isowhite background minimalist minimalist minimalist line decal

Compare De-Diffusion text and COCO caption for text-to-image reconstruction.

Stable Diffusion XL

Midjourney

Imagen

[De-Diffusion Text] an davilishlishblog closeup berries jar through largerefrerefrejar glass jar eachother glass on an on peach hardwood closeup glass homemade osmixed glass jar called relating called an oranges fruit shown slices eachother containing relating an orange orange slices slices between black grapes open chunks orange oranges and berry blackblueberry consist though towards pink closeup facing that background pink wall background pink pink wall wall closeup chia grapes recipe

Stable Diffusion XL

Midjourney

Imagen

[COCO Caption] A jar filled with different types of fruit on a table.

Text-Based Image Blending

Step 1: Obtain De-Diffusion text.

a colrejolossoils painting of transformer robot robot standing wearing dusk red robot in a blue armor it across blue waves amidst towards a a yellow pollens a swirl behind viewed between a colorful swirl swirl beside colorful yellow sunset cloudy colourful hills colourful valleys smh wearing gogh gogh bered red blue blue blue painting presented red red red psorirobot robson capcom modernist gicpainting painting painting blue painting abstract mural

a colstavgmbmagewallpapers of deer deer animal standing standing wearing deer deer on a water water it on a river accompanied asting a a orangepolreflection a water among blur despite a green trees trees also but yellow autumn misty among forests autumnal fir trunk besides snow winter shutterorange teal teal darkness forest featuring bered auburn auburn majeantlers bavholistic fantasy forest deer wallpapers wallpapers wide reflections forest forest

Step 2: Ask ChatGPT to mix image A and image B here to obtain text prompts for A+B and B+A.

[A+B] Show a dusk red transformer robot standing tall beside a river in a holistic fantasy forest. Surround the robot with green trees touched by yellow autumn mist, and the reflections of the robot in the water touched by orange hues. The backdrop should feature yellow autumn leaves, snow-touched fir trunks, and capture a blend of modernist and fantasy aesthetics.

[B+A] Depict a deer amidst swirling blue waves with a colorful sunset behind it. Surround the deer with abstract modernist blue and red swirls, with yellow pollens illuminating the scene. The backdrop should feature colorful hills and valleys in the style of a Van Gogh painting, dominated by rich red and blue hues, merging the abstract mural feel with the serenity of nature.

Step 3: Image generation with text-to-image tools.

A+B (Stable Diffusion XL)

A+B (Midjourney)

B+A (Stable Diffusion XL)

B+A (Midjourney)

Failure Cases

De-Diffusion can recognize text, but can not read text in images.

Reading text is beyond the reach of the current De-Diffusion model, because the pre-trained text-to-image decoder can not accurately generate text.

surreal neon flower heart text

colorful kawaii render letters surrounded lettering colorful heart pink letters purple flowers flowers text text

a edit render cgi frontal of a heart font elaborate pink pink font elaborate plants placed on a shape in colorful flowers oversized heart having floral with pink gerber roses heads with colourful and seen alongside some purple pom pom right with colourful toys flowers on greypink ground on spotlight darkness brown background mshcarnival toy colourful blue pink pink colorful font font flowers birthday font font dop photography flickr johan afm cgi toys

dubstep colorful butterfly letters illustration

colorful kawaii psd cartoon typical letters colorful butterflies colorful letters colourful butterflies colorful background splash

a edit vector cartoon font of colorful letters font specially white blue sculpted sculpted letters placed embraced eachother font on colorful paint paint paint with two with pink butterfly butterflies above with font and flying with two colorful butterfly butterflies atop on colorful buttons letters on grey green background on shadows light green background wacky wacky cute cute blue magenta yellow yellow fontpretzel cute ceramic font font client client flickr rizio alivedeviantart font

BibTeX

@article{wei2023de,
      author    = {Wei, Chen and Liu, Chenxi and Qiao, Siyuan and Zhang, Zhishuai and Yuille, Alan and Yu, Jiahui},
      title     = {De-Diffusion Makes Text a Strong Cross-Modal Interface},
      journal   = {arXiv preprint arXiv:2311.00618},
      year      = {2023},
    }