开源版Imagen来了!效果完全碾压Stable Diffusion!

共 8310字,需浏览 17分钟

 ·

2023-05-04 10:12

5f2a5029f3cffcd3f576580522f81c17.webp点蓝色字关注 “机器学习算法工程师

设为 星标 ,干货直达!

真正的开源版Imagen终于来了:DeepFloyd Lab联合StabilityAI(提供算力)开源了新的文生图大模型DeepFloyd IF,它是对Google之前文生图模型Imagen的复现,从效果上也接近甚至超过原版的Imagen。

30253c8c7f3613bdb4d0d455f5c4e16c.webp

Imagen的核心采用了一个预训练好的语言模型(T5-XXL Encoder,参数量4.6B)来编码文本,同时训练了3个不同的扩散模型来实现图像的生成,第一个扩散模型(参数量2B)实现64x64图片的生成,而后面两个扩散模型(参数量为600M和400M)分别实现64x64->256x256和256x256->1024x1024的图像超分。

995954d1c6ae9bdf215b5d4fe6090038.webp

而DeepFloyd IF也基本遵循了Imagen的设计,其模型结构如下所示:

cb6e71b5e4ea8053aab8b288fe1c4eed.webp

这里也是采用Frozen T5 XXL来提取text embeddings,不过每个阶段的扩散模型都设计了不同的参数量的模型,比如第一阶段64x64模型最大的模型IF-I-XL参数量为4.3B,比Imagen模型的参数量大2倍多。DeepFloyd IF的UNet模型结构基本上和Imagen一样,都是通过cross-attention的方式将text embeddings加入UNet,而且超分模型也都引入了text condition。不过,DeepFloyd IF还额外引入了一个attention pooling从text embeddings中提取一个embedding和UNet的time embedding相加,这相当于起到一个全局text embedding的作用。

Imagen的训练数据量是460M(从互联网上收集)+ 400M(LAION 400M),而DeepFloyd IF的训练数据集是从LAION5B中筛选的1B样本,两者的训练样本差不太多。


下面是不同大小的DeepFloyd IF模型在COCO数据集上的评测效果,可以看到模型参数越大,FID越低,最大的模型IF-4.3B的FID为6.6。

c2629298b7b4172f6c3a9713185fcf1e.webp


这个FID结果基本上是超过了目前其它文生图模型:

bd354e8fc1f686981216f57b9edf386e.webp

不过FID很多时候并不能很好地反映模型实际的生成效果,我们还是直接看生成图像的直接对比,这里是选取了一些比较难的text prompt和主流的文生图模型(SD 2.1,Muse,Imagen,eDiff-I,Parti,DALLE-2)进行对比,整体看起来DeepFloyd IF模型还是能打的。

52ef6b7ad7e85f77835163ca85565d43.webp4a53ab8813977c5b396cf9d751046e28.webp

下面是DeepFloyd IF模型的一些生成例子(来自Stability AI releases DeepFloyd IF, a powerful text-to-image model that can smartly integrate text into images — Stability AI):

83ab147a62fcd0e6273b297179cc5fe5.webp7c95d6138a77c93cf72438ce4e9eada0.webp

看起来生成的效果还是非常棒的(不过应该是cherry-pick),上述生成图像对应的text prompts分别是:

  • a photo of a full size old rusty sign that says "Deep Floyd Street", photo realism, bokeh, 50mm cine lens, super sharp focus.

  • capybara holding a neon sign with text that reads "capybara podcast", a professional photo of a capybara podcasting, capybara chimera animorph, transformer animal, anamorphic, 8k, 4k, 85 mm, f2.2, photography awards and hyperrealistic, highly detailed, f1.4 lens, 50mm photo, soft light, masterpiece, sharp focus, pretty, hasselblad

  • #practicingfonts trying the word "Floyd" in calligraphy

  • delicious burger painted in the style of starry night

  • film still photograph of redhead bearded Abraham Lincoln look alike starring in a live action documentary about the life of Vincent an Gogh produced by Netflix, 4k

  • renaissance painting of Justin Bieber in the 15th century

  • casual photo of a leaf maple syrup glass container sitting on a wooden table in a log cabin, high depth of field during golden hour as the sunlight shines through the windows, dusty air

  • one piece of fruit that's 'blackberry on the outside', 'orange texture on the inside', cut in half.

  • a strawberry mug filled with white sesame seeds. The mug is floating in a dark chocolate sea.

  • 4 bottles of wine next to each other labeled (1,2,3,4)

  • the yellow metal ball on the left is smooth to the touch and cool to the skin, with a faint, metallic scent. The red plastic cylinder in the middle has a rough texture, reminiscent of sandpaper. its bright hue stands out among the other objects. The blue fuzzy triangle on the right is not only soft, but also slightly damp. there are small specks of dirt stuck to it that can be felt when touched

  • three colored wood blocks with the letters (a, b, c)

目前DeepFloyd IF模型已经开源在HugggingFace上:DeepFloyd (DeepFloyd),不过模型实在是太大了,最大的模型IF-4.3B光64x64的模型参数量都接近10B,下载FP16的权重就需要20GB的空间,普通的GPU卡估计也难跑得动。不过HuggingFace上放出了体验空间:IF - a Hugging Face Space by DeepFloyd。这里我选择了一些prompt简单测试了一下(从生成的4个中选择一个最好的),同时也用同样的prompt用SD 2.1(Stable Diffusion 2-1 - a Hugging Face Space by stabilityai)进行生成,第一张图像是IF生成的(1024x1024),而第二张图像是SD 2.1生成的(768x768):

three colored wood blocks with the letters (a, b, c)

1b2a0528517697c6e83684fed91d0ff5.webp

最后一个字母错了

a70982b164cc2fa34a245eb523f5f185.webp

多了一个字母
1girl, white hair, golden eyes, beautiful eyes, detail, flower meadow, cumulonimbus clouds, lighting, detailed sky, garden

c72aecc0d55d53c7186994635da4e5b3.webp

头发颜色不对

286d1dc6a5400f74d014a3f1a677eed5.webp

是差了一些
Aerial photo of a beach, the words "what if?" written in the sand.

f5487cdd59109439162759a72ff5f28d.webp

f5590009870df7da3f285443205740c4.webp

文字差点意思
A photo of girl standing in front of stargate for another dimension made of stone that form a circle.

e0be085ef60342ac017fa560143021fe.webp

c8dcb8965b65997cfb7605d5f011a3a8.webp

差了太多
A face of a woman made completely out of the foliage, twigs, leaves and flowers, side view.

ba1a80459134f89006cdbe956ab667e3.webp

nice

76da75505396f20eab035dd80db58387.webp

view方向没对
A photo of a corgi dog wearing a wizard hat playing guitar on the top of a mountain.

2b0c0f558115fea8091e980a4090c7ba.webp

不错

2055fa5f0657f2f62a5cba02db9e7929.webp

也有点扭曲
New York Skyline with Hello World written with fireworks on the sky.

e94a2e52b7896d88209c5fc7708294f8.webp

文字基本对了

254306e2c507af2df2ac942d32af4e77.webp

文字还是差点
A cloud in the shape of two bunnies playing with a ball. The ball is made of clouds too.

cd9921ab7c5c308a34692bfa2a70dd64.webp

就是云不太像

ba7a8a0bdeca52c2aee853f21ad06566.webp

兔子呢?
A raccoon wearing formal clothes, wearing a tophat and holding a cane. The raccoon is holding a garbage bag. Oil painting in the style of abstract cubism

80ef17513cb65b36700a4498dedd1f9f.webp

少了cane和bag

e8e2eb791cee791bd6722bbc2947e4fa.webp

还挺好看
A high-contrast photo of a panda riding a horse. The panda is wearing a wizard hat and is reading a book. The horse is standing on a street against a gray concrete wall. Colorful flowers and the word "PEACE" are painted on the wall. Green grass grows from cracks in the street. DSLR photograph. daytime lighting.

c8b6f3ae85bd82e5bf701921131ce30a.webp

还可以

93c4f7d5fcd2f03b78ce000723137b44.webp

说好的骑马呢,还有文字呢?
A map of the United States made out of sushi. It is on a table next to a glass of red wine.

426160f4200b64d108536c1e6121ec28.webp

看起来正确

e8fcae99578fe878fba2857de20c1c1f.webp

看起来也正确
A movie scene where A beautiful blonde girl have blue eyes in A ultra detailed Metallic Steampunk Gothic Fantasy Skeleton Latex Filigree mechanical rib cage mounted on the middle of her chest,Her chest cavity is empty except for a glowing mechanical heart,She doesn't wear any visible clothing other than small area in the middle of her chest, some sci - fi cables connecting her heart from above

c23d6d8d7109413f4415312ae30951c2.webp

有点丑

e5fe6c384ec71938db16714bf77b9269.webp

吓人
a beautiful anime girl with red eyes and blue hair on the beach

6da7f81847613263c7449d4d6dba5b8b.webp

颜色属性搞反了哦

c31074ac4084820cfd434837e1f46e53.webp

这。。。

整体测试下来,DeepFloyd IF模型理解text的能力确实很强,完全碾压SD,但是生成的图像质量和SD一样还是会差一些(完全和Mdj和C站上模型没法比)。特别是在文字生成方面,DeepFloyd IF模型很强,这大概是得益于采用了T5-XXL。

a75f1ba2cad2f9c134c709343e80a03e.webp

除了文生图,DeepFloyd IF模型还可以像SD一样实现图生图和图像inpainting,这个原理都是一样的。

acfc17049b739b0c2c13270443725303.webp

目前DeepFloyd IF模型也已经集成到了diffusers库,具体见IF,如果使用model cpu offloading,只需要14 GB 显存就可以运行:

    from diffusers import DiffusionPipeline
from diffusers.utils import pt_to_pil
import torch

# stage 1
stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-IF-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()

# stage 2
stage_2 = DiffusionPipeline.from_pretrained(
"DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()

# stage 3
safety_modules = {
"feature_extractor": stage_1.feature_extractor,
"safety_checker": stage_1.safety_checker,
"watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
generator = torch.manual_seed(1)

# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

# stage 1
image = stage_1(
prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
).images
pt_to_pil(image)[0].save("./if_stage_I.png")

# stage 2
image = stage_2(
image=image,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
generator=generator,
output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_II.png")

# stage 3
image = stage_3(prompt=prompt, image=image, noise_level=100, generator=generator).images
image[0].save("./if_stage_III.png")

参考

  • DeepFloyd IF — DeepFloyd

  • GitHub - deep-floyd/IF

  • https://stability.ai/blog/deepfloyd-if-text-to-image-model

  • Text-to-Image Diffusion Models




推荐阅读

使用PyTorch 2.0加速Transformer:训练推理均拿下!

硬核解读Stable Diffusion(系列三)

硬核解读Stable Diffusion(系列二)

硬核解读Stable Diffusion(系列一)

带你入门扩散模型:DDPM


机器学习算法工程师


                                    一个用心的公众号

c63c01a81fdd93173037a1d1a93336cc.webp


浏览 343
点赞
评论
收藏
分享

手机扫一扫分享

分享
举报
评论
图片
表情
推荐
点赞
评论
收藏
分享

手机扫一扫分享

分享
举报