Meta ImageBind 多模态模型开源,我们离AGI又进一步。
Meta ImageBind, a multimodal AI model, embeds six types of data into a unified vector space, enabling comprehensive perception and interaction across text, audio, visual, motion, temperature, and depth, with potential applications ranging from generating accurate descriptions to creating animations and enhancing real-world perception with additional sensory inputs.
当人类看到一辆行驶中的火车,不仅会使用视觉,还会听到声音,感知距离,感知速度。
ImageBind 也是类似,它将六种数据,文本,音频,视觉,运动,温度,深度,嵌入到一个向量空间,让模型像千脑智能那样,调动不同的感知区域进行「交谈」并做出全面的解释和判断。
(这与文心一言等模型每个模态有自己嵌入空间的所谓多模态截然不同。)
一些应用(见图):
- 通过火车的声音、图像、深度信息,生成准确的文字描述
- 通过鸽子的图片和摩托的声音,减缩到摩托车和鸽子的图像
- 通过企鹅的声音,生成企鹅的图像
另一些可能性:
- 拍摄一段海洋日落的视频,自动生成完美的音频剪辑。
- 通过静态图像和音频组合,创建动画。
- 通过Make-A-Video生成视频时,自动加上背景音。(飞狗图)
未来不止于此,模型还可以引入更多的模态,如触觉、语音、嗅觉和大脑 fMRI 信号,以增强模型对实体世界的感知。
https://ai.facebook.com/blog/imagebind-six-modalities-binding-ai/
Translation:
Meta ImageBind, a multimodal model, is now open source, bringing us closer to AGI.
When humans see a moving train, they not only use their vision, but also hear the sound, perceive distance and speed.
ImageBind works similarly by embedding six types of data-text, audio, visual, motion, temperature, and depth-into a vector space, allowing the model to communicate like the AI at OpenAI and make comprehensive explanations and judgments by accessing different perceptual regions.
(This is different from models like Wenxin and Yanyan, where each modality has its own embedding space.)
Applications include:
- Generating accurate text descriptions of a train using sound, images, and depth information
- Scaling down to images of a motorcycle and pigeon using an image and sound of a motorcycle
- Generating an image of a penguin using its sound
Other possibilities include:
- Automatically creating a perfect audio clip when filming a sunset over the ocean
- Creating animations by combining static images and audio
- Automatically adding background music when using Make-A-Video (Fei Gou Tu)
In the future, the model can also introduce more modalities such as touch, speech, smell, and fMRI signals for enhanced perception of the real world.
Source: https://ai.facebook.com/blog/imagebind-six-modalities-binding-ai/