The arms-race between technology companies regarding AI models has heated up significantly in the last two weeks. OpenAI revealed GPT-4o immediately before Google announced a number of Gemini improvements. That was followed by Microsoft announcing a bunch of Copilot+ PCs along with AI improvements of its own.
But while all that was going on, Meta was taking care of AI business of its own. The company quietly released a research paper on its efforts in the multi-modal AI space. Spotted by Venture Beat , the paper shows that Meta is working on a multi-modal large language model called Chameleon.
That's not to be confused with the generative AI model CM3leon (pronounced Chameleon) that Meta AI revealed last summer . Meta AI did say in its blog post that the CM3leon model would lead to improvements on future LLMs. The research paper claims that Chameleon is state-of-the-art and beats or competes evenly with other models like Gemini, GPT-4 and Meta’s own Llama-2.
Similar to Google’s Gemini, Chameleon is built on an “early-fusion token-based mixed-modal” architecture. This means that model was built to learn from the beginning from a combination of images, code, text and other inputs, and it uses that content to create sequences. The other way to build a multi-modal architecture is to sew together a number of models that are trained on a single modality.
This is called “late fusion.” In essence, the AI system takes individual models and fuses them together to make infer.