Encord releases EBIND multimodal embedding model for AI agents


The EBIND model enables AI teams to use multimodal data. Source: StockBuddies, AI, via Adobe Stock

As robots tackle increasingly complex environments and tasks, their artificial intelligence needs to be able to process and use data from many sources. Encord today launched EBIND, an embedding model that it said allows AI teams to enhance the capabilities of agents, robots, and other AI systems that use multimodal data.

“The EBIND model we’ve launched today further demonstrates the power of Encord’s data-centric approach to driving progress in multimodal AI,” stated Ulrik Stig Hansen, co-founder and president of Encord. “The speed, performance and functionality of the model are all made possible by the high-quality E-MM1 dataset it was built on – demonstrating again that AI teams do not need to be constrained by compute power to push the boundaries of what is possible in this field.”

Founded in 2020, Encord provides data infrastructure for physical and multimodal AI. The London-based company said its platform enables AI labs, human data companies, and enterprise AI teams to curate, label, and manage data for AI models and systems at scale. It uses agentic and human-in-the-loop workflows so these teams can work with multiple types of data.

EBIND built on E-MM1 dataset, covers five modalities

Encord built EBIND on its recently released E-MM1 dataset, which it claimed is “the largest open-source multimodal dataset in the world.” The model allows users to retrieve audio, video, text, or image data using data of any other modality.

EBIND can also incorporate 3D point clouds from lidar sensors as a modality. This allows downstream multimodal models to, for example, understand an object’s position, shape, and relationships to other objects in its physical environment.

“It was quite difficult to bring together all the data,” acknowledged Eric Landau, co-founder and CEO of Encord. “Data coming in through the internet is often paired, like text and data, or maybe with some sensor data.”

“It’s difficult to find these quintuples in the wild, so we had to go through a very painstaking exercise of constructing the data set that powered EBIND,” he told The Robot Report. “We’re quite excited by the power we saw of having all the different modalities interact in a simultaneous manner. This data set is 100 times larger than the next largest one.”

AI and robotics developers can use EBIND to build multimodal models, explained Encord. With it, they can extrapolate the 3D shape of a car based on a 2D image, locate video based on simple voice prompts, or accurately render the sound of an airplane based on its position relative to the listener, for instance.

“That’s how you compare the sound of a truck in a snowy environment to the image of it, to the actual audio file, to the 3D representation,” Landau said. “And we were actually surprised that data of as diverse and specific as that actually existed and could be related from a multimodal sense.”

Thanks to the higher quality of data, Encord said EBIND is smaller and faster than competing models, while maintaining a lower cost per data item and supporting a broader range of modalities. In addition, the model’s smaller size means it can be deployed and run on local infrastructure, significantly reducing latency and enabling real-time inference.

Encord makes model open-source

Encord said its release of EBIND as an open-source model demonstrates its commitment to making multimodal AI more accessible.

“We are very proud of the highly competitive embedding model our team has created, and even more pleased to further democratize innovation in multimodal AI by making it open source,” said Stig Hansen.

Encord asserted that this will empower AI teams, from university labs and startups to publicly traded companies, to quickly expand and enhance the capabilities of their multimodal models in a cost-effective way.

“Encord has seen tremendous success with our open-source E-MM1 dataset and EBIND training methodology, which are allowing AI teams around the world to develop, train, and deploy multimodal models with unprecedented speed and efficiency,” said Landau. “Now we’re taking the next step, providing the AI community with a model that will form a critical piece of their broader multimodal systems by enabling them to seamlessly and quickly retrieve any modality of data, regardless of whether the initial query comes in the form of text, audio, image, video or 3D point cloud.”


SITE AD for the 2026 Robotics Summit save the date.

Use cases range from LLMs and quality control to safety

Encord said it expects key use cases for EBIND to include:

Enabling large language models (LLMs) to understand all data modalities from a single unified space
Teaching LLMs to describe or answer questions about images, audio, video and/or 3D content
Cross-modal learning, or using examples from one data type such as images to help models recognize patterns in others like audio
Quality-control applications such as detecting instances in which audio doesn’t match the generated video or finding biases in datasets
Using embeddings from the EBIND model to condition video generation using text, objects, or audio embeddings, such as transferring an audio “style” to 3D models

Encord works with customers including Synthesia, Toyota, Zipline, AXA Financial, and Northwell Health.

“We work across the spectrum of physical AI, including autonomous vehicles, traditional robots for manufacturing and logistics, humanoids, and drones,” said Landau. “Our focus are these applications where AI is embodied in the real world, and we’re agnostic to the form that it takes.”

Users could also swap in different sensor modalities such as tactile or even olfactory sensing or synthetic data, he said. “One of our initiatives that is that we’re now looking at multilingual sources, because a lot of the textual data is heavily weighted to English,” added Landau. “We’re looking at expanding the data set itself.”

“Humans take in multiple sets of like sensory data to navigate and make inferences and decisions,” he noted. “It’s not just visual data, but also audio data and sensory data. If you have an AI that’s existing in the physical world, you would want it to have a similar set of abilities to operate as effectively as humans do in 3D space.

“So you want your autonomous vehicle to not just see and not just sense through through lidar, but also to hear if there’s a siren in the background, you want your car to know that a police police car, which might not be in sight, is coming,” Landau concluded. “Our view is that all physicalized systems will be multimodal in some sense in the future.”



Source link

Related Articles

Responses

Your email address will not be published. Required fields are marked *