Locate 3D transforms how AI understands physical space, delivering state-of-the-art object localization in complex 3D environments.
Operating directly on standard sensor data, Locate 3D brings spatial intelligence to robotics and augmented reality applications in real-world settings.
Locate 3D excels at understanding nuanced spatial relationships, identifying objects through natural language queries like "guitar leaning on the wall"
Locate 3D functions with standard RGB-D sensor data without requiring ground-truth 3D information, enabling integration across diverse real-world environments.
Locate 3D uses natural language to create a foundational understanding of the physical world, transforming localization tasks in robotics and augmented reality.
Test Locate 3D's breakthrough capabilities by prompting it to find objects in complex 3D environments.
Enter your own prompt or choose from our examples to see Locate 3D's spatial reasoning in action
Watch Locate 3D precisely identify objects using your natural language descriptions. Your accuracy ratings help us continuously improve the model's performance.
Challenge the model with varied spatial reasoning prompts or explore how it performs across our diverse collection of scenes!
Locate 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities.
Locate 3D employs an innovative three-phase approach. In Preprocessing, point clouds are enhanced with "lifted" features from 2D foundation models. Next, we build a Contextual Representation of the scene by passing these feature-enriched point clouds through an encoder pretrained with 3D-JEPA. Finally, a specialized 3D decoder interprets the user’s natural language queries alongside the encoded 3D-JEPA features to accurately localize referenced objects.
As part of the Locate 3D release, we are open-sourcing 3D-JEPA—a novel self-supervised learning algorithm that revolutionizes point cloud understanding.
Leveraging pointclouds with “lifted” features from 2D foundation models, 3D-JEPA learns contextualized representations by predicting the latent embeddings of randomly masked regions. This innovative dual method combines masked prediction with latent space prediction, allowing the model to focus on learning features that predict the context of a point within the broader scene, while filtering out unpredictable information.
Accelerate your research with our comprehensive new dataset for 3D Referential Expressions that spans 1,346 diverse scenes enriched with over 130,000 detailed referring expression annotations.
This extensive collection serves as a valuable training resource for advancing referential expression models and spatial understanding research. Beyond training, it enables rigorous evaluation across diverse capture configurations and indoor environments.