Locate 3D
Research by Meta FAIR
Meta Logo
AboutDemo
Paper
GitHub Logo
Allow the use of cookies from Meta on this browser? To find out more about the use of cookies, see our Privacy Policy and Cookies Policy

Locate 3D transforms how AI understands physical space, delivering state-of-the-art object localization in complex 3D environments.

Operating directly on standard sensor data, Locate 3D brings spatial intelligence to robotics and augmented reality applications in real-world settings.

A room with a guitar leaning against the wall. The guitar and wall are highlighted and labeled.

Advanced Spatial Reasoning

Locate 3D excels at understanding nuanced spatial relationships, identifying objects through natural language queries like "guitar leaning on the wall"

Built for Real-World Deployment

Locate 3D functions with standard RGB-D sensor data without requiring ground-truth 3D information, enabling integration across diverse real-world environments.

A person wears a virtual reality headset while facing a quadruped robot in the kitchen of an apartment.

World Understanding for Robotics

Locate 3D uses natural language to create a foundational understanding of the physical world, transforming localization tasks in robotics and augmented reality.

Try it yourself

Test Locate 3D's breakthrough capabilities by prompting it to find objects in complex 3D environments.

1

Prompt the model

Enter your own prompt or choose from our examples to see Locate 3D's spatial reasoning in action

2

See Locate 3D in action

Watch Locate 3D precisely identify objects using your natural language descriptions. Your accuracy ratings help us continuously improve the model's performance.

3

Try diverse prompts and scenes

Challenge the model with varied spatial reasoning prompts or explore how it performs across our diverse collection of scenes!

How the tech works

Locate 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities.

Locate 3D employs an innovative three-phase approach. In Preprocessing, point clouds are enhanced with "lifted" features from 2D foundation models. Next, we build a Contextual Representation of the scene by passing these feature-enriched point clouds through an encoder pretrained with 3D-JEPA. Finally, a specialized 3D decoder interprets the user’s natural language queries alongside the encoded 3D-JEPA features to accurately localize referenced objects.

3D-JEPA Encoder

As part of the Locate 3D release, we are open-sourcing 3D-JEPA—a novel self-supervised learning algorithm that revolutionizes point cloud understanding.

Leveraging pointclouds with “lifted” features from 2D foundation models, 3D-JEPA learns contextualized representations by predicting the latent embeddings of randomly masked regions. This innovative dual method combines masked prediction with latent space prediction, allowing the model to focus on learning features that predict the context of a point within the broader scene, while filtering out unpredictable information.

Potential applications
  • 3D object localization
  • 3D segmentation
  • 3D embodied question answering
Locate 3D Dataset

Accelerate your research with our comprehensive new dataset for 3D Referential Expressions that spans 1,346 diverse scenes enriched with over 130,000 detailed referring expression annotations.

This extensive collection serves as a valuable training resource for advancing referential expression models and spatial understanding research. Beyond training, it enables rigorous evaluation across diverse capture configurations and indoor environments.

A bar graph comparing the amount of referring expressions in the Locate 3D dataset to other datasets. Locate 3D contains 131,641 while all others contain under 52,000.
© 2025 MetaPrivacy PolicyTermsCookies