An Open Source Exploration of visionOS 2's Object Tracking

Published: July 2, 2024

Just here for the code? Here's my open source exploration of visionOS 2's object tracking.

A month ago I hoped against hope that visionOS 2 would bring us object tracking – and it did! Kinda. Sorta. Just with lots of required prepwork and caveats.

I never delved deep into iOS's object detection capabilities that have been around for years. I checked out the demo in 2018, I scanned in & tested an object of my own, I made a little impressed noise, and then I moved on. I found the process for creating ARReferenceObjects to be cumbersome. There were plenty of other early-ARKit goodies to play around with that were more accessible to myself as a dev and any prospective users of the apps I built. Also, I wasn't quite convinced of the practicality of searching for objects through a phone's camera and then viewing any subsequent information on that same little screen.

But now we've got something to put on our heads that drops screens right next to our eyeballs while simultaneously tracking our hands – all of which makes me very excited for object tracking. That combination enables interactions with our world in brand new ways. Here's a quick clip, or you can check out the full demo with sound over here:

Tackle even the hardest recipes with no know-how by using Vision Pro's object tracking

Before I expound on all the potential these abilities open up, let me level-set expectations and highlight some hurdles and downsides related to object tracking as of visionOS 2 Beta 2.

Caveats

Objects can only be so big – generally they need to fit in a roughly 4 foot cube
But objects can't be TOO small – I've had little success with objects smaller than a paperback book
Objects can't be shiny
Objects can't be transparent
Objects need some texture variation – scanning in a rather featureless bowl resulted in a poor object detection success rate
There is considerable lag – moving an object with a standard object tracking configuration will yield nowhere close to real-time tracking
There are additional parameters you can configure to increase detection and tracking frequencies...but they're locked behind Apple's opaque Enterprise APIs program, so best of luck if you're not connected to a significant, potential/actual corporate partner
Model training (which yields a .referenceObject file which is needed for a visionOS app to recognize objects) must be done on a Mac using Create ML...and it takes HOURS, I've seen anything from 4 - 16 hours to create a single .referenceObject
I think training against areas rather than objects should be possible – but I've run up against issues with Object Capture's area mode in iOS 18, so I will be revisiting that at some later point
Object Tracking requires an ARKitSession with .worldSensing authorization – this means an app that tracks objects cannot run in the Shared Space alongside other apps, a severe limitation in my mind

I suppose it's only going to get better from here – but I cannot wait for training APIs/the ability to train on platforms other than macOS, faster tracking, sensors that can distinguish smaller objects, and...well, not having to deal with any of the caveats above.

Caveat to the Caveats

I believe there are tremendous opportunities for Apple Vision products to be used by medical professionals to help patients visualize post-procedure changes to their bodies (I'm not interested in working on products related to plastic surgery nor am I referring to plastic surgery, although that will assuredly be a future use case of visionOS). To that effect, within hours of the WWDC24 announcement of object tracking in visionOS 2, I was training a .referenceObject in Create ML of a 3D object that's roughly the size of a small apple.

Unfortunately, as of today in visionOS 2 Beta 2, I'm unable to get my Vision Pro to reliably detect this part of the body using object tracking. However, I did notice a marked improvement in object detection between visionOS 2 Beta 1 and visionOS 2 Beta 2 using the exact same .referenceObject.

To put it another, overly optimistic way: A lot of the use cases for object tracking that don't seem possible today with visionOS 2 Beta 2 may be feasible next week, or next month, or next year.

The Possibilities

What can't you do with object tracking?

I am admittedly very enamored with the idea of remixing the real world. While my app Twin enables people to scan in real world objects and remix them in their virtual environment – it lacks object tracking (since that didn't exist when I built Twin and there isn't a clear, convenient path to enable Twin's users to create .referenceObjects from their scanned objects). You can manually place a Twin model to overlap its real-world twin...but doing so is clunky and brittle.

With object tracking (plus some lighter hardware, better software, better sensors, better screens, etc.), people could personalize their homes in an instant at the snap of a finger. Scan in your couch once and change its color hourly if you want. Try out a different finish/color of your watch for a day. Customize how you see your home and how you see your stuff unbounded by the constraints of money, time, or physics.

And there's a bunch more use cases to be excited about, including:

Instructions: show me exactly what to do with the things I have in front of me, or show me what's missing
Tutorials: walk me through the steps at exactly the right time at exactly my pace
Guided Work: let me learn by doing, in the field, with the actual equipment
Assembly: show me how things fit together with the actual things, not pictures or videos
Construction: show me the highest resale value items to steal from an unguarded construction site
Recipes: help me find the right ingredient at the right time
Compliance: confirm two hands are being used for specific tasks
Training: show me the key parts and controls of unfamiliar equipment
Simulating: show me potential scenarios I should prepare for given what's in front of me
Accessibility: be my eyes

In my demo project whenever an object is detected, a visual highlight is placed on or near the object. But that's just the first thing that came to mind. If you can make another behavior happen in code, you can make that behavior occur when an object is tracked.

Play a sound. Play a video. Launch an exclusive experience based on encountering a rare item. Play a spatial 3D animation. Bring up the service manual for the specific engine I'm looking at. Recognize what I'm looking at and tell me how to fix it or its most common failure modes.

I expect object tracking to be a vital component of the visionOS ecosystem and I can't wait to see what people build.