Prototyping a Real-time Spatial Map

Published: February 14, 2024

I recently wrapped up a milestone in my Minimap project. Here's the latest demo:

Inspiration courtesy of Total Recall and Perfect Dark

I especially enjoy these projects because they scratch an itch and I'm generally overly optimistic about how easy things will be to build. When I'm wrong, that just means I have to learn about a bunch of SDKs and APIs that are unfamiliar to me. What typically unfolds is, soon thereafter, I have a client that wants something built using those newly-explored technologies. After shipping the client's product, I chase another "I wonder if I can build that" concept and the cycle continues.

I learnt about the technologies required to build things like Minimap, but I learnt a lot more along the way. Here are the highlights of what I gathered while pursuing every dead end possible and duct taping RealityKit, ARKit, SwiftUI, Combine, Metal, RoomPlan, SwiftData, Multipeer Connectivity, RealityKit Postprocessing, VideoToolbox, and DrawableQueue together into the Minimap prototype.

Spatial Computing Multiplayer Woes

The way Minimap presently works is by utilizing a network of cameras. For the demo above, an iPhone 14 Pro was on the far side of the wall sending person segmented camera frames in real-time to the iPhone 15 Pro recording on the near side of the wall. This example network of devices has one streaming camera – but ideally there would be several streaming cameras mapping out a place more thoroughly than a single camera's field of view can cover.

After much finagling, I settled on using Multipeer Connectivity as RealityKit's SynchronizationService. It was not easy. It was not predictable. I've spent many hours over the past few weeks waving two iPhones, one in each hand, desperately willing them to wirelessly connect to one other as fast as possible so I could test whatever I was actually trying to test.

Now, the first time local devices "automatically" connect to one another: it does feel like magic. Things are just happening! The issue is sometimes the devices connect in a couple seconds. Other times they never connect. Other times they randomly disconnect. I convinced myself there was an ideal, arcane process that would ensure reliable connection within ~25 seconds – and later found that process was based on faulty assumptions. My favorite part was when I wasted several days trying to buffer the streamed frames and generally troubleshoot bursty video playback when what actually fixed the issue was to turn off the wifi on my devices and let Bluetooth handle the Multipeer Connectivity session. Of course! That makes sense!

Here's the thing: My naive understanding is that, under the hood, AirPods (probably my favorite product of the past decade) use Multipeer Connectivity for knowing when to connect, what to connect to, when to transfer to another device, etc. I think the AirPods experience is magnificent – the wireless experience was buggy the first few years, but it's steadily improved and I've loved it every step of the way.

So there's a 95% chance the problem is me. I'm unfamiliar with the esoteric incantations necessary to build an excellent multiplayer/multi-device experience using Multipeer Connectivity because I haven't used it that much.

BUT. The Vision Pro, and visionOS in general, needs to be more social than it is right now. A lot more social. With the intent of providing devs with the tools necessary to make visionOS products that are more social, and having tinkered around with Multipeer Connectivity a bit, I would love to see some or all of the following:

Updates to the Multipeer Connectivity framework
An expansion of SharePlay to support a wider variety of use cases
(Preferably) A brand new multiplayer framework focused on spatial computing that has the approachability of SwiftUI and SwiftData

I believe one of the crucial elements that will determine the success or failure of the visionOS platform is whether such a nebulous "multiplayer spatial computing SDK" materializes. I think Apple, Google, some new Firebase-esque platform focused on spatial computing, or someone else needs to make it dead simple for developers to plug in a multiplayer SDK and get spatial functionality out of the box. In the interest of not building on tech that has a high chance of losing support in the near future: Preferably this product would not be built by a startup...or Google for that matter.

Ultimately, what I just requested may be Unity and Unreal – but I prefer to work in code, so a native version of the above would help all the dozens of augmented reality developers like me.

A Video Standard With Depth

I ended up using Video Toolbox, RealityKit Postprocessing, DrawableQueue, and a bunch of other stuff for segmenting and streaming the frames between devices. As can be seen in the demo, there are segmentation artifacts around the streamed frames – that's more a result of me wanting to be done with this phase of the project than anything else. I'm sure some shader tweaks, some blending, or something else would resolve that (and the washed-out colors, and many other issues on my list) relatively quickly.

The codec I used is HEVC with Alpha (High-Efficiency Video Encoding). What I really want is a natively supported video codec that encodes depth information into each frame. No more grabbing things off ARFrame. No more RealityKit Postprocessing. Just give me the depth data right in the video frames.

A video codec that readily streamed depth data would have made this project significantly easier. There are a handful of such codecs out there, but I don't think they have wide support in general nor native visionOS/iOS support specifically.

As many people who've tried the Vision Pro will tell you: The immersive videos are insane. It's fun to drop on my Vision Pro and look at pics I've taken. Ordered by increasing levels of getting my mind blown, they go:

Pictures
Panoramas (very surprising to me)
Spatial Videos recorded on my iPhone
Spatial Videos recorded on my Vision Pro (I actually haven't tried this yet, but I've heard they're slightly better vs. those recorded on iPhone)
Apple Immersive Videos (like the highlining and rhino videos featured on Apple TV)

Numbers 3 and 4 above are recorded in MV-HEVC (Multi-View High-Efficiency Video Coding). But what's really interesting is there is an exponential increase in extraordinariness between numbers 4 and 5. Apple Immersive Videos are phenomenal...and I'm pretty sure they're not using MV-HEVC. My uninformed guess is they have several multimillion dollar camera rigs, custom software, and custom processing to achieve that level of immersion.

Hopefully we'll get those sensors, tech, and software in our devices in the coming years – in the meantime, there is a legion of developers who've focused on video, editing, and creative apps for years. Those developers could build some amazing stuff with a video codec that also encodes depth information and therefore makes the depth info directly available. I think if a single stream contains both video and depth, the depth will get used more often and in more interesting ways.

I wouldn't be surprised to see native support for such a codec soon (perhaps in the form of an abstraction layer that processes MV-HEVC frames and provides the related depths).

Testing Spatial Software Is Rough

I complained about testing AR/spatial computing code in 2020. I also complained in 2017, 2018, and 2019. I'm guessing I'll be complaining in 2027. I'm not talking about writing tests. I'm talking about how when I work on an iOS project sometimes I build an app to my phone fifty times in a day. Build; tap around; test; get back to writing code.

What happens when you're working with multiple devices? What happens when you need to scan a space? What happens when you need a person in-frame to test functionality? What happens when testing spatial computing software entails actually getting up and doing something?

What happens is I end up getting an ab workout. To test functionality that requires a person in-frame to stream frames wirelessly between devices, I end up waving two iPhones (one in each hand) pointed at my outstretched legs while I do a seated leg lift and kick the air to gauge frame latency.

It's annoying, tedious, and not ideal. Perhaps I am the outlier here. I work alone from home. I've worked with clients around the world, but rarely with a client in Chicago. Perhaps other developers focused on spatial computing, and especially multiplayer spatial computing, are pair-programming next to their team and don't have issues getting multiple people and multiple devices together for a quick test.

What I want is simple. Apple: Buy Snap. Give them a 50% premium. As of today, that'd be a measly $28B. Buy Snap, do whatever the hell you want with everything else, and give free reign to the team of probably 3-20 people who've focused on the preview and lens testing experiences in Lens Studio. While selfie-focused lenses are easier to test than the wide-open possibilities enabled by general spatial computing – being a visionOS/iOS developer focused on ARKit, I still think shaking things up could only improve things.

But in all seriousness: Testing capabilities need to be improved. Give us some multiplayer testing options. Some virtual/artificial person settings to test out segmentation in the simulator. Some ARKit hand tracking and scene reconstruction options. Something.

Where Does Minimap Go From Here?

As I slowly, painstakingly, painfully prototyped my way to Minimap's current state, I couldn't help but be reminded of building CyberWave about four years ago.

CyberWave is an iOS app that is a spatial version of a music visualizer. I released it in 2020 between client contracts. RealityKit was announced almost a year prior and I still hadn't really dipped my toes into the spanking new SDK. I sat down to build. I looked at the RealityKit docs. And I immediately went back to SceneKit. To build CyberWave I needed things like shaders, comprehensive material flexibility, and straightforward access to scene geometry – features that wouldn't come to RealityKit for a full 12+ months later from when I was determined to build CyberWave.

I was perplexed and frustrated that features or methods or SDK functionalities that I expected to be available and easy to implement...weren't.

I felt that feeling over and over and over the past few weeks whilst building my Minimap prototype:

"This should be easier."
"I can't do that? Why?!"
"Burn in Hell ChatGPT, you know nothing."

I didn't even accomplish the full prototype functionality I envisioned (i.e. a point cloud of the segmented human to provide more depth to the result vs. the segmented billboard currently in place) – but I did feel like I was consistently butting up against temporary obstacles and unfinished SDK functionality at every other step.

In 2020 I was determined to build CyberWave as quickly as possible, ship it, and move on. Now, in 2024, having built a spatial concept that is nowhere near ready to ship and having had a Vision Pro sitting on my desk for 1.5 weeks without building anything strictly for it yet: I think I'm going to wait before taking any more steps towards releasing a real-time, spatial mapping version of Minimap.

I think there will be features, extensions, and many things I haven't even dreamed of coming later this year, next year, and beyond that will make the Minimap I envision easier to build, more performant, more complete, less buggy, and more powerful.

Now, on to the actual Vision Pro fun.