Every year, we have a new iPhone that claims to be faster and better in every way. And yes, these new computer vision models and new image sensors can exercise the phone as hard as they can. However, you could already take good pictures on an iPhone 10 years ago. These are incremental improvements.
These incremental asks only deserve incremental improvements. Once in a few years, there are programs where even on the best of our computing devices they can be barely usable. But these new programs with newly enabled scenarios are so great that people are willing to suffer through.
Last time this happened was the deep neural networks, and the time before that, was the 3D graphics. I believe this is the 3rd time. In fact, I am so convinced that I built an app to prove the point.
In the past 3 weeks, I built an app that can summon images by casting a few spells, and then editing it to the way you liked. It took a minute to summon the picture on the latest and greatest iPhone 14 Pro, uses about 2GiB in-app memory, and requires you to download about 2GiB data to get started. Even though the app itself is rock solid, given these requirements, I would probably call it barely usable.
Even if it took a minute to paint one image, now my Camera Roll is filled with drawings from this app. It is an addictive endeavor. More than that, I am getting better at it. If the face is cropped, now I know how to use the inpainting model to fill it in. If the inpainting model doesn’t do its job, you can always use a paint brush to paint it over and do an image-to-image generation again focused in that area.
Now the cat is out of the box, let’s talk about how.
It turns out, to run Stable Diffusion on an iPhone is easier than I thought, and I probably left 50% performance on the table still. It is just a ton of details. The main challenge is to run the app on the 6GiB RAM iPhone devices. 6GiB sounds a lot, but iOS will start to kill your app if you use more than 2.8GiB on a 6GiB device, and more than 2GiB on a 4GiB device.
The original Stable Diffusion open-source version cannot run on a 8GiB card, and these are 8GiB usable space. But before that, let’s get to some basics. How much memory exactly does the Stable Diffusion model need for inference?
The model has 4 parts: a text encoder that generates text feature vectors to guide the image generation. An optional image encoder to encode image into latent space (for image-to-image generation). A denoiser model that slowly denoise out a latent representation of an image from noise. An image decoder to decode the image from that latent representation. The 1st, 2nd, and 4th models need to run once during inference. They are relatively cheap (around 1GiB max). The denoiser model’s weights occupy 3.2GiB (in full floating-point) of the original 4.2GiB model weights. It also needs to run multiple times per execution, so we want to keep it in RAM longer.
Then, why originally Stable Diffusion model requires close to 10GiB to run for a single image inference? Besides the other weights we didn’t unload (about 1GiB in full floating point), there are tons of intermediate allocations required. Between the single input (2x4x64x64) and single output (2x4x64x64), there are many layer outputs. Not all layer outputs can be immediately reused next. Some of these, due to the network structures, have to be kept around to be used later (residual networks). Besides that, PyTorch uses NVIDIA CUDNN and CUBLAS libraries. These libraries kept their own scratch space as well. Since the publication, many optimizations have been done on the PyTorch stable diffusion model to bring the memory usage down so it can be run with as little as 4GiB cards.
That’s still a little bit more than what we can afford. But I will focus on Apple hardware and optimization now.
The 3.2GiB, or 1.6GiB in half floating-point, is the starting point we are working with. We have around 500MiB space to work with if we don’t want to get near where Apple’s OOM killer might kill us.
The first question, what exactly is the size of each intermediate output?
It turns out that most of them are relatively small, in a range lower than 6MiB each (2x320x64x64). The framework I use (s4nnc) does a reasonable job of bin-packing them into somewhere less than 50MiB total accounting for reuses etc. Then there is a particularly interesting one. Denoiser has a self-attention mechanism with its own image latent representation as input. During self-attention computation, there is a batched matrix of size (16x4096x4096). That, if not obvious, is about 500MiB in half floating-point (FP16). Later, we applied softmax against this matrix. That’s another 500MiB in FP16. A careful softmax implementation can be done “inplace”, meaning it can rewrite its input safely without corruption. Luckily, both Apple and NVIDIA low-level libraries provided inplace softmax implementation. Unluckily, higher level libraries such as PyTorch didn’t expose that.
So, it is tight, but it sounds like we can get it done somewhere around 550MiB + 1.6GiB?
On Apple hardware, a popular choice to implement neural network backend is to use the MPSGraph framework. It is a pretty fancy framework that sports a static computation graph construction mechanism (or otherwise known as “TensorFlow”). People liked it because it is reasonably ergonomic and performant (have all the conveniences such as the broadcast semantics). PyTorch’s new M1 support has a large chunk of code implemented with MPSGraph.
For the first pass, I implemented all neural network operations with MPSGraph. It uses about 6GiB (!!!) at peak with FP16 precision. What’s going on?
First, let me be honest, I don’t exactly use MPSGraph as it is expected (a.k.a., the TensorFlow way). MPSGraph probably expects you to encode the whole computation graph and then feed it input / output tensors. It then handles internal allocations for you, and lets you submit the whole graph for execution. The way I use MPSGraph is much like how PyTorch does it: as an op execution engine. Thus, for inference, there are many compiled MPSGraphExecutable gets executed on a Metal command queue. Because each of these may hold some intermediate allocations. If you submit all of them at once, they will all hold the allocations at submission time until it finishes the execution.
A simple way to solve this is to pace the submission. There is no reason to submit them all at once, and in fact, Metal has a limit of 64 concurrent submissions per queue. I paced the submission to 8 ops at a time, and that drives the peak memory down to 4GiB.
That is still 2 GiB more than what we can afford on an iPhone. What gives? Before that, more background stories: when compute self-attentions with CUDA, a common trick, as implemented in original Stable Diffusion code, is to use permutation rather than transposes (for more about what I mean, please read Transformers from the Scratch). This helps because CUBLAS can deal with permuted strided tensors directly, avoiding one dedicated memory traffic to transpose a tensor.
But for MPSGraph, there is no strided tensor support. Thus, a permuted tensor will be transposed anyway internally, and that requires one more intermediate allocation. By transposing explicitly, the allocation will be handled by the higher-level layer, avoiding the inefficiency inside MPSGraph. This trick drives the memory usage close to 3GiB now.
1GiB to go, and more back stories! Before MPSGraph, there was Metal Performance Shaders. These are a collection of fixed Metal primitives for some neural network operations. You can think of MPSGraph as this more shining, just-in-time compiled shaders while Metal Performance Shaders are the older, but more reliable alternative.
It turns out that MPSGraph as of iOS 16.0 doesn’t make the optimal allocation decision for softmax. Even if both the input and output tensors point to the same data, MPSGraph will allocate an extra output tensor and then copy the result over to the place we pointed it to. This is not exactly memory sensitive due to the 500MiB tensor we mentioned earlier. Using the Metal Performance Shaders alternative does exactly what we want and this brings memory usage down to 2.5GiB without any performance regression.
The same story happened to the GEMM kernel of MPSGraph: some GEMM require transposes internally, and these require internal allocations (rather than just use the strided tensor for multiplication directly like what GEMM from Metal Performance Shaders or CUBLAS did. However, curiously, at the MLIR layer, GEMM inside MPSGraph seems indeed to support transpose parameters (without additional allocation) like most other GEMM kernels). Moving these transposes out explicitly doesn’t help either because transposes are not “inplace” ops for the higher-level layer, so this extra allocation is unavoidable for that particular 500MiB size tensor. By switching to Metal Performance Shaders, we reclaimed another 500MiB with about 1% performance loss. There, we finally arrived at the 2GiB size we strived for earlier.
There are still a bunch of performances I left on the table. I never switched to ANE while finally getting some sense about how (it requires a specific convolution input shape / stride, and for these, you can enable the mysterious OptimizationLevel1 flag). Using Int8 for convolution seems to be a safe bet (I looked at the magnitude of these weights, none exceeding the magic 6) and can save both the model size and the memory usage about 200MiB more. I should move the attention module to a custom made one, much like FlashAttention or XFormer on the CUDA side. These combined, probably can reduce runtime by 30% and memory usage by about 15%. Well, for another day.
You can download Draw Things today at https://draw.nnc.ai/
Here are some related links on this topic:
How to Draw Anything, this is the most influential piece to me on this topic early on, and I point everyone who liked this topic to this post. It describes one workflow where text-to-image models can be more than a party trick. It is a real productivity tool (since then, there are more alternative workflows popping up, people are still figuring this out).
Maple Diffusion, while I am working on swift-diffusion, there is an concurrent effort by @madebyollin to implement stable diffusion in MPSGraph directly. I learned from this experiment that NHWC layout might be more fruitful on M1 hardware, and switched accordingly.