No comment yet

“Use-faced” computing: any computing that needs to happen in the tight loop of users interacting with each other or themselves. By this definition, most business insight generations, log processing, server health stats, data reprocessing for cold storage, automated trading, and automatic surveillance analysis are not user-faced computing. Conversely, when you send a message to another user, every step in this process counts as user-faced computing.

Many people have the misconception that most user-faced computing has shifted to the cloud (remote servers) in recent years.

Messaging, arguably one of the more “even” interactions (i.e., typing/encrypting one message involves very little client computing), still spends very little computing on the server computing. Signal, an open messaging platform, has calculated its server computing costs at approximately 2.9 million dollars per year, equating to about $0.058 per user per year, or roughly 10 vCPU hours per user. User time spent, while not revealed or collected, should far exceed this number (both Meta & Snap report time spent around 0.5 hours per day per user).

There have been many indications that most user-faced computing occurs primarily on the client-side. Most personal devices in the past 15 years have included hardware-level support for video decoding/encoding. However, only recently (in the past 5 years) have major video providers like YouTube started switching to their own video acceleration chips.

Efforts to shift this balance have been made. Video games, which demand significant client computing, have been attempting to move to the cloud since OnLive’s debut in 2008. More substantial efforts followed, including NVIDIA GeForce NOW, Xbox Cloud Gaming, Google Stadia, and Unity Parsec, to name a few. However, few have succeeded.

Conversely, the transition of storage to the cloud has been phenomenally successful. Many companies and services have thrived by moving what we store locally to remote servers: Dropbox, Spotify, YouTube, iCloud services, Google Photos. Today, operating your own local storage box is a niche activity (e.g., r/DataHoarder) outside the mainstream.

The change is already under way.

ChatGPT, the most successful generative AI product to date, demonstrates a significant imbalance in user-faced computing: to complete each query, it requires approximately 1 to 3 seconds of dedicated GPU time on 8-H100 machines. Google doesn’t publish their numbers on search, but it is likely to be 1 or 2 orders of magnitude smaller in comparison.

This puts an immense pressure to secure computing-heavy servers to meet the needs of generative AI, in terms of text/image/video generation inferences. So far, companies are often limited by how much server-computing resources they can secure to support their user growth.

Is this a phase change? Could it be a temporary crunch? Why has the storage transition been so successful while the computing shift lags behind?

A closer analysis of the storage transition reveals it is fairly straightforward. While storage capacity continues to grow at a steady rate (2008, 1.5TB; 2023, 20TB; approximately 20% per year), the form factor and price haven’t changed significantly (2008, $100 per terabyte; 2023, $25 per terabyte; approximately 10% per year). The shift of personal devices to SSDs further exacerbated this issue: in 2008, $100 per terabyte HDD; in 2023, $80 per terabyte SSD.

Cost-wise, moving to the cloud offers at most modest savings. However, the inherent redundancy in data provides 3 to 4 orders of magnitude in savings, making cloud storage more appealing in other aspects. In the early 2000s, devices like the iPod were frequently advertised for their large storage capacity, capable of holding thousands of songs. These are examples of redundant storage easily shared among users. The fact that many users consume the same content makes cloud storage economically efficient. Other minor improvements, such as a 10x increase in household bandwidth and a 100x increase in mobile bandwidth, render the bandwidth benefits of local storage negligible.

With user-faced computing contemplating a potential move to the cloud, the advantages are clear. The computing requirements for generative AI, especially for high-quality results, are not easily met in portable form factors. There are also redundancies that can be exploited. KV cache, a straightforward optimization for large language model inference, eliminates redundant computing between token generations. vLLM takes this further by sharing KV caches between batches. While not widely deployed yet, it is conceivable that KV cache could be shared between different users and sessions, potentially leading to a 10x efficiency improvement (though this raises its own security concerns, such as the risk of timing attacks on LLM’s KV cache potentially recovering previous conversations or conversations between different users).

Another advantage of moving user-faced computing to the cloud is data efficiency. Like video games, generative AI models require substantial data payloads to function. When user-faced computing shifts to the cloud, data transfer between the server and client becomes unnecessary. This enables the instant satisfaction that most cloud gaming platforms promise. However, it remains unclear whether new experiences can be enabled by delta-patching the existing model or whether transmitting a new model is necessary.

Will this phase change in user-faced computing finally occur? What should we watch for?

First, if this phase change materializes, we could see a 1 to 2 orders of magnitude increase in server-side computing capacity. This would make what NVIDIA/AMD ships today seem insignificant in 5 years.

Second, the success of this transition is not guaranteed. It is not as straightforward as the storage transition to the cloud. There have been previous failures (e.g., cloud gaming). It depends on several factors: 1. Will client computing power/efficiency continue to grow at approximately 50%, leading to nearly a 10x increase in local computing power in 5 years? 2. Will we see an acceleration in available RAM/storage locally to support large models? 3. Would there be a winner model that serves as the true foundation for other model adapters to adapt?

Third, resistance to this phase change also comes from within. The cost of acquiring computing hardware impacts both pricing and availability. It is unclear whether the economics will be viable without addressing these price and availability gaps.

‹ Newer
blog comments powered by Disqus