No comment yet
December 1st, 2019

I've discussed a stackful coroutine implementation to coordinate CUDA stream last year.

That was an implementation based on swapcontext / makecontext APIs. Increasingly, when I thought about porting nnc over to WASM, it becomes problematic because these APIs are more or less deprecated. Popular libc implementations such as musl don't have implementation of these methods.

After the article, it became obvious that I cannot swapcontext into the internal CUDA thread (that thread cannot launch any kernels). Thus, the real benefit of such stackful coroutine is really about convenience. Writing a coroutine that way is no different from writing a normal C function.

This is the moment where C++ makes sense. The coroutine proposal in C++20 is a much better suit. The extra bits of compiler support just make it much easier to write.

If we don't use swapcontext / makecontext, the natural choice is either longjmp / setjmp or good-old Duff's device. It is a no-brainer to me that I will come back to Duff's device. It is simple enough and the most platform-agnostic way.

There are many existing stackless coroutines implemented in C. The most interesting one with Duff's device is Protothreads. To me, the problem with Protothreads is its inability to maintain local variables. Yes, you can allocate additional states by passing in additional parameters. But it can quickly become an exercise and drifting away from a simple stackless coroutine to one with all bells-and-whistles of structs for some parameters and variables. You can declare everything as static. But it is certainly not going to work other than the most trivial examples.

I've spent this weekend to sharpen my C-macro skills on how to write the most natural stackless coroutine in C. The implementation preserves local variables. You can declare the parameters and return values almost as natural as you write normal functions.

Here is an example of how you can write a function-like stackless coroutine in C:

co_decl_task will declare the interface and the implementation. You can also separate the interface into header file with co_decl and implementation into co_task. In this case, static keyword continues to work to scope the coroutine to file-level visibility. Taking a look at this:

The first parameter is the return type, and then function name, parameters, all feel very natural to C functions. The local variable has to be declared within the private block, that's the only catch.

To access parameters and local variables, you have to use CO_P / CO_V macro to wrap the access, otherwise it is the same.

Of course, there are a few more catches:

  1. No variadic parameters;
  2. No variable length local arrays;
  3. No void, () meant for that in parameters, and you can simply omit the return type if you don't need them.

There is no magic really, just some ugly macros hide away the complexity of allocating parameters / local variables on the heap and such.

There are examples in the repo that shows the usage of co_resume, co_await, co_apply, co_yield, co_decl, co_task, co_decl_task and co_return in varies formats. You can check out more there:

Currently, I have a single-threaded scheduler. However, it is not hard to switch that to a multi-threaded scheduler with the catch that you cannot maintain the dependencies as a linked-list, but rather a tree.

It is a weekend exercise, I don't expect to maintain this repo going forward. Some form of this will be ported into nnc.

Closing Thoughts

In theory, swapcontext / makecontext can make a much more complex interaction between functions that an extra scheduler object is not needed. For what it's worth, Protothreads also doesn't have a central scheduler. But in practice, I found it still miles easier to have a scheduler like what libtask does. Tracking and debugging is much easier with a central scheduler especially if you want to make that multi-thread safe as well.

No comment yet
August 2nd, 2019

To train large deep neural network, you need a lot of GPU and a lot of memory. That is why a Titan RTX card cost more than 3 times of a RTX 2080 Ti with just a bit more tensor cores. It has 24GiB memory and that makes a lot of models much easier to train. More memory also means bigger batch size. Many GPU kernels run faster with larger batch size. If somehow we can reduce memory footprint at training time, we can train bigger models, and we can train with larger batch size faster.

There are methods to reduce memory footprints. It is no-brainer nowadays to use fp16 for training. Other than that, many of today's memory reduction techniques are derivatives of binomial checkpointing, a well-known technique in automatic differentiation community. Specific details need to be considered that cheap operations such as batch normalization or RELU results can be dropped and then recomputed later. The paper suggested a 30% more time required for DNN-tuned binomial checkpointing for roughly 80% reduction in memory usage. In practice, people often see 10% more time with 50% reduction in memory usage thanks to optimizations in forward pass over the years.

In the past a few days, I've been experimenting with another type of memory usage reduction technique.

It is common today in operating systems to do something called virtual memory compression. It uses data compression techniques to compress under-utilized pages, and on page fault, to decompress these pages back. These are lossless compressions. It doesn't make sense to revisit some memory and suddenly an 'a' becomes a 'z'. However, in another world, lossy compression does used to reduce memory usage.

In computer graphics, a full-blown 32-bit texture could take a lot of memory. People exploited more effective texture representation for ages. Formats such as PVRTC or ETC rely on heavy compression schemes (many involve search a space for better representations) to find perceptually similar but much smaller texture representation. For example, PVRTC2 could spend less than 15% memory for visually the same result as a full-blown 32-bit texture. These compression schemes are also very light and predictable to decompress.

There are certain similarities between textures and tensors for convolutional neural networks. They both have spatial dimensions. Convolutional neural networks traditionally have more precisions, but nowadays we are exploring 4-bit or 8-bit tensors for convolutional neural networks too. For a tensor compression algorithm to work in practice, it needs to be fast at both compression and decompression on GPU, and hopefully, has high fidelity to the original.

I've devised a very simple, very easy-to-implement adaptive quantization algorithm for this purpose. The past a few days, I've been experimenting on ResNet-50 models to confirm its effectiveness.

At batch size 128x4 (4 GPUs, 128 per GPU), the baseline ResNet-50 model trained on ImageNet reached single crop top-1 accuracy 77.6% with 20.97GiB memory allocated across 4 GPUs. The ResNet-50 model with tensor compression trained on ImageNet reached accuracy 75.8% with 6.75GiB memory allocated.

On each feature map, within a 4x4 patch, we find the max value and the min value. With these, we have 4 values {min, max - min) / 3 + min, (max - min) * 2 / 3 + min, max}. Each scalar within that 4x4 patch can be represented with one of the 4 values. Thus, we use 2 bits per scalar. That totals 64 bits per patch, 25% of the original (assuming fp16). This is super easy to implement on GPU, in fact, I am surprised my simple-minded implementation on GPU this fast. It incurs less than 10% runtime cost during training (throughput reduced from 1420 images per second to 1290 images per second).

It is also simple to update the computation graph for tensor compression. For each convolution layer's output tensor, if it is used during backpropagation, we compress it immediately after its creation in forward pass, and decompress it before its use in backpropagation. If the backpropagation of the convolution layer uses a input tensor, we compress it immediately after its creation in forward pass, and decompress it before its use in the backpropagation. This simple scheme covered all tensors potentially have spatial redundancy.

Is this algorithm useful? Probably not. As long as there are accuracy loss, I am pretty certain no one will use it. At this moment, it is unclear whether 2-bit is too little or this whole scheme inherently doesn't work. Some more experiments are required to determine whether adaptive quantization is good enough or the spatial redundancy plays a role (by adaptive quantize across feature maps rather than within a feature map). Nevertheless, I'd like to share these early results to help the community determine whether this is a worthy path to explore.

You can find the CUDA implementation of the above adaptive quantization algorithm in:

No comment yet
May 22nd, 2019

War starts when people fail to communicate with each other. The current U.S. and China dispute is so complex and overreaching, any rational discussions online can devolve into flame wars. There are so many topics, making the multi-variable optimizations difficult. Overlaying all this with a gloomy long-term implications of technology, it is far easier to just pick a side and rooting for the red / blue team.

The Gloomy Long-Term Implications of Technology

It is far easier for the Bay Area people thinking themselves as a force of good. But the technology we developed over the past a few years greatly expanded central governments' ability. It is too easy to track down a person, collect all their communication records, for profiling and categorization. Alternative technologies to combat these implications such as end-to-end encryption can be easily outlawed at governments' will. It is pleasantly surprising to see the United States resisted so long. As Republican given up their ideology completely for the totalitarianism fantasy, finally, the expansion of the executive branch power will result, not necessarily a president for life (although likely), but at the very least, a one-party state. Whether it is Republican or Democrat are besides the point. Populists, on either far right or far left, come dangerously close in ideology terms. After all, the United States has a Republican president running unprecedented fiscal deficit and issuing orders to anyone by the name of national security right now.

The Chinese has been playing the one-party state game for too long. The art of ruling, lies in appeasing many, allowing a few to vent, and exterminating anomalies. The digital technologies allow them to scale up. With such surveillance power, the crime rate will fall, so does the freedom.

The Gear Up to a New War

When a new war begins to break out, both sides first stop talk with each other. The media on both sides seem to have agendas. In China, the media appeals to the nationalistic honor, tries to remind the average Chinese the the past under western imperialism with Opium War and Korean War. In the United States, the media paints an evil axis of China, tries to gain a moral high-ground for the U.S. position. The sheer number of fanatics for both sides makes civil discussion impossible. It seems that media are well-positioned to setup the war between the two power.

What the United States Wants

The current trade war is difficult partly because the United States demands are fairly opaque. It is a baggage of things, ranging from pure economical to pure political. It is understandable because the Trump administration are not known for making crisp clear demands. There are feelings, numbers, ideologies, all bagged together in the trade deal.

The Feelings: the United States felt that they were in a one-sided relationship. In the past two decades, it benefitted more to the Chinese. This can be seen from stagnation of the U.S. growth and the stellar growth of China. More specifically, the feeling can be seen from the broad ban of the U.S. internet companies in China, the joint-venture requirements for any U.S. adventures to the Chinese domestic market. The fact of great many made-in-China products means the less of made-in-America. That again, attributed back to the stagnation of the U.S. common people for the past decade.

The Numbers: the United States sees the hard-cold trade imbalance as a proof that the relationship is truly one-sided. If the Americans make less than the Chinese from this relationship, isn't it enough to prove the United States lost?

The Ideologies: to many Americans, the Communist China by the prefix is evil. The behavior in Tibet, Xinjiang and South China Sea is a proof the communists will go far to suppress oppositions. Many years of propaganda in the United States attributed the end of the Cold War to the superiority of Capitalism over Communism (rather than, for example, the open government over the authoritarian government).

It makes the U.S. demand unlikely to be simply economical. If the U.S. wants a balanced trade, the problem should already be solved last year. The Chinese wants to buy from the U.S. to the extent of anything the U.S. wants to sell. The agricultural products in a little over the past decade rose from 0% to almost 20% of total U.S. exports to China. There are a long list of things that the Chinese want to import but banned by national security reasons.

Beyond the economical demands, the U.S. wants to fix the open-market problem. The Chinese was quick to extend the olive branch on that front with the 100% Tesla-owned factory in China, even with some Chinese investments.

The sticky points, lie in the alleged IP theft, cyber warfare and the humanitarian concerns. The Chinese was quick to promise. But the United States wants more than a promise.

What's China's Red Line

One misunderstanding from the U.S. media and discussions, is how serious the Chinese regarding sovereignty. There are many disputes in China about how the slow progress to implement open-market hurts the mutual trusts within WTO. During the interview with Ren Zhengfei on May 21st, he mentioned this as well. The humanitarian aspects with current regime is another topic has many resonating audience within China. However, imposing a U.S. based overseeing body in the Chinese governing system is difficult for the Chinese to swallow. The sovereignty issue is a big part of Chinese education in the past half a century. The extraterritorial rights granted to westerners since Opium War are something the Chinese will not forget.

The Endgame

With the United States being the only world super power, it has the full range of options to play out the endgame. Given the unpredictability of the Trump administration, the trade war could end tomorrow with only a lip service to appeal the electoral base. It is always back to what the United States views China in the long term. If the United States sees its role to contain China and sees China as the evil axis that endangers the U.S. dominated world order, the United States should escalate fearlessly to a war with China while it can, do what it is the most familiar with (toppling the regime). The consequence of that, is a far weaker, poorer China, with 1.5 billion people that cannot feed themselves. I wish to appeal to many of my American friends, this is an undesirable humanitarian dilemma.

Alternatively, the United States could fool itself into the sanction game. Even without coordinated efforts with Europe and Japan, the sanction from the United States will greatly damage the Chinese with limited negative impact to the U.S. corporations. However, it is unlikely the United States will see a more friendly China there. With the us v.s. them mentality, it is hard to imagine a pro-American regime being born that way. An inward-looking China will ultimately poses greater threat than an outward-looking one.

The United States has to recognize that without a hot war, it needs to work with China. The shared sovereignty request is not acceptable, by both the regime and the people. On the other hand, if the United States wants a friendlier China, the demands should be a rule-based mechanism that enforces IP protection and the participation of foreign capital. The right to participate made-in-China 2025 would also be a far more interesting play for the United States than forcing China to abandon them.

No comment yet
August 15th, 2018

When programming with CUDA, there are several ways to exploit concurrency for CUDA kernel launches. As explained in some of these slides, you can either:

  1. Create thread corresponding each execution flow, execute serially on stream per thread, coordinate with either cudaEventSynchronize or cudaStreamSynchronize;
  2. Carefully setup CUDA events and streams such that the correct execution flow will follow.

The 2. seems more appealing to untrained eyes (you don't have to deal with threads!) but in practice, often error-prune. One of the major issue, is that the cudaEventRecord / cudaStreamWaitEvent pair doesn't capture all synchronization needs. Comparing this to Grand Central Dispatch provided primitives: dispatch_group_enter / dispatch_group_leave / dispatch_group_notify, the under-specified part is where the cudaEventEnter happens. This often leads to a surprising fact that when you cudaStreamWaitEvent on a event not yet recorded on another stream (with cudaEventRecord), the current stream will treat as if this event is already happened and won't wait at all.

This is OK if your execution flows is static, thus, all the kernels need to be executed on which stream, are fully specified upfront. Requires some careful arrangement? Yes, but it is doable. However, it all breaks down if some coordinations need to happen after some kernel computations are done. For example, based on the newly computed losses, to determine whether decrease learn rate or not. Generally-speaking, for any computation graph that supports control structure, these coordinations are necessary.

The obvious way to solve this, is to go route 1. However, that imposes other problems, especially given pthread's handling of spawn / join is something much left to be desired.

For a few brave souls wanting to go route 2. to solve this, how?

After CUDA 5.x, a new method cudaStreamAddCallback is provided. This method itself carries some major flaws (before Kepler, cudaStreamAddCallback could cause unintended kernel launch serializations; the callback itself happens on the driver thread; and you cannot call any CUDA API inside that callback). But if we can gloss over some of these fundamental flaws and imagine, here is how I could make use of it with the imaginary cudaEventEnter / cudaEventLeave pair.

At the point I need to branch to determine whether to decrease learn rate, before cudaStreamAddCallback, I call cudaEventEnter to say that a event need to happen before certain stream to continue. Inside the callback, I get the loss from GPU, makes the decision, and call cudaEventLeave on the right event to continue the stream I want to branch into.

In real world, the above just cannot happen. We miss cudaEventEnter / cudaEventLeave primitives, and you cannot do any CUDA API call inside such callback. More over, the code will be complicated with these callbacks anyway (these are old-fashioned callbacks, not even lambda functions or dispatch blocks!).

What if, I can write code as if it is all synchronous, but under the hood, it all happens on one thread, so I don't have to worry about thread spawn / join when just scheduling work from CPU?

In the past a few days, I've been experimenting how to make coroutines work along cudaStreamAddCallback, and it seems all working! To make this actually useful in NNC probably will take more time, but I just cannot wait to share this first :P

First, we need to have a functional coroutine implementation. There are a lot stackful C coroutine implementations online and my implementation borrowed heavily from these sources. This particular coroutine implementation just uses makecontext / swapcontext / getcontext.

Setup basic data structures:

Setup a main run loop that can schedule coroutines:

Now, create a new task:

Usual utilities for coroutine (ability to yield, launch a new coroutine, and wait for existing coroutine to finish):

With above utilities, you can already experiment with coroutines:

Unsurprisingly, you should be able to see print outs in order of:

coroutine f first executed, it launches coroutine g. When g gives up control (taskyield), coroutine f continues to execute until finish. After that, scheduler resumes coroutine g, and it finishes as well.

You can also try to taskwait(task, gtask) in coroutine f, to see that f will finish only after coroutine g is scheduled again until finish.

So far, we have a functional coroutine implementation in C. Some of these code doesn't seem to make sense, for example, why we need a mutex and a condition variable? Because a secret function that enables us to wait on a stream is not included above:

taskcudawait will put the current coroutine on-hold until the said stream finishes. Afterwards, you can do branch, and knowing comfortably kernels in the stream above are all done. The condition variable and the mutex is necessary because the callback happens on the driver thread.

You can see the full code that demonstrated the usage here:

It seems above utilities would cover all my usages (the taskwait and taskresume are important to me because I don't want too much hard to control async-y when launch sub-coroutines). Will report back if some of these doesn't hold and I failed to implement fully-asynchronous, control structure supported computation graph with these cute little coroutines.

No comment yet
May 3rd, 2018

NNC is a tiny deep learning framework I was working on for the past three years. Before you close the page on yet another deep learning framework. let me quickly summarize why: starting from scratch enables me to toy with some new ideas on the implementation, and some of these ideas, after implemented, has some interesting properties.

After three years, and given the fresh new takes on both APIs and the implementation, I am increasingly convinced this will also be a good foundation to implement high-level deep learning APIs in any host languages (Ruby, Python, Java, Kotlin, Swift etc.).

What are these fresh new takes? Well, before we jump into that, let's start with some not-so-new ideas inside NNC: Like every other deep learning framework, NNC operates dataflow graphs. Data dependencies on the graph are explicitly specified. NNC also keeps the separation of symbolic dataflow graphs v.s. concrete dataflow graphs. Again, like every other deep learning framework, NNC supports dynamic execution, which is called dynamic graph in NNC.

With all that get out of the way, the interesting bits:

  • NNC supports control flows, with a very specific while loop construct and multi-way branch construct;

  • NNC implements a sophisticated tensor allocation algorithm that treats tensors as a region of memory, which enables tensor partial reuse;

  • The above allocation algorithm handles control flows, eliminates data transfers for while loop, and minimizes data transfers for branching;

  • Dynamic execution in NNC is implemented on top of its static graph counterpart, thus, all optimization passes available for static graph can be applied when doing automatic differentiation in the dynamic execution mode;

  • Tensors used during the dynamic execution can be reclaimed, there is no explicit tape session or requires_grad flag;

You can read more about it on Over the next a few months, I will write more about this. There are still tremendous amount of work ahead for me to get to a point of release. But getting ahead of myself and put some pressure on is not a bad thing either :P The code lives in the unstable branch of libccv: ccv_nnc.h.

No comment yet
March 21st, 2018

十年前,我开始每两三年写一篇 BLOG ,内容是关于四年之后的一些预测。预测的原理也很简单,就是根据和大家聊的内容,还有每天看的新闻,做一些合理的臆测。后来也开始正儿八经的先描述一下大致的人文市场政治环境,再做猜测。这么一个设置,让很多地方也就看起来扯得更有道理了。然而追根究底之所以如此,不过是其实接下去几年大家要做什么也都了解得八九不离十。说四年后苹果会生产第十四代 iPhone ,这又有什么好猜的呢。


首先,是中国的消除贫困。在2021年,也就是建党一百周年的时候,中国领导人会宣布全面建成小康社会。而在中国的语境下,全面建成小康社会是明确可量的,那就是消除贫困。在2022年,中国人均 GDP 将超过一万元美金。任何其他的情况都没有什么意义,因为如果这2021年中国不能全面建成小康社会,那么整个国际环境也将很不稳定,接下去的预测自然就会漏洞百出了。


计算性能在混合计算领域仍然有大的提升,单机的计算能力会接近 0.5 Petaflops (全 32bit 浮点),GPU的内存将会达到 48GiB 每片。而移动系统的CPU / GPU发展终于遇到了性能瓶颈。在未来4年,他们的价格仍然会继续下降,但是性能的提升只在两倍左右。移动芯片的大部分工作将在于特殊优化和功能集成。




比较意外的是,现在在电动车行业起步较早的传统汽车厂商并没有抢得什么先机。具体而言,宝马的i系销量会越来越不尽人意,而普遍认为已经失掉推出纯电车先机的福特和丰田都推出了成功(总销量在十万辆以上)的纯电动车型。尽管如此,在全球范围内,仍然有两到三家新兴的电动车企业成长了起来。特斯拉,虽然在自动驾驶方面步履蹒跚,也在2020年做到了L3的自动驾驶水平。而Model 3,要么是一个巨大的成功(年销量二十万辆以上),要么是一个中等的成功(年销量八万辆以上)。真正的爆款,或许是一种 Minivan 和紧凑型 SUV 的混合体。



HIV 的疫苗终于在未来的四年上市了。这或许是普罗大众所知道的医学方面最重磅的消息了。很多行业都是这样,一些大的变化,都是从很小的地方开始的。比如便宜的无刷电机和芯片的组合,还有因为智能手机而便宜起来的传感器。这些小东西组合起来,就变成了信息化农业落地的重要元素。尤其是在亚洲,未来四年你会逐渐发现一些产量高,产品一致性好的现代化精细生产的农场。利用信息化技术,这些农场的产量达到了普通工业化农场产量的两倍,接近于需要大量劳力的农场产量。

虽然2017年号称是 AR 元年,然而,即使四年之后,也不会有一款大众的AR硬件产品(单品总销量在五百万件以上)。




Ten years ago, I began to post some predictions of 4-year in the future. The principle of these predictions are simple: it was a combination of things we chatted, things I read, and a stash of reasonable imaginations. Later, to make this a bit more fun and educating, I would also map out the potential market political environments before the prediction. With this setup, everything now looks more systematic and professional. But to be honest, everything that is going to happen in the next a few years has already set in motion today. It won't be that entertaining to predict that Apple will design the 14th generation of iPhone in 2022.

That's been said, what it would look like in 2022, now the 2018 starts to unfolding?

First, the elimination of poverty in China. In 2021, the 100th anniversary of Communist Party of China, the leadership in China will announce that they have finished building the moderately prosperous society in all respects. For China, the moderately prosperous society in all respects is a measurable goal, and the end result is the elimination of poverty. In 2022, China's GDP per capita will reach 10,000 USD. If China cannot reach that goal, everything else is not very meaningful to predict due to the global instability.

The main theme of 21st century is the decline of American power. But in this 4 years, we can only see occasional hints of such, this nation is and continue to be the main player in global economy, and the major powerhouse for technology development.

There are at least 5x improvements in raw computation power from heterogeneous computing paradigm. Single chip can reach 0.5 Petaflops (full 32-bit floating point) by the end of 2022. On-device memory per GPU card can reach up to 48GiB. The view for mobile is not as rosy however. In the next 4 years, the price of mobile system-on-chip (SoC) will continue to go down, but the speed on traditional workload will not improve much, and at max, 2x. More work will go into function-specific optimizations and feature integration.

Now, grand scene has been set, what will happen next?

Cars. In 2022, most production electric cars and luxury vehicles will have level-3 autonomous driving capability. Middle-class will now drive more electric cars while for families with annual income less than $30,000, they will continue to drive cars with internal-combustion engine. Although most electric cars on sales have level-3 autonomous driving capability, there is no viable after-market component for level-3 autonomous driving.

Level-3: No human attention needed in most highway and local environments. The system will alert the driver under certain conditions.

To many people's surprise, traditional car manufacturers who started early in the electric vehicle market don't have much first-mover advantage. Specifically, BMW's i-series sales number will plunge. Ford and Toyota, who were once considered late-comers to the market now both have successful battery electric vehicles (total unit sales exceeding 100,000). Even so, globally, there will be two or three new but established all-electric car manufacturer. Tesla, who had some false starts in autonomous driving technology finally gets to level-3 in 2020. Its Model 3, is either a huge success (200,000 unit sales per year) or a moderate one (80,000 unit sales per year). The most popular battery electric car? We probably haven't seen it yet, and it is likely to be a cross-over between minivan and compact SUV.

High-speed railway. We've been talking about the high-speed railway from Mumbai to Hyderabad since 2013. At the end of 2022, it will finish. The high-speed railway between San Francisco and Los Angeles? It probably hasn't even broken the ground.

With the stability of oil price, and the mass use of the new generation airplanes, there will be more ultra-long distance non-stop flights (more than 16 hours). There will be no regular supersonic commercial flights by 2022 though.

The HIV vaccine will hit the market in the next 4 years. This probably will be the single most known medical breakthrough in that 4 years. A lot of important breakthroughs, often have some miniscule starts. Cheap brush-less motors, SoC, and cheap sensors, thanks to the ubiquitousness of smartphones. These gadgets becomes the important ingredients of why information agriculture now works. Especially in Asia, you will find some modern lean production farms with high yield and high quality produce. Equipped with information technology, these farms have yield more than 2x of their industrial-farm counterparts, closer to the yield of small labor intensive farming.

2017 was called the origin year of AR. However, even after 4 years, there will be no mass-market successful AR hardware (more than 5 million total unit sales).

And it finally happened, Amazon, just before the end of 2022, starts deliveries with drones in certain area of North America. These delivery drones, along with autonomous trunk on the inter-state high ways (one driver, many trunks), symbolizes the beginning of elimination of low-paying jobs.

Lucky for all of us, after a economic downturn, Bitcoin will stop being an investment vehicle.

No comment yet
June 1st, 2016

Decades have passed before we had a yet high quality consumer software. It is now taken that software supposes to be crashy, laggy and barely functional. Why and how we get here? When the question is asked, many people felt the nostalgia, where the software is simpler, and people crafts their cathedral. They often overlooked the fact that the software we built today, was many orders of magnitudes more complex than software we had in 1980s. Even today's software with simplest tableau operations, its graphic user interface combining with the complex animations and multi-touch interactions, if, built from scratch, requires many months of developer time.

For what I can remember, concept of quality was popularized in 1970s from Japan. 1970s, through the quest for quality, the Japanese auto industry reached the level of low cost that its American competitors could only dream of.

No comment yet
March 23rd, 2016


No comment yet
September 27th, 2015






1). 在货币和经济政策调控下经济成功硬着陆,中国的GDP增长会降到每年4.5%到5.5%。总体上,中国的财政将会更加平衡,作为全球制造工厂,中国所整合的系统生产效率将是其他即使有低成本劳动力的经济体难以企及的。在这种情况下,中国理所应当地成为了人均年GDP产值在九千到一万美金左右的新兴发达国家。

2). 在未来的两年中,中国的GDP增长将降到致命的4.5%或者更低。经济和货币政策在大量的货币对外流失和自2008年以来的资本失控中效果乏乏。社会动乱比想象中的更加容易。地方政府将会疲于应付各种社会暴动。而中央政府可能会和反对党领导人展开对话。而下一步的发展会变得无法预测。









  • 智能硬件上,它原有的功能应该变得极其傻瓜化。轻松、一次完成、不用动脑的操作;

  • 除去基本的功能以外,智能硬件能够用有限的方法解决一些之前使用中的“痛点”(一个好的例子是可以自动下载云端内容的路由,而能自动预订食物的冰箱就是一个坏案例);

  • 在未来四年,家中的智能硬件不大可能是一种全新的东西。


虽然有地方冲突和不稳定因素,总体上来说,交通的成本效率更高了。地面交通而言,自动驾驶或者驾驶辅助技术成为新汽车的标配,但离强制标准仍然差很远。曾经Abu Dhabi的PRT虽然不会在中东成功,但类似的交通工具将在一些城市商业化。长距离的交通工具革新在美国仍然没有摆脱实验阶段。不仅如此,一些长距离的不间断航班由于成本上升而被取消。商业交通将会更慢,也更贵(虽然成本效率更高)。







It has been 4 years since the last prediction for the year 2016. My original plan is to draft a prediction every 2 years, and scope for the next 4 years. Gates once said, we always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten. A decade ago, having computing devices as small as a palm with Pentium 4 computational power was unimaginable. Even 8 years ago, it was a difficult fate for us to build an all-in-one TV with high-end PC capability.

Review the Prediction of the Past 4 Years

The prediction of the past 4 years has been accurate. The biggest promise of economic stability has been kept with all the unusual fiscal policies, otherwise such predictions can hardly be any believable if at all. Reviewing the prediction I made 4 years ago, Internet connection speed, the unfortunate market share of 3D TV, Television on demand, computational power, driving assistance (self-driving), and photography technology have matched the reality pretty well. However, for wireless power source, Pads and ultrabook merging, commercial supersonic flight, unemployment rate, and artificial intelligence has been off quite a bit. No predictions on unmanned aerial vehicles. Overall, some of these predictions are too optimistic, and some of these are simply ignorant.

The Economic / Social Outlook for the Next 4 Years

However, it is harder to predict the next 4 years on the same social / economic stability promise. Globally, the economy growth slowdown will be a given. On the contrary, the United States will be least affected due to the dominance of Dollar in Global economy. In Europe, it is unlikely the economic situation in Spain, Greece and other Mediterranean countries will get any better. As slow as politics go, the possibility of one or several countries exiting euro-zone becomes ever more real. However, under the gloomy environment, Japan's outlook improved marginally after several scheduled tax hikes. The tricky bits, is China. China would likely to take either two paths:

1). Its GDP will land at around 4.5% to 5.5% yoy growth in the next 4 years. This is after a controlled turbulence landing, with some finesse mix of fiscal / monetary stimulus. Overall, the fiscal sheet is more balanced, and as the world manufacturer, China integrates more efficiency in its system, and it is harder to compete on efficiency front even with much lower labor costs. This is a China as a newly-minted developed nation, seating comfortably among the rest of developed nations with GDP per capita between $9,000 and $10,000.

2). Its GDP will land at 4.5% or even below in the next 2 years and will be considered as fatal. Fiscal and monetary tools seem ineffective due to large amounts of capital outflow, as well as loosen control over capital in general after 2008. The social uprising turns out to be much easier than expected. The regional government would be hard to contain the unrest, and the central government would likely to have several rounds of negotiations with opposition leaders, it becomes impossible to predict what would happen afterwards.

For the sake of making any progress on this prediction, I will pick the China option 1 as the background for the next 4 years. If option 2 turns out to be closer to the reality, it nullifies all the predictions I am going to make below.

India, for the lack of systematic knowledge in that area, it is hard to predict the impact of India to the global technology and economy outlook. For Russian and Middle-East oil-producing countries, the assumption will be that oil per barrel will float around $40 to $100, and Russian's economy will struggle nevertheless due to the more volatility in the oil price.

The Basis of Any Predictions

The success of any prediction, if at all, looks at the past patterns. For the past 100 years or so, it has been the capturing and interpretation of exponential growth. It has been emphasized in enormous books and talks about the fascination of exponential growth. However,by applying exponential growth, without the underlying understanding of technological principles, we risk of hitting some fundamental laws of the physics, and makes no progress at all (and on the other hand, a premature prejudice of "understanding" the fundamental limits of physics, can be fatal too).

The exponential growth is made possible only with two key terms: standardization and the economy of the scale. The modern marvel of this kind, is the iPhone. Without the scale of the iPhone, modern high resolution screen with capacitive touch will cost thousands dollars to manufacture per square inch. But now, everyone gets a modern high resolution touchscreen with a few bucks.

These two key words, will manifest themselves in many forms, and will continue to play wonders in the next 4 years.

The Prediction

The smart hardware has been around for more than 10 years. But what makes sense as a "smart hardware"?

  • It makes the basic functionalities we assumed about that hardware a no-brainer. Smooth, one touch, perfect and care-free integration;

  • It extends beyond the basic functionalities, but operates under well-defined principles (good example, a router that caches cloud content and make the access instantaneous, bad example, a refrigerator that orders food for you);

  • It is unlikely to be something completely new.

Then, there is the un-PC era. In the next 4 years, homes rarely own any desktop computers, even though aggregated processing power in a single-family house can easily reach more than 10Tflops. There is a change of the interface too. People now interact with these devices by either touching or talk. The graphical interfaces now have a meaningful conversational re-touch.

Despite the potential conflicts and regional instability, the transportation will be more cost effective. In terms of the land transportation, self-driving or smarter driving assistant will be standard add-on in newly shipped vehicles. However, it is far from becoming the mandatory standard. The Abu Dhabi PRT was a failure in the Middle-East, but similar transportation services will run commercially in some cities. The next generation of long distance land-transportation is still in experimental phase in the United States. Not only that, some of the longest commercial flights are cancelled due to the cost. Commercial transportation is going to be more expensive, and slower.

Entertainment industry gets a big boost in time of recession. People still spend disproportionate time on big television, The movement of "cutting-the-cord" will happen much faster than expected. The United States 15 to 35 year viewership on cable will drop at the rate of 10% to 20% year over year and accelerating. Today's top TV show numbers (5m viewer at the premiere) will keep steady. But shows with 2m to 3m premiere viewership will see a drop to 1m or less. In the United States, online streaming players will ink deals with major sports and have exclusive rights to stream online. People will spend more than 3 hours a day on streaming services, either on television or on their mobile devices.

Shared economy is not going the way you would expect. At its core, shared economy moves the assets out of the company such as AirBnb or Uber's balance sheet and bumped up its profitability. At boom times, asset-light companies can move fast and quickly get rid of less profitable businesses painlessly. At down times, these companies will try to own more assets as the asset prices are all cheap. However, the most popular way for them to do so will not be out-right purchase. Instead, they will launch finance programs to help its share economy workers to own these assets, and leave the risk of asset depreciation to them.

The mobile messaging service will consolidate. Respectable players on messaging service will reach 300m daily active users, and have at least 2b message sent per day. Any player cannot reach that hallmark will be dead. There will be only 3 to 4 major players in that space, if not less. All the messaging services will have the ability to make audio and video calls, which will continue to marginalize the phone call service business for traditional phone service providers. In the United States at least, more than one online-based business will enter ISP business. The speed of the Internet will continue to improve. Home Internet speed globally will average to 100Mbps. Global mobile Internet speed will average to 10Mbps. Specifically, the mobile Internet service in Middle / South Africa will reach average 500Kbps. In the other word, as long as you can pay, with your cellphone, you can have semi-stable Internet connection and will be able to do video calls anywhere in the world except Antarctic.

Cost-effectiveness is penetrating medical equipments. With lower cost of processing power and general application of machine learning techniques in signal processing, popular and essential medical equipments will reach a point that are cheap and versatile enough to even be delivered to the most remote area on Earth. The profound impact will be a global lift in life expectancy.

Virtual reality gears will have tractions in many more homes. They are still struggling to find its killer applications. But on average, shipped units per year will be around 30m globally at the end of 2019. Industrial robots will replace more human labor, which is a good thing for China. Privatization of space technology continues. One or more private companies will accomplish at least one low-orbit manned mission.

Thus, it is not the worst time of humanity yet for the year of 2020.

No comment yet
August 16th, 2015







09年的八月在湾区租的一个小房间里。一早到Sunnyvale的另一个小房间,中午在Castro吃饭,将近九点的时候回到Mountain View的那个小房间。然后第一次到Charlottesville。






‹ Newer