- Create thread corresponding each execution flow, execute serially on stream per thread, coordinate with either
- Carefully setup CUDA events and streams such that the correct execution flow will follow.
The 2. seems more appealing to untrained eyes (you don't have to deal with threads!) but in practice, often error-prune. One of the major issue, is that the
cudaStreamWaitEvent pair doesn't capture all synchronization needs. Comparing this to Grand Central Dispatch provided primitives:
dispatch_group_notify, the under-specified part is where the
cudaEventEnter happens. This often leads to a surprising fact that when you
cudaStreamWaitEvent on a event not yet recorded on another stream (with
cudaEventRecord), the current stream will treat as if this event is already happened and won't wait at all.
This is OK if your execution flows is static, thus, all the kernels need to be executed on which stream, are fully specified upfront. Requires some careful arrangement? Yes, but it is doable. However, it all breaks down if some coordinations need to happen after some kernel computations are done. For example, based on the newly computed losses, to determine whether decrease learn rate or not. Generally-speaking, for any computation graph that supports control structure, these coordinations are necessary.
The obvious way to solve this, is to go route 1. However, that imposes other problems, especially given pthread's handling of spawn / join is something much left to be desired.
For a few brave souls wanting to go route 2. to solve this, how?
After CUDA 5.x, a new method
cudaStreamAddCallback is provided. This method itself carries some major flaws (before Kepler,
cudaStreamAddCallback could cause unintended kernel launch serializations; the callback itself happens on the driver thread; and you cannot call any CUDA API inside that callback). But if we can gloss over some of these fundamental flaws and imagine, here is how I could make use of it with the imaginary
At the point I need to branch to determine whether to decrease learn rate, before
cudaStreamAddCallback, I call
cudaEventEnter to say that a event need to happen before certain stream to continue. Inside the callback, I get the loss from GPU, makes the decision, and call
cudaEventLeave on the right event to continue the stream I want to branch into.
In real world, the above just cannot happen. We miss
cudaEventLeave primitives, and you cannot do any CUDA API call inside such callback. More over, the code will be complicated with these callbacks anyway (these are old-fashioned callbacks, not even lambda functions or dispatch blocks!).
What if, I can write code as if it is all synchronous, but under the hood, it all happens on one thread, so I don't have to worry about thread spawn / join when just scheduling work from CPU?
In the past a few days, I've been experimenting how to make coroutines work along
cudaStreamAddCallback, and it seems all working! To make this actually useful in NNC probably will take more time, but I just cannot wait to share this first :P
First, we need to have a functional coroutine implementation. There are a lot stackful C coroutine implementations online and my implementation borrowed heavily from these sources. This particular coroutine implementation just uses
Setup basic data structures:
Setup a main run loop that can schedule coroutines:
Now, create a new task:
Usual utilities for coroutine (ability to yield, launch a new coroutine, and wait for existing coroutine to finish):
With above utilities, you can already experiment with coroutines:
Unsurprisingly, you should be able to see print outs in order of:
coroutine f first executed, it launches coroutine g. When g gives up control (
taskyield), coroutine f continues to execute until finish. After that, scheduler resumes coroutine g, and it finishes as well.
You can also try to
taskwait(task, gtask) in coroutine f, to see that f will finish only after coroutine g is scheduled again until finish.
So far, we have a functional coroutine implementation in C. Some of these code doesn't seem to make sense, for example, why we need a mutex and a condition variable? Because a secret function that enables us to wait on a stream is not included above:
taskcudawait will put the current coroutine on-hold until the said stream finishes. Afterwards, you can do branch, and knowing comfortably kernels in the stream above are all done. The condition variable and the mutex is necessary because the callback happens on the driver thread.
You can see the full code that demonstrated the usage here: https://gist.github.com/liuliu/7366373d0824a915a26ff295c468b6e4
It seems above utilities would cover all my usages (the
taskresume are important to me because I don't want too much hard to control async-y when launch sub-coroutines). Will report back if some of these doesn't hold and I failed to implement fully-asynchronous, control structure supported computation graph with these cute little coroutines.
NNC is a tiny deep learning framework I was working on for the past three years. Before you close the page on yet another deep learning framework. let me quickly summarize why: starting from scratch enables me to toy with some new ideas on the implementation, and some of these ideas, after implemented, has some interesting properties.
After three years, and given the fresh new takes on both APIs and the implementation, I am increasingly convinced this will also be a good foundation to implement high-level deep learning APIs in any host languages (Ruby, Python, Java, Kotlin, Swift etc.).
What are these fresh new takes? Well, before we jump into that, let's start with some not-so-new ideas inside NNC: Like every other deep learning framework, NNC operates dataflow graphs. Data dependencies on the graph are explicitly specified. NNC also keeps the separation of symbolic dataflow graphs v.s. concrete dataflow graphs. Again, like every other deep learning framework, NNC supports dynamic execution, which is called dynamic graph in NNC.
With all that get out of the way, the interesting bits:
NNC supports control flows, with a very specific while loop construct and multi-way branch construct;
NNC implements a sophisticated tensor allocation algorithm that treats tensors as a region of memory, which enables tensor partial reuse;
The above allocation algorithm handles control flows, eliminates data transfers for while loop, and minimizes data transfers for branching;
Dynamic execution in NNC is implemented on top of its static graph counterpart, thus, all optimization passes available for static graph can be applied when doing automatic differentiation in the dynamic execution mode;
Tensors used during the dynamic execution can be reclaimed, there is no explicit tape session or
You can read more about it on http://libnnc.org/. Over the next a few months, I will write more about this. There are still tremendous amount of work ahead for me to get to a point of release. But getting ahead of myself and put some pressure on is not a bad thing either :P The code lives in the
unstable branch of libccv: ccv_nnc.h.
十年前，我开始每两三年写一篇 BLOG ，内容是关于四年之后的一些预测。预测的原理也很简单，就是根据和大家聊的内容，还有每天看的新闻，做一些合理的臆测。后来也开始正儿八经的先描述一下大致的人文市场政治环境，再做猜测。这么一个设置，让很多地方也就看起来扯得更有道理了。然而追根究底之所以如此，不过是其实接下去几年大家要做什么也都了解得八九不离十。说四年后苹果会生产第十四代 iPhone ，这又有什么好猜的呢。
首先，是中国的消除贫困。在2021年，也就是建党一百周年的时候，中国领导人会宣布全面建成小康社会。而在中国的语境下，全面建成小康社会是明确可量的，那就是消除贫困。在2022年，中国人均 GDP 将超过一万元美金。任何其他的情况都没有什么意义，因为如果这2021年中国不能全面建成小康社会，那么整个国际环境也将很不稳定，接下去的预测自然就会漏洞百出了。
计算性能在混合计算领域仍然有大的提升，单机的计算能力会接近 0.5 Petaflops （全 32bit 浮点），GPU的内存将会达到 48GiB 每片。而移动系统的CPU / GPU发展终于遇到了性能瓶颈。在未来4年，他们的价格仍然会继续下降，但是性能的提升只在两倍左右。移动芯片的大部分工作将在于特殊优化和功能集成。
比较意外的是，现在在电动车行业起步较早的传统汽车厂商并没有抢得什么先机。具体而言，宝马的i系销量会越来越不尽人意，而普遍认为已经失掉推出纯电车先机的福特和丰田都推出了成功（总销量在十万辆以上）的纯电动车型。尽管如此，在全球范围内，仍然有两到三家新兴的电动车企业成长了起来。特斯拉，虽然在自动驾驶方面步履蹒跚，也在2020年做到了L3的自动驾驶水平。而Model 3，要么是一个巨大的成功（年销量二十万辆以上），要么是一个中等的成功（年销量八万辆以上）。真正的爆款，或许是一种 Minivan 和紧凑型 SUV 的混合体。
虽然2017年号称是 AR 元年，然而，即使四年之后，也不会有一款大众的AR硬件产品（单品总销量在五百万件以上）。
Ten years ago, I began to post some predictions of 4-year in the future. The principle of these predictions are simple: it was a combination of things we chatted, things I read, and a stash of reasonable imaginations. Later, to make this a bit more fun and educating, I would also map out the potential market political environments before the prediction. With this setup, everything now looks more systematic and professional. But to be honest, everything that is going to happen in the next a few years has already set in motion today. It won't be that entertaining to predict that Apple will design the 14th generation of iPhone in 2022.
That's been said, what it would look like in 2022, now the 2018 starts to unfolding?
First, the elimination of poverty in China. In 2021, the 100th anniversary of Communist Party of China, the leadership in China will announce that they have finished building the moderately prosperous society in all respects. For China, the moderately prosperous society in all respects is a measurable goal, and the end result is the elimination of poverty. In 2022, China's GDP per capita will reach 10,000 USD. If China cannot reach that goal, everything else is not very meaningful to predict due to the global instability.
The main theme of 21st century is the decline of American power. But in this 4 years, we can only see occasional hints of such, this nation is and continue to be the main player in global economy, and the major powerhouse for technology development.
There are at least 5x improvements in raw computation power from heterogeneous computing paradigm. Single chip can reach 0.5 Petaflops (full 32-bit floating point) by the end of 2022. On-device memory per GPU card can reach up to 48GiB. The view for mobile is not as rosy however. In the next 4 years, the price of mobile system-on-chip (SoC) will continue to go down, but the speed on traditional workload will not improve much, and at max, 2x. More work will go into function-specific optimizations and feature integration.
Now, grand scene has been set, what will happen next?
Cars. In 2022, most production electric cars and luxury vehicles will have level-3 autonomous driving capability. Middle-class will now drive more electric cars while for families with annual income less than $30,000, they will continue to drive cars with internal-combustion engine. Although most electric cars on sales have level-3 autonomous driving capability, there is no viable after-market component for level-3 autonomous driving.
Level-3: No human attention needed in most highway and local environments. The system will alert the driver under certain conditions.
To many people's surprise, traditional car manufacturers who started early in the electric vehicle market don't have much first-mover advantage. Specifically, BMW's i-series sales number will plunge. Ford and Toyota, who were once considered late-comers to the market now both have successful battery electric vehicles (total unit sales exceeding 100,000). Even so, globally, there will be two or three new but established all-electric car manufacturer. Tesla, who had some false starts in autonomous driving technology finally gets to level-3 in 2020. Its Model 3, is either a huge success (200,000 unit sales per year) or a moderate one (80,000 unit sales per year). The most popular battery electric car? We probably haven't seen it yet, and it is likely to be a cross-over between minivan and compact SUV.
High-speed railway. We've been talking about the high-speed railway from Mumbai to Hyderabad since 2013. At the end of 2022, it will finish. The high-speed railway between San Francisco and Los Angeles? It probably hasn't even broken the ground.
With the stability of oil price, and the mass use of the new generation airplanes, there will be more ultra-long distance non-stop flights (more than 16 hours). There will be no regular supersonic commercial flights by 2022 though.
The HIV vaccine will hit the market in the next 4 years. This probably will be the single most known medical breakthrough in that 4 years. A lot of important breakthroughs, often have some miniscule starts. Cheap brush-less motors, SoC, and cheap sensors, thanks to the ubiquitousness of smartphones. These gadgets becomes the important ingredients of why information agriculture now works. Especially in Asia, you will find some modern lean production farms with high yield and high quality produce. Equipped with information technology, these farms have yield more than 2x of their industrial-farm counterparts, closer to the yield of small labor intensive farming.
2017 was called the origin year of AR. However, even after 4 years, there will be no mass-market successful AR hardware (more than 5 million total unit sales).
And it finally happened, Amazon, just before the end of 2022, starts deliveries with drones in certain area of North America. These delivery drones, along with autonomous trunk on the inter-state high ways (one driver, many trunks), symbolizes the beginning of elimination of low-paying jobs.
Lucky for all of us, after a economic downturn, Bitcoin will stop being an investment vehicle.
Decades have passed before we had a yet high quality consumer software. It is now taken that software supposes to be crashy, laggy and barely functional. Why and how we get here? When the question is asked, many people felt the nostalgia, where the software is simpler, and people crafts their cathedral. They often overlooked the fact that the software we built today, was many orders of magnitudes more complex than software we had in 1980s. Even today's software with simplest tableau operations, its graphic user interface combining with the complex animations and multi-touch interactions, if, built from scratch, requires many months of developer time.
For what I can remember, concept of quality was popularized in 1970s from Japan. 1970s, through the quest for quality, the Japanese auto industry reached the level of low cost that its American competitors could only dream of.
It has been 4 years since the last prediction for the year 2016. My original plan is to draft a prediction every 2 years, and scope for the next 4 years. Gates once said, we always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten. A decade ago, having computing devices as small as a palm with Pentium 4 computational power was unimaginable. Even 8 years ago, it was a difficult fate for us to build an all-in-one TV with high-end PC capability.
Review the Prediction of the Past 4 Years
The prediction of the past 4 years has been accurate. The biggest promise of economic stability has been kept with all the unusual fiscal policies, otherwise such predictions can hardly be any believable if at all. Reviewing the prediction I made 4 years ago, Internet connection speed, the unfortunate market share of 3D TV, Television on demand, computational power, driving assistance (self-driving), and photography technology have matched the reality pretty well. However, for wireless power source, Pads and ultrabook merging, commercial supersonic flight, unemployment rate, and artificial intelligence has been off quite a bit. No predictions on unmanned aerial vehicles. Overall, some of these predictions are too optimistic, and some of these are simply ignorant.
The Economic / Social Outlook for the Next 4 Years
However, it is harder to predict the next 4 years on the same social / economic stability promise. Globally, the economy growth slowdown will be a given. On the contrary, the United States will be least affected due to the dominance of Dollar in Global economy. In Europe, it is unlikely the economic situation in Spain, Greece and other Mediterranean countries will get any better. As slow as politics go, the possibility of one or several countries exiting euro-zone becomes ever more real. However, under the gloomy environment, Japan's outlook improved marginally after several scheduled tax hikes. The tricky bits, is China. China would likely to take either two paths:
1). Its GDP will land at around 4.5% to 5.5% yoy growth in the next 4 years. This is after a controlled turbulence landing, with some finesse mix of fiscal / monetary stimulus. Overall, the fiscal sheet is more balanced, and as the world manufacturer, China integrates more efficiency in its system, and it is harder to compete on efficiency front even with much lower labor costs. This is a China as a newly-minted developed nation, seating comfortably among the rest of developed nations with GDP per capita between $9,000 and $10,000.
2). Its GDP will land at 4.5% or even below in the next 2 years and will be considered as fatal. Fiscal and monetary tools seem ineffective due to large amounts of capital outflow, as well as loosen control over capital in general after 2008. The social uprising turns out to be much easier than expected. The regional government would be hard to contain the unrest, and the central government would likely to have several rounds of negotiations with opposition leaders, it becomes impossible to predict what would happen afterwards.
For the sake of making any progress on this prediction, I will pick the China option 1 as the background for the next 4 years. If option 2 turns out to be closer to the reality, it nullifies all the predictions I am going to make below.
India, for the lack of systematic knowledge in that area, it is hard to predict the impact of India to the global technology and economy outlook. For Russian and Middle-East oil-producing countries, the assumption will be that oil per barrel will float around $40 to $100, and Russian's economy will struggle nevertheless due to the more volatility in the oil price.
The Basis of Any Predictions
The success of any prediction, if at all, looks at the past patterns. For the past 100 years or so, it has been the capturing and interpretation of exponential growth. It has been emphasized in enormous books and talks about the fascination of exponential growth. However,by applying exponential growth, without the underlying understanding of technological principles, we risk of hitting some fundamental laws of the physics, and makes no progress at all (and on the other hand, a premature prejudice of "understanding" the fundamental limits of physics, can be fatal too).
The exponential growth is made possible only with two key terms: standardization and the economy of the scale. The modern marvel of this kind, is the iPhone. Without the scale of the iPhone, modern high resolution screen with capacitive touch will cost thousands dollars to manufacture per square inch. But now, everyone gets a modern high resolution touchscreen with a few bucks.
These two key words, will manifest themselves in many forms, and will continue to play wonders in the next 4 years.
The smart hardware has been around for more than 10 years. But what makes sense as a "smart hardware"?
It makes the basic functionalities we assumed about that hardware a no-brainer. Smooth, one touch, perfect and care-free integration;
It extends beyond the basic functionalities, but operates under well-defined principles (good example, a router that caches cloud content and make the access instantaneous, bad example, a refrigerator that orders food for you);
It is unlikely to be something completely new.
Then, there is the un-PC era. In the next 4 years, homes rarely own any desktop computers, even though aggregated processing power in a single-family house can easily reach more than 10Tflops. There is a change of the interface too. People now interact with these devices by either touching or talk. The graphical interfaces now have a meaningful conversational re-touch.
Despite the potential conflicts and regional instability, the transportation will be more cost effective. In terms of the land transportation, self-driving or smarter driving assistant will be standard add-on in newly shipped vehicles. However, it is far from becoming the mandatory standard. The Abu Dhabi PRT was a failure in the Middle-East, but similar transportation services will run commercially in some cities. The next generation of long distance land-transportation is still in experimental phase in the United States. Not only that, some of the longest commercial flights are cancelled due to the cost. Commercial transportation is going to be more expensive, and slower.
Entertainment industry gets a big boost in time of recession. People still spend disproportionate time on big television, The movement of "cutting-the-cord" will happen much faster than expected. The United States 15 to 35 year viewership on cable will drop at the rate of 10% to 20% year over year and accelerating. Today's top TV show numbers (5m viewer at the premiere) will keep steady. But shows with 2m to 3m premiere viewership will see a drop to 1m or less. In the United States, online streaming players will ink deals with major sports and have exclusive rights to stream online. People will spend more than 3 hours a day on streaming services, either on television or on their mobile devices.
Shared economy is not going the way you would expect. At its core, shared economy moves the assets out of the company such as AirBnb or Uber's balance sheet and bumped up its profitability. At boom times, asset-light companies can move fast and quickly get rid of less profitable businesses painlessly. At down times, these companies will try to own more assets as the asset prices are all cheap. However, the most popular way for them to do so will not be out-right purchase. Instead, they will launch finance programs to help its share economy workers to own these assets, and leave the risk of asset depreciation to them.
The mobile messaging service will consolidate. Respectable players on messaging service will reach 300m daily active users, and have at least 2b message sent per day. Any player cannot reach that hallmark will be dead. There will be only 3 to 4 major players in that space, if not less. All the messaging services will have the ability to make audio and video calls, which will continue to marginalize the phone call service business for traditional phone service providers. In the United States at least, more than one online-based business will enter ISP business. The speed of the Internet will continue to improve. Home Internet speed globally will average to 100Mbps. Global mobile Internet speed will average to 10Mbps. Specifically, the mobile Internet service in Middle / South Africa will reach average 500Kbps. In the other word, as long as you can pay, with your cellphone, you can have semi-stable Internet connection and will be able to do video calls anywhere in the world except Antarctic.
Cost-effectiveness is penetrating medical equipments. With lower cost of processing power and general application of machine learning techniques in signal processing, popular and essential medical equipments will reach a point that are cheap and versatile enough to even be delivered to the most remote area on Earth. The profound impact will be a global lift in life expectancy.
Virtual reality gears will have tractions in many more homes. They are still struggling to find its killer applications. But on average, shipped units per year will be around 30m globally at the end of 2019. Industrial robots will replace more human labor, which is a good thing for China. Privatization of space technology continues. One or more private companies will accomplish at least one low-orbit manned mission.
Thus, it is not the worst time of humanity yet for the year of 2020.
Birthdays are often joyless for me. I've yet to find this as an excuse to celebrate for. But the clock is ticking, and every time, it kicks me hard on the back this day of the year. It is always a thing for me to accomplish something, to set a goal, and work towards it. But looking back, I've done nothing tangible, not even to mention worthy causes. When I die, I die.
硅谷老是说，让世界变得更美好。再年轻一点的时候，也觉得热血沸腾。But now, I only want to touch lives. So be it one, or two, or many. 但是成长这么大，除了父母，也没有人会在乎我是活着还是死去了。过去三四年，却也没有做成什么事情。Life is not about a house, a car or a pack of children. What I want, is to make beautiful objects and put these into people's hands.
上周四晚上，和同事们去特Sketchy的一Neighborhood吃饭。洗手间在楼上，酒足饭饱之后去，发现小便池上面一排「THIS PLACE IS HAUNTED」的警告。匆忙跑下来，接到了姥姥过世的噩耗。不到30个小时之后，就回到了家乡，老房子的楼下。