No comment yet
May 31st, 2020

At times, a year can be a day and a day can be a year. After the past several years since January 2020, it feels far-reaching nowadays to preach how awesome the human world is. But the progress cannot be stalled, the problem cannot solve itself with inaction.

So, good people continue to publish great works. The recent GPT-3 model shows some interesting abstract reasoning capability. While I still have some trouble to grasp how in-context conditioning works with such a big model. It seems interesting enough that a language model, with inputs of words of trillions, can demonstrate some basic arithmetic and logic understanding.

That’s probably because we have too many words, to name things, to show emotions, to communicate abstract concepts, and to describe this beautiful world. That is also why I am sad when people erase meanings from words. And when that happens, a small part of our shared consciousness dies.

When communism becomes simply satanic, who needs to read The Road to Serfdom to understand when central planning from so-called leaders can become the tyranny of dictatorship.

But here we are. Anti-fascism is the foundation of the post-WWII order. From Leningrad to Paris, from Saigon to Washington D.C., on both sides of the Berlin Wall, if there is one thought that connects people together, that is never-again for fascism.

Today, one man tries to erase one word. They chanted: war is peace, freedom is slavery, anti-fascism is fascism.

I am not a native English speaker. Sometimes I have trouble to differentiate between the hundreds of ways to insert please. But I love to learn new words. English is so powerful precisely because we shamelessly borrow words. We say chutzpah to describe how gutsy someone is. We call a person kaizen master to praise the continuous improvement efforts they put in place. The meaning of a word can change, but all these meanings become the historical footnotes. Words establish a common ground for us to start understanding each other.

Maybe there is a future version where we speak a sequence of 128d vectors, arguing between the L1 or L2 distances of these vectors. It could be high-bandwidth. Or there may be a future we learn from videos rather than words (and short videos can certainly be educational: https://youtu.be/sMRRz1J9RkI). But today, a compact word is still our best way to communicate big ideas, to grasp the complex reality that sometimes can be overwhelming.

And that is why the literacy in words, in the past 70 years, is our best defense against colonialism, extremism and fascism.

Use Wikipedia, go read the footnotes, don’t trivialize a word. Our world depends on it.

No comment yet
April 15th, 2020

I had some of the fondest memories for Visual Basic in 1998. In January, I enlisted myself to a project to revive the fun part of programming. There are certain regrets in today’s software engineering culture where we put heavy facades to enforce disciplines. Fun was lost as the result.

With Visual Basic, you can create a GUI and start to hack a program in no time. You write directives and the computer will obey. There are certain weirdnesses in the syntax and some magic in how everything works together. But it worked, you can write and distribute a decent app that works on Windows with it.

When planning my little series the fun part of programming, there is a need to write cross-platform UI outside of Apple’s ecosystem in Swift. I picked Swift because its progressive disclosure nature (it is the same as Python, but there are other reasons why not Python discussed earlier in that post). However, the progressive disclosure ends when you want to do any UI work. If you are in the Apple’s ecosystem, you have to learn that a program starts when you have an AppDelegate, a main Storyboard and a main.swift file. On other platforms, the setup is completely different, even if it exists at all.

That’s why I spent the last two days experimenting whether we can have a consistent and boilerplate-free cross-platform UI in Swift. Ideally, it should:

  • Have a consistent way to build GUI app from the Swift source, no matter what platform you are on;
  • Progressive disclosure. You can start with very simple app and it will have the GUI show up as expected;
  • Retained-mode. So it matches majority of UI paradigms (on Windows, macOS, iOS and Android), easier for someone to progress to real-world programming;
  • Can still code up an event loop, which is essential to build games.

After some hacking and experimenting, here is what a functional GUI app that mirrors whatever you type looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import Gui

let panel = Panel(title: "First Window")
let button = Button(title: "Click me")
let text = Text(text: "Some Text")
panel.add(subview: button)
panel.add(subview: text)
let childText = TextInput(title: "Text")
childText.multiline = true
let childPanel = Panel(title: "Child Panel")
childPanel.add(subview: childText)
panel.add(subview: childPanel)

button.onClick = {
  let panel = Panel(title: "Second Window")
  let text = Text(text: "Some Text")
  panel.add(subview: text)
  text.text = childText.text
  childText.onTextChange = {
    text.text = childText.text
  }
}

You can use the provided build.sh to build the above source on either Ubuntu (requires sudo apt install libglfw3-dev and Swift 5.2.1) or macOS (requires Xcode 11):

1
./build.sh main.swift

and you will see this:

Ubuntu Swift GUI macOS Swift GUI

Alternatively, you can build an event loop all by yourself (rather than use callbacks):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import Gui

let panel = Panel(title: "First Window")
let button = Button(title: "Click me")
let text = Text(text: "Some Text")
panel.add(subview: button)
panel.add(subview: text)

var onSwitch = false
var counter = 0
while true {
  if button.didClick {
    if !onSwitch {
      onSwitch = true
    } else {
      onSwitch = false
    }
  }
  if onSwitch {
    text.text = "You clicked me! \(counter)"
  } else {
    text.text = "You unclicked me! \(counter)"
  }
  counter += 1
  Gui.Do()
}

In fact, the Gui.Do() method is analogous to DoEvents that yields control back to the GUI subsystem.

The cross-platform bits leveraged the wonderful Dear imgui library. Weirdly, starting with an immediate-mode GUI library makes it easier to implement a retained-mode GUI subsystem that supports custom run loops well.

You can see the proof-of-concept in https://github.com/liuliu/imgui/tree/swift/swift. Enjoy!

No comment yet
March 17th, 2020

There are some articles referring to the past 20 years as the modern Gilded Age. While Gilded Age is an American specific reference, it nevertheless carries a certain western culture mark in the early 20th century. However, the characterization of the Gilded Age, particularly railroads v.s. the internet, would come across to me as lazy. The fanatic obsession with literal bits over atoms comparison can delight certain readers, by and large, it failed to account for the scientific breakthroughs we accomplished in the physical world during the past 30 years. The modern Gilded Age would happily talk about income inequality and wealth gaps between the people and the super rich. While the class-struggle is real, the rather American centric view doesn’t help frame the issues in many other places during the Globalization. The last Gilded Age ended with a Great war. With the rising tide of Nationalism today, the non-cooperation between nations is a greater threat than another war.

The ruthless Globalization in the past century gives us a more interconnected world. This world brings new hopes, poses brand new challenges and unveils new ugliness in ourselves. The obsession with analogies of the old can only blind us with challenges at the hand and impede new progresses we have to make.

The Bits over Atoms Fallacy

The typical Silicon Valley’s talk would often revolve around bits versus atoms. It immediately draws a cheap analogy to railroads’ giants and today’s big tech corps. While the analogy is apparently appealing (railroads to fibers!), the modern days big tech corps don’t build the digital railroads. They assimilate, curate and disseminate information to you and of you. Today, the seemingly boundless “bits” is a result of all powerful service-based industries in the United States. These service-based industries only matter because the ever growing greed for profit margin. Globally, wealth is more diversely distributed than just finance and technology sectors. The list of top 10 richest people in the United Kingdom own very different things compared with their counterparts in the United States.

Technologies continue to shape our lives in important ways. But sweeping all the advances under the rug of digitalization wishfully ignores innovations in new materials, new manufacturing progresses and even the dearest new chip productions. The fallacy of bits over atoms emphasizes the form (bits) over the underlying function enabler (atoms) while the rapid progress of underlying function enablers drives us to new heights. Hopefully, over the next few years, people will rediscover the atoms as the moat for their competition again.

The Income Inequality Fallacy

The wealth gap analysis from Capitalism in the 21st Century was well-known. In the United States, the wealth gap was trivialized as the income inequality. The politicians are either too lazy or too dumb to discuss the the difference between wealth gap and income inequality. While it is true that in Europe, the income inequality was derived from wealth gaps (income generated from wealth far outpaced the income generated from labor). Professional managers are the new class in the United States that drive the income inequality.

The inequality of income as a signature symptom of the Gilded Age draws another analogy to the modern time. The dynamics of wealth destruction and creation in the United States however make it another lazy attempt to paint over another problem. Besides the income generated from wealth inherited through generations, both the career path of a professional manager and the wealth creation through destructive innovations can generate outsized income for people in the game. When means of production were outsourced to other nations, it can only be a limited argument to explain income inequalities.

While the wealth gap in the United States becomes a pure social structure play, the income inequality is a fallacy because it willingly ignores that the trouble is have nots, not haves. What prevents us from lifting the living standard for all as a society? Why haves can deprive resources from have nots? We as a society only recently have left the world of scarcity (with the Green Revolution?), it is doubtful that the artificial scarcity we created through privileges (luxury goods) will be the resources we deprive from have nots. As a society, we ought to have smarter solutions than going straight to the fallacy derived from the old scarcity world.

The more interesting question is about the developing world. If the United States entered the society of abundance, why the rest are not? Does the United States live in abundance because of technology breakthroughs, or on the back of its world dominance? These questions can find their references in historical context, but we need to have a new perspective to lift the living standard for everyone on the planet Earth.

The Great War Fallacy

The Gilded Age ended with WWI, and soon after, the Great Depression and WWII. It is never clear to me why people, especially men, have this enthusiasm with global warfare. Regional conflicts and ongoing tit-for-tat military operations will continue to be what they are. The Great War fallacy is deeply rooted in the belief that military conflicts can be effective means to advance one superpower’s objective against another. With the recent uprising of nationalism and xenophobia, this outdated view finally finds some of its consolation. It is all too familiar to draw an analogy of a new rising world power (Germany) against the old (British). That has been said, the mutual destruction power, the Globalization and the belief of a rule-based system are still very much alive in our modern world. However, the biggest fallacy of the new Great War is that there is nothing to bet on. There is not going to be a world after another Great War. Thus, believing in this fallacy has no relevance to our daily lives.

Although a war is meaningless to speculate on, the non-cooperation between superpowers could deteriorate the progress we have made so far. The soil for xenophobia and nationalism are richer than ever. While the dream of Globalization by elites is very much alive, the answer to why and an appeal to our better nature is desperately needed.

With a world-wide plague, an all-out trade-war, and the great fires from Climate Change, in these trying times, seeking a better tomorrow for our modern world can not be more critical. Analogies with old times are slides for the old man to swipe through on a sunny afternoon; it is an intellectual laziness. We ought to navigate our time more thoughtfully, with a hope and widened eyes. Doomsayers can always be right, but the future belongs to the dreamers.

No comment yet
January 29th, 2020

Yesterday, I was listening to an interview by Oxide Computer people with Jonathan Blow on my way back to San Francisco. There were usual complaints about how today’s programmers buried themselves into the pile of abstractions. As a game programmer Jonathan Blow himself, they also discussed some fond memories about programming basic games in his childhood. All that kept me thinking, how uninteresting today’s programming books are! They start with some grand concepts. Compilers! Deep learning! GPGPU programming! SQL! Databases! Distributed systems! Or with some kind of frameworks. React! iOS! TensorFlow! Elasticsearch! Kubernetes! Is it really that fun to learn some abstractions people put up with? Seriously, what is the fun in that?

Over the years, I learned that there are two kinds of people. The first one loves to create programs that people can use. They take joy from people using the program they create. The magic satisfaction came from observing people using this thing. The second one loves to program. They take joy from solving the interactive puzzle through programming. The fact that the program can do non-obvious tasks by itself is enjoyable, no matter whether these tasks have practical use or not. As for myself, I love to solve puzzles, and understand every last detail about how these puzzles are solved. At the same time, I take pride in people using the software I built.

My earliest memories with programming came from Visual Basic and Delphi. These were RAD tools (Rapid-Application-Development) back in the late 1990s. They were integrated environments to shoot-and-forget when it came to programming. They were not the best to help understand computer architecture ins-and-outs. To some extent, they were not even that good at developing efficient programs. But there are two things they did really well: 1. it was really easy to jump in and write some code to do something; 2. things you made can be shared with others and ran on most Windows computers like the “real” applications would do. At that time, there were a lot of magazines and books that teach you how to make useful things. From a simple chat room, to a remake of the Breakout game, you can type in the code and it would run! Then there were spiritual successors. Flash later evolved into Flex Builder, that meant to use Java-like syntax but preserves the spirit of RAD environment. As of late 2000s, you could build a SWF file and it would run almost everywhere. There were millions of amazing games built with Flash / Flex Builder by amateurs now live in our collective online memory.

Writing the iOS app in the 2010s somewhat gave me similar feelings. But the wheel moved on. Nowadays, we have MVVM / VIPER / RIB patterns. We have one-way data flow and React. We invented concepts to make programming more robust and productive in industrial settings with these abstractions. But the fun part was lost.

That is why this year, I plan to write a series to remind people how fun it is to program. It won’t be a series about frameworks and patterns. We will pick the simplest tool available. We will write code in different languages if that is what’s required. We will maintain states in globals when that makes sense. We will write the simplest code to do fun things. It will work and you can show it to your friends, distribute it as if it was made by professionals.

I would like to cover a broad range of topics, but mostly, just practical things you can build. There certainly will be a lot of games. Some of the arrangements I have in mind, in this particular order:

  • A street-fighter like game. Introduce you to game loops, animation playback, keyboard events and coordinate system.
  • A remake of Super Mario 1-1 with a level editor. With physics simulation, mouse events and data persistence.
  • A chat room with peer-to-peer connection over the internet. Introduce the ideas of in-order message delivery and the need for protocols.
  • Remake Super Mario into a multiplayer side-scrolling game like Contra (NES). (this may be too much plumbing, I need to feel about it after the first 3 chapters).
  • Chess, and the idea of searching.
  • Online Chess with people or more powerful computers.
  • Secure communication through RSA and AES.
  • Why don’t implement your own secure protocols (show a few hacks and defenses around the protocols above).
  • Geometry and explore the 3D of DOOM. Introduce the graphics pipeline. I am not certain whether to introduce GPU or not at this point.
  • Face recognition with a home security camera. Introduce convolutional networks and back-propagation. Likely use a simple network trained on CIFAR-10, so everything will be on CPU.
  • Convolutional networks and Chess, a simple RL.

There are many more topics I’d like to cover. I would like to cover some aspects of natural language processing through machine translation, either RNN or Transformer models. It is however challenging if I don’t want to introduce GPGPU programming. I also would like to cover parsers, and a little bit of persisted data structures. But there are really no cool applications at the moment with these. Raytracer would be interesting, but it is hard to fit into a schedule other than it looks kind of real? Implementing a virtual machine, likely something that can run NES games would be fun, but that is something I haven’t yet done and don’t know how much plumbing it requires.

All the arrangements will be built with no external dependencies. We are going to build everything from scratch. It should run on most of the platforms with a very simple dependency I built, likely some kind of Canvas / Communication API. This is unfortunate due to several factors: 1. We don’t have a good cross-platform render API except HTML / JavaScript / TypeScript. 2. Most of our devices are now behind NAT and cannot talk to peers through IP addresses. The Canvas API would provide simple controls as well, such as text input boxes and scroll views. That also means the API will be pretty much in retained mode.

For the tool of choice, it has to be a language that professionals use. There are quite a few candidates nowadays. Python, Ruby, Julia, Swift and TypeScript are all reasonable choices. TypeScript has excellent cross-platform capability and I don’t really need to do much for the Canvas API. Python and Ruby all have libraries you can leverage to do both the Canvas API and Communication. However, I want to do a bit more raw numeric programming. For the speed, Python, Ruby and TypeScript are just not that great. Yes, there is numpy / numba, but what is the fun if I start to call numpy, PyTorch and millions of other Python packages do anything and everything for me? For Julia, I simply need to build too much myself to even get started.

There are many downsides with Swift too. For one, I still need to build a ton to support Windows and Linux. The language itself is too complicated especially with weak references and automatic reference count. Luckily, early on, Swift subscribed to the progressive disclosure philosophy. I will try to avoid most of the harder concepts in Swift such as generics, protocols and nullability. Will try to delay the introduction of weak reference as late as possible. Whenever there is a choice between struct and class, I will go with class until there is a compelling reason to introduce struct in some chapters. I also don’t think that I need to introduce threads or GCD. This probably depends on whether I can come up with an intuitive Communication API.

For the platform to run, I will prioritize macOS, Windows 10 and Ubuntu Linux on Jetson Nano. Keyboard and mouse will still be assumed as main input devices. Jetson Nano would be a particularly interesting device because that would be the cheapest to run with some GPGPU programming capability. I am not certain whether I want to introduce that concept. But having that flexibility is great.

Interested?

No comment yet
December 27th, 2019

The past 3 months ought to be the best 3 months for television shows. As a prolific TV viewer, the shows from the past 3 months are all impressive and well-executed. Some of these due to better technologies. Some of these we can probably thank to the investment coming from the streaming war. The pouring of money and talents certainly worked. On top of all these, one of the most interesting turns in the past a few months is the prolific international shows on Netflix. From Better than Us, Kingdom, The Untamed to Ad Vitam, Netflix is uniquely positioned to provide some international flavors to the U.S. viewers.

For All Mankind

The new Apple TV+ show gives a good 90s vibe when we still have The West Wing and Star Trek TNG. The future is hopeful, and the leaders are altruistic. It sets itself apart from the more recent twisted and darker themes from The Walking Dead to Breaking Bad. We had too many of these in this decade.

The Expanse Season 4

I’d be honest that this season is a bit dense and I am still working through it. But hey, it is back! No matter how few space operas we made this decade, or as a genre, it is dead to many. Somehow we made the best space opera yet with The Expanse (or do we?).

The Witcher

This is a surprise to me. Some of Netflix’s recent dump of fantasy / sci-fi genre such as The Umbrella Academy and Lost in Space are not exactly a runaway success. I liked Altered Carbon, but not many people share the same view. But The Witcher has to be one of the best game adaptation Netflix, or anyone had made so far. To someone has no background on the novel or the games, it is easy to consume. The 3 split timelines in the beginning are not that hard to follow and it gets merged relatively quickly. It has the right mix of independent stories and the main storyline. Comparing with other fantasy series such as Game of Thrones, the background is not as grandiose, but comparably interesting. Comparing to similar nordic origin Grimm, the character development is simply miles better.

The Mandalorian

Who says space opera as a genre is dead again? The Mandalorian if keeps its current momentum, would certainly cement a good position for Disney+ in the streaming war. It has some weaker episodes, but the storyline was kept on the right track. The baby yoda is a quick catcher but the Mandalorian himself starts to develop some very interesting background and characters. Besides the story, which always gave me an easy and enjoyable Friday night, the special effect is also superb. It is generally bright (Tatooine does have 2 suns), that contrasts itself from other sci-fi series often in a darker setup (in general, fewer light sources are easier to render) probably thanks to Disney+’s budget. The new movie-like special effect quality is less about awe you, but to keep the storytelling part unimpeded. I believe to many viewers, they don’t really feel the special effect.

To put the cherry on the top, all shows above now supports 4K HDR (with the exception of The Expanse, I think it only does 4K). The TV streaming nowadays is such a great experience if you can afford it and have Gbps internet connection :) Hope you all enjoyed these great TV shows as I do in this holiday season, and keep warm!

No comment yet
December 26th, 2019

Grand Central Dispatch is the de-facto task-based parallelism / scheduling system on macOS / iOS. It has been open-sourced as libdispatch and ported to many platforms including Linux and FreeBSD.

libdispatch has been designed to work closely with the Clang extension: Blocks. Blocks is a simple, yet powerful function closure implementation that can implicitly capture variables to facilitate the design of task-based parallelism systems.

That choice imposed some constraints when designing the QoS classification system for libdispatch. Blocks’ metadata is of the Clang’s internal. It would leave a bad taste if we were required to modify Clang in order to add Blocks based QoS information. It would be interesting to discover how libdispatch engineers overcame these design dilemmas.

There are also some API limitations for the Blocks’ QoS API. We cannot inspect the QoS assignments for a given block. That makes certain wrappers around libdispatch APIs challenging. For example, we cannot simply put a wrapper to account for how many blocks we executed like this:

1
2
3
4
5
6
7
8
static atomic_int executed_count;

void my_dispatch_async(dispatch_queue_t queue, dispatch_block_t block) {
    dispatch_async(queue, ^{
        ++executed_count;
        block();
    });
}

The above could have unexpected behavior because the new block doesn’t carry over the QoS assignment for the block passed in. For all we know, that block could be wrapped with dispatch_block_create_with_qos_class. Specifically:

1
dispatch_block_t block = dispatch_block_create_with_qos_class(DISPATCH_BLOCK_ENFORCE_QOS_CLASS, QOS_USER_INITIATED, 0, old_block);

If dispatched, would lift the underlying queue’s QoS to QOS_USER_INITIATED. However, with our wrapper my_dispatch_async, the QoS assignment will be stripped.

We would like to have a way at least to copy the QoS assignment over to the new block. This requires to inspect libdispatch internals.

What is a Block?

Blocks is the function closure implementation from Clang that works across Objective-C, C and C++. Under the hood, it is really just a function pointer to a piece of code with some variables from the calling context copied over. Apple conveniently provided a header that specified exactly the layout of the Block metadata in memory:

https://github.com/apple/swift-corelibs-libdispatch/blob/master/src/BlocksRuntime/Block_private.h#L59

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// ...
struct Block_descriptor_1 {
    unsigned long int reserved;
    unsigned long int size;
};
// ...
struct Block_layout {
    void *isa;
    volatile int32_t flags; // contains ref count
    int32_t reserved; 
    void (*invoke)(void *, ...);
    struct Block_descriptor_1 *descriptor;
    // imported variables
};
// ...

The first two fields just so happen to match the Objective-C object’s memory layout. This will facilitate the requirement for Objective-C compatibility especially with ARC. The whole Block moved to the heap along with the imported variables in one allocation. Thus, if you have the pointer to the block metadata, you can already inspect captured variables if you know the exact order of their capturing.

At runtime, once a block is called, the compiler will restore the captured variables, and then cast and invoke block->invoke as if it is a normal function.

The Additional Block Metadata

As we can see, the Block_layout is relatively tight with no much space for additional block metadata. How libdispatch engineers find the extra space for the QoS information?

The answer lies in another indirection:

https://github.com/apple/swift-corelibs-libdispatch/blob/master/src/block.cpp#L113

1
2
3
4
5
6
7
8
9
10
11
dispatch_block_t
_dispatch_block_create(dispatch_block_flags_t flags, voucher_t voucher,
		pthread_priority_t pri, dispatch_block_t block)
{
	struct dispatch_block_private_data_s dbpds(flags, voucher, pri, block);
	return reinterpret_cast<dispatch_block_t>(_dispatch_Block_copy(^{
		// Capture stack object: invokes copy constructor (17094902)
		(void)dbpds;
		_dispatch_block_invoke_direct(&dbpds);
	}));
}

dispatch_block_create or dispatch_block_create_with_qos_class ultimately calls into this _dispatch_block_create private function.

It captures a particular variable dbpds that contains numerous fields onto the block, and then invoke the actual block directly.

As we can see in the previous section, it is relatively easy to inspect the captured variables if you know the actual layout. It just happens we know the layout of struct dispatch_block_private_data_s exactly.

Copying QoS Metadata

Back to the previously mentioned my_dispatch_async implementation. If we want to maintain the QoS metadata, we need to copy it over to the new block. Now we have cleared the skeleton, there are only a few implementation details.

First, we cannot directly inspect the captured variables.

It is straightforward to cast (struct dispatch_block_private_data_s *)((uint8_t *)block + sizeof(Block_layout)), and then check the fields. However, there is no guarantee that a passed-in block is wrapped with dispatch_block_create method always. If a passed-in block happens to contain no captured variables, you may access out-of-bound memory address.

The way libdispatch implemented is to first check the invoke function pointer. If it is wrapped with dispatch_block_create, it will always point to the same function inside the block.cpp implementation. We can find this function pointer at link time like what libdispatch did or we can find it at runtime.

1
2
3
4
5
6
7
8
9
10
typedef void (*dispatch_f)(void*, ...);
dispatch_f dispatch_block_special_invoke()
{
    static dispatch_once_t onceToken;
    static dispatch_f f;
    dispatch_once(&onceToken, ^{
        f = (__bridge struct Block_layout *)dispatch_block_create(DISPATCH_BLOCK_INHERIT_QOS_CLASS, ^{})->invoke;
    });
    return f;
}

Second, we need to deal with runtime changes. We don’t expect libdispatch has dramatic updates to its internals, however, it is better safe than sorry. Luckily, struct dispatch_block_private_data_s has a magic number to compare notes. We can simply check dbpds->dbpd_magic against library updates and corruptions.

Finally, we can assemble our my_dispatch_async method properly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static atomic_int executed_count;

void my_dispatch_async(dispatch_queue_t queue, dispatch_block_t block) {
    dispatch_block_t wrapped_block = ^{
        ++executed_count;
        block();
    };
    struct Block_layout *old_block_layout = (__bridge struct Block_layout *)block;
    if (old_block_layout->invoke == dispatch_block_special_invoke()) {
        wrapped_block = dispatch_block_create(DISPATCH_BLOCK_INHERIT_QOS_CLASS, wrapped_block);
        struct Block_layout *wrapped_block_layout = (__bridge struct Block_layout *)wrapped_block;
        struct dispatch_block_private_data_s *old_dbpds = (struct dispatch_block_private_data_s *)(old_block_layout + 1);
        struct dispatch_block_private_data_s *wrapped_dbpds = (struct dispatch_block_private_data_s *)(wrapped_block_layout + 1);
        if (old_dbpds->dbpd_magic == 0xD159B10C) {
            wrapped_dbpds->dbpd_flags = old_dbpds->dbpd_flags;
            wrapped_dbpds->dbpd_priority = old_dbpds->dbpd_priority;
        }
    }
    dispatch_async(queue, wrapped_block);
}

This new my_dispatch_async wrapper now will respect the block QoS assignments passed in, you can check this by dispatch a block with dispatch_block_create and observe the executed QoS with qos_class_self().

Closing Thoughts

The implementation of QoS in dispatch block is quite indigenous. However, it does present challenges outside of libdispatch scope. This implementation is specialized against dispatch_block_t type of blocks, you cannot simply extend that to other types of blocks. I am particularly not happy that dispatch_block_create is not a generic function such that any given block, parameterized or not can have QoS wrapped and somehow respected (for example, taking its QoS out and assign it to a plain dispatch_block_t when you do dispatch_async dance).

Implementing your own QoS-carrying block this way would be quite painful. Each parameterized block would require a specialized function that carries the QoS information. You probably can do that with C macro hackery, but that would be ugly too quickly. You’d better off to have an object that takes both the block and QoS information plainly, than trying to be clever and embedding the QoS information into the block.

No comment yet
December 1st, 2019

I’ve discussed a stackful coroutine implementation to coordinate CUDA stream last year.

That was an implementation based on swapcontext / makecontext APIs. Increasingly, when I thought about porting nnc over to WASM, it becomes problematic because these APIs are more or less deprecated. Popular libc implementations such as musl don’t have implementation of these methods.

After the article, it became obvious that I cannot swapcontext into the internal CUDA thread (that thread cannot launch any kernels). Thus, the real benefit of such stackful coroutine is really about convenience. Writing a coroutine that way is no different from writing a normal C function.

This is the moment where C++ makes sense. The coroutine proposal in C++20 is a much better suit. The extra bits of compiler support just make it much easier to write.

If we don’t use swapcontext / makecontext, the natural choice is either longjmp / setjmp or good-old Duff’s device. It is a no-brainer to me that I will come back to Duff’s device. It is simple enough and the most platform-agnostic way.

There are many existing stackless coroutines implemented in C. The most interesting one with Duff’s device is Protothreads. To me, the problem with Protothreads is its inability to maintain local variables. Yes, you can allocate additional states by passing in additional parameters. But it can quickly become an exercise and drifting away from a simple stackless coroutine to one with all bells-and-whistles of structs for some parameters and variables. You can declare everything as static. But it is certainly not going to work other than the most trivial examples.

I’ve spent this weekend to sharpen my C-macro skills on how to write the most natural stackless coroutine in C. The implementation preserves local variables. You can declare the parameters and return values almost as natural as you write normal functions.

Here is an example of how you can write a function-like stackless coroutine in C:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
static co_decl_task(ab_t, _coroutine_a, (const int a, const int b), private(
	int i;
)) {
	printf("param a %d\n", CO_P(a));
	printf("param b %d\n", CO_P(b));
	CO_V(i) = 2;
	printf("%d\n", CO_V(i));
	co_yield((ab_t){
		.a = CO_V(i)
	});
	CO_V(i) += 1;
	printf("param b %d\n", CO_P(b));
	printf("%d\n", CO_V(i));
	co_yield((ab_t){
		.a = CO_V(i)
	});
	co_return((ab_t){
		.a = 10
	});
} co_end()

static co_decl_task(int, _coroutine_b, (), private(
	co_routine_t* task_a;
	ab_t a0;
	ab_t a1;
	ab_t a2;
)) {
	CO_V(task_a) = co_new(_coroutine_a, (12, 10));
	co_resume(CO_V(task_a), CO_V(a0));
	co_resume(CO_V(task_a), CO_V(a1));
	co_resume(CO_V(task_a), CO_V(a2));
	printf("returned value %d %d %d\n", CO_V(a0).a, CO_V(a1).a, CO_V(a2).a);
	co_free(CO_V(task_a));
} co_end()

int main(void)
{
	co_scheduler_t* scheduler = co_scheduler_new();
	co_routine_t* const task = co_new(_coroutine_b, ());
	co_schedule(scheduler, task);
	co_free(task);
	co_scheduler_free(scheduler);
	return 0;
}

co_decl_task will declare the interface and the implementation. You can also separate the interface into header file with co_decl and implementation into co_task. In this case, static keyword continues to work to scope the coroutine to file-level visibility. Taking a look at this:

1
static co_decl_task(ab_t, _coroutine_a, (const int a, const int b), 

The first parameter is the return type, and then function name, parameters, all feel very natural to C functions. The local variable has to be declared within the private block, that’s the only catch.

To access parameters and local variables, you have to use CO_P / CO_V macro to wrap the access, otherwise it is the same.

Of course, there are a few more catches:

  1. No variadic parameters;
  2. No variable length local arrays;
  3. No void, () meant for that in parameters, and you can simply omit the return type if you don’t need them.

There is no magic really, just some ugly macros hide away the complexity of allocating parameters / local variables on the heap and such.

There are examples in the repo that shows the usage of co_resume, co_await, co_apply, co_yield, co_decl, co_task, co_decl_task and co_return in varies formats. You can check out more there: https://github.com/liuliu/co

Currently, I have a single-threaded scheduler. However, it is not hard to switch that to a multi-threaded scheduler with the catch that you cannot maintain the dependencies as a linked-list, but rather a tree.

It is a weekend exercise, I don’t expect to maintain this repo going forward. Some form of this will be ported into nnc.

Closing Thoughts

In theory, swapcontext / makecontext can make a much more complex interaction between functions that an extra scheduler object is not needed. For what it’s worth, Protothreads also doesn’t have a central scheduler. But in practice, I found it still miles easier to have a scheduler like what libtask does. Tracking and debugging is much easier with a central scheduler especially if you want to make that multi-thread safe as well.

No comment yet
August 2nd, 2019

To train large deep neural network, you need a lot of GPU and a lot of memory. That is why a Titan RTX card cost more than 3 times of a RTX 2080 Ti with just a bit more tensor cores. It has 24GiB memory and that makes a lot of models much easier to train. More memory also means bigger batch size. Many GPU kernels run faster with larger batch size. If somehow we can reduce memory footprint at training time, we can train bigger models, and we can train with larger batch size faster.

There are methods to reduce memory footprints. It is no-brainer nowadays to use fp16 for training. Other than that, many of today’s memory reduction techniques are derivatives of binomial checkpointing, a well-known technique in automatic differentiation community. Specific details need to be considered that cheap operations such as batch normalization or RELU results can be dropped and then recomputed later. The paper suggested a 30% more time required for DNN-tuned binomial checkpointing for roughly 80% reduction in memory usage. In practice, people often see 10% more time with 50% reduction in memory usage thanks to optimizations in forward pass over the years.

In the past a few days, I’ve been experimenting with another type of memory usage reduction technique.

It is common today in operating systems to do something called virtual memory compression. It uses data compression techniques to compress under-utilized pages, and on page fault, to decompress these pages back. These are lossless compressions. It doesn’t make sense to revisit some memory and suddenly an ‘a’ becomes a ‘z’. However, in another world, lossy compression does used to reduce memory usage.

In computer graphics, a full-blown 32-bit texture could take a lot of memory. People exploited more effective texture representation for ages. Formats such as PVRTC or ETC rely on heavy compression schemes (many involve search a space for better representations) to find perceptually similar but much smaller texture representation. For example, PVRTC2 could spend less than 15% memory for visually the same result as a full-blown 32-bit texture. These compression schemes are also very light and predictable to decompress.

There are certain similarities between textures and tensors for convolutional neural networks. They both have spatial dimensions. Convolutional neural networks traditionally have more precisions, but nowadays we are exploring 4-bit or 8-bit tensors for convolutional neural networks too. For a tensor compression algorithm to work in practice, it needs to be fast at both compression and decompression on GPU, and hopefully, has high fidelity to the original.

I’ve devised a very simple, very easy-to-implement adaptive quantization algorithm for this purpose. The past a few days, I’ve been experimenting on ResNet-50 models to confirm its effectiveness.

At batch size 128x4 (4 GPUs, 128 per GPU), the baseline ResNet-50 model trained on ImageNet reached single crop top-1 accuracy 77.6% with 20.97GiB memory allocated across 4 GPUs. The ResNet-50 model with tensor compression trained on ImageNet reached accuracy 75.8% with 6.75GiB memory allocated.

On each feature map, within a 4x4 patch, we find the max value and the min value. With these, we have 4 values {min, max - min) / 3 + min, (max - min) * 2 / 3 + min, max}. Each scalar within that 4x4 patch can be represented with one of the 4 values. Thus, we use 2 bits per scalar. That totals 64 bits per patch, 25% of the original (assuming fp16). This is super easy to implement on GPU, in fact, I am surprised my simple-minded implementation on GPU this fast. It incurs less than 10% runtime cost during training (throughput reduced from 1420 images per second to 1290 images per second).

It is also simple to update the computation graph for tensor compression. For each convolution layer’s output tensor, if it is used during backpropagation, we compress it immediately after its creation in forward pass, and decompress it before its use in backpropagation. If the backpropagation of the convolution layer uses a input tensor, we compress it immediately after its creation in forward pass, and decompress it before its use in the backpropagation. This simple scheme covered all tensors potentially have spatial redundancy.

Is this algorithm useful? Probably not. As long as there are accuracy loss, I am pretty certain no one will use it. At this moment, it is unclear whether 2-bit is too little or this whole scheme inherently doesn’t work. Some more experiments are required to determine whether adaptive quantization is good enough or the spatial redundancy plays a role (by adaptive quantize across feature maps rather than within a feature map). Nevertheless, I’d like to share these early results to help the community determine whether this is a worthy path to explore.

You can find the CUDA implementation of the above adaptive quantization algorithm in: https://github.com/liuliu/ccv/blob/unstable/lib/nnc/cmd/compression/gpu/ccv_nnc_lssc_gpu_ref.cu

No comment yet
May 22nd, 2019

War starts when people fail to communicate with each other. The current U.S. and China dispute is so complex and overreaching, any rational discussions online can devolve into flame wars. There are so many topics, making the multi-variable optimizations difficult. Overlaying all this with a gloomy long-term implications of technology, it is far easier to just pick a side and rooting for the red / blue team.

The Gloomy Long-Term Implications of Technology

It is far easier for the Bay Area people thinking themselves as a force of good. But the technology we developed over the past a few years greatly expanded central governments’ ability. It is too easy to track down a person, collect all their communication records, for profiling and categorization. Alternative technologies to combat these implications such as end-to-end encryption can be easily outlawed at governments’ will. It is pleasantly surprising to see the United States resisted so long. As Republican given up their ideology completely for the totalitarianism fantasy, finally, the expansion of the executive branch power will result, not necessarily a president for life (although likely), but at the very least, a one-party state. Whether it is Republican or Democrat are besides the point. Populists, on either far right or far left, come dangerously close in ideology terms. After all, the United States has a Republican president running unprecedented fiscal deficit and issuing orders to anyone by the name of national security right now.

The Chinese has been playing the one-party state game for too long. The art of ruling, lies in appeasing many, allowing a few to vent, and exterminating anomalies. The digital technologies allow them to scale up. With such surveillance power, the crime rate will fall, so does the freedom.

The Gear Up to a New War

When a new war begins to break out, both sides first stop talk with each other. The media on both sides seem to have agendas. In China, the media appeals to the nationalistic honor, tries to remind the average Chinese the the past under western imperialism with Opium War and Korean War. In the United States, the media paints an evil axis of China, tries to gain a moral high-ground for the U.S. position. The sheer number of fanatics for both sides makes civil discussion impossible. It seems that media are well-positioned to setup the war between the two power.

What the United States Wants

The current trade war is difficult partly because the United States demands are fairly opaque. It is a baggage of things, ranging from pure economical to pure political. It is understandable because the Trump administration are not known for making crisp clear demands. There are feelings, numbers, ideologies, all bagged together in the trade deal.

The Feelings: the United States felt that they were in a one-sided relationship. In the past two decades, it benefitted more to the Chinese. This can be seen from stagnation of the U.S. growth and the stellar growth of China. More specifically, the feeling can be seen from the broad ban of the U.S. internet companies in China, the joint-venture requirements for any U.S. adventures to the Chinese domestic market. The fact of great many made-in-China products means the less of made-in-America. That again, attributed back to the stagnation of the U.S. common people for the past decade.

The Numbers: the United States sees the hard-cold trade imbalance as a proof that the relationship is truly one-sided. If the Americans make less than the Chinese from this relationship, isn’t it enough to prove the United States lost?

The Ideologies: to many Americans, the Communist China by the prefix is evil. The behavior in Tibet, Xinjiang and South China Sea is a proof the communists will go far to suppress oppositions. Many years of propaganda in the United States attributed the end of the Cold War to the superiority of Capitalism over Communism (rather than, for example, the open government over the authoritarian government).

It makes the U.S. demand unlikely to be simply economical. If the U.S. wants a balanced trade, the problem should already be solved last year. The Chinese wants to buy from the U.S. to the extent of anything the U.S. wants to sell. The agricultural products in a little over the past decade rose from 0% to almost 20% of total U.S. exports to China. There are a long list of things that the Chinese want to import but banned by national security reasons.

Beyond the economical demands, the U.S. wants to fix the open-market problem. The Chinese was quick to extend the olive branch on that front with the 100% Tesla-owned factory in China, even with some Chinese investments.

The sticky points, lie in the alleged IP theft, cyber warfare and the humanitarian concerns. The Chinese was quick to promise. But the United States wants more than a promise.

What’s China’s Red Line

One misunderstanding from the U.S. media and discussions, is how serious the Chinese regarding sovereignty. There are many disputes in China about how the slow progress to implement open-market hurts the mutual trusts within WTO. During the interview with Ren Zhengfei on May 21st, he mentioned this as well. The humanitarian aspects with current regime is another topic has many resonating audience within China. However, imposing a U.S. based overseeing body in the Chinese governing system is difficult for the Chinese to swallow. The sovereignty issue is a big part of Chinese education in the past half a century. The extraterritorial rights granted to westerners since Opium War are something the Chinese will not forget.

The Endgame

With the United States being the only world super power, it has the full range of options to play out the endgame. Given the unpredictability of the Trump administration, the trade war could end tomorrow with only a lip service to appeal the electoral base. It is always back to what the United States views China in the long term. If the United States sees its role to contain China and sees China as the evil axis that endangers the U.S. dominated world order, the United States should escalate fearlessly to a war with China while it can, do what it is the most familiar with (toppling the regime). The consequence of that, is a far weaker, poorer China, with 1.5 billion people that cannot feed themselves. I wish to appeal to many of my American friends, this is an undesirable humanitarian dilemma.

Alternatively, the United States could fool itself into the sanction game. Even without coordinated efforts with Europe and Japan, the sanction from the United States will greatly damage the Chinese with limited negative impact to the U.S. corporations. However, it is unlikely the United States will see a more friendly China there. With the us v.s. them mentality, it is hard to imagine a pro-American regime being born that way. An inward-looking China will ultimately poses greater threat than an outward-looking one.

The United States has to recognize that without a hot war, it needs to work with China. The shared sovereignty request is not acceptable, by both the regime and the people. On the other hand, if the United States wants a friendlier China, the demands should be a rule-based mechanism that enforces IP protection and the participation of foreign capital. The right to participate made-in-China 2025 would also be a far more interesting play for the United States than forcing China to abandon them.

No comment yet
August 15th, 2018

When programming with CUDA, there are several ways to exploit concurrency for CUDA kernel launches. As explained in some of these slides, you can either:

  1. Create thread corresponding each execution flow, execute serially on stream per thread, coordinate with either cudaEventSynchronize or cudaStreamSynchronize;
  2. Carefully setup CUDA events and streams such that the correct execution flow will follow.

The 2. seems more appealing to untrained eyes (you don’t have to deal with threads!) but in practice, often error-prune. One of the major issue, is that the cudaEventRecord / cudaStreamWaitEvent pair doesn’t capture all synchronization needs. Comparing this to Grand Central Dispatch provided primitives: dispatch_group_enter / dispatch_group_leave / dispatch_group_notify, the under-specified part is where the cudaEventEnter happens. This often leads to a surprising fact that when you cudaStreamWaitEvent on a event not yet recorded on another stream (with cudaEventRecord), the current stream will treat as if this event is already happened and won’t wait at all.

This is OK if your execution flows is static, thus, all the kernels need to be executed on which stream, are fully specified upfront. Requires some careful arrangement? Yes, but it is doable. However, it all breaks down if some coordinations need to happen after some kernel computations are done. For example, based on the newly computed losses, to determine whether decrease learn rate or not. Generally-speaking, for any computation graph that supports control structure, these coordinations are necessary.

The obvious way to solve this, is to go route 1. However, that imposes other problems, especially given pthread’s handling of spawn / join is something much left to be desired.

For a few brave souls wanting to go route 2. to solve this, how?

After CUDA 5.x, a new method cudaStreamAddCallback is provided. This method itself carries some major flaws (before Kepler, cudaStreamAddCallback could cause unintended kernel launch serializations; the callback itself happens on the driver thread; and you cannot call any CUDA API inside that callback). But if we can gloss over some of these fundamental flaws and imagine, here is how I could make use of it with the imaginary cudaEventEnter / cudaEventLeave pair.

At the point I need to branch to determine whether to decrease learn rate, before cudaStreamAddCallback, I call cudaEventEnter to say that a event need to happen before certain stream to continue. Inside the callback, I get the loss from GPU, makes the decision, and call cudaEventLeave on the right event to continue the stream I want to branch into.

In real world, the above just cannot happen. We miss cudaEventEnter / cudaEventLeave primitives, and you cannot do any CUDA API call inside such callback. More over, the code will be complicated with these callbacks anyway (these are old-fashioned callbacks, not even lambda functions or dispatch blocks!).

What if, I can write code as if it is all synchronous, but under the hood, it all happens on one thread, so I don’t have to worry about thread spawn / join when just scheduling work from CPU?

In the past a few days, I’ve been experimenting how to make coroutines work along cudaStreamAddCallback, and it seems all working! To make this actually useful in NNC probably will take more time, but I just cannot wait to share this first :P

First, we need to have a functional coroutine implementation. There are a lot stackful C coroutine implementations online and my implementation borrowed heavily from these sources. This particular coroutine implementation just uses makecontext / swapcontext / getcontext.

Setup basic data structures:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
union ptr_splitter {
	void *ptr;
	uint32_t part[2];
};

static const int default_stack_size = 65536;

typedef struct schd_s schd_t;
typedef struct task_s task_t;
typedef void (*task_fn_t)(task_t *task);

struct task_s {
	struct task_s* prev;
	struct task_s* next;
	schd_t* schd;
	int done;
	struct task_s* waitfor;
	// For swapcontext / makecontext / getcontext.
	ucontext_t context;
	char *stack;
	task_fn_t fn;
};

struct schd_s {
	task_t* head;
	task_t* tail;
	struct {
		int suspend;
	} count;
	pthread_cond_t cv;
	pthread_mutex_t mutex;
	ucontext_t caller, callee;
};

Setup a main run loop that can schedule coroutines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
static void deltask(schd_t* const schd, task_t* const t)
{
	if (t->prev)
		t->prev->next = t->next;
	else
		schd->head = t->next;
	if (t->next)
		t->next->prev = t->prev;
	else
		schd->tail = t->prev;
}

static void* schdmain(void* userdata)
{
	schd_t* const schd = (schd_t*)userdata;
	for (;;) {
		pthread_mutex_lock(&schd->mutex);
		// No one is waiting, and no more tasks. exit.
		if (schd->head == 0 && schd->count.suspend == 0)
		{
			pthread_mutex_unlock(&schd->mutex);
			break;
		}
		if (schd->head == 0)
		{
			pthread_cond_wait(&schd->cv, &schd->mutex);
			pthread_mutex_unlock(&schd->mutex);
			continue;
		}
		task_t* const t = schd->head;
		deltask(schd, t);
		pthread_mutex_unlock(&schd->mutex);
		swapcontext(&schd->caller, &t->context);
		t->context = schd->callee;
		if (t->done)
			taskfree(t);
	}
	return 0;
}

Now, create a new task:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
static void _task_entry_point(uint32_t part0, uint32_t part1)
{
	union ptr_splitter p;
	p.part[0] = part0;
	p.part[1] = part1;
	task_t *task = (task_t*)p.ptr;
	task->fn(task);
	task->done = 1;
	swapcontext(&task->schd->callee, &task->schd->caller);
}

static task_t* taskcreate(schd_t* const schd, task_fn_t fn)
{
	task_t *task = (task_t*)calloc(1, sizeof(task_t));

	task->schd = schd;
	task->stack = (char*)calloc(1, default_stack_size);
	task->fn = fn;

	getcontext(&task->context);
	task->context.uc_stack.ss_sp = task->stack;
	task->context.uc_stack.ss_size = default_stack_size;
	task->context.uc_link = 0;

	union ptr_splitter p;
	p.ptr = task;
	makecontext(&task->context, (void (*)(void))_task_entry_point, 2, p.part[0], p.part[1]);
	return task;
}

static void addtask(schd_t* const schd, task_t* const t)
{
	if (schd->tail)
	{
		schd->tail->next = t;
		t->prev = schd->tail;
	} else {
		schd->head = t;
		t->prev = 0;
	}
	schd->tail = t;
	t->next = 0;
}

static void taskfree(task_t* const task)
{
	task_t* waitfor = task->waitfor;
	while (waitfor)
	{
		task_t* const next = waitfor->next;
		addtask(task->schd, waitfor);
		waitfor = next;
	}
	free(task->stack);
	free(task);
}

Usual utilities for coroutine (ability to yield, launch a new coroutine, and wait for existing coroutine to finish):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static void taskyield(task_t* const task)
{
	addtask(task->schd, task);
	swapcontext(&task->schd->callee, &task->schd->caller);
}

static void taskresume(task_t* const task)
{
	ucontext_t old_context = task->schd->caller;
	swapcontext(&task->schd->caller, &task->context);
	task->context = task->schd->callee;
	task->schd->caller = old_context;
	if (task->done) // If the task is done here, we should just remove it.
		taskfree(task);
}

static void taskwait(task_t* const task, task_t* const waiton)
{
	task->prev = 0;
	task->next = waiton->waitfor;
	waiton->waitfor = task;
	swapcontext(&task->schd->callee, &task->schd->caller);
}

With above utilities, you can already experiment with coroutines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
static void g(task_t* const task)
{
	printf("start task %p\n", task);
	taskyield(task);
	printf("back to task %p to finish\n", task);
}

static void f(task_t* const task)
{
	printf("create a new task to resume %p\n", task);
	task_t* gtask = taskcreate(task->schd, g);
	taskresume(gtask); // Run the gtask directly.
	printf("done task %p\n", task);
}

int main(void)
{
	schd_t schd = {};
	pthread_cond_init(&schd.cv, 0);
	pthread_mutex_init(&schd.mutex, 0);
	task_t* task = taskcreate(&schd, f);
	addtask(&schd, task);
	schdmain(&schd);
	pthread_cond_destroy(&schd.cv);
	pthread_mutex_destroy(&schd.mutex);
	return 0;
}

Unsurprisingly, you should be able to see print outs in order of:

1
2
3
4
create a new task to resume 0x288d010
start task 0x289d410
done task 0x288d010
back to task 0x289d410 to finish

coroutine f first executed, it launches coroutine g. When g gives up control (taskyield), coroutine f continues to execute until finish. After that, scheduler resumes coroutine g, and it finishes as well.

You can also try to taskwait(task, gtask) in coroutine f, to see that f will finish only after coroutine g is scheduled again until finish.

So far, we have a functional coroutine implementation in C. Some of these code doesn’t seem to make sense, for example, why we need a mutex and a condition variable? Because a secret function that enables us to wait on a stream is not included above:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static void taskcudaresume(cudaStream_t stream, cudaError_t status, void* userdata)
{
	task_t* const task = (task_t*)userdata;
	pthread_mutex_lock(&task->schd->mutex);
	addtask(task->schd, task);
	--task->schd->count.suspend;
	pthread_cond_signal(&task->schd->cv);
	pthread_mutex_unlock(&task->schd->mutex);
}

static void taskcudawait(task_t* const task, cudaStream_t stream)
{
	pthread_mutex_lock(&task->schd->mutex);
	++task->schd->count.suspend;
	cudaStreamAddCallback(stream, taskcudaresume, task, 0);
	pthread_mutex_unlock(&task->schd->mutex);
	// Compare to taskyield, this function doesn't do addtask(task->schd, task);
	swapcontext(&task->schd->callee, &task->schd->caller);
}

taskcudawait will put the current coroutine on-hold until the said stream finishes. Afterwards, you can do branch, and knowing comfortably kernels in the stream above are all done. The condition variable and the mutex is necessary because the callback happens on the driver thread.

You can see the full code that demonstrated the usage here: https://gist.github.com/liuliu/7366373d0824a915a26ff295c468b6e4

It seems above utilities would cover all my usages (the taskwait and taskresume are important to me because I don’t want too much hard to control async-y when launch sub-coroutines). Will report back if some of these doesn’t hold and I failed to implement fully-asynchronous, control structure supported computation graph with these cute little coroutines.

‹ Newer