Posts from December, 2019
No comment yet
December 27th, 2019

The past 3 months ought to be the best 3 months for television shows. As a prolific TV viewer, the shows from the past 3 months are all impressive and well-executed. Some of these due to better technologies. Some of these we can probably thank to the investment coming from the streaming war. The pouring of money and talents certainly worked. On top of all these, one of the most interesting turns in the past a few months is the prolific international shows on Netflix. From Better than Us, Kingdom, The Untamed to Ad Vitam, Netflix is uniquely positioned to provide some international flavors to the U.S. viewers.

For All Mankind

The new Apple TV+ show gives a good 90s vibe when we still have The West Wing and Star Trek TNG. The future is hopeful, and the leaders are altruistic. It sets itself apart from the more recent twisted and darker themes from The Walking Dead to Breaking Bad. We had too many of these in this decade.

The Expanse Season 4

I’d be honest that this season is a bit dense and I am still working through it. But hey, it is back! No matter how few space operas we made this decade, or as a genre, it is dead to many. Somehow we made the best space opera yet with The Expanse (or do we?).

The Witcher

This is a surprise to me. Some of Netflix’s recent dump of fantasy / sci-fi genre such as The Umbrella Academy and Lost in Space are not exactly a runaway success. I liked Altered Carbon, but not many people share the same view. But The Witcher has to be one of the best game adaptation Netflix, or anyone had made so far. To someone has no background on the novel or the games, it is easy to consume. The 3 split timelines in the beginning are not that hard to follow and it gets merged relatively quickly. It has the right mix of independent stories and the main storyline. Comparing with other fantasy series such as Game of Thrones, the background is not as grandiose, but comparably interesting. Comparing to similar nordic origin Grimm, the character development is simply miles better.

The Mandalorian

Who says space opera as a genre is dead again? The Mandalorian if keeps its current momentum, would certainly cement a good position for Disney+ in the streaming war. It has some weaker episodes, but the storyline was kept on the right track. The baby yoda is a quick catcher but the Mandalorian himself starts to develop some very interesting background and characters. Besides the story, which always gave me an easy and enjoyable Friday night, the special effect is also superb. It is generally bright (Tatooine does have 2 suns), that contrasts itself from other sci-fi series often in a darker setup (in general, fewer light sources are easier to render) probably thanks to Disney+’s budget. The new movie-like special effect quality is less about awe you, but to keep the storytelling part unimpeded. I believe to many viewers, they don’t really feel the special effect.

To put the cherry on the top, all shows above now supports 4K HDR (with the exception of The Expanse, I think it only does 4K). The TV streaming nowadays is such a great experience if you can afford it and have Gbps internet connection :) Hope you all enjoyed these great TV shows as I do in this holiday season, and keep warm!

No comment yet
December 26th, 2019

Grand Central Dispatch is the de-facto task-based parallelism / scheduling system on macOS / iOS. It has been open-sourced as libdispatch and ported to many platforms including Linux and FreeBSD.

libdispatch has been designed to work closely with the Clang extension: Blocks. Blocks is a simple, yet powerful function closure implementation that can implicitly capture variables to facilitate the design of task-based parallelism systems.

That choice imposed some constraints when designing the QoS classification system for libdispatch. Blocks’ metadata is of the Clang’s internal. It would leave a bad taste if we were required to modify Clang in order to add Blocks based QoS information. It would be interesting to discover how libdispatch engineers overcame these design dilemmas.

There are also some API limitations for the Blocks’ QoS API. We cannot inspect the QoS assignments for a given block. That makes certain wrappers around libdispatch APIs challenging. For example, we cannot simply put a wrapper to account for how many blocks we executed like this:

1
2
3
4
5
6
7
8
static atomic_int executed_count;

void my_dispatch_async(dispatch_queue_t queue, dispatch_block_t block) {
    dispatch_async(queue, ^{
        ++executed_count;
        block();
    });
}

The above could have unexpected behavior because the new block doesn’t carry over the QoS assignment for the block passed in. For all we know, that block could be wrapped with dispatch_block_create_with_qos_class. Specifically:

1
dispatch_block_t block = dispatch_block_create_with_qos_class(DISPATCH_BLOCK_ENFORCE_QOS_CLASS, QOS_USER_INITIATED, 0, old_block);

If dispatched, would lift the underlying queue’s QoS to QOS_USER_INITIATED. However, with our wrapper my_dispatch_async, the QoS assignment will be stripped.

We would like to have a way at least to copy the QoS assignment over to the new block. This requires to inspect libdispatch internals.

What is a Block?

Blocks is the function closure implementation from Clang that works across Objective-C, C and C++. Under the hood, it is really just a function pointer to a piece of code with some variables from the calling context copied over. Apple conveniently provided a header that specified exactly the layout of the Block metadata in memory:

https://github.com/apple/swift-corelibs-libdispatch/blob/master/src/BlocksRuntime/Block_private.h#L59

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// ...
struct Block_descriptor_1 {
    unsigned long int reserved;
    unsigned long int size;
};
// ...
struct Block_layout {
    void *isa;
    volatile int32_t flags; // contains ref count
    int32_t reserved; 
    void (*invoke)(void *, ...);
    struct Block_descriptor_1 *descriptor;
    // imported variables
};
// ...

The first two fields just so happen to match the Objective-C object’s memory layout. This will facilitate the requirement for Objective-C compatibility especially with ARC. The whole Block moved to the heap along with the imported variables in one allocation. Thus, if you have the pointer to the block metadata, you can already inspect captured variables if you know the exact order of their capturing.

At runtime, once a block is called, the compiler will restore the captured variables, and then cast and invoke block->invoke as if it is a normal function.

The Additional Block Metadata

As we can see, the Block_layout is relatively tight with no much space for additional block metadata. How libdispatch engineers find the extra space for the QoS information?

The answer lies in another indirection:

https://github.com/apple/swift-corelibs-libdispatch/blob/master/src/block.cpp#L113

1
2
3
4
5
6
7
8
9
10
11
dispatch_block_t
_dispatch_block_create(dispatch_block_flags_t flags, voucher_t voucher,
		pthread_priority_t pri, dispatch_block_t block)
{
	struct dispatch_block_private_data_s dbpds(flags, voucher, pri, block);
	return reinterpret_cast<dispatch_block_t>(_dispatch_Block_copy(^{
		// Capture stack object: invokes copy constructor (17094902)
		(void)dbpds;
		_dispatch_block_invoke_direct(&dbpds);
	}));
}

dispatch_block_create or dispatch_block_create_with_qos_class ultimately calls into this _dispatch_block_create private function.

It captures a particular variable dbpds that contains numerous fields onto the block, and then invoke the actual block directly.

As we can see in the previous section, it is relatively easy to inspect the captured variables if you know the actual layout. It just happens we know the layout of struct dispatch_block_private_data_s exactly.

Copying QoS Metadata

Back to the previously mentioned my_dispatch_async implementation. If we want to maintain the QoS metadata, we need to copy it over to the new block. Now we have cleared the skeleton, there are only a few implementation details.

First, we cannot directly inspect the captured variables.

It is straightforward to cast (struct dispatch_block_private_data_s *)((uint8_t *)block + sizeof(Block_layout)), and then check the fields. However, there is no guarantee that a passed-in block is wrapped with dispatch_block_create method always. If a passed-in block happens to contain no captured variables, you may access out-of-bound memory address.

The way libdispatch implemented is to first check the invoke function pointer. If it is wrapped with dispatch_block_create, it will always point to the same function inside the block.cpp implementation. We can find this function pointer at link time like what libdispatch did or we can find it at runtime.

1
2
3
4
5
6
7
8
9
10
typedef void (*dispatch_f)(void*, ...);
dispatch_f dispatch_block_special_invoke()
{
    static dispatch_once_t onceToken;
    static dispatch_f f;
    dispatch_once(&onceToken, ^{
        f = (__bridge struct Block_layout *)dispatch_block_create(DISPATCH_BLOCK_INHERIT_QOS_CLASS, ^{})->invoke;
    });
    return f;
}

Second, we need to deal with runtime changes. We don’t expect libdispatch has dramatic updates to its internals, however, it is better safe than sorry. Luckily, struct dispatch_block_private_data_s has a magic number to compare notes. We can simply check dbpds->dbpd_magic against library updates and corruptions.

Finally, we can assemble our my_dispatch_async method properly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static atomic_int executed_count;

void my_dispatch_async(dispatch_queue_t queue, dispatch_block_t block) {
    dispatch_block_t wrapped_block = ^{
        ++executed_count;
        block();
    };
    struct Block_layout *old_block_layout = (__bridge struct Block_layout *)block;
    if (old_block_layout->invoke == dispatch_block_special_invoke()) {
        wrapped_block = dispatch_block_create(DISPATCH_BLOCK_INHERIT_QOS_CLASS, wrapped_block);
        struct Block_layout *wrapped_block_layout = (__bridge struct Block_layout *)wrapped_block;
        struct dispatch_block_private_data_s *old_dbpds = (struct dispatch_block_private_data_s *)(old_block_layout + 1);
        struct dispatch_block_private_data_s *wrapped_dbpds = (struct dispatch_block_private_data_s *)(wrapped_block_layout + 1);
        if (old_dbpds->dbpd_magic == 0xD159B10C) {
            wrapped_dbpds->dbpd_flags = old_dbpds->dbpd_flags;
            wrapped_dbpds->dbpd_priority = old_dbpds->dbpd_priority;
        }
    }
    dispatch_async(queue, wrapped_block);
}

This new my_dispatch_async wrapper now will respect the block QoS assignments passed in, you can check this by dispatch a block with dispatch_block_create and observe the executed QoS with qos_class_self().

Closing Thoughts

The implementation of QoS in dispatch block is quite indigenous. However, it does present challenges outside of libdispatch scope. This implementation is specialized against dispatch_block_t type of blocks, you cannot simply extend that to other types of blocks. I am particularly not happy that dispatch_block_create is not a generic function such that any given block, parameterized or not can have QoS wrapped and somehow respected (for example, taking its QoS out and assign it to a plain dispatch_block_t when you do dispatch_async dance).

Implementing your own QoS-carrying block this way would be quite painful. Each parameterized block would require a specialized function that carries the QoS information. You probably can do that with C macro hackery, but that would be ugly too quickly. You’d better off to have an object that takes both the block and QoS information plainly, than trying to be clever and embedding the QoS information into the block.

No comment yet
December 1st, 2019

I’ve discussed a stackful coroutine implementation to coordinate CUDA stream last year.

That was an implementation based on swapcontext / makecontext APIs. Increasingly, when I thought about porting nnc over to WASM, it becomes problematic because these APIs are more or less deprecated. Popular libc implementations such as musl don’t have implementation of these methods.

After the article, it became obvious that I cannot swapcontext into the internal CUDA thread (that thread cannot launch any kernels). Thus, the real benefit of such stackful coroutine is really about convenience. Writing a coroutine that way is no different from writing a normal C function.

This is the moment where C++ makes sense. The coroutine proposal in C++20 is a much better suit. The extra bits of compiler support just make it much easier to write.

If we don’t use swapcontext / makecontext, the natural choice is either longjmp / setjmp or good-old Duff’s device. It is a no-brainer to me that I will come back to Duff’s device. It is simple enough and the most platform-agnostic way.

There are many existing stackless coroutines implemented in C. The most interesting one with Duff’s device is Protothreads. To me, the problem with Protothreads is its inability to maintain local variables. Yes, you can allocate additional states by passing in additional parameters. But it can quickly become an exercise and drifting away from a simple stackless coroutine to one with all bells-and-whistles of structs for some parameters and variables. You can declare everything as static. But it is certainly not going to work other than the most trivial examples.

I’ve spent this weekend to sharpen my C-macro skills on how to write the most natural stackless coroutine in C. The implementation preserves local variables. You can declare the parameters and return values almost as natural as you write normal functions.

Here is an example of how you can write a function-like stackless coroutine in C:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
static co_decl_task(ab_t, _coroutine_a, (const int a, const int b), private(
	int i;
)) {
	printf("param a %d\n", CO_P(a));
	printf("param b %d\n", CO_P(b));
	CO_V(i) = 2;
	printf("%d\n", CO_V(i));
	co_yield((ab_t){
		.a = CO_V(i)
	});
	CO_V(i) += 1;
	printf("param b %d\n", CO_P(b));
	printf("%d\n", CO_V(i));
	co_yield((ab_t){
		.a = CO_V(i)
	});
	co_return((ab_t){
		.a = 10
	});
} co_end()

static co_decl_task(int, _coroutine_b, (), private(
	co_routine_t* task_a;
	ab_t a0;
	ab_t a1;
	ab_t a2;
)) {
	CO_V(task_a) = co_new(_coroutine_a, (12, 10));
	co_resume(CO_V(task_a), CO_V(a0));
	co_resume(CO_V(task_a), CO_V(a1));
	co_resume(CO_V(task_a), CO_V(a2));
	printf("returned value %d %d %d\n", CO_V(a0).a, CO_V(a1).a, CO_V(a2).a);
	co_free(CO_V(task_a));
} co_end()

int main(void)
{
	co_scheduler_t* scheduler = co_scheduler_new();
	co_routine_t* const task = co_new(_coroutine_b, ());
	co_schedule(scheduler, task);
	co_free(task);
	co_scheduler_free(scheduler);
	return 0;
}

co_decl_task will declare the interface and the implementation. You can also separate the interface into header file with co_decl and implementation into co_task. In this case, static keyword continues to work to scope the coroutine to file-level visibility. Taking a look at this:

1
static co_decl_task(ab_t, _coroutine_a, (const int a, const int b), 

The first parameter is the return type, and then function name, parameters, all feel very natural to C functions. The local variable has to be declared within the private block, that’s the only catch.

To access parameters and local variables, you have to use CO_P / CO_V macro to wrap the access, otherwise it is the same.

Of course, there are a few more catches:

  1. No variadic parameters;
  2. No variable length local arrays;
  3. No void, () meant for that in parameters, and you can simply omit the return type if you don’t need them.

There is no magic really, just some ugly macros hide away the complexity of allocating parameters / local variables on the heap and such.

There are examples in the repo that shows the usage of co_resume, co_await, co_apply, co_yield, co_decl, co_task, co_decl_task and co_return in varies formats. You can check out more there: https://github.com/liuliu/co

Currently, I have a single-threaded scheduler. However, it is not hard to switch that to a multi-threaded scheduler with the catch that you cannot maintain the dependencies as a linked-list, but rather a tree.

It is a weekend exercise, I don’t expect to maintain this repo going forward. Some form of this will be ported into nnc.

Closing Thoughts

In theory, swapcontext / makecontext can make a much more complex interaction between functions that an extra scheduler object is not needed. For what it’s worth, Protothreads also doesn’t have a central scheduler. But in practice, I found it still miles easier to have a scheduler like what libtask does. Tracking and debugging is much easier with a central scheduler especially if you want to make that multi-thread safe as well.