No comment yet
March 19th, 2021

I am a mediocre marathon runner that hasn’t done any long-distances since Covid-19. But out of serendipity today, I was thinking about marathon and the interesting choices you would make during it.

Flat road marathon with support stations every half a mile is … bland. I probably have done that only twice, one in Maui, one in Denver. The only interesting bits for these two are hot temperature or the mile-high altitude. More interesting cases are trail runs with support stations every few miles. You have a lot of decisions to make along the way.

Elite runners probably can bulldoze through the trails. But for everyday runners like me, on the trail, the first decision needs to be made is, what’s going to be my pace for this segment? This is a function of your desired finish time, the elevation gain / loss for this segment and weather conditions. On a good day, at the beginning of the race, with massive elevation loss, you probably want to go as fast as you can. This is economical from the energy expenditure perspective. On the other hand, if there are a lot of ups-and-downs in the beginning, you probably want to keep a steady pace such that your overall finish time would be on-track.

The decision would be a bit different towards the end. If there are a lot of ups-and-downs towards the end, you probably want to take it slow on uphills and speed it up on downhills. This is more efficient energy-wise, and can still keep yourself on-track somewhat.

Besides questions on the pace, if there are a fairly reasonable number of support stations, you would need to make the decision on where to stop for refills. If there are only 3 support stations 6 to 7 miles away from each, you probably want to stop at every support station, since your 1L water bag may run out at probably 10-mile range. It is more interesting if there are not enough support stations such that you have to carry your own water bag, but sufficient number of them so you can make decisions to skip some of these.

This is a decision that can be more interesting than it first appears. If you are breezing through at 7min/mi pace, you probably don’t want to break it and stop for a refill. It is simply not economical. However, if you are running low on water, it is more difficult. You would never ever want to go without water for a mile or so, it is soul-crushingly terrible. In this case, you may want to ration the water intake until the next support station, or stop for a big refill now and try to make up the time by skipping the next two.

For an everyday runner, at mile 12 or 13, you probably would like to consider taking some food. Energy bars, jellies, gels, watermelon chunks, bananas, or M&M beans, all kinds. It is a great way to prepare for the wall. To be honest, I am not great at picking stuff out from my bag while running. As such, getting food out requires a brief stop. When is the most economical time to do so is not obvious. Uphills? Probably, it is definitely easier to stop and restart on uphills. But you also want to get food into the body before the glycogen depletion. Gels probably the fastest, it will take 5 minutes to kick in. Other solids can take longer. A mile or so before the depletion probably is required. Depending on these factors, there could be a range that we can take food more optimally during the race.

These decisions can be even more interesting if it is a self-supported multi-day event. Self-supported here means you carry your own food for these days (excluding water). Multi-day means a marathon a day. The decision is much more interesting because you are not just carrying a pack of energy gels. There are proper solids: pastas, ramens, instant rices, beef jerkys for your after-race celebrations. Simply put, energy gels for 6 days is not an option psychologically.

In this situation, the decisions on when to take what not only need to consider the optimality of the current race, but also the race next day. If the next day would be harder, I probably need to ration my energy gels today and have a bigger breakfast before today’s race to balance it out. It also cannot be completely rational. You would want to go with high energy-weight ratio food so that in the tougher day, the backpack would be lighter. However, you want to keep some celebratory food (beef jerkys anyone?) such that psychologically, there is a strong motivation to finish sooner. It quickly becomes an interesting game between the pace, stop-restart points, energy expenditures / refills and psychological reward functions. I simply don’t know an universal optimal strategy for this game yet with several trials-and-errors.

People say that life is a marathon, not a sprint. Well, I guess. As an everyday person, I would love to participate in the game with many more interleaving decisions and consequences. That is certainly much more interesting than the genetic lottery, isn’t it?

No comment yet
February 14th, 2021

前段时间人人影视组因为侵犯版权问题被调查了。国内引起了对知识产权保护的大讨论。在一定的时期和特定的领域,知识产权有着积极的意义。但是,知识产权制度本身并不是真理,它只是人为的一种创造。

知识产权的本质是将知识,这一属于人类共有资源(Public common)的事物通过国家机器,在有限的时间内赋予其完整的私有产权,将其私有化。定义私有产权的四种性质:即可使用性(right to use),可转移性(right to transfer),排他性(right to exclude)和可销毁性(right to destory)对于知识产权都是适用的。

需要注意的一点是,将知识产权化私有这一行为本身会将知识从人类公有资源中排除。长此以往,人类共享的公有资源将会减少消失。因此,知识产权均有时间限制。比如专利有20年的有效期,而著作权在作者死后50年后失效(美国是70年)。

在知识产权形成的工业革命时期,它具有进步性。但是同时,知识产权对再创造的打击是毁灭性的。在知识产权形成前,很多的文学创造,从耳熟能详的水浒传、金瓶梅、三国志、三国演义,都有各种在民间传说和前作上的再创造。而这样的再创造(或者叫同人),在现代由于版权问题,相对数量是较少的。这本质是一种选择。我们认为通过鼓励知识产权,鼓励创作者盈利,能够产生更多的作品,来弥补相对较少的再创造所造成的损失。

在另一方面,知识产权,或者更窄的说,专利体系,在形成之初还有另外一个目的,即是鼓励公开。专利系统要求提交专利申请并公开。这一行为本身保证了在20年之后,知识不会丢失,而是能够回归到人类公有资源中。

因此,即使到了现代,也有很多高科技的创新,比如发动机制造、材料制造等,是通过商业秘密(trade secret)的方式去保密的。这也就是为什么我们看不到氢弹的专利申请书的原因。

所以,较多使用知识产权保护利益的,是那些反向工程简单,边际效应趋向于零,但是研究开发成本很高的领域,比如:文化产品、软件和医药等。

说了这么多,知识产权真的能够让发明者受益,因此鼓励发明创新吗?

阴极管电视机公认的发明人是Philo Farnsworth(费罗·法恩斯沃斯,以下称法恩斯)。

法恩斯在1906年出生于一个农民家庭。在他15岁的时候,便幻想发明一种能把声音和图像一起传输的装置。后来,在他21岁的时候,便做出了第一台阴极管电视机的原型。在此之前,电视机的原理多是基于机械扫描装置(Nipkow disks)。阴极管电视更加可靠,图像也更清晰。事实上,直到21世纪初,我们的显像装置仍然是基于阴极管的,可见这一发明的有效性。

法恩斯在1926年搬到了加州,并申请了阴极管电视的发明专利。专利在1930年得到了批准,法恩斯打算生产电视,大赚特赚一笔。在他正要开始的时候,美国广播公司(RCA)宣称他们的员工Vladimir Zworykin(费拉蒂米尔·斯福罗金)在法恩斯之前就发明了电视,要专利局裁定法恩斯的专利无效。

在1931年,美国广播公司向法恩斯投出橄榄枝,想要用十万美金获得授权,并招聘法恩斯作公司员工。但是这被法恩斯拒绝了。

法恩斯为了拥有完整的电视发明专利池,举债和美国广播公司打官司。在1939年底终于打赢了所有的官司并和美国广播公司签订了一百万美金的授权协议。但是此时电视的主要专利已近过期,而他也已经债务缠身。雪上加霜的是,时值二战开始,美国暂停了电视等新兴的娱乐服务全力备战。

法恩斯的例子,很难说明专利的争夺是完全偏向发明者一方的。只要拥有雄厚的资本,完全可以通过法律程序消耗发明者的时间和精力。这也部分解释了为什么在当代,拥有大量专利和知识产权的,往往是实力雄厚的大公司。

在文化产品、软件、医药等较多使用知识产权保护的领域,知识产权的保护真的有效吗?

在文化领域,音乐、电影、电视,是广为人知的通过知识产权盈利的领域。这也是我们题目所说的人人影视组侵权的领域。特别是音乐领域,因为其较低的创作门槛和简单的复制方式,是最早受到互联网冲击的领域。这一冲击,看上去似乎发展受到了打击。仅就美国而言,唱片行业在2000年的收入达到了140亿美元,而在2019年是110亿,这还得益于过去几年流媒体收入的发展。

然而,创作者的收入并没有减少。同为年轻的顶级歌手,Taylor Swift在2020年的身价为6500万美金,而Britiney Spears在2000年的身价为3800万美金。由于唱片行业的势微,更多的创作者倾向于用不可复制的体验,比如现场演唱会等形式来盈利。因此也有了2007年,Radiohead将新专辑In Rainbows放到网上直接提供下载的行为艺术。

而SCP一类的互联网联合创作模式也在尝试用一种版权上更开放的态度进行共享和合作。

在软件领域,几乎所有的从业者都认为,软件专利对于这个行业而言是弊大于利的。因为专利的可转移性,特别是在软件专利方面,出现了像Intellectual Ventures这种名字都是主义,实际是做掠夺性专利诉讼(predatorial patent trolling)生意的公司。这类公司自己不搞发明创造,专门通过收购定义模糊广泛的专利,再利用这种专利诉讼赚钱。比如告Facebook侵犯了“一种让不同人通过网络连接联系的方法”的专利,或者告Google侵犯了“一种向服务器请求资源的方法”的专利来盈利。这种滥用专利诉讼对于真正将发明创造转化成对人类有益产品的人只能起到负面作用。

正因为软件专利存在这么多的弊端,大家也发明了将各个公司关于某项技术的专利放到一起,形成一个专利池。这样只需要向专利池付费,就能得到专利池的保护。如果有人告专利池里的技术,大家就一起凑钱打官司。H265编码和所谓的4G / 5G技术,其实核心就是他们的知识产权放在一起的专利池。

在医药领域,因为药品的正外部性,对于医药的知识产权保护一直都是存有争议的。《我不是药神》电影的核心就是在于印度政府为了保障人民能用上廉价药品,对于药品专利的不认可。讽刺的是,也正是因为印度大量生产专利保护的药品,反而积累了大量的药品生产经验,成为了全球主要的药品生产地。即使在美国,因为EpiPen(一种过敏患者用于急救的便携式注射器加药物)在过去的十年价格翻了4倍,也引起了广泛的对药物专利是否应该有价格限制的讨论。

从另一个方面来说,其实,现代药品的开发模式已经不是大公司从最初研发到完成三期一条龙的模式了。更经常的,是在早期由国家投资给大学和实验室,进行药品的研发。在药品有希望的时候,教授通常会成立一个小公司负责将研究商业化。小公司完成了一期和二期实验之后会再打包卖给Lilly这样的大药厂,他们再去完成费时费力的三期实验和上市。正是因为这一模式,也引发了为什么用纳税人的钱进行研发,却把利润都给大药厂赚了的大讨论。

专利对技术影响的一个直接例子就是E-Ink。E-Ink曾经是和OLED并驾齐驱的下一代显示技术。但是E-Ink公司一家垄断了E-Ink的核心专利。在其发明之后的十几年间,E-Ink不授权给别的公司生产,所以无论是刷新率、尺寸、价格或是新的应用都没有得到大的发展。现在,所有的公司都在等待在E-Ink的核心专利在20年代过期之后,引起的产品更新大爆发。

这并不是一厢情愿。LED照明技术的专利大部分在2010年过期。巧合的是,也正是在2010年之后,LED产业迎来了大爆发。很多人家开始装上条状LED、非直接LED照明或者LED变色灯也都是在2010年之后。

正是因为知识的可复制性,我们的生活才能越来越好。在互联网时代,知识的获得和复制都更加廉价了。比如开源软件,就是践行知识共享的生动例子。我小时候也并不理解知识产权和开源的关系,14岁的时候还在电脑报上写文章,斥责别人将开源的FileZilla重新打包在国内卖不尊重知识产权。现在的我更加相信,任何能辅助知识推广的行为都是好的。如果重新打包翻译之后通过收费来推广比免费的推广得更好,有什么不可以呢。

SQLite,一个广泛使用的嵌入式数据库软件,就是将知识共享的精神发挥到了极致。和大部分开源软件需要在重新打包时包含授权条款不同(Permission is hereby granted … without restriction, including without limitation …),SQLite和所有的在知识产权形成前的发明创造一样,仅仅是将所有的权利回归了人类公有资源库(Public domain)之中。

在文化领域,古腾堡计划(Project Gustenberg)致力于通过互联网将所有著作权过期的作品提供给人们免费下载和浏览。每年年初,古腾堡计划都会开心的写一篇文章,介绍都有哪些好的作品已经过期了大家可以免费下载浏览了。

这些项目本身,就是对知识产权制度的反思。你可以问问自己,米老鼠这么一个角色,让迪斯尼拥有了100多年的知识产权,而且只要美国还在,就还会一直持续下去,真的对人类公有资源有任何正面的帮助吗?假设吴承恩将孙悟空的知识产权拥有到了现代,我们还能看得到《大话西游》吗?

这时,我们再回头问,知识产权形成的初期为了鼓励市场竞争和更多的发明创造的目的,真的达到了吗?

1774年,英国当时是全世界的纺织生产大国。为了保护英国的纺织企业,通过法案防止纺织技术出口。Samuel Slater(塞缪尔·斯莱特)将纺织机的制作和生产都默记在脑子里,1789年到美国纽约,根据自己默记的内容,重新制作了纺织机。在现代,斯莱特广泛被认为是美国的工业生产之父。

当然在后来,美国取得技术领先的优势之后,就通过国际条约和机构,将知识产权保护的条例写进了参与国的法律中,以此保护自己的技术优势。TPP(泛太平洋伙伴关系协定)被广泛批判的,也是将美国国内过于严苛冗长的知识产权保护条例写进了合作伙伴的法律中。

知识产权将产权的所有性质赋予给了可复制可传播的知识,将其长期从人类共有资源中隔离出来。在当今合作紧密、知识迭代变化快速的时代,这样完整的知识产权是为了促进发明创造,还是只是保护资本的利益,是非常存疑的。我们不应该全盘接受知识产权的概念而不加思考。对它好的地方要鼓励,坏的地方也要批判。知识产权应该是一个动态的概念,刘慈欣说要多想,在那以前多想,总是对的。

No comment yet
February 7th, 2021

For many machine learning practitioners, training loop is a universally agreed upon concept as demonstrated by numerous documentations, conference papers to use the word without any reference. It would be a helpful concept for many beginners to get familiar with before diving into the rabbit holes of many deep learning tools.

The field of machine learning becomes vast, diverse and ever-more open in the past 10 years. We have all kinds of open-source softwares, from XGBoost, LightGBM, Theano to TensorFlow, Keras and PyTorch to simplify various tasks of machine learning. We have supervised learning, unsupervised learning, generative network and reinforcement learning, choose your own pill. It can be dazzling for beginners. It doesn’t help that many popular softwares we use made simplifications to hide many details from beginners with abstractions like classes, functions and callbacks.

But fear not, for machine learning beginners, there is one universal template. Once you understand it, it is straightforward to fit all existing training programs into this template and start to dig into how they implemented the details. I call this the universal training loop.

An Extremely Simplified Interface

Many high-level framework may provide an extremely simplified interface looking like this:

1
func train(training_dataset: Dataset, validation_dataset: Dataset) -> Classifier

When you do:

1
let trainedClassifier = Classifier.train(training_dataset: training_dataset, validation_dataset: validation_dataset)

You somehow get the trained classifier from that interface. This is what you would find in FastAI’s Quick Start page, or Apple’s Create ML framework.

However, this doesn’t tell you much about what it does. It is also not helpful some of these frameworks provided callbacks or hooks into the training loop at various stages. The natural question would be: what are the stages?

An Supervised Learning Training Loop

It is actually not hard to imagine what a supervised learning training loop would look like underneath the extremely simplified interface. It may look like this:

1
2
3
4
var classifier = Classifier()
for example in training_dataset {
  classifier.fit(input: example.data, target: example.target)
}

It tries to go through all examples in the training dataset and try to fit them one-by-one.

For stochastic gradient descent methods, we had a few modifications to make the training process more stable and less prone to input orders:

1
2
3
4
var classifier = Classifier()
for minibatch in training_dataset.shuffled().grouped(by: 32) {
  classifier.fit(inputs: minibatch.data, targets: minibatch.targets)
}

We randomizes the order of input (shuffled()), group them into mini-batches, and pass them into the classifier, assuming the classifier operates with a group of examples directly.

For many different types of neural networks, shuffled mini-batches will be the essential part of your training loop for both efficiency and stability reasons.

4 Steps to Fit

The magical fit function doesn’t inspire any deeper understanding of what’s going on. It looks like we simply lift the train function into a for-loop.

However, if we can assume that our machine learning model is differentiable and based on gradient methods (e.g. neural networks), we can break down the fit function into 4 steps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var classifier = Classifier()
for minibatch in training_dataset.shuffled().grouped(by: 32) {
  // Given inputs, a machine learning model can guess what the
  // outputs would be. (Labels of the images, positions of the faces
  // or translated texts from the original.)
  let guesses = classifier.apply(inputs: minibatch.data)
  // Loss measures how far away our guesses compares to the targets
  // we knew from the training data. This is supervised learning, we
  // know the answer already.
  let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
  // Based on the loss, gradients give us the direction and magnitude
  // to update our model parameters.
  let gradients = loss.gradients()
  // Update the parameters with gradients from this mini-batch.
  // Optimizer specifies a particular algorithm we use to update
  // parameters, such as stochastic gradient descent or ADAM.
  optimizer.apply_gradients(gradients, classifier.parameters)
}

For any supervised learning, you will be able to find this 4 steps. It can vary, some of the model may accumulate gradients a bit and then apply_gradients after several rounds. Some of them may apply additional clipping on the gradients before applying them.

You could find the exact 4 steps in frameworks like Keras or PyTorch.

Validation Dataset and Epoch

We haven’t talked about the validation_dataset parameter you saw earlier for the train method!

For first-order gradients based methods (e.g. neural networks), we need to go over the whole training dataset multiple times to reach the local minima (a reasonable model). When we went over the whole training dataset once, we call it one epoch. Our models can also suffer the overfitting problem. Validation dataset are the data the model never uses when updating its parameters. It is useful for us to understand what our model would be like to data it never sees.

To incorporate the above two insights, our training loop can be further modified to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var classifier = Classifier()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, classifier.parameters)
  }
  var stats = Stats()
  for example in validation_dataset {
    // Only gather guesses, never update the parameters.
    let guess = classifier.apply(input: example.data)
    // Stats will compare guess to the target, and return some
    // helpful statistics.
    stats.accumulate(guess: guess, target: target)
  }
  print("Epoch \(epoch), validation dataset stats: \(stats)")
}

Now I can claim, for any supervised learning task, you can find the above training loop when you dig deeper enough through their abstractions. We can call this the universal supervised training loop.

Unsupervised Learning and Generative Networks

The main difference between unsupervised learning and supervised learning for our training loop is that we won’t have the target provided from the training dataset. We derive the target somewhere else. In unsupervised learning, we derive the target from some transformations of the input. In generative networks, we derive the target from random noises (hence generating something from nothing).

1
2
3
4
5
6
7
8
9
10
11
12
13
var model = Model()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32) {
    let guesses = model.apply(inputs: minibatch.data)
    // Unsupervised learning.
    let targets = model.targets(from: minibatch.data)
    // Generative networks
    // let targets = model.targets(from: noise)
    let loss = model.loss(guesses: guesses, targets: targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, model.parameters)
  }
}

Often times, for this types of tasks, the targets are derived from another set of neural networks and updated jointly. Because of that, there are more whistles and bells in many frameworks when they implement the above training loop. You can find example from Keras on how they derive targets from the input data only (get_masked_input_and_labels) for BERT (a popular unsupervised natural language processing model). Or you can find example from PyTorch how they generate adversarial examples from noises for DCGAN (deep convolutional generative adversarial network).

Deep Reinforcement Learning

Deep reinforcement learning generates the training data by having an agent interacts with the environment. It has its own loop that looks like this:

1
2
3
4
5
6
7
8
while true {
  let action = agent.respond(to: lastObservation)
  let (observation, reward, done) = environment.interact(action: action)
  lastObservation = observation
  if done {
    break
  }
}

The agent took action against our last observation. The environment will be interacted with the action and give a new set of observation.

This is independent from our training loop. In contrast to the supervised learning, in the deep reinforcement learning, we use the interaction loop to generate training data.

Our training loop can be modified to look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
var policy = Policy()
var training_dataset = Dataset()
for epoch in 0..<max_epoch {
  var data_in_episode = Dataset()
  while true {
    let action = policy(inputs: lastObservation)
    let (observation, reward, done) = environment.interact(action: action)
    data_in_episode.append((action: action, reward: reward, observation: lastObservation))
    lastObservation = observation
    if done {
      for (i, data) in data_in_episode.enumerated() {
        // Use all future rewards to compute our target.
        let target = target_from_future_rewards(data_in_episode[i..<])
        // Our input will be the last observation (Q-learning), and
        // potentially also include the action (Actor-Critic model),
        // or the next observation (model-based methods).
        training_dataset.append((input: (data.action, data.observation), target: target))
      }
      break
    }
  }
  // Rather than shuffling the whole training dataset, we just random
  // sample a subset.
  for minibatch in training_dataset.randomSampled().grouped(by: 32) {
    let guesses = policy.apply(inputs: minibatch.data)
    let loss = policy.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, policy.parameters)
  }
}

The training_dataset in above training loop can be referred to as the replay memory in the literature. If we retain all the old data before training, this is often referred to as off-policy training. If instead we remove all training data after each episode, this can be called on-policy training. OpenAI Spinning Up has better explanation on the differences between on-policy and off-policy with a bit more details.

You can find examples of the above training loops in PyTorch or in OpenAI Baselines.

Distributed Learning

Follow the same training loop, we can extend them to train on multiple machines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
var classifier = Classifier()
let machineID = MPI_Comm_rank()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32, on: machineID) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    // Compute gradients from the specific data on this machine only.
    let machineGradients = loss.gradients()
    // Use allreduce primitive to compute gradients summed from all
    // machines.
    let allGradients = allreduce(op: +, value: machineGradients, on: machineID)
    // Applying updates with the same gradients, it should yield the
    // same parameters on all machines.
    optimizer.apply_gradients(allGradients, classifier.parameters)
  }
}

The allreduce primitive go over all machines to sum gradients from them. In reality, it is often implemented with ring-based communication pattern to optimize the throughput. Naively, it can look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
func allreduce(op: Op, value: Tensor, on: Int) -> Tensor {
  MPI_Send(value, to: 0)
  if on == 0 {
    tensors[0] = value
    for i in 1..<MPI_Comm_size() {
      tensors[i] = MPI_Recv(from: i)
    }
    let sum = tensors.sum()
    for i in 1..<MPI_Comm_size() {
       MPI_Send(sum, to: i)
    }
    return sum
  } else {
    return MPI_Recv(from: 0)
  }
}

This naive data-distributed training loop can be extended to more sophisticated distributed training regime. For example, in ZeRO-Offload, its distributed strategy can be represented as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
var classifier = Classifier()
let machineID = MPI_Comm_rank()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32, on: machineID) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    for (i, gradient) in gradients.enumerated() {
      // Each machine only sum the gradients it responsible for. This
      // method will return nil if it tries to reduce a gradient it
      // is not responsible for.
      if let reducedGradient = reduce(op: +, id: i, value: gradient, on: machineID) {
        // Copy the summed gradient to CPU.
        cpuGradients[machineID, i] = reducedGradient.copy(to: .CPU)
      }
    }
    // Apply gradients to the model from CPU.
    optimizer.apply_gradients(cpuGradients[machineID], classifier.parameters[machineID])
    // Broadcast the latest parameters to all machines.
    broadcast(classifier.parameters[machineID])
  }
}

The Universal Training Loop

Finally, we arrived at our universal training loop template:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
var model = Model()
var training_dataset = Dataset()
for epoch in 0..<max_epoch {
  // Collect training dataset either from agent-environment
  // interaction, or from the disk.
  training_dataset = collect(...)
  // Go over mini-batch either on the whole training dataset, or from
  // a subset of it.
  for minibatch in training_dataset.extract(...) {
    // Apply the model to generate some guesses.
    let guesses = model.apply(inputs: minibatch.data)
    // Generate targets either from inputs, from noise, or it already
    // exists in the training dataset. Or a combination of above.
    let targets = targets_from(...)
    // Compute the loss from the model's guess w.r.t. to the target.
    let loss = model.loss(guesses: guesses, targets: targets)
    // First-order gradients from the loss.
    let gradients = loss.gradients()
    // Sum gradients from everywhere (other machines, other GPUs) for
    // this particular node to process.
    if let (i, collected) = gradients.collect(...) {
      optimizer.apply_gradients(collected, model.parameters[i])
      // Broadcast the updated parameters to everywhere.
      broadcast(model.parameters[i])
    }
  }
}

*: The pseudo-code in this article uses Swift. A particular syntax concerns with a[x..<y]. It is semantically the same as a[x:y] in Python, including for cases where some of the subscript is missing: a[x..<] should be the same as a[x:].

No comment yet
February 4th, 2021

I started to work on a new data science related project last year. Our initial setup was fairly traditional: it is Python-centric, with Anaconda for dependency management and environment setup.

Anaconda is interesting. It is probably useful if you have a ton of dependencies and the system tries really hard to figure out the package compatibilities with each other based on their claims (version numbers of their dependencies). For us though, we only use a handful of packages with clean dependencies on each other (your usual suspects: Pandas, SciPy, numpy), and the version number check just means each time package upgrade is half-an-hour SMT solving.

On the other hand, I made my fortune in the past decade by doing app development (Facebook, Snapchat). I’ve been eyeing on Swift since the 1.0 version. Since then, the language matured a lot. After gaining some Swift experience with my last employer, it seems to be a good compromise between expressivity and performance. The noise from the language itself is minimal, and the performance can be tuned if you are hard on it, otherwise still better than raw Python.

After a few months of probing, investing, bug fixes, we’ve migrated our setup to Swift in last December. It has been serving us well so far.

Problems

We have a small project, and the problem with packages mostly around package management and upgrade. Since Anaconda’s environment is not per project based, we have to switch back and forth when entering / leaving the project.

Our project, although small, is a bit exotic. Our core algorithm was implemented in C, and we don’t want to ship a Python plugin in-tree. Hence, we opt-ed to talk with the C lib through standard IO (subprocess). It turns out to be hairy and the core algorithm update process is more than terrible.

Pandas has a reasonable performance if these are builtin functions, once we drop to use apply / lambda, going through a few million rows for a particular column can take 30 to 40 seconds. For these cases, we also cannot use the rest of idle cores efficiently.

However, switching to a new language setup is not an easy task. Besides solving the above said problems, we would still like to keep a few things we liked with our old setup:

  • Jupyter notebook: we really liked to do data exploration with Jupyter notebooks. Anything requires us to compile / run / print would be a no go. The interactive data exploration experience is essential for our workflow.

  • Pandas: we liked Pandas, it is a Swiss Army knife for data science. It also has a big API surface that would be very hard to reimplement from scratch.

  • PyCharm or alike IDE: we liked to use PyCharm for Python development a lot. Data inspection, re-run, and in general, debugging experience within an IDE is no comparison with tools like vim (although I still use vim for a lot of development personally).

Other Potential Choices

Before embarking on this quest, we briefly looked at other potential choices:

  • TypeScript / JavaScript: it has some interesting data exploration patterns, such as observablehq.com. However, it doesn’t have good C library integration points, making the fore-mentioned core algorithm integration problem unsolved.

  • Rust: when I did my investigation, I didn’t notice the evcxr project. Other than that, the syntax for modeling would be a bit noisier than I’d like. The way to call Python through PyO3 is a bit clumsy too.

  • Julia: if I have more experience with the language, I may have a different opinion. But as it stands, the language has its own ecosystem, and I didn’t see a good way to call Python libraries from the language*. On the C lib integration part, it seems to require dynamic linking, and that would be a little bit more hurdle on my toolchain setup.

New Setup

Monorepo

Our new setup is a monorepo, with Bazel as the build system for both our Swift code, C libraries and our Python dependencies. In the meantime, we still have some legacy Python libraries that are now managed by Bazel too.

Bazel’s new rules_python has a pretty reasonable pip_install rule for 3rd-party dependencies. As I mentioned, because we use relatively a small number of Python packages, cross package compatibility is not a concern for us.

All our open-source dependencies are managed through WORKSPACE rules. This worked for our monorepo because we don’t really have a large number of open-source dependencies in the Swift ecosystem. The things we import mostly are Swift numerics, algorithms and argument-parser.

Jupyterlab

We don’t use a separate Jupyter installation anymore. Jupyterlab is installed as a pip_install requirement for our monorepo. Opening a Jupyterlab would be as simple as bazel run :lab. This enables us to in-tree our Swift Jupyter kernel. We adopted the swift-jupyter and added Bazel dependency support. We also have a pending PR to upstream our sourcekit-lsp integration with the Swift Jupyter kernel.

This complicates a bit on plugin management for Jupyterlab. But we haven’t yet found a well-maintained out-of-tree plugin that we would like to use.

Python

To support calling Python from Swift, we opted to use PythonKit library. We’ve upstreamed a patch to make Pandas work better within Swift. We made a few more enhancements around UnboundedRange syntax, passing Swift closures as Python lambda, which haven’t been upstreamed at the moment.

One thing that makes calling Python from Swift easy is the use of reference counting in both languages. This makes memory management cross the language boundary much more natural.

We also wrote a simple pyswift_binary macro within Bazel such that a Swift binary can declare their Python dependencies and these will be setup properly before invoking the Swift binary.

We haven’t yet vendoring our Python runtime at the moment. Just painstakingly making sure all machines on Python 3.8. However, we do intend to use py_runtime to solve this discrepancy in the future.

C Libraries

Swift’s interoperability with C is top-notch. Calling and compiling C dependencies (with Bazel) and integrating with Swift is as easy as it should be. So far, we’ve fully migrated our core algorithms to Swift. The iterations on the core algorithms are much easier.

IDE Integration

For obvious reasons, we cannot use PyCharm with Swift. However, we successfully migrated most of our workflow to VS Code. LLDB support makes debugging easy. However, we did compile our own Swift LLDB with Python support for this purpose (still in the process to figure out with the core team why the shipped Swift LLDB on Linux has no Python support).

Sourcekit-lsp doesn’t recognize Bazel targets. The work for Build Server Protocol support on both Bazel side and on sourcekit-lsp side seems stalled at the moment. We ended up writing a simple script to query compilation parameters from bazel aquery for all our Swift source code, and put these into compile_commands.json. Sourcekit-lsp just has enough support for Clang’s compilation database format to make code highlighting, auto-complete, go to definition and inline documentation work again.

We committed .vscode/settings.json.tpl and .vscode/launch.json.tpl into the codebase. Upon initial checkout, a small script can run to convert these template files into actual settings.json and launch.json. We did this to workaround issues with some VS Code plugins requiring absolute paths. These two files will keep in sync with the templates as part of the build process ever since.

Bonus

Swift’s argument-parser library makes creating command line tools really easy. Furthermore, it also supports auto-complete in your favorite shells. We implemented one all-encompassing CLI tool for our monorepo to do all kinds of stuff: data downloading, transformation, model evaluation, launching Jupyterlab etc. With auto-completion and help messages, it is much easier to navigate than our previous ./bin directory with twenty-something separate scripts.

Is this the End?

We’ve been happy with this new setup since the beginning of this year, but it is far from the end-all be-all solution.

Swift is great for local development, however, the Bazel support on Linux generates binary that dynamically links to Swift runtime. This makes deployment a little bit more involved than we’d like it to be.

Pandas uses internal data structure for DataFrame (column-based numpy). While we do have efficient numpy to tensor conversions, these cannot be leveraged in the context of Pandas (we want to lift the whole data frame, not just numpy-ready columns). Our current solution calls itertuples, which can be quite slow. We’d like to sponsor an open-source project to implement Apache Arrow support for Swift. This should enable us to pass data from Pandas to Swift through Arrow in-memory format, which may be faster than what we currently do.


*: notagoodidea pointed out there is a PyCall.jl library that implemented Python interoporability.

No comment yet
December 29th, 2020

每两年,我都会做一次四年为期的预测。对2020年的预测,怎么也不会预料到这样的突发事件。这样的突变,反而会使得两年前和四年前过于悲观的预测显得乐观不少。

回顾2016年的预测,辅助驾驶技术已经广泛应用于汽车之中。Cruise和Waymo也分别在旧金山和凤凰城开始了无驾驶员的自动驾驶汽车的运营。存活的移动消息服务,基本也都达到了三亿日活左右的标准。

许多长线航班的取消,虽然确实实现了,却是因为不同的原因。而石油价格虽然大幅变化,且维持在40美元左右,变化的方向却完全错了。

VR设备并没有在19年底售出三千万,仅仅售出了三百万台左右。

写了这么几篇的四年预测之后,我的目的也从单纯的追求准确变成了在追求准确的同时,尽量写一些以前没写过的新趋势。这也就是为什么在絮絮叨叨了将近十年的自动驾驶之后,这一篇并没有提了。因为他终于开始走上了产业化的正轨,接下去就是按部就班了。


2024年,却应该是乐观的。在经历了一场大瘟疫之后,短暂的萧条却由于多国政府的措施得当,变成了经济增长期。尤其是传统的发达国家,因为美国的国家正常化(美元占全球交易总额在30%以下)举措,迎来了一波复苏。

科技的发展也并没有因为瘟疫而停下脚步。更多的数据中心、更快的网络,是人们对疫情后世界的基本要求。更多的更持续的工作虚拟化需求加速了往云端迁徙的脚步。到2024年,全世界而言,线上支付会占到消费者总支付的一半以上。

无线互联网的延迟在90%的情况下会在50毫秒以下,25%以上的人口会有机会使用1G以上带宽的互联网。在算力方面,最好的单机工作站可以达到1 Petaflops的全32位浮点矩阵运算速度,近32位浮点矩阵运算速度将会达到10 Petaflops。2011年的超级计算机京也不过是10 Petaflops左右。

但是计算不止发生在云端。在2024年,我们会看到Hyperscalers运营耗电量接近1 Gigawatt的单个数据中心,但是也会看到耗电量不到15 Kilowatt的最后一公里边缘计算中心。成功的边缘计算服务提供商会管理将近25,000个边缘计算中心。这些边缘计算中心是提供高速互动视频和云游戏的关键。

即使有了云游戏,玩3A游戏的人却并没有增加多少,只是现有玩家的游戏时长增加了而已。另一方面,游戏引擎的使用却越来越平易近人。到2024年,20%的独立视频制作者将会使用游戏引擎实现实时的视频特效。一半以上的在线教学课件也会使用游戏引擎进行展示。

mRNA疫苗的成功应用和对疫苗监管的放松使得在2024年,一小部分人最先享受到了个性化疫苗的服务。这些个性化疫苗会从最先预防普通感冒开始,我们甚至可能会听到阿尔茨海默症疫苗的好消息。

虽然在2024年,我们不大可能完成一整套AGI系统的设计,但是一种通用的增强学习算法架构成功的应用到了从短期趋势预测到长期战略计划的方方面面。除了少数在数据融合输入方面尚有问题的领域(比如多边贸易谈判,民事刑事侦查诉讼等),大部分领域的执行将会是这套增强学习算法,而我们只是负责简单的数据整理和维护。从这个角度来看,我们离能自我进化的AGI系统又近了一步。

因为疫情而导致的电影网络同步上映并不会因为疫情的消失而消失。事实上,到了2024年,全年50%的电影会在院线上映的30天内在网络上映。每十几分钟分段的直播和录播混合的互动视频占据了我们剩余的娱乐注意力。我们希望视频的创作者更快的做出我们想要看的东西。90%头部创作者的视频上传周期从每周两更变成了一天一更。

虽然我们迎来了经济的复苏,全球化和全球化相关的全球变暖议题并没有实质的进步。除了山火和热带风暴,我们将在2024年以前见证一次大规模的近大陆架海洋动物死亡事件。因为海水酸化,太平洋沿大陆架将会迎来大量的海带养殖投资。因为海带养殖的盈利性,我们甚至会往深海发展,养殖更多的海带。海带会在美国成为一种时尚的健康食品。

到了2024年,因为植树造林和海带养殖,地球的肺得以有所喘息。但是对于控制温室气体的排放而言,这远远不够。除了少数的例外,大部分国家,包括中国、印度和美国,都将放弃或者达不到2024年的减排目标。我们仍然没有一种可以大规模快速捕获温室气体的技术投入使用。唯一的好消息可能是俄罗斯人多了几个天然不冻港。

但是,站在2024年的门槛上,虽然我们看到了全球化的受阻和意识形态斗争的暗流涌动,却充满希望:只要我们进步得够快,或许不会那么糟呢。

Every two years, I will make a 4-year scenario prediction / planning. However, the previous prediction for 2020 won’t even touch the slightest how dramatic it was. Such drastic change, however, makes seemingly gloomy predictions of 4 years more optimistic.

Right on mark for the 4-year prediction made in 2016, is the prevalence of driver-assistance technologies in all types of cars. Cruise and Waymo finally started their human driver-free autonomous vehicle commercialization in San Francisco and Phoenix, Arizona. The surviving mobile messaging services all will have around 300 million DAUs by now.

A lot of long-haul flights have been cancelled, but for reasons other than what was predicted. There were wide swings of crude oil prices, and stayed at $40 year-end. The directions of these swings are wrong from the prediction.

VR devices haven’t sold 30 millions in 2019. There were only 3 millions sold that year.

After the posts in the past few years, my goal for these changed from the singular pursuit of accuracy to unmask new trends while being accurate. That’s why after almost ten years religiously repetitive discussions on autonomous driving, there is none this time. It is on track of real-world productionization now, the rest will happen over time.


2024 should be an optimistic year. After the global pandemic, the predicted depression, with a combined monetary and fiscal policies from the world governments, will not happen. Developed countries, especially the United States, after its dollar normalization effort (dollar will only account for 30% and lower in global transactions), will be on a strong recovery path.

The technology developments won’t slow down a bit, even during the pandemic. More data centers, faster the internet, is what people expect post-pandemic. More and sustained work digitalization accelerated the migration to the cloud. Online transactions globally, by 2024, would be more than 50% of total consumer transactions.

The latency of wireless networks will improve. 90% of them will be below 50ms. 25% of the world population will have access to 1Gbps internet connection. The best workstation will do 1 petaflops full 32-bit floating-point matrix multiplication, and around 10 petaflops near 32-bit floating-point matrix multiplication. To put this in perspective, Fujitsu K in 2011 can only do 10 petaflops.

Computations are not only monopolized to the cloud. In 2024, we can see at least 1 hyperscaler runs a 1-gigawatt data center. At the same time, there will be data centers with less than 15 kilowatt power consumption on the edge. Successful edge computing providers will manage close to 25,000 edge computing data centers. These data centers will be the key for high-speed interactive videos and cloud-gaming.

Even with the moderate success of cloud-gaming, there won’t be much more gamers for AAA titles. Existing gamers will simply play more of them. On the other hand, game engines penetrated more markets. By 2024, 20% of independent content creators will use game engines for real-time visual effects. More than half of online education kits will use game engines for presentation.

mRNA vaccine and the subsequent streamlined regulation made it possible that in 2024, a privileged few can enjoy personalized vaccines. These vaccines will start with common cold, we will even hear some good news on Alzheimer vaccine development.

We cannot finish the full-blown AGI system by 2024. However, a generalized reinforcement learning system will be successfully applied from short-term trend analysis to long-term strategic planning. Besides a few areas with problems on data input (such as multilateral trade negotiations and civil / criminal litigations), many more places will use this generalized reinforcement learning system, and we will simply be responsible for data input and maintenance. We are inch-closer to a self-evolving artificial general intelligence system.

The online / in-theater lock-step movie releases won’t be gone post-pandemic. By 2024, 50% of movies will have digital release within 30-days of its in-theater release. The rest of our entertainment will embrace a few minutes a-piece live / clips mixed interactive videos. We demand faster feedback loops from content creators. 90% of top content creators will perform daily, compared to twice a week today.

During our economic recovery, globalization and its related topics such as climate change won’t make much progress. In between mega wildfires and mega storms, we will witness a large-scale offshore ocean-going animal’s synchronized death. Due to the acidification of the ocean, we will see growing investment in kelp farming around pacific shore. It is so profitable that we will even build kelp farms in the deep sea. Kelp will be a fashionable health diet.

With the reversal of deforestation and kelp farming, by 2024, the lung of the Earth slowly starts to heal. This is far from enough. With some exceptions, most countries, in particular, China, India and the United States will either give up or cannot reach their 2024 carbon reduction goals. There is no viable large-scale carbon capturing technology in existence. The only good news from the climate-change, is probably a few more year-round non-freezing ports for the Russians.

With the stagnation of globalization and the slow-motion ideology conflicts, we will still be hopeful in 2024: if we can progress fast enough like we did in the past 4 years, maybe, just maybe, the worst won’t come.

No comment yet
December 6th, 2020

From the onset of implementing libnnc, it meant to be a common ground for higher-level language bindings beyond Python. The underlying architecture has been stable for a year or so, and I have been using it for some personal projects for a while. But the raw C interface is not the easiest to use, it is the time to implement some high-level language bindings for that library.

The default high-level language for deep learning likely would be Python. However, I am not happy with how it performs on a typical many-core system even with things that are supposed to help. Swift, on the other hand, has no issues with saturating my many-core system and it has a reasonable Python binding to tap into the rich Python tool-kits. Not to mention the calling C functions from Swift path is as easy as you possibly can get.

This conviction resulted s4nnc, a Swift language binding for libnnc. Because s4nnc is a pure interface to interact with the underlying deep learning library. I paid close attention to its API design. Below are some design notes around why it is, and how Swift as a language fares on such a task. If you want to read the introduction to s4nnc, feel free to visit the GitHub homepage.

What is a Deep Learning Library API

A good deep learning API to me, can be largely modeled after Keras and PyTorch. It concerns, above all, with 2 questions:

  1. How to specify a deep learning model?
  2. How to construct a training loop?

Everything else is nice-to-have and largely orthogonal to these two questions (but these whistles-and-bells are a lot of hard work!).

A training loop consists of a repeated sequence of operations to: evaluate a model, compute gradients against the loss, apply gradients to update model parameters.

The details can be flexible: you could evaluate one part of the model in one round, and another part in another round; you could have different losses for different model outputs each round; you could modify the gradients, scale them, truncate them to whatever you liked; and you could apply gradients to update different model parameters with different optimizers. But at the core, I didn’t see much changes for this 3-step throughout many model training code.

What constitutes a model can be more interesting, but it seems we converged to a concept where a model consists of some inputs, some outputs, and some parameters. Particularly, parameters are stateful and internals to the model itself. Go beyond that, a model could have different inputs / outputs and different input / output shapes during the training loop. However, the shapes and number of parameters during the training are likely to be constant.

Basic Data Types

A deep learning library operates on multi-dimensional arrays (or tensors). In Swift, a concrete tensor can be represented as a value type like Array itself in Swift. That means the Tensor type would need things such as copy-on-write to implement said value-type semantics. Extra attention needs to be paid to make sure throughout the implementation of the API, no unnecessary copy was made. This value-type choice is a bigger deal than it sounds (it sounds like a no-brainer given S4TF made exactly the same choice) because in Python, everything is a reference type.

This becomes more interesting when deciding whether tensor variables could be value types or not. Tensor variable is an abstract tensor type which you can compute gradients (has a grad property). It is bound to a computation graph, and can be the parameter to update during the training loop. While it is possible to make many functions associated with tensor variables taking inout parameters and marking some of tensor variables’ functions as mutating, the more hairy part is about updates during the training loop.

In PyTorch, an optimizer takes a list of parameters and then applies new gradients to update these parameters when step() method is called. This is possible because parameters are reference types in Python. Any updates to the parameters will be reflected to the model who holds these parameters. This is not possible if tensor variables are value types. In Swift, you cannot hold a reference to a value type.

Thus, practically, tensor variables have to be implemented as reference types in Swift.

Despite my best intention, it turns out most of the objects, including Graph, Model, StreamContext are still implemented as reference types. It is possible for some of them (for example: the Model) to be value types. The lack of deinit in struct requires us to wrap a reference type inside a value type to create such API. At the end of day, I don’t see much of the value from API aesthetics or performance-wise to make these value types.

Automatic Differentiation

While Swift has a proposal for automatic differentiation, the automatic differentiation right now is implemented at library level and only applicable to models and tensor variables.

On the API side, it is popular to have a backward() method on the final loss variable. The said method will compute gradients against all variables associated with the final computation of the loss.

This also means we need to keep track of all variables in the computation graph, unless some point is reached and we can free them. In PyTorch, such point is when the step() method is called.

libnnc early on made the decision to avoid holding vast amount of memory by default. That resulted in the interface backward(to tensors: Sequence<Tensor>) where you have to specify to which tensors you compute the gradients against. Because we are doing backward-mode AD, we still compute gradients on all variables up until these tensors. But we don’t compute gradients against variables passed that point. In effect, we can rely on reference-counting to free memory associated with tensors beyond that point.

In return for this a bit more complex interface, you don’t have to worry about scoping to no_grad to avoid unbounded memory allocations.

Optimizers in the Training Loop

An optimizer in a deep learning library represents a particular gradient descent method associated with some parameters to update each round.

One particular challenge is about how to use multiple optimizers with different parameters in one round. While for simpler cases, you could call step() many times in one round. It may be more efficient to call step() once.

Swift makes this particular choice easier by supporting extensions on built-in types.

1
2
3
4
5
public extension Collection where Element: Optimizer {
  func step() {
    ...
  }
}

This is a good starting point to support more concise updates such as: [adam1, adam2].step().

The same pattern can be applied if you want to support gradients with multiple losses: [loss1, loss2, loss3].backward(to: x).

These extension methods are conditional, and type-safe in Swift.

Operators

While Swift allows operator overloading and the ability to introduce custom operators, somehow the ambiguity of * is not addressed. We cannot have consistency with Python because @ cannot be overloaded in Swift language. I have to resort to .* and .+ for element-wise multiplications and additions.

Type-safe Functions

While Swift can enforce some type consistency, without more language level changes, we cannot deduce shapes, and would still encounter runtime errors if shape doesn’t match. Even with language-level support, we may still need an escape-hatch because some tensors could be loaded from IO. Not to mention it would be nice to support dynamic input shapes to a model while it can still statically compute its parameter shapes.

Transparent Multi-GPU Data Parallel Training

One thing annoying in the raw C interface, is the transition from one GPU to multi-GPU, even with simple data-parallel models. Unable to abstract tensor to higher-level in C makes the code unnecessarily complex.

Fortunately, with basic generics, this is not a problem in Swift. However, to use such generics turns out to be more complicated than I expected on the library author side. If I want to avoid runtime type-check (is / as keyword), there are quite a bit of protocol / extension type dance I would need to do. This doesn’t help with Swift’s borderline hostility against protocol-associated types. Luckily, there are new proposals to lift some of the restrictions (and no, some is not all it needs). You can see this monstrosity here. At the end, I have to introduce some runtime type-checks to keep my sanity.

Closing Words

I am pretty happy with the end result of s4nnc. It is small, versatile and does exactly what I set out to: an ergonomics win from the raw C interface. Generics, type-safety and reference-counting in Swift the language really made a difference in the API ergonomics. The language itself is not the fastest, but has a good balance in ergonomics, expressivity and performance. In the next blog post, I am going to detail the Swift data science workflow I had, and why we moved to this from the initial Python one. Stay tuned!

No comment yet
November 1st, 2020

For the past a few months, I’ve worked on Dflat, among many other things. The theme is to explore a workflow using Bazel as the build and deployment system, and Swift as the main language for data exploration, modeling, and production system.

Dflat appears to be an outlier for this theme. I simply have the urge to implement what I considered to be the one-true-way of doing structured data persistence on mobile, and to share it with the wider community. It was designed squarely for mobile data persistence needs, much like Core Data before. Case in point, the backing data engine has always been SQLite.

Up until recently. As I dug deeper in this Swift for data exploration, modeling and production system theme, it is increasingly likely that I need some data persistence mechanism. It may not be as fancy as for mobile, where I can observe changes and rely on one-way data flow to update UI. But it is darned nice to specify schema explicitly, persisting to a proper database, and don’t worry about schema upgrade at all.

For the past couple of the days, I’ve worked on porting Dflat to Linux. With some minimal work (mostly around figuring out the right SQLite compilation flags and moving code from one place to another to make Swift Linux runtime happy), it is done. I’ve also added some Bazel rules for code generation, so if you use Bazel, Dflat would work wonders out of the box.

What does this mean? If you use Swift in Linux, you don’t need to interact with SQLite directly any more. Dflat handles SQLite concurrency control, proper SQLite configuration, strongly-typed queries, data schema management, transparent (and read-only) backward compatible upgrades. Beyond that, it offers query subscription so for a long-running program, you can simply subscribe to a query and update based on the changed results.

An Example

If you happen to do Bazel + Swift on Linux, using Dflat is simple. In WORKSPACE file, add following:

1
2
3
4
5
6
7
8
git_repository(
    name = "dflat",
    remote = "https://github.com/liuliu/dflat.git",
    commit = "3dc11274e8c466dd28ee35cdd04e84ddf7d420bc",
    shallow_since = "1604185591 -0400"
)
load("@dflat//:deps.bzl", "dflat_deps")
dflat_deps()

For the binary that requires to persist some user settings, you can start to edit the schema file: user.fbs.

1
2
3
4
5
table User {
    username: string (primary);
    accessTime: double;
}
root_type User;

In your BUILD.bazel:

1
2
3
4
5
6
7
8
9
10
11
12
load("@dflat//:dflat.bzl", "dflatc")
dflatc(
    name = "user_schema",
    src = "user.fbs"
)
swift_binary(
    name = "main",
    srcs = ["main.swift", ":user_schema"],
    deps = [
        "@dflat//:SQLiteDflat"
    ]
)

Use them in main.swift file:

1
2
3
4
5
6
7
8
9
10
import Dflat
import SQLiteDflat
let workspace = SQLiteWorkspace(filePath: "main.db", fileProtectionLevel: .noProtection)
workspace.performChanges([User.self], changesHandler: { txnContext in
  let creationRequest = UserChangeRequest.creationRequest()
  creationRequest.username = "lliu"
  creationRequest.accessTime = Date().timeIntervalSince1970
  txnContext.try(submit: creationRequest)
})
workspace.shutdown()

Later read the data and update the accessTime:

1
2
3
4
5
6
7
8
9
10
11
12
import Dflat
import SQLiteDflat
let workspace = SQLiteWorkspace(filePath: "main.db", fileProtectionLevel: .noProtection)
let result = workspace.fetch(for: User.self).where(User.username == "lliu")
let user = result[0]
print(user.accessTime)
workspace.performChanges([User.self], changesHandler: { txnContext in
  let changeRequest = UserChangeRequest.changeRequest(user)
  changeRequest.accessTime = Date().timeIntervalSince1970
  txnContext.try(submit: changeRequest)
})
workspace.shutdown()

iPhone 12 Pro v.s. Threadripper 3970x

Dflat benchmark is now available on both iPhone 12 Pro and TR 3970x. It enables some quick apple-to-orange comparisons.

The iPhone 12 Pro benchmark followed exactly the previous benchmark.

The Linux benchmark runs with bazel run --compliation_mode=opt app:BenchmarksBin.

Work iPhone 12 Pro TR 3970x
Insert 10,000 0.072s 0.054s
Fetch 3,334 0.004s 0.006s
Update 10,000 0.085s 0.066s
4-Thread Insert 40,000 0.452s 0.219s
4-Thread Delete 40,000 0.289s 0.193s
10,000 Updates to 1,000 Subscriptions* 3.249s 3.356s

Each subscribed query contains roughly 1,000 objects.

Beyond Server / Client Relationship

The Bazel and Swift Package Manager version of Dflat for Linux packaged SQLite 3.33.0 in source-form. It enables some interesting ideas. With a little bit more work, we may as well compile Dflat to run in WebAssembly. Wasm can be the ideal form to be deployed at edge nodes. When worked on Snapchat app in the past, I’ve long theorized that we could deploy peering nodes for customers, and the server / device communication can be replaced with server / peering node / device with peering node and device syncing materialized views incrementally through some kind of binlog mechanism. It looks like this kind of architecture can be done, if we have some kind of uniformity between peering nodes and mobile devices.

Beyond SQLite

Even Dflat now supports Linux, it doesn’t mean this is a server software. Advanced features such as live query subscription can only happen if you go through a single Workspace instance. To make Dflat work in Linux, we actually deepened the SQLite and Dflat relationship by promoting some code from SQLiteDflat to Dflat.

This doesn’t mean we cannot change the backing data engine. Dflat could potentially help server-grade databases such as PostgreSQL in other ways. The smooth schema evolution would be an interesting way to formalize what people like Uber already did.

A server-grade database such as PostgreSQL could also potentially support live query subscription across many Workspace instances with some substantial redesign of Dflat internals.

Exploration Ahead

Supporting Dflat on Linux enables some interesting design space exploration (for one, I need to change Workspace.shutdown() semantics to better work with short-lived command-line programs). Whether a mobile-focused database would work for a variety of Linux-based workflows is a question without definitive answer still. I would continue my exploration in the meantime and report back on any substantial findings.

No comment yet
October 9th, 2020

I plan to write a series of articles to discuss some simple but not embarrassingly parallel algorithms. These will have practical usages and would most likely be on many-core CPUs or CUDA GPUs. Today’s is the first one to discuss a parallel algorithm implementation for CSV file parser.

In the old days, when our spin disk speed maxed out at 100MiB/s, we only have two choices: either we don’t care about the file loading time at all, treating it as a cost of life, or we have a file format entangled with the underlying memory representation to squeeze out the last bits of performance for data loading.

That world has long gone. My current workstation uses a software RAID0 (mdadm) over two 1TB Samsung 970 EVO NVMe storage for data storage. This setup usually gives me around 2GiB/s read / write speed (you can read more about my workstation here).

The CSV file format is firmly in the former category of the two. The thing that people who exchange CSV files care most, above anything else, is the interoperability. Serious people who actually care about speed and efficiency moved to other formats such as Apache Parquet or Apache Arrow. But CSV files live on. It is still by far the most common format in Kaggle contests.

There exist many implementations for CSV file parsers. Among them, csv2 and Vince’s CSV Parser would be two common implementations. That doesn’t account for standard implementations such as the one from Python.

Most of these implementations shy away from utilizing many-cores. It is a reasonable choice. In many likely scenarios, you would load many small-ish CSV files, and these can be done in parallel at task-level. That is an OK choice until recently, when I have to deal with some many GiBs CSV files. These files can take many seconds to load, even from tmpfs. That indicates a performance bottleneck at CPU parsing time.

The most obvious way to overcome the CPU parsing bottleneck is to fully utilize the 32 cores of Threadripper 3970x. This can be embarrassingly simple if we can reliably breakdown the parsing by rows. Unfortunately, RFC 4180 prevents us from simply using line breaks as row delimiters. Particularly, when quoted, a cell content can contain line breaks and these will not be recognized as row delimiters.

Paratext first implemented a two-pass approach for parallel CSV parsing. Later it is documented in Speculative Distributed CSV Data Parsing for Big Data Analytics. The paper discussed, besides the two-pass approach, a more sophisticated speculative approach that is suitable for the higher-latency distributed environment.

In the past few days, I implemented a variant of the two-pass approach that can max out the NVMe storage bandwidth. It is an interesting journey as I didn’t write any serious parser in C for a very long time.

The CSV File Parsing Problem

CSV file represents simple tabular data with rows and columns. Thus, to parse a CSV file, it is meant to divide a text file into cells that can be uniquely identified with row and column index.

In C++, this can be done in zero-copy fashion with string_view. In C, every string has to be null-terminated. Thus, you need to either manipulate the original buffer, or copy it over. I elected the latter.

Memory-Mapped File

To simplify the parser implementation, it is assumed we are given a block of memory that is the content of the CSV file. This can be done in C with:

1
2
3
4
5
6
FILE* file = fopen("file path", "r");
const int fd = fileno(file);
fseek(file, 0, SEEK_END);
const size_t file_size = ftell(file);
fseek(file, 0, SEEK_SET);
void *const data = mmap((caddr_t)0, file_size, PROT_READ, MAP_SHARED, fd, 0);

OpenMP

We are going to use OpenMP’s parallel for-loop to implement the core algorithm. Nowadays, Clang has pretty comprehensive support for OpenMP. But nevertheless, we will only use the very trivial part of what OpenMP provides.

Find the Right Line Breaks

To parallel parse a CSV file, we first need to break it down into chunks. We can divide the file into 1MiB sequence of bytes as our chunks. Within each chunk, we can start to find the right line breaks.

The double-quote in RFC 4180 can quote a line break, that makes us find the right line breaks harder. But at the same time, the RFC defines the way to escape double-quote by using two double-quote back-to-back. With this, if we count double-quotes from the beginning of a file, we know that a line break is within a quoted cell if we encounter an odd number of double-quotes so far. If we encounter an even number of double-quotes before a line break, we know that is a beginning of a new row.

We can count double-quotes from the beginning of each chunk. However, because we don’t know if there are an odd or even number of double-quotes before this chunk, we cannot differentiate whether a line break is the starting point of a new row, or just within a quoted cell. What we do know, though, is that a line break after an odd number of double-quotes within a chunk is the same class of line breaks. We simply don’t know at that point which class that is. We can count these two classes separately.

A code excerpt would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#define CSV_QUOTE_BR(c, n) \
    do { \
        if (c##n == quote) \
            ++quotes; \
        else if (c##n == '\n') { \
            ++count[quotes & 1]; \
            if (starter[quotes & 1] == -1) \
                starter[quotes & 1] = (int)(p - p_start) + n; \
        } \
    } while (0)
    parallel_for(i, aligned_chunks) {
        const uint64_t* pd = (const uint64_t*)(data + i * chunk_size);
        const char* const p_start = (const char*)pd;
        const uint64_t* const pd_end = pd + chunk_size / sizeof(uint64_t);
        int quotes = 0;
        int starter[2] = {-1, -1};
        int count[2] = {0, 0};
        for (; pd < pd_end; pd++)
        {
            // Load 8-bytes at batch.
            const char* const p = (const char*)pd;
            char c0, c1, c2, c3, c4, c5, c6, c7;
            c0 = p[0], c1 = p[1], c2 = p[2], c3 = p[3], c4 = p[4], c5 = p[5], c6 = p[6], c7 = p[7];
            CSV_QUOTE_BR(c, 0);
            CSV_QUOTE_BR(c, 1);
            CSV_QUOTE_BR(c, 2);
            CSV_QUOTE_BR(c, 3);
            CSV_QUOTE_BR(c, 4);
            CSV_QUOTE_BR(c, 5);
            CSV_QUOTE_BR(c, 6);
            CSV_QUOTE_BR(c, 7);
        }
        crlf[i].even = count[0];
        crlf[i].odd = count[1];
        crlf[i].even_starter = starter[0];
        crlf[i].odd_starter = starter[1];
        crlf[i].quotes = quotes;
    } parallel_endfor

This is our first pass.

Columns and Rows

After the first pass, we can sequentially go through each chunk’s statistics to calculate how many rows and columns in the given CSV file.

The line breaks in the first chunk after even number of double-quotes would be the number of rows in the first chunk. Because we know the number of double-quotes in the first chunk, we now know what class of line breaks in the second chunk are the start points of a row. The sum of these line breaks would be the number of rows.

For the number of columns, we can go through the first row and count the number of column delimiters outside of double-quotes.

Wiring the Cell Strings

The second pass will copy the chunks over, null-terminate each cell, escape the double-quotes if possible. We can piggyback our logic on top of the chunks allocated for the first pass. However, unlike the first pass, the parsing logic doesn’t start at the very beginning of each chunk. It starts at the first starting point of a row in that chunk and ends at the first starting point of a row in the next chunk.

The second pass turns out to occupy the most of our parsing time, simply because it does most of the string manipulations and copying in this pass.

More Optimizations

Both the first pass and second pass unrolled into 8-byte batch parsing, rather than per-byte parsing. For the second pass, we did some bit-twiddling to quickly check whether there are delimiters, double-quotes, or line breaks that needed to be processed, or we can simply copy it over.

1
2
3
4
5
const uint64_t delim_mask = (uint64_t)0x0101010101010101 * (uint64_t)delim;
const uint64_t delim_v = v ^ delim_mask;
if ((delim_v - (uint64_t)0x0101010101010101) & ((~delim_v) & (uint64_t)0x8080808080808080)) {
    // Has delimiters.
}

You can find more discussions about this kind of bit-twiddling logic here.

Is it Fast?

The complete implementation is available at ccv_cnnp_dataframe_csv.c.

The implementation was compared against csv2, Vince’s CSV Parser and Paratext.

The workstation uses AMD Threadripper 3970x, with 128GiB memory running at 2666MHz. It has 2 Samsung 1TB 970 EVO with mdadm-based RAID0.

For csv2, I compiled csv2/benchmark/main.cpp with:

1
g++ -I../include -O3 -std=c++11 -o main main.cpp

For Vince’s CSV Parser, I compiled csv-parser/programs/csv_bench.cpp with:

1
g++ -I../single_include -O3 -std=c++17 -o csv_bench csv_bench.cpp -lpthread

Paratext hasn’t been actively developed for the past 2 years. I built it after patched paratext/python/paratext/core.py by removing the splitunc method. The simple benchmark Python script look like this:

1
2
3
4
import paratext
import sys

dict_frame = paratext.load_raw_csv(sys.argv[1], allow_quoted_newlines=True)

I choose the DOHUI NOH dataset, which contains a 16GiB CSV file with 496,782 rows and 3213 columns.

First, to test the raw performance, I moved the downloaded file to /tmp, which is mounted as in-memory tmpfs.

Software Time
Paratext 12.437s
Vince’s CSV Parser 37.829s
csv2 19.221s
NNC’s Dataframe CSV 4.093s

The above performance accounts for the best you can get if file IO is not a concern. With the said 970 EVO RAID0, we can run another round of benchmark against the real disk IO. Note that for this round of benchmark, we need to drop system file cache with: sudo bash -c "echo 3 > /proc/sys/vm/drop_caches" before each run.

Software Time
Paratext 16.747s
Vince’s CSV Parser 39.6075s
csv2 21.035s
NNC’s Dataframe CSV 7.895s

The performance of our parser approaches 2000MiB/s, not bad!

Is our Implementation Reasonable with Lower Core Count?

csv2 is single-threaded. With only one thread, would our implementation still be reasonable? I moved parallel_for back to serial for-loop and ran the experiment on tmpfs again.

Software Time
csv2 19.221s
NNC’s Dataframe CSV (Many-core) 4.093s
NNC’s Dataframe CSV (Single-core) 45.391s

It is about 2x slower than csv2. This is expected because we need to null-terminate strings and copy them to a new buffer.

Finish It Up with Fuzzing

You cannot really ship a parser in C without doing fuzzing. Luckily, in the past few years, it is incredibly easy to write a fuzz program in C. This time, I chose LLVM’s libFuzzer and turned on AddressSanitizer along the way.

The fuzz program is very concise:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include <ccv.h>
#include <nnc/ccv_nnc.h>
#include <nnc/ccv_nnc_easy.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int LLVMFuzzerInitialize(int* argc, char*** argv)
{
    ccv_nnc_init();
    return 0;
}

int LLVMFuzzerTestOneInput(const uint8_t* data, size_t size)
{
    if (size == 0)
        return 0;
    int column_size = 0;
    ccv_cnnp_dataframe_t* dataframe = ccv_cnnp_dataframe_from_csv_new((void*)data, CCV_CNNP_DATAFRAME_CSV_MEMORY, size, ',', '"', 0, &column_size);
    if (dataframe)
    {
        if (column_size > 0) // Iterate through the first column.
        {
            ccv_cnnp_dataframe_iter_t* const iter = ccv_cnnp_dataframe_iter_new(dataframe, COLUMN_ID_LIST(0));
            const int row_count = ccv_cnnp_dataframe_row_count(dataframe);
            int i;
            size_t total = 0;
            for (i = 0; i < row_count; i++)
            {
                void* data = 0;
                ccv_cnnp_dataframe_iter_next(iter, &data, 1, 0);
                total += strlen(data);
            }
            ccv_cnnp_dataframe_iter_free(iter);
        }
        ccv_cnnp_dataframe_free(dataframe);
    }
    return 0;
}

./csv_fuzz -runs=10000000 -max_len=2097152 took about 2 hours to finish. I fixed a few issues before the final successful run.

Closing Words

With many-core systems becoming increasingly common, we should expect more programs to use these cores at the level traditionally considered for single core such as parsing. It doesn’t necessarily need to be hard either! With good OpenMP support, some simple tuning on algorithm-side, we can easily take advantage of improved hardware to get more things done.

I am excited to share more of my journey into parallel algorithms on modern GPUs next. Stay tuned!


2020-10-18 Update

There are some more discussions in lobste.rs and news.ycombinator. After these discussions, I benchmarked xsv for this particular csv file and implemented zero-copy parsing (effectively pushing more processing to iteration time) on my side. The zero-copy implementation becomes trickier than I initially liked because expanding from pointer to pointer + length has memory usage implications and can be slower if implemented naively for the later case.

Here are some results for parsing from NVMe storage:

Software Time
xsv index 26.613s
NNC’s Dataframe CSV 7.895s
NNC’s Dataframe CSV (Zero-copy) 7.142s

If runs on single-core, the parallel parser was penalized by the two-pass approach. Particularly:

Software Time
NNC’s Dataframe CSV (Single-core) 45.391s
NNC’s Dataframe CSV (Zero-copy, Single-core) 39.181s
NNC’s Dataframe CSV (Zero-copy, Single-core, First-pass) 9.274s
NNC’s Dataframe CSV (Zero-copy, Single-core, Second-pass) 32.963s
No comment yet
September 15th, 2020

In the past a few days, I’ve been experimenting with Bazel as a library distribution mechanism for ccv.

I am pretty familiar with hermetic build systems at this point. My main knowledge comes from Buck dating 8 years back. At that time, it never occurred to me such a build system could eventually be a library distribution mechanism. During the same 8 years, NPM has taken over the world. New language-dependent package managers such as Go module, Cargo and Swift Package Manager popularized the concept of using the public repositories (GitHub) as the dependency references. Languages prior to this period, mainly C / C++ are moving to this direction, slowly.

ccv has a simple autoconf based feature detection / configuration system. You would expect the package to work when ./configure && make. However, it never made any serious attempt to be too smart. My initial experience with monorepos at companies strongly influenced the decision to have a simple build system. I fully expect that serious consumers will vendor the library into their monorepo using their own build systems.

This has been true for the past a few years. But as I am finishing up nnc and increasingly using that for other closed-source personal projects, maintaining a closed-source monorepo setup for my personal projects while upstreaming fixes is quite an unpleasant experience. On the other hand, nnc from the beginning meant to be a low-level implementation. I am expected to have high-level language bindings at some point. Given that I am doing more application-related development with nnc in closed-source format now, it feels like the right time.

Although there is no one-true library distribution mechanism for C / C++, there are contenders. From the good-old apt / rpm, to Conan, which has gained some mind-share in the open-source world in recent years.

The choice of Bazel is not accidental. I’ve been doing some Swift development with Bazel and the experience has been positive. Moreover, the choice of high-level binding language for nnc, I figured, would be Swift.

Configure

ccv’s build process, as much as I would rather not, is host-dependent. I use autoconf to detect system-wide libraries such as libjpeg and libpng, to configure proper compiler options. Although ccv can be used with zero dependency, in that configuration, it can sometimes be slow.

Coming from the monorepo background, Bazel doesn’t have many utilities that are as readily available as in autoconf. You can write automatic configurations in Starlark as repository rules, but there is no good documentation on how to write robust ones.

I ended up letting whoever use ccv to decide how they are going to enable certain features. For things like CUDA, such configuration is not tenable. I ended up copying over TensorFlow’s CUDA rules.

Dependencies

Good old C / C++ libraries are notoriously indifferent to libraries dependencies v.s. toolchains. Autoconf detects both toolchain configurations as well as available libraries. These types of host dependencies make cross-compilation a skill in itself.

Bazel is excellent for in-tree dependencies. For out-tree dependencies however, there is no established mechanism. The popular way is to write a repository rules to load relevant dependencies.

This actually works well for me. It is versatile enough to handle cases that have Bazel integrations and have no Bazel integrations.

Consume Bazel Dependencies

Consumption of the packaged Bazel dependencies then becomes as simple as adding git_repository to the WORKSPACE and call proper <your_library_name>_deps() repository rule.

After packaging ccv with Bazel, now Swift for nnc can consume the packaged dependency.

Semantic Versioning Challenges

While the Bazel-provided library distribution mechanism works well for my case, it is simplistic. For one, there is really no good way to do semantic versioning. It is understandable. Coming from a monorepo culture, it is challenging for anyone to dive into dependency hells of library versioning. A slightly different story happened to Go a while back as well.

It is messy if you want to pin a specific version of the library while your dependencies are not agreeing with you. This is going to be messy regardless in C / C++ world, unless you prelink these extremely carefully. Bazel’s philosophy from what I can see, seems largely on keeping the trunk working side. It is working so far, but one has to wonder whether this can scale if more libraries adopted Bazel as the distribution mechanism.

Closing Words

The past a few months experience with Bazel has been delightful. While I would continue to use language specific tools (pip, Go modules, Cargo, NPM) when doing development in that particular language, Bazel is a capable choice for me when doing cross-language development. Concepts such as workspace, git_repository, http_archive fit well within the larger open-source ecosystem. And most surprisingly, it works for many-repo setup if you ever need to.

No comment yet
August 7th, 2020

During the past few years, I’ve been involved in migrations from Xcode workspace based build system to Buck twice (Facebook and Snapchat). Both of these experiences took talented engineers multi-months work with many fixes upstreamed to get it work. Recently, I’ve been helping another company to migrate their iOS project from Xcode workspace based build system to Bazel. This experience may be relevant to other people, considering the Bazel guide is quite light on details.

Why

Why in these companies, people choose to use alternative build tools such as Buck or Bazel but not Xcode? There are a few reasons:

Better Code Review and Dependency Management

Xcode stores its build configurations in both xcscheme files and pbxproj files. None of them are text-editor friendly and heavily rely on Xcode GUI for its configurations. There are ways to only use xcconfig, a text-based configuration file for pbxproj. But for many real-world projects, that is just the third place for build configurations rather than the only source of truth. As a result, configuration changes are impossible to review effectively in any popular web tools (GitHub / Gitlab / Gerrit).

Xcode does dependency management in implicit ways. Thus, you have to create xcworkspace to include all dependent sub-projects for build. This is problematic if you want to split your projects and tests into smaller manageable units. That often ends up with many xcworkspace projects and each need to have manual dependency management and being included in CI tool.

Better Caching and Faster Build Speed

Xcode’s build cache management is quite incompetent. A change in the scheme can result in rebuilding from scratch. Switching between branches can often result in full rebuilding as well.

Bazel is a hermetic build system. If a file or its dependency doesn’t change, a rebuild won’t be triggered. Switching between branches won’t slow down the build because artifacts are cached by its content.

Bazel provided an upgrade path from its primitive disk-based cache system to a distributed remote cache. You can make a switch when the codebase grows or the team grows.

Better Modularization

Swift likes to have smaller modules. Having each module as its own project poses challenges in code review and dependency management. There are alternative solutions such as Swift Package Manager or CocoaPods. Both of them have their unique set of problems (SPM is often too opinionated, while CocoaPods is slow and invasive).

When

Engineering teams often have competing priorities. For many teams, it is unlikely that their starting point will be a Bazel-based project (hopefully this will change in the future). When to prioritize the migration to Bazel? I’ve read it somewhere a good summary on this, and will just put it here: a good time to migrate to Bazel is when you about to need it. If the team starts to feel the pain of Xcode-based build system (multi-minute build time, a ball of mud mono-project, or many small projects but with CI brokages many times every week), it can often take months to do the migration. On the other hand, when you have only 2 people developing an app, it is simply hard to judge the value proposition.

A semi mono-project, with 3 to 4 developers, and some external dependencies (~10 CocoaPods libraries) would be a good place to allocate 1 engineer-week to do the transition. That’s what I did here.

Setup Tooling

The first step of the migration is to set up the tools. We will use Bazelisk to manage Bazel versions, and will symlink Bazelisk to /usr/local/bin/bazel for ease of access. For Xcode integration, we will use Tulsi. It will be installed from source. Both tools are checked out under: $HOME/${Company_Name}/devtools directory. The installation is automated through scripts inside the repository.

Setup Repository Directory Structure

While it is possible to manage external dependencies through WORKSPACE, Bazel loves monorepo. Thus, we are going to vendoring almost all our dependencies into our repository. The repository will be reorganized from a one-project centric layout to a monorepo layout.

1
2
3
4
5
6
7
8
9
10
11
$HOME/${Company_Name}/${Repository}/
 - common
 - common/bazel_tools
 - common/scripts
 - common/vendors
 - ios/bazel_tools
 - ios/Apps
 - ios/Features
 - ios/Libraries
 - ios/Scripts
 - ios/Vendors

The Migration

Bazel Basics

If you haven’t, now is a good time to read the Bazel guide for iOS. We first setup an ordinary WORKSPACE file that has rules_apple, rules_swift and xctestrunner imported.

This will allow us to start to use swift_library, swift_binary and ios_application to quickly build iOS app using Bazel.

For Xcode integration, we use the generate_xcodeproj.sh script to create Xcode project with Tulsi. The tulsiproj however, was never checked into our repository. This keeps our Bazel BUILD file the only source of truth for our build configurations.

At this step, we exhausted what we learned from the Bazel guide for iOS and need to face some real-world challenges.

CocoaPods Dependencies

It is hard to avoid CocoaPods if you do any kind of iOS development. It is certainly not the best, but it has accumulated enough popularity that everyone has a .podspec file somewhere in their open-source repository. Luckily, there is the PodToBUILD project from Pinterest to help alleviate the pain of manually converting a few CocoaPods dependencies to Bazel.

First, I use pod install --verbose to collect some information about the existing project’s CocoaPods dependency. This script is used to parse the output and generate Pods.WORKSPACE file that PodToBUILD want to use. We use the bazel run @rules_pods//:update_pods to vendoring CocoaPods dependencies into the repository.

Some dependencies such as Google’s Protobuf already have Bazel support. After vendoring, we can switch from PodToBUILD generated one to the official one. Some of the dependencies are just static / dynamic binary frameworks. We can just use apple_dynamic_framework_import / apple_static_framework_import. Pure Swift projects support in PodToBUILD is something to be desired. But luckily, we can simply use swift_library for these dependencies. They usually don’t have complicated setup.

Some do. Realm is a mix of C++17, static libraries (librealm.a) and Swift / Objective-C sources. We can still use PodToBUILD utilities to help, and it is a good time to introduce swift_c_module that can use modulemap file to create proper Swift imports.

The C++17 portion is interesting because until this point, we used Bazel automatically created toolchains. This toolchain, unfortunately, forced C++11 (at least for Bazel 3.4.1 on macOS 10.15). The solution is simple. You need to copy bazel-$(YourProject)/external/local_config_cc out into your own common/bazel_tools directory. Thus, we will no longer use the automatically generated toolchains configuration. You can modify C++11 to C++17 in the forked local_config_cc toolchain.

Here is what my .bazelrc looks like after CocoaPods dependencies migration:

1
2
3
4
5
6
7
8
9
10
build --apple_crosstool_top=//common/bazel_tools/local_config_cc:toolchain
build --strategy=ObjcLink=standalone
build --symlink_prefix=/
build --features=debug_prefix_map_pwd_is_dot
build --experimental_strict_action_env=true
build --ios_minimum_os=11.0
build --macos_minimum_os=10.14
build --disk_cache=~/${Company_Name}/devtools/.cache
try-import .bazelrc.variant
try-import .bazelrc.local

The Application

If everything goes well, depending on how many CocoaPods dependencies you have, you may end up on day 3 or day 5 now. At this point, you can build each of your CocoaPods dependencies individually. It is time to build the iOS app with Bazel.

There actually aren’t many gotchas for building in the simulator. Following the Bazel guide for iOS and set up your dependencies properly, you should be able to run the app inside the simulator.

If you have any bridging header (which you should avoid as much as possible!), you can add ["-import-objc-header". "$(location YourBridgingHeader.h)"] to your swift_library’s copts.

To run the app from the device, it may need some extra work. First, Bazel needs you to tell it the exact location of provisioning files. I elected to store development provisioning files directly in ios/ProvisioningFiles directory. With more people, this may be problematic to update, since each addition of device or developer certificate requires a regeneration of provisioning files. Alternatively, you can manage them through Fastlane tools.

iOS devices are often picky about the entitlements. Make sure you have the proper application-identifier key-value pair in your entitlements file.

If you use Xcode, now is a good time to introduce the focus.py script. This script will take a Bazel target, and generate / open the Xcode project for you. It is a good idea to have such a script to wrap around generate_xcodeproj.sh. You will inevitably need some light modifications around the generated Xcode project or scheme files beyond what Tulsi is capable of. Here is mine.

You can use such script like this:

1
./focus.py ios/Apps/Project:Project

Dynamic Frameworks

rules_apple in 04/2020 introduced a bug that won’t codesign dynamic frameworks properly. It is not a big deal for people that have no dynamic framework dependency (you should strive to be that person!). For many mortals, this is problematic. Simply switching to rules_apple prior to that commit will fix the issue.

Tests

Bazel, as it turns out, has fantastic support for simulator-based tests. I still remember the days to debug Buck issues around hosted unit tests and alike.

Many projects may start with something called Hosted Tests in the Xcode world. It is quick and dirty. With Hosted Tests, you have full access to UIKit, you can even write fancy SnapshotTesting. However, now is a good time for you to separate your tests out into two camps: a library test and a hosted test.

A library test in Bazel is a ios_unit_test without test_host assigned. It can run without the simulator, and often faster. It is restrictive too. Your normal way of accessing UIImage from the application bundle won’t work. Some UIKit components will not initialize properly without an UIApplicationDelegate. These are not bad! It is great to isolate your test to what you really care about: your own component!

You should move most of your existing tests to library tests.

SnapshotTesting has to be a hosted test. There are also implicit assumptions within that library about where to look for snapshot images. Luckily, we can pass --test_env from Bazel to our test runner and write a wrapper around assertSnapshot method.

The ios_ui_test will just work for your UI tests. The only bug we encountered is about bundle_name. Just don’t modify bundle_name in your ios_application. The ios_ui_test is not happy to run a bundle that is not named after the Bazel target name.

Code Coverage

Bazel in theory have good support for code coverage. You should be able to simply bazel test --collect_code_coverage and it is done. However, at least for the particular rules_apple and 3.4.1 Bazel, I have trouble doing that.

The code coverage is not hard though. Under the hood, Xcode simply uses source based code coverage available from Clang / Swift. We can pass the right compilation parameters through Bazel and it will happily build tests with coverage instrumentation.

1
2
3
4
5
6
7
8
build --copt="-fprofile-instr-generate"
build --cxxopt="-fprofile-instr-generate"
build --linkopt="-fprofile-instr-generate"
build --swiftcopt="-profile-generate"
build --copt="-fcoverage-mapping"
build --cxxopt="-fcoverage-mapping"
build --linkopt="-fcoverage-mapping"
build --swiftcopt="-profile-coverage-mapping"

To run tests with the coverage report, we need to turn off Bazel sandbox to disable Bazel’s tendency of deleting files generated from test runs. LLVM_PROFILE_FILE environment variable needs to be passed through --test_env as well. Here are four lines how I generated coverage report that Codecov.io will be happy to process:

1
2
3
4
bazel test --test_env="LLVM_PROFILE_FILE=\"$GIT_ROOT/ProjectTests.profraw\"" --spawn_strategy=standalone --cache_test_results=no ios/Apps/Project:ProjectTests
xcrun llvm-profdata merge -sparse ProjectTests.profraw -o ProjectTests.profdata
BAZEL_BIN=$(bazel info bazel-bin)
xcrun llvm-cov show $BAZEL_BIN/ios/Apps/Project/ProjectTests.__internal__.__test_bundle_archive-root/ProjectTests.xctest/ProjectTests --instr-profile=ProjectTests.profdata > Project.xctest.coverage.txt 2>/dev/null

CI Integration

It is surprisingly simple to integrate Bazel with CI tools we use. For the context, we use Bitrise for unit tests and ipa package submission / deployment. Since we scripted our Bazel installation, for unit tests, it is as simple as running bazel test from the CI. Both hosted tests and UI tests worked uniformly that way.

There are some assumptions from Bitrise about how provisioning files are retrieved. If you use Fastlane tools throughout, you may not have the same problem.

We end up forked its Auto Provision step to make everything they retrieved from Xcode project the pass-in parameters. At later stage, we simply copied out the provisioning file to replace the development provisioning file from the repo.

Benefits after the Bazel Migration

Apollo GraphQL Integration

Prior to Bazel migration, Apollo GraphQL integration relies on Xcode build steps to generate the source. That means we have tens of thousands lines of code need to be recompiled every time when we build. People also need to install apollo toolchain separately on their system, with node.js and npm dependencies.

We were able to integrate the packaged apollo cli further into Bazel.

1
2
3
4
5
6
7
http_archive(
  name = "apollo_cli",
  sha256 = "c2b1215eb8e82ec9d777f4b1590ed0f60960a23badadd889e4d129eb08866f14",
  urls = ["https://install.apollographql.com/legacy-cli/darwin/2.30.1"],
  type = "tar.gz",
  build_file = "apollo_cli.BUILD"
)

The toolchain itself will be managed as a sh_binary from Bazel perspective.

1
2
3
4
5
6
sh_binary(
  name = "cli",
  srcs = ["run.sh"],
  data = ["@apollo_cli//:srcs"],
  visibility = ["//visibility:public"]
)
1
2
3
#!/usr/bin/env bash
RUNFILES=${BASH_SOURCE[0]}.runfiles
"$RUNFILES/__main__/external/apollo_cli/apollo/bin/node" "$RUNFILES/__main__/external/apollo_cli/apollo/bin/run" "$@"

With genrule and apollo cli, we were able to generate the source code from Bazel as a separate module. In this way, unless the query changed or schema changed, we don’t need to recompile the GraphQL module any more.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
filegroup(
  name = "GraphQL_Files",
  srcs = glob(["*.graphql"])
)

filegroup(
  name = "GraphQL_Schema",
  srcs = ["schema.json"]
)

genrule(
  name = "GraphQLAPI_Sources",
  srcs = [":GraphQL_Files", ":GraphQL_Schema"],
  outs = ["GraphQLAPI.swift"],
  # Apollo CLI is not happy to see bunch of symlinked files. So we copied the GraphQL files out
  # such that we can use --includes properly.
  cmd = """
mkdir -p $$(dirname $(location GraphQLAPI.swift))/SearchPaths && \
cp $(locations :GraphQL_Files) $$(dirname $(location GraphQLAPI.swift))/SearchPaths && \
$(location //common/vendors/apollo:cli) codegen:generate --target=swift --includes=$$(dirname $(location GraphQLAPI.swift))/SearchPaths/*.graphql --localSchemaFile=$(location :GraphQL_Schema) $(location GraphQLAPI.swift)
""",
  tools = ["//common/vendors/apollo:cli"]
)

CI Build Time

Even Bazel’s disk cache is primitive, we were able to reap the benefit from our CI side. Bitrise CI allows you to push and pull build artifacts. We were able to leverage that to cut our build time by half from Bitrise.

Build for Different Flavors

A select_a_variant Bazel function is introduced. Under the hood, it is based on select and config_setting primitives from Bazel. A simple variant.py script can be added to switch between different flavors.

For different flavors, a different set of Swift macros will be passed in (we wrapped swift_library with a new Bazel macro). Different sets of dependencies can be selected as well. These build options, particularly for dependencies, are difficult to manage with the old Xcode build system.

Code Generation and More Code Generations

We’ve changed the app runtime static configurations from reading a bundled-JSON file at startup to generating Swift code from the same JSON file. I am looking forward to having more code generations and even try a bit of Sourcery now after the Bazel migration. The vanilla genrule is versatile enough and supports multiple outputs (comparing to Buck). Once figured out that swift_binary should belong to the tools parameter of genrule, it is a breeze to write code generation tools in Swift.

Conclusion

Even though there are still some workarounds needed for Bazel. It is night and day compared to my experience with Buck a few years ago. It is relatively simple to migrate a somewhat complicated setup in a few days.

Looking forward, I think a few things can be improved.

PodToBUILD was built with Swift. It cannot parse many Ruby syntax, and can cause failures from that. In retrospect, we probably should have such tool to be built with Ruby. At the end of the day, the CocoaPods build syntax is not complicated. Once you run Ruby through that DSL, everything should be neatly laid out.

Although Bazel is language agnostic. I hope that in the future, we can have a Bazel package manager that is as easy to use as CocoaPods. That probably can be the final nag for many people to use Bazel to start new iOS projects. The WORKSPACE alternative with git_repository is not a real solution. For one, it doesn’t traverse dependencies by default. This is for a good reason if you understand the philosophy behind it. But still, it makes Bazel a harder hill to climb.

Let me know if you have more questions.

‹ Newer