Posts from February, 2021
No comment yet
February 14th, 2021

前段时间人人影视组因为侵犯版权问题被调查了。国内引起了对知识产权保护的大讨论。在一定的时期和特定的领域,知识产权有着积极的意义。但是,知识产权制度本身并不是真理,它只是人为的一种创造。

知识产权的本质是将知识,这一属于人类共有资源(Public common)的事物通过国家机器,在有限的时间内赋予其完整的私有产权,将其私有化。定义私有产权的四种性质:即可使用性(right to use),可转移性(right to transfer),排他性(right to exclude)和可销毁性(right to destory)对于知识产权都是适用的。

需要注意的一点是,将知识产权化私有这一行为本身会将知识从人类公有资源中排除。长此以往,人类共享的公有资源将会减少消失。因此,知识产权均有时间限制。比如专利有20年的有效期,而著作权在作者死后50年后失效(美国是70年)。

在知识产权形成的工业革命时期,它具有进步性。但是同时,知识产权对再创造的打击是毁灭性的。在知识产权形成前,很多的文学创造,从耳熟能详的水浒传、金瓶梅、三国志、三国演义,都有各种在民间传说和前作上的再创造。而这样的再创造(或者叫同人),在现代由于版权问题,相对数量是较少的。这本质是一种选择。我们认为通过鼓励知识产权,鼓励创作者盈利,能够产生更多的作品,来弥补相对较少的再创造所造成的损失。

在另一方面,知识产权,或者更窄的说,专利体系,在形成之初还有另外一个目的,即是鼓励公开。专利系统要求提交专利申请并公开。这一行为本身保证了在20年之后,知识不会丢失,而是能够回归到人类公有资源中。

因此,即使到了现代,也有很多高科技的创新,比如发动机制造、材料制造等,是通过商业秘密(trade secret)的方式去保密的。这也就是为什么我们看不到氢弹的专利申请书的原因。

所以,较多使用知识产权保护利益的,是那些反向工程简单,边际效应趋向于零,但是研究开发成本很高的领域,比如:文化产品、软件和医药等。

说了这么多,知识产权真的能够让发明者受益,因此鼓励发明创新吗?

阴极管电视机公认的发明人是Philo Farnsworth(费罗·法恩斯沃斯,以下称法恩斯)。

法恩斯在1906年出生于一个农民家庭。在他15岁的时候,便幻想发明一种能把声音和图像一起传输的装置。后来,在他21岁的时候,便做出了第一台阴极管电视机的原型。在此之前,电视机的原理多是基于机械扫描装置(Nipkow disks)。阴极管电视更加可靠,图像也更清晰。事实上,直到21世纪初,我们的显像装置仍然是基于阴极管的,可见这一发明的有效性。

法恩斯在1926年搬到了加州,并申请了阴极管电视的发明专利。专利在1930年得到了批准,法恩斯打算生产电视,大赚特赚一笔。在他正要开始的时候,美国广播公司(RCA)宣称他们的员工Vladimir Zworykin(费拉蒂米尔·斯福罗金)在法恩斯之前就发明了电视,要专利局裁定法恩斯的专利无效。

在1931年,美国广播公司向法恩斯投出橄榄枝,想要用十万美金获得授权,并招聘法恩斯作公司员工。但是这被法恩斯拒绝了。

法恩斯为了拥有完整的电视发明专利池,举债和美国广播公司打官司。在1939年底终于打赢了所有的官司并和美国广播公司签订了一百万美金的授权协议。但是此时电视的主要专利已近过期,而他也已经债务缠身。雪上加霜的是,时值二战开始,美国暂停了电视等新兴的娱乐服务全力备战。

法恩斯的例子,很难说明专利的争夺是完全偏向发明者一方的。只要拥有雄厚的资本,完全可以通过法律程序消耗发明者的时间和精力。这也部分解释了为什么在当代,拥有大量专利和知识产权的,往往是实力雄厚的大公司。

在文化产品、软件、医药等较多使用知识产权保护的领域,知识产权的保护真的有效吗?

在文化领域,音乐、电影、电视,是广为人知的通过知识产权盈利的领域。这也是我们题目所说的人人影视组侵权的领域。特别是音乐领域,因为其较低的创作门槛和简单的复制方式,是最早受到互联网冲击的领域。这一冲击,看上去似乎发展受到了打击。仅就美国而言,唱片行业在2000年的收入达到了140亿美元,而在2019年是110亿,这还得益于过去几年流媒体收入的发展。

然而,创作者的收入并没有减少。同为年轻的顶级歌手,Taylor Swift在2020年的身价为6500万美金,而Britiney Spears在2000年的身价为3800万美金。由于唱片行业的势微,更多的创作者倾向于用不可复制的体验,比如现场演唱会等形式来盈利。因此也有了2007年,Radiohead将新专辑In Rainbows放到网上直接提供下载的行为艺术。

而SCP一类的互联网联合创作模式也在尝试用一种版权上更开放的态度进行共享和合作。

在软件领域,几乎所有的从业者都认为,软件专利对于这个行业而言是弊大于利的。因为专利的可转移性,特别是在软件专利方面,出现了像Intellectual Ventures这种名字都是主义,实际是做掠夺性专利诉讼(predatorial patent trolling)生意的公司。这类公司自己不搞发明创造,专门通过收购定义模糊广泛的专利,再利用这种专利诉讼赚钱。比如告Facebook侵犯了“一种让不同人通过网络连接联系的方法”的专利,或者告Google侵犯了“一种向服务器请求资源的方法”的专利来盈利。这种滥用专利诉讼对于真正将发明创造转化成对人类有益产品的人只能起到负面作用。

正因为软件专利存在这么多的弊端,大家也发明了将各个公司关于某项技术的专利放到一起,形成一个专利池。这样只需要向专利池付费,就能得到专利池的保护。如果有人告专利池里的技术,大家就一起凑钱打官司。H265编码和所谓的4G / 5G技术,其实核心就是他们的知识产权放在一起的专利池。

在医药领域,因为药品的正外部性,对于医药的知识产权保护一直都是存有争议的。《我不是药神》电影的核心就是在于印度政府为了保障人民能用上廉价药品,对于药品专利的不认可。讽刺的是,也正是因为印度大量生产专利保护的药品,反而积累了大量的药品生产经验,成为了全球主要的药品生产地。即使在美国,因为EpiPen(一种过敏患者用于急救的便携式注射器加药物)在过去的十年价格翻了4倍,也引起了广泛的对药物专利是否应该有价格限制的讨论。

从另一个方面来说,其实,现代药品的开发模式已经不是大公司从最初研发到完成三期一条龙的模式了。更经常的,是在早期由国家投资给大学和实验室,进行药品的研发。在药品有希望的时候,教授通常会成立一个小公司负责将研究商业化。小公司完成了一期和二期实验之后会再打包卖给Lilly这样的大药厂,他们再去完成费时费力的三期实验和上市。正是因为这一模式,也引发了为什么用纳税人的钱进行研发,却把利润都给大药厂赚了的大讨论。

专利对技术影响的一个直接例子就是E-Ink。E-Ink曾经是和OLED并驾齐驱的下一代显示技术。但是E-Ink公司一家垄断了E-Ink的核心专利。在其发明之后的十几年间,E-Ink不授权给别的公司生产,所以无论是刷新率、尺寸、价格或是新的应用都没有得到大的发展。现在,所有的公司都在等待在E-Ink的核心专利在20年代过期之后,引起的产品更新大爆发。

这并不是一厢情愿。LED照明技术的专利大部分在2010年过期。巧合的是,也正是在2010年之后,LED产业迎来了大爆发。很多人家开始装上条状LED、非直接LED照明或者LED变色灯也都是在2010年之后。

正是因为知识的可复制性,我们的生活才能越来越好。在互联网时代,知识的获得和复制都更加廉价了。比如开源软件,就是践行知识共享的生动例子。我小时候也并不理解知识产权和开源的关系,14岁的时候还在电脑报上写文章,斥责别人将开源的FileZilla重新打包在国内卖不尊重知识产权。现在的我更加相信,任何能辅助知识推广的行为都是好的。如果重新打包翻译之后通过收费来推广比免费的推广得更好,有什么不可以呢。

SQLite,一个广泛使用的嵌入式数据库软件,就是将知识共享的精神发挥到了极致。和大部分开源软件需要在重新打包时包含授权条款不同(Permission is hereby granted … without restriction, including without limitation …),SQLite和所有的在知识产权形成前的发明创造一样,仅仅是将所有的权利回归了人类公有资源库(Public domain)之中。

在文化领域,古腾堡计划(Project Gustenberg)致力于通过互联网将所有著作权过期的作品提供给人们免费下载和浏览。每年年初,古腾堡计划都会开心的写一篇文章,介绍都有哪些好的作品已经过期了大家可以免费下载浏览了。

这些项目本身,就是对知识产权制度的反思。你可以问问自己,米老鼠这么一个角色,让迪斯尼拥有了100多年的知识产权,而且只要美国还在,就还会一直持续下去,真的对人类公有资源有任何正面的帮助吗?假设吴承恩将孙悟空的知识产权拥有到了现代,我们还能看得到《大话西游》吗?

这时,我们再回头问,知识产权形成的初期为了鼓励市场竞争和更多的发明创造的目的,真的达到了吗?

1774年,英国当时是全世界的纺织生产大国。为了保护英国的纺织企业,通过法案防止纺织技术出口。Samuel Slater(塞缪尔·斯莱特)将纺织机的制作和生产都默记在脑子里,1789年到美国纽约,根据自己默记的内容,重新制作了纺织机。在现代,斯莱特广泛被认为是美国的工业生产之父。

当然在后来,美国取得技术领先的优势之后,就通过国际条约和机构,将知识产权保护的条例写进了参与国的法律中,以此保护自己的技术优势。TPP(泛太平洋伙伴关系协定)被广泛批判的,也是将美国国内过于严苛冗长的知识产权保护条例写进了合作伙伴的法律中。

知识产权将产权的所有性质赋予给了可复制可传播的知识,将其长期从人类共有资源中隔离出来。在当今合作紧密、知识迭代变化快速的时代,这样完整的知识产权是为了促进发明创造,还是只是保护资本的利益,是非常存疑的。我们不应该全盘接受知识产权的概念而不加思考。对它好的地方要鼓励,坏的地方也要批判。知识产权应该是一个动态的概念,刘慈欣说要多想,在那以前多想,总是对的。

No comment yet
February 7th, 2021

For many machine learning practitioners, training loop is a universally agreed upon concept as demonstrated by numerous documentations, conference papers to use the word without any reference. It would be a helpful concept for many beginners to get familiar with before diving into the rabbit holes of many deep learning tools.

The field of machine learning becomes vast, diverse and ever-more open in the past 10 years. We have all kinds of open-source softwares, from XGBoost, LightGBM, Theano to TensorFlow, Keras and PyTorch to simplify various tasks of machine learning. We have supervised learning, unsupervised learning, generative network and reinforcement learning, choose your own pill. It can be dazzling for beginners. It doesn’t help that many popular softwares we use made simplifications to hide many details from beginners with abstractions like classes, functions and callbacks.

But fear not, for machine learning beginners, there is one universal template. Once you understand it, it is straightforward to fit all existing training programs into this template and start to dig into how they implemented the details. I call this the universal training loop.

An Extremely Simplified Interface

Many high-level framework may provide an extremely simplified interface looking like this:

1
func train(training_dataset: Dataset, validation_dataset: Dataset) -> Classifier

When you do:

1
let trainedClassifier = Classifier.train(training_dataset: training_dataset, validation_dataset: validation_dataset)

You somehow get the trained classifier from that interface. This is what you would find in FastAI’s Quick Start page, or Apple’s Create ML framework.

However, this doesn’t tell you much about what it does. It is also not helpful some of these frameworks provided callbacks or hooks into the training loop at various stages. The natural question would be: what are the stages?

An Supervised Learning Training Loop

It is actually not hard to imagine what a supervised learning training loop would look like underneath the extremely simplified interface. It may look like this:

1
2
3
4
var classifier = Classifier()
for example in training_dataset {
  classifier.fit(input: example.data, target: example.target)
}

It tries to go through all examples in the training dataset and try to fit them one-by-one.

For stochastic gradient descent methods, we had a few modifications to make the training process more stable and less prone to input orders:

1
2
3
4
var classifier = Classifier()
for minibatch in training_dataset.shuffled().grouped(by: 32) {
  classifier.fit(inputs: minibatch.data, targets: minibatch.targets)
}

We randomizes the order of input (shuffled()), group them into mini-batches, and pass them into the classifier, assuming the classifier operates with a group of examples directly.

For many different types of neural networks, shuffled mini-batches will be the essential part of your training loop for both efficiency and stability reasons.

4 Steps to Fit

The magical fit function doesn’t inspire any deeper understanding of what’s going on. It looks like we simply lift the train function into a for-loop.

However, if we can assume that our machine learning model is differentiable and based on gradient methods (e.g. neural networks), we can break down the fit function into 4 steps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var classifier = Classifier()
for minibatch in training_dataset.shuffled().grouped(by: 32) {
  // Given inputs, a machine learning model can guess what the
  // outputs would be. (Labels of the images, positions of the faces
  // or translated texts from the original.)
  let guesses = classifier.apply(inputs: minibatch.data)
  // Loss measures how far away our guesses compares to the targets
  // we knew from the training data. This is supervised learning, we
  // know the answer already.
  let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
  // Based on the loss, gradients give us the direction and magnitude
  // to update our model parameters.
  let gradients = loss.gradients()
  // Update the parameters with gradients from this mini-batch.
  // Optimizer specifies a particular algorithm we use to update
  // parameters, such as stochastic gradient descent or ADAM.
  optimizer.apply_gradients(gradients, classifier.parameters)
}

For any supervised learning, you will be able to find this 4 steps. It can vary, some of the model may accumulate gradients a bit and then apply_gradients after several rounds. Some of them may apply additional clipping on the gradients before applying them.

You could find the exact 4 steps in frameworks like Keras or PyTorch.

Validation Dataset and Epoch

We haven’t talked about the validation_dataset parameter you saw earlier for the train method!

For first-order gradients based methods (e.g. neural networks), we need to go over the whole training dataset multiple times to reach the local minima (a reasonable model). When we went over the whole training dataset once, we call it one epoch. Our models can also suffer the overfitting problem. Validation dataset are the data the model never uses when updating its parameters. It is useful for us to understand what our model would be like to data it never sees.

To incorporate the above two insights, our training loop can be further modified to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var classifier = Classifier()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, classifier.parameters)
  }
  var stats = Stats()
  for example in validation_dataset {
    // Only gather guesses, never update the parameters.
    let guess = classifier.apply(input: example.data)
    // Stats will compare guess to the target, and return some
    // helpful statistics.
    stats.accumulate(guess: guess, target: target)
  }
  print("Epoch \(epoch), validation dataset stats: \(stats)")
}

Now I can claim, for any supervised learning task, you can find the above training loop when you dig deeper enough through their abstractions. We can call this the universal supervised training loop.

Unsupervised Learning and Generative Networks

The main difference between unsupervised learning and supervised learning for our training loop is that we won’t have the target provided from the training dataset. We derive the target somewhere else. In unsupervised learning, we derive the target from some transformations of the input. In generative networks, we derive the target from random noises (hence generating something from nothing).

1
2
3
4
5
6
7
8
9
10
11
12
13
var model = Model()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32) {
    let guesses = model.apply(inputs: minibatch.data)
    // Unsupervised learning.
    let targets = model.targets(from: minibatch.data)
    // Generative networks
    // let targets = model.targets(from: noise)
    let loss = model.loss(guesses: guesses, targets: targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, model.parameters)
  }
}

Often times, for this types of tasks, the targets are derived from another set of neural networks and updated jointly. Because of that, there are more whistles and bells in many frameworks when they implement the above training loop. You can find example from Keras on how they derive targets from the input data only (get_masked_input_and_labels) for BERT (a popular unsupervised natural language processing model). Or you can find example from PyTorch how they generate adversarial examples from noises for DCGAN (deep convolutional generative adversarial network).

Deep Reinforcement Learning

Deep reinforcement learning generates the training data by having an agent interacts with the environment. It has its own loop that looks like this:

1
2
3
4
5
6
7
8
while true {
  let action = agent.respond(to: lastObservation)
  let (observation, reward, done) = environment.interact(action: action)
  lastObservation = observation
  if done {
    break
  }
}

The agent took action against our last observation. The environment will be interacted with the action and give a new set of observation.

This is independent from our training loop. In contrast to the supervised learning, in the deep reinforcement learning, we use the interaction loop to generate training data.

Our training loop can be modified to look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
var policy = Policy()
var training_dataset = Dataset()
for epoch in 0..<max_epoch {
  var data_in_episode = Dataset()
  while true {
    let action = policy(inputs: lastObservation)
    let (observation, reward, done) = environment.interact(action: action)
    data_in_episode.append((action: action, reward: reward, observation: lastObservation))
    lastObservation = observation
    if done {
      for (i, data) in data_in_episode.enumerated() {
        // Use all future rewards to compute our target.
        let target = target_from_future_rewards(data_in_episode[i..<])
        // Our input will be the last observation (Q-learning), and
        // potentially also include the action (Actor-Critic model),
        // or the next observation (model-based methods).
        training_dataset.append((input: (data.action, data.observation), target: target))
      }
      break
    }
  }
  // Rather than shuffling the whole training dataset, we just random
  // sample a subset.
  for minibatch in training_dataset.randomSampled().grouped(by: 32) {
    let guesses = policy.apply(inputs: minibatch.data)
    let loss = policy.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, policy.parameters)
  }
}

The training_dataset in above training loop can be referred to as the replay memory in the literature. If we retain all the old data before training, this is often referred to as off-policy training. If instead we remove all training data after each episode, this can be called on-policy training. OpenAI Spinning Up has better explanation on the differences between on-policy and off-policy with a bit more details.

You can find examples of the above training loops in PyTorch or in OpenAI Baselines.

Distributed Learning

Follow the same training loop, we can extend them to train on multiple machines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
var classifier = Classifier()
let machineID = MPI_Comm_rank()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32, on: machineID) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    // Compute gradients from the specific data on this machine only.
    let machineGradients = loss.gradients()
    // Use allreduce primitive to compute gradients summed from all
    // machines.
    let allGradients = allreduce(op: +, value: machineGradients, on: machineID)
    // Applying updates with the same gradients, it should yield the
    // same parameters on all machines.
    optimizer.apply_gradients(allGradients, classifier.parameters)
  }
}

The allreduce primitive go over all machines to sum gradients from them. In reality, it is often implemented with ring-based communication pattern to optimize the throughput. Naively, it can look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
func allreduce(op: Op, value: Tensor, on: Int) -> Tensor {
  MPI_Send(value, to: 0)
  if on == 0 {
    tensors[0] = value
    for i in 1..<MPI_Comm_size() {
      tensors[i] = MPI_Recv(from: i)
    }
    let sum = tensors.sum()
    for i in 1..<MPI_Comm_size() {
       MPI_Send(sum, to: i)
    }
    return sum
  } else {
    return MPI_Recv(from: 0)
  }
}

This naive data-distributed training loop can be extended to more sophisticated distributed training regime. For example, in ZeRO-Offload, its distributed strategy can be represented as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
var classifier = Classifier()
let machineID = MPI_Comm_rank()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32, on: machineID) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    for (i, gradient) in gradients.enumerated() {
      // Each machine only sum the gradients it responsible for. This
      // method will return nil if it tries to reduce a gradient it
      // is not responsible for.
      if let reducedGradient = reduce(op: +, id: i, value: gradient, on: machineID) {
        // Copy the summed gradient to CPU.
        cpuGradients[machineID, i] = reducedGradient.copy(to: .CPU)
      }
    }
    // Apply gradients to the model from CPU.
    optimizer.apply_gradients(cpuGradients[machineID], classifier.parameters[machineID])
    // Broadcast the latest parameters to all machines.
    broadcast(classifier.parameters[machineID])
  }
}

The Universal Training Loop

Finally, we arrived at our universal training loop template:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
var model = Model()
var training_dataset = Dataset()
for epoch in 0..<max_epoch {
  // Collect training dataset either from agent-environment
  // interaction, or from the disk.
  training_dataset = collect(...)
  // Go over mini-batch either on the whole training dataset, or from
  // a subset of it.
  for minibatch in training_dataset.extract(...) {
    // Apply the model to generate some guesses.
    let guesses = model.apply(inputs: minibatch.data)
    // Generate targets either from inputs, from noise, or it already
    // exists in the training dataset. Or a combination of above.
    let targets = targets_from(...)
    // Compute the loss from the model's guess w.r.t. to the target.
    let loss = model.loss(guesses: guesses, targets: targets)
    // First-order gradients from the loss.
    let gradients = loss.gradients()
    // Sum gradients from everywhere (other machines, other GPUs) for
    // this particular node to process.
    if let (i, collected) = gradients.collect(...) {
      optimizer.apply_gradients(collected, model.parameters[i])
      // Broadcast the updated parameters to everywhere.
      broadcast(model.parameters[i])
    }
  }
}

*: The pseudo-code in this article uses Swift. A particular syntax concerns with a[x..<y]. It is semantically the same as a[x:y] in Python, including for cases where some of the subscript is missing: a[x..<] should be the same as a[x:].

No comment yet
February 4th, 2021

I started to work on a new data science related project last year. Our initial setup was fairly traditional: it is Python-centric, with Anaconda for dependency management and environment setup.

Anaconda is interesting. It is probably useful if you have a ton of dependencies and the system tries really hard to figure out the package compatibilities with each other based on their claims (version numbers of their dependencies). For us though, we only use a handful of packages with clean dependencies on each other (your usual suspects: Pandas, SciPy, numpy), and the version number check just means each time package upgrade is half-an-hour SMT solving.

On the other hand, I made my fortune in the past decade by doing app development (Facebook, Snapchat). I’ve been eyeing on Swift since the 1.0 version. Since then, the language matured a lot. After gaining some Swift experience with my last employer, it seems to be a good compromise between expressivity and performance. The noise from the language itself is minimal, and the performance can be tuned if you are hard on it, otherwise still better than raw Python.

After a few months of probing, investing, bug fixes, we’ve migrated our setup to Swift in last December. It has been serving us well so far.

Problems

We have a small project, and the problem with packages mostly around package management and upgrade. Since Anaconda’s environment is not per project based, we have to switch back and forth when entering / leaving the project.

Our project, although small, is a bit exotic. Our core algorithm was implemented in C, and we don’t want to ship a Python plugin in-tree. Hence, we opt-ed to talk with the C lib through standard IO (subprocess). It turns out to be hairy and the core algorithm update process is more than terrible.

Pandas has a reasonable performance if these are builtin functions, once we drop to use apply / lambda, going through a few million rows for a particular column can take 30 to 40 seconds. For these cases, we also cannot use the rest of idle cores efficiently.

However, switching to a new language setup is not an easy task. Besides solving the above said problems, we would still like to keep a few things we liked with our old setup:

  • Jupyter notebook: we really liked to do data exploration with Jupyter notebooks. Anything requires us to compile / run / print would be a no go. The interactive data exploration experience is essential for our workflow.

  • Pandas: we liked Pandas, it is a Swiss Army knife for data science. It also has a big API surface that would be very hard to reimplement from scratch.

  • PyCharm or alike IDE: we liked to use PyCharm for Python development a lot. Data inspection, re-run, and in general, debugging experience within an IDE is no comparison with tools like vim (although I still use vim for a lot of development personally).

Other Potential Choices

Before embarking on this quest, we briefly looked at other potential choices:

  • TypeScript / JavaScript: it has some interesting data exploration patterns, such as observablehq.com. However, it doesn’t have good C library integration points, making the fore-mentioned core algorithm integration problem unsolved.

  • Rust: when I did my investigation, I didn’t notice the evcxr project. Other than that, the syntax for modeling would be a bit noisier than I’d like. The way to call Python through PyO3 is a bit clumsy too.

  • Julia: if I have more experience with the language, I may have a different opinion. But as it stands, the language has its own ecosystem, and I didn’t see a good way to call Python libraries from the language*. On the C lib integration part, it seems to require dynamic linking, and that would be a little bit more hurdle on my toolchain setup.

New Setup

Monorepo

Our new setup is a monorepo, with Bazel as the build system for both our Swift code, C libraries and our Python dependencies. In the meantime, we still have some legacy Python libraries that are now managed by Bazel too.

Bazel’s new rules_python has a pretty reasonable pip_install rule for 3rd-party dependencies. As I mentioned, because we use relatively a small number of Python packages, cross package compatibility is not a concern for us.

All our open-source dependencies are managed through WORKSPACE rules. This worked for our monorepo because we don’t really have a large number of open-source dependencies in the Swift ecosystem. The things we import mostly are Swift numerics, algorithms and argument-parser.

Jupyterlab

We don’t use a separate Jupyter installation anymore. Jupyterlab is installed as a pip_install requirement for our monorepo. Opening a Jupyterlab would be as simple as bazel run :lab. This enables us to in-tree our Swift Jupyter kernel. We adopted the swift-jupyter and added Bazel dependency support. We also have a pending PR to upstream our sourcekit-lsp integration with the Swift Jupyter kernel.

This complicates a bit on plugin management for Jupyterlab. But we haven’t yet found a well-maintained out-of-tree plugin that we would like to use.

Python

To support calling Python from Swift, we opted to use PythonKit library. We’ve upstreamed a patch to make Pandas work better within Swift. We made a few more enhancements around UnboundedRange syntax, passing Swift closures as Python lambda, which haven’t been upstreamed at the moment.

One thing that makes calling Python from Swift easy is the use of reference counting in both languages. This makes memory management cross the language boundary much more natural.

We also wrote a simple pyswift_binary macro within Bazel such that a Swift binary can declare their Python dependencies and these will be setup properly before invoking the Swift binary.

We haven’t yet vendoring our Python runtime at the moment. Just painstakingly making sure all machines on Python 3.8. However, we do intend to use py_runtime to solve this discrepancy in the future.

C Libraries

Swift’s interoperability with C is top-notch. Calling and compiling C dependencies (with Bazel) and integrating with Swift is as easy as it should be. So far, we’ve fully migrated our core algorithms to Swift. The iterations on the core algorithms are much easier.

IDE Integration

For obvious reasons, we cannot use PyCharm with Swift. However, we successfully migrated most of our workflow to VS Code. LLDB support makes debugging easy. However, we did compile our own Swift LLDB with Python support for this purpose (still in the process to figure out with the core team why the shipped Swift LLDB on Linux has no Python support).

Sourcekit-lsp doesn’t recognize Bazel targets. The work for Build Server Protocol support on both Bazel side and on sourcekit-lsp side seems stalled at the moment. We ended up writing a simple script to query compilation parameters from bazel aquery for all our Swift source code, and put these into compile_commands.json. Sourcekit-lsp just has enough support for Clang’s compilation database format to make code highlighting, auto-complete, go to definition and inline documentation work again.

We committed .vscode/settings.json.tpl and .vscode/launch.json.tpl into the codebase. Upon initial checkout, a small script can run to convert these template files into actual settings.json and launch.json. We did this to workaround issues with some VS Code plugins requiring absolute paths. These two files will keep in sync with the templates as part of the build process ever since.

Bonus

Swift’s argument-parser library makes creating command line tools really easy. Furthermore, it also supports auto-complete in your favorite shells. We implemented one all-encompassing CLI tool for our monorepo to do all kinds of stuff: data downloading, transformation, model evaluation, launching Jupyterlab etc. With auto-completion and help messages, it is much easier to navigate than our previous ./bin directory with twenty-something separate scripts.

Is this the End?

We’ve been happy with this new setup since the beginning of this year, but it is far from the end-all be-all solution.

Swift is great for local development, however, the Bazel support on Linux generates binary that dynamically links to Swift runtime. This makes deployment a little bit more involved than we’d like it to be.

Pandas uses internal data structure for DataFrame (column-based numpy). While we do have efficient numpy to tensor conversions, these cannot be leveraged in the context of Pandas (we want to lift the whole data frame, not just numpy-ready columns). Our current solution calls itertuples, which can be quite slow. We’d like to sponsor an open-source project to implement Apache Arrow support for Swift. This should enable us to pass data from Pandas to Swift through Arrow in-memory format, which may be faster than what we currently do.


*: notagoodidea pointed out there is a PyCall.jl library that implemented Python interoporability.