No comment yet
August 25th, 2021

This series of essays explores how cryptocurrencies, despite the frauds and scams in the system, can possibly go beyond the speculative asset category. Rather than investigating the underlying utility value for a cryptocurrency system, this series will focus on power interplays between people, and how that can draw a narrow scenario in which cryptocurrencies become mainstream.

When reading materials on monetary policies and financial history, inflation-enabling property of currency is universally praised by academics. It gives the government the ability to act properly as the lender of last resort. It promotes consumption by suppressing hoarding behavior. It is generally considered pro-innovation because money needs to seek higher-return / riskier investments than itself. This is easier if the baseline is low.

Thus, academics project their likes to inflation-enabling currency back to the cryptocurrency market then predict they won’t be a worthy money replacement because of their limited supply. Even for Proof-of-work cryptocurrencies that don’t have a 21-million cap, it is pretty obvious these are front-loaded. Many of their mining rates are not matched to the economic growth. If you factor in how many wallets lost their private keys, these are strictly supply-limited.

What many academics didn’t factor in, is how higher-level consensus was achieved. In the cryptocurrency world, per-transaction consensus can be achieved through algorithms implemented in C++ / Rust or Go programming languages. Higher-level consensus, decisions such as which version of the software to use, what new features need to be included in a certain release, what are the new features would be interesting to experiment, was achieved, in this day, through human negotiations such as EIP. There could be added motivations to move miners over to newer software through difficulty bomb but nothing prevents larger pools to collude and fork the software.

The supposed supply-limited nature of cryptocurrencies are not immoveable ancient laws set on stone. It is, at its core, a brilliant if not deliberate marketing ploy to attract the unsophisticated. Market participants knew well that it cannot be the Gold standard of our time if there is no circulation. If most participants are HODLers, the circulation will be limited. But no matter, this is only the first stage.

For how ridiculous the marketing goes, look no further than . Brought to you from people with invested interest in cryptocurrencies, it tries to tie every problem we have in the modern world to the decoupling from the Gold standard.

As far as the first stage goes, cryptocurrency market needs to solve the paradoxical challenge that while remain low in circulation, it also needs to get as many people as possible to be HODLers. This helps to gain wide political support in democracies. More importantly, if cryptocurrencies truly want to be the second-coming of commodity money, they need to have the needed breath once the circulation knob is turned on.

Inflation

Before we discuss the “circulation knob”, current conventional consensus still treats cryptocurrencies as speculative assets. For the plot to be “the better commodity money”, cryptocurrencies need yet to prove itself in a high-inflation world. It is actually not obvious that as speculative assets, how cryptocurrencies can compete with harder things such as commodities. It is even more unclear how it would fare with safe heavens once the inflation triggered monetary tightening.

With brilliant market ploy, low circulation and capital controls, it seems possible to maintain the herd psychology long enough that the hardness of cryptocurrencies can be self-reinforcing (at a certain point, you can point CPI / cryptocurrency price graph and claim that it is “inflation-proof”).

“Circulation Knob”

To be “the money” for everyday use, cryptocurrencies need to turn on their circulation knob at a certain point. The circulation knob would be an implementation that allows cryptocurrencies to have better adjusted supply mechanisms. In a healthy economy, this meant no longer to be forced into deflation. It is hard to imagine, as it stands, that Central Banks would allow broader circulation of unregulated currencies in their respective economies.

However, this is not impossible. In modern democracies, no matter how many ivory-towered academics in the Central Banks and how much they disliked cryptocurrencies collectively, they need to appease the political establishment. At the end of the day, Central Banks only care about impacts to their monetary policy tools. This can be managed by nationalizing mining operations, issuing additional cryptocurrencies to Central Banks or adding new money-supply mechanisms that Central Banks can use to play as the lenders of last resort if needed. These new tools in the cryptocurrencies world would be acted on as higher-level consensus aforementioned. It can be possible because this is another validation to the viabilities of cryptocurrencies at large.

The pursuit of hardness for cryptocurrencies, in addition to the competitions between different cryptocurrencies would be the more challenging part when turning on the circulation knob. They will hold out the circulation knob as long as possible to prove its hardness to the others. This will be a complex play between Central Banks, different cryptocurrencies and their respective philosophies.

Notes

Path Dependence

Monetary history is full of accidental path dependent coincidence. One one hand, the prevalence of cryptocurrencies in its current form would be a deterrent to any Central Banks to issue their own alternatives. It is hard to balance the convertibility of their alternative programmable money in relation to others. Getting it wrong, especially on conversion rate and programmability, could be disastrous to their existing monetary system.

On the other hand, the cryptocurrency world gets its first notice on the back of Austrian doctrine. To depart from the nature of limited supply could be death on arrival for the community. Higher-level consensus that is to the satisfaction of the Central Banks may never be reached. It is very difficult to imagine how eliminating all existing monetary policy tools, effectively abolishing or fully privatizing Central Banks (cryptocurrency players with large sums will be the new unregulated “Central Banks”) would be a good idea. It is not impossible, given that the United States has done that several times, with a semi-private Federal Reserve already. The implied transfer of power if that happened would be worth another essay in itself to explore.

What can go wrong?

A lot. This essay simply plotted a narrow path acting in stages that is imaginable how cryptocurrencies can possibly be a form of money. It requires changes in higher-level consensus carefully several times. One early test would be to see how current slew of cryptocurrencies fare with post-Covid inflation to prove their hardness. So far, the result is mixed.

No comment yet
August 20th, 2021

One thing in the past decade the academics failed to appreciate is the power of believing. In democracies like the United States, if the power of believing can be maintained long enough, a supporting structure from both the political and economic side will emerge. In every aspect, cryptocurrencies are moving along the right direction to establish themselves with such supporting structure despite scams and frauds associated around them.

The news of frauds (Polynetwork, IRON) and crackdowns at this point only strengthens the belief in the resilience of the system. People will point to these events and claim the robustness of the cryptocurrency systems. The centralized exchanges, centralized protocol maintainers / miners and the intentional light regulations (or no regulations at all) seems to be a stable triangle to maintain the cryptocurrency systems.

All these, makes it very interesting to contemplate what’s the exact endgame for cryptocurrency.

Many critics claim that the most risks for the cryptocurrencies come down to the slew of stablecoins that effectively set the price for other cryptocurrency assets. They suggest that the stablecoins such as USDT or USDC are backed by fractal reserves at the most. These fractal reserves are so small (less than 4% for USDT), the critics suspect that they will face liquidation crisis if a run of bank scenario happens.

Unfortunately, these critics failed to read the fine footprints of these said stablecoins. While you can exchange either USDT or USDC and with hundreds of alternative coins in decentralized exchanges, you cannot do so between USDT / USD pair or USDC / USD pair. It is these centralized exchanges that control how you can redeem these stablecoins into USD. Like countries with capital controls, the fixed exchange rate can be maintained indefinitely.

The endgame for cryptocurrency won’t be the crash of the stablecoins.

While the stablecoins cannot crash with absence of regulations, for the current slew of stablecoins, their power has been associated with the power of USD. It is possible to have a stablecoin to be associated with a basket of currencies, but that also means the said stablecoins will face not only the U.S. regulations, but regulations from the EU, China or Japan, depending on what is in the basket.

Paradoxically, stablecoins won’t crash, but can land to safety with the slide of the USD from its global reserve currency position. However, that won’t be good news to cryptocurrencies at large.

The academics largely got it wrong in the past decade regarding the cryptocurrencies because they failed to factor in the waning power of the United States. The U.S. has lost both willingness and political power to maintain absolute control globally. Its regulators moved too slowly to defend the USD’s global interests. Meanwhile, the internet moved from fear of creating any digital USD alternatives (e-gold) to that any individual can do ICO. Now, there is no willingness to regulate cryptocurrencies as long as the market is high and people make money.

The waning power of the United States is both a blessing and a curse to the cryptocurrencies.

At the moment, the cryptocurrencies need a powerful democracy that can maintain global hegemony. Without global hegemony, cryptocurrencies will face real issues with intermittent internet connectivities, difficulties of consensus convergence, widely-fluctuating exchange rates against commodities. With any form of government other than democracy, powerful authoritarian or dictatorial governments simply cannot tolerate the loss of capital control.

There could be alternative scenarios that cryptocurrencies succeed without the United States. Central operators need to grow to be powerful transnational entities that have controlling shares in commodities and other life essentials globally, potentially have their own enforcement militias to maintain such hegemony. This scenario has been explored in other anarchist’s readings extensively since the 1980s (or earlier). The problem, of course, is that such a transition of power takes time.

To believe in cryptocurrencies, you have to simultaneously believe that the U.S. power is waning and it can maintain global hegemony for an extended time. However paradoxical this proposition is, it increasingly seems to be a likely scenario for the next 15 years. Will this be the Goldilocks situation for cryptocurrencies? I don’t really know, and the readers, you have to make up your own mind.

Notes

What about KYC (“Know Your Customer”) rules?

If anything, this limits the number of centralized exchanges that can do stablecoin / USD pair. Limited number of centralized exchanges also means that it is easier to collude and fix pricing.

What about increasing regulations / self-regulations on USD?

The heavy regulations on USD (such as KYC rules mentioned prior) and recent news about self-regulations on USD (MasterCard / OnlyFans speculations) suggest the political willingness to regulate. However, USD is an easier target for politicians because no one else profits from USD other than the United States itself. Piling regulations on USD to “save our children” is an easy win in democracies. Piling regulations on cryptocurrencies because regular people make money on it (while the market is high) is a political suicide.

The imbalance of regulations against USD and the no regulations against stablecoins only reinforces the ability to do capital control from centralized exchanges. As long as these regulations exist in one form or another, it is much easier to accept the capital controls imposed by exchanges because you cannot do X with USD but can with stablecoins.

What about utility values of cryptocurrencies?

While there are utility values from cryptocurrencies, I am trying to explore alternative frameworks to assess based on power structures rather than the underlying utility values provided by a technology. This essay treats transition to cryptocurrencies as any other transitions: yes, there are utility values to motivate the transition in the first place. But ultimately, it is about people, especially people in power to make decisions on actions and inactions.

What can go wrong?

A lot. People can do stupid things. No amount of rosy scenario planning can prevent that.

No comment yet
March 19th, 2021

I am a mediocre marathon runner that hasn’t done any long-distances since Covid-19. But out of serendipity today, I was thinking about marathon and the interesting choices you would make during it.

Flat road marathon with support stations every half a mile is … bland. I probably have done that only twice, one in Maui, one in Denver. The only interesting bits for these two are hot temperature or the mile-high altitude. More interesting cases are trail runs with support stations every few miles. You have a lot of decisions to make along the way.

Elite runners probably can bulldoze through the trails. But for everyday runners like me, on the trail, the first decision needs to be made is, what’s going to be my pace for this segment? This is a function of your desired finish time, the elevation gain / loss for this segment and weather conditions. On a good day, at the beginning of the race, with massive elevation loss, you probably want to go as fast as you can. This is economical from the energy expenditure perspective. On the other hand, if there are a lot of ups-and-downs in the beginning, you probably want to keep a steady pace such that your overall finish time would be on-track.

The decision would be a bit different towards the end. If there are a lot of ups-and-downs towards the end, you probably want to take it slow on uphills and speed it up on downhills. This is more efficient energy-wise, and can still keep yourself on-track somewhat.

Besides questions on the pace, if there are a fairly reasonable number of support stations, you would need to make the decision on where to stop for refills. If there are only 3 support stations 6 to 7 miles away from each, you probably want to stop at every support station, since your 1L water bag may run out at probably 10-mile range. It is more interesting if there are not enough support stations such that you have to carry your own water bag, but sufficient number of them so you can make decisions to skip some of these.

This is a decision that can be more interesting than it first appears. If you are breezing through at 7min/mi pace, you probably don’t want to break it and stop for a refill. It is simply not economical. However, if you are running low on water, it is more difficult. You would never ever want to go without water for a mile or so, it is soul-crushingly terrible. In this case, you may want to ration the water intake until the next support station, or stop for a big refill now and try to make up the time by skipping the next two.

For an everyday runner, at mile 12 or 13, you probably would like to consider taking some food. Energy bars, jellies, gels, watermelon chunks, bananas, or M&M beans, all kinds. It is a great way to prepare for the wall. To be honest, I am not great at picking stuff out from my bag while running. As such, getting food out requires a brief stop. When is the most economical time to do so is not obvious. Uphills? Probably, it is definitely easier to stop and restart on uphills. But you also want to get food into the body before the glycogen depletion. Gels probably the fastest, it will take 5 minutes to kick in. Other solids can take longer. A mile or so before the depletion probably is required. Depending on these factors, there could be a range that we can take food more optimally during the race.

These decisions can be even more interesting if it is a self-supported multi-day event. Self-supported here means you carry your own food for these days (excluding water). Multi-day means a marathon a day. The decision is much more interesting because you are not just carrying a pack of energy gels. There are proper solids: pastas, ramens, instant rices, beef jerkys for your after-race celebrations. Simply put, energy gels for 6 days is not an option psychologically.

In this situation, the decisions on when to take what not only need to consider the optimality of the current race, but also the race next day. If the next day would be harder, I probably need to ration my energy gels today and have a bigger breakfast before today’s race to balance it out. It also cannot be completely rational. You would want to go with high energy-weight ratio food so that in the tougher day, the backpack would be lighter. However, you want to keep some celebratory food (beef jerkys anyone?) such that psychologically, there is a strong motivation to finish sooner. It quickly becomes an interesting game between the pace, stop-restart points, energy expenditures / refills and psychological reward functions. I simply don’t know an universal optimal strategy for this game yet with several trials-and-errors.

People say that life is a marathon, not a sprint. Well, I guess. As an everyday person, I would love to participate in the game with many more interleaving decisions and consequences. That is certainly much more interesting than the genetic lottery, isn’t it?

No comment yet
February 14th, 2021

前段时间人人影视组因为侵犯版权问题被调查了。国内引起了对知识产权保护的大讨论。在一定的时期和特定的领域,知识产权有着积极的意义。但是,知识产权制度本身并不是真理,它只是人为的一种创造。

知识产权的本质是将知识,这一属于人类共有资源(Public common)的事物通过国家机器,在有限的时间内赋予其完整的私有产权,将其私有化。定义私有产权的四种性质:即可使用性(right to use),可转移性(right to transfer),排他性(right to exclude)和可销毁性(right to destory)对于知识产权都是适用的。

需要注意的一点是,将知识产权化私有这一行为本身会将知识从人类公有资源中排除。长此以往,人类共享的公有资源将会减少消失。因此,知识产权均有时间限制。比如专利有20年的有效期,而著作权在作者死后50年后失效(美国是70年)。

在知识产权形成的工业革命时期,它具有进步性。但是同时,知识产权对再创造的打击是毁灭性的。在知识产权形成前,很多的文学创造,从耳熟能详的水浒传、金瓶梅、三国志、三国演义,都有各种在民间传说和前作上的再创造。而这样的再创造(或者叫同人),在现代由于版权问题,相对数量是较少的。这本质是一种选择。我们认为通过鼓励知识产权,鼓励创作者盈利,能够产生更多的作品,来弥补相对较少的再创造所造成的损失。

在另一方面,知识产权,或者更窄的说,专利体系,在形成之初还有另外一个目的,即是鼓励公开。专利系统要求提交专利申请并公开。这一行为本身保证了在20年之后,知识不会丢失,而是能够回归到人类公有资源中。

因此,即使到了现代,也有很多高科技的创新,比如发动机制造、材料制造等,是通过商业秘密(trade secret)的方式去保密的。这也就是为什么我们看不到氢弹的专利申请书的原因。

所以,较多使用知识产权保护利益的,是那些反向工程简单,边际效应趋向于零,但是研究开发成本很高的领域,比如:文化产品、软件和医药等。

说了这么多,知识产权真的能够让发明者受益,因此鼓励发明创新吗?

阴极管电视机公认的发明人是Philo Farnsworth(费罗·法恩斯沃斯,以下称法恩斯)。

法恩斯在1906年出生于一个农民家庭。在他15岁的时候,便幻想发明一种能把声音和图像一起传输的装置。后来,在他21岁的时候,便做出了第一台阴极管电视机的原型。在此之前,电视机的原理多是基于机械扫描装置(Nipkow disks)。阴极管电视更加可靠,图像也更清晰。事实上,直到21世纪初,我们的显像装置仍然是基于阴极管的,可见这一发明的有效性。

法恩斯在1926年搬到了加州,并申请了阴极管电视的发明专利。专利在1930年得到了批准,法恩斯打算生产电视,大赚特赚一笔。在他正要开始的时候,美国广播公司(RCA)宣称他们的员工Vladimir Zworykin(费拉蒂米尔·斯福罗金)在法恩斯之前就发明了电视,要专利局裁定法恩斯的专利无效。

在1931年,美国广播公司向法恩斯投出橄榄枝,想要用十万美金获得授权,并招聘法恩斯作公司员工。但是这被法恩斯拒绝了。

法恩斯为了拥有完整的电视发明专利池,举债和美国广播公司打官司。在1939年底终于打赢了所有的官司并和美国广播公司签订了一百万美金的授权协议。但是此时电视的主要专利已近过期,而他也已经债务缠身。雪上加霜的是,时值二战开始,美国暂停了电视等新兴的娱乐服务全力备战。

法恩斯的例子,很难说明专利的争夺是完全偏向发明者一方的。只要拥有雄厚的资本,完全可以通过法律程序消耗发明者的时间和精力。这也部分解释了为什么在当代,拥有大量专利和知识产权的,往往是实力雄厚的大公司。

在文化产品、软件、医药等较多使用知识产权保护的领域,知识产权的保护真的有效吗?

在文化领域,音乐、电影、电视,是广为人知的通过知识产权盈利的领域。这也是我们题目所说的人人影视组侵权的领域。特别是音乐领域,因为其较低的创作门槛和简单的复制方式,是最早受到互联网冲击的领域。这一冲击,看上去似乎发展受到了打击。仅就美国而言,唱片行业在2000年的收入达到了140亿美元,而在2019年是110亿,这还得益于过去几年流媒体收入的发展。

然而,创作者的收入并没有减少。同为年轻的顶级歌手,Taylor Swift在2020年的身价为6500万美金,而Britiney Spears在2000年的身价为3800万美金。由于唱片行业的势微,更多的创作者倾向于用不可复制的体验,比如现场演唱会等形式来盈利。因此也有了2007年,Radiohead将新专辑In Rainbows放到网上直接提供下载的行为艺术。

而SCP一类的互联网联合创作模式也在尝试用一种版权上更开放的态度进行共享和合作。

在软件领域,几乎所有的从业者都认为,软件专利对于这个行业而言是弊大于利的。因为专利的可转移性,特别是在软件专利方面,出现了像Intellectual Ventures这种名字都是主义,实际是做掠夺性专利诉讼(predatorial patent trolling)生意的公司。这类公司自己不搞发明创造,专门通过收购定义模糊广泛的专利,再利用这种专利诉讼赚钱。比如告Facebook侵犯了“一种让不同人通过网络连接联系的方法”的专利,或者告Google侵犯了“一种向服务器请求资源的方法”的专利来盈利。这种滥用专利诉讼对于真正将发明创造转化成对人类有益产品的人只能起到负面作用。

正因为软件专利存在这么多的弊端,大家也发明了将各个公司关于某项技术的专利放到一起,形成一个专利池。这样只需要向专利池付费,就能得到专利池的保护。如果有人告专利池里的技术,大家就一起凑钱打官司。H265编码和所谓的4G / 5G技术,其实核心就是他们的知识产权放在一起的专利池。

在医药领域,因为药品的正外部性,对于医药的知识产权保护一直都是存有争议的。《我不是药神》电影的核心就是在于印度政府为了保障人民能用上廉价药品,对于药品专利的不认可。讽刺的是,也正是因为印度大量生产专利保护的药品,反而积累了大量的药品生产经验,成为了全球主要的药品生产地。即使在美国,因为EpiPen(一种过敏患者用于急救的便携式注射器加药物)在过去的十年价格翻了4倍,也引起了广泛的对药物专利是否应该有价格限制的讨论。

从另一个方面来说,其实,现代药品的开发模式已经不是大公司从最初研发到完成三期一条龙的模式了。更经常的,是在早期由国家投资给大学和实验室,进行药品的研发。在药品有希望的时候,教授通常会成立一个小公司负责将研究商业化。小公司完成了一期和二期实验之后会再打包卖给Lilly这样的大药厂,他们再去完成费时费力的三期实验和上市。正是因为这一模式,也引发了为什么用纳税人的钱进行研发,却把利润都给大药厂赚了的大讨论。

专利对技术影响的一个直接例子就是E-Ink。E-Ink曾经是和OLED并驾齐驱的下一代显示技术。但是E-Ink公司一家垄断了E-Ink的核心专利。在其发明之后的十几年间,E-Ink不授权给别的公司生产,所以无论是刷新率、尺寸、价格或是新的应用都没有得到大的发展。现在,所有的公司都在等待在E-Ink的核心专利在20年代过期之后,引起的产品更新大爆发。

这并不是一厢情愿。LED照明技术的专利大部分在2010年过期。巧合的是,也正是在2010年之后,LED产业迎来了大爆发。很多人家开始装上条状LED、非直接LED照明或者LED变色灯也都是在2010年之后。

正是因为知识的可复制性,我们的生活才能越来越好。在互联网时代,知识的获得和复制都更加廉价了。比如开源软件,就是践行知识共享的生动例子。我小时候也并不理解知识产权和开源的关系,14岁的时候还在电脑报上写文章,斥责别人将开源的FileZilla重新打包在国内卖不尊重知识产权。现在的我更加相信,任何能辅助知识推广的行为都是好的。如果重新打包翻译之后通过收费来推广比免费的推广得更好,有什么不可以呢。

SQLite,一个广泛使用的嵌入式数据库软件,就是将知识共享的精神发挥到了极致。和大部分开源软件需要在重新打包时包含授权条款不同(Permission is hereby granted … without restriction, including without limitation …),SQLite和所有的在知识产权形成前的发明创造一样,仅仅是将所有的权利回归了人类公有资源库(Public domain)之中。

在文化领域,古腾堡计划(Project Gustenberg)致力于通过互联网将所有著作权过期的作品提供给人们免费下载和浏览。每年年初,古腾堡计划都会开心的写一篇文章,介绍都有哪些好的作品已经过期了大家可以免费下载浏览了。

这些项目本身,就是对知识产权制度的反思。你可以问问自己,米老鼠这么一个角色,让迪斯尼拥有了100多年的知识产权,而且只要美国还在,就还会一直持续下去,真的对人类公有资源有任何正面的帮助吗?假设吴承恩将孙悟空的知识产权拥有到了现代,我们还能看得到《大话西游》吗?

这时,我们再回头问,知识产权形成的初期为了鼓励市场竞争和更多的发明创造的目的,真的达到了吗?

1774年,英国当时是全世界的纺织生产大国。为了保护英国的纺织企业,通过法案防止纺织技术出口。Samuel Slater(塞缪尔·斯莱特)将纺织机的制作和生产都默记在脑子里,1789年到美国纽约,根据自己默记的内容,重新制作了纺织机。在现代,斯莱特广泛被认为是美国的工业生产之父。

当然在后来,美国取得技术领先的优势之后,就通过国际条约和机构,将知识产权保护的条例写进了参与国的法律中,以此保护自己的技术优势。TPP(泛太平洋伙伴关系协定)被广泛批判的,也是将美国国内过于严苛冗长的知识产权保护条例写进了合作伙伴的法律中。

知识产权将产权的所有性质赋予给了可复制可传播的知识,将其长期从人类共有资源中隔离出来。在当今合作紧密、知识迭代变化快速的时代,这样完整的知识产权是为了促进发明创造,还是只是保护资本的利益,是非常存疑的。我们不应该全盘接受知识产权的概念而不加思考。对它好的地方要鼓励,坏的地方也要批判。知识产权应该是一个动态的概念,刘慈欣说要多想,在那以前多想,总是对的。

No comment yet
February 7th, 2021

For many machine learning practitioners, training loop is a universally agreed upon concept as demonstrated by numerous documentations, conference papers to use the word without any reference. It would be a helpful concept for many beginners to get familiar with before diving into the rabbit holes of many deep learning tools.

The field of machine learning becomes vast, diverse and ever-more open in the past 10 years. We have all kinds of open-source softwares, from XGBoost, LightGBM, Theano to TensorFlow, Keras and PyTorch to simplify various tasks of machine learning. We have supervised learning, unsupervised learning, generative network and reinforcement learning, choose your own pill. It can be dazzling for beginners. It doesn’t help that many popular softwares we use made simplifications to hide many details from beginners with abstractions like classes, functions and callbacks.

But fear not, for machine learning beginners, there is one universal template. Once you understand it, it is straightforward to fit all existing training programs into this template and start to dig into how they implemented the details. I call this the universal training loop.

An Extremely Simplified Interface

Many high-level framework may provide an extremely simplified interface looking like this:

1
func train(training_dataset: Dataset, validation_dataset: Dataset) -> Classifier

When you do:

1
let trainedClassifier = Classifier.train(training_dataset: training_dataset, validation_dataset: validation_dataset)

You somehow get the trained classifier from that interface. This is what you would find in FastAI’s Quick Start page, or Apple’s Create ML framework.

However, this doesn’t tell you much about what it does. It is also not helpful some of these frameworks provided callbacks or hooks into the training loop at various stages. The natural question would be: what are the stages?

An Supervised Learning Training Loop

It is actually not hard to imagine what a supervised learning training loop would look like underneath the extremely simplified interface. It may look like this:

1
2
3
4
var classifier = Classifier()
for example in training_dataset {
  classifier.fit(input: example.data, target: example.target)
}

It tries to go through all examples in the training dataset and try to fit them one-by-one.

For stochastic gradient descent methods, we had a few modifications to make the training process more stable and less prone to input orders:

1
2
3
4
var classifier = Classifier()
for minibatch in training_dataset.shuffled().grouped(by: 32) {
  classifier.fit(inputs: minibatch.data, targets: minibatch.targets)
}

We randomizes the order of input (shuffled()), group them into mini-batches, and pass them into the classifier, assuming the classifier operates with a group of examples directly.

For many different types of neural networks, shuffled mini-batches will be the essential part of your training loop for both efficiency and stability reasons.

4 Steps to Fit

The magical fit function doesn’t inspire any deeper understanding of what’s going on. It looks like we simply lift the train function into a for-loop.

However, if we can assume that our machine learning model is differentiable and based on gradient methods (e.g. neural networks), we can break down the fit function into 4 steps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var classifier = Classifier()
for minibatch in training_dataset.shuffled().grouped(by: 32) {
  // Given inputs, a machine learning model can guess what the
  // outputs would be. (Labels of the images, positions of the faces
  // or translated texts from the original.)
  let guesses = classifier.apply(inputs: minibatch.data)
  // Loss measures how far away our guesses compares to the targets
  // we knew from the training data. This is supervised learning, we
  // know the answer already.
  let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
  // Based on the loss, gradients give us the direction and magnitude
  // to update our model parameters.
  let gradients = loss.gradients()
  // Update the parameters with gradients from this mini-batch.
  // Optimizer specifies a particular algorithm we use to update
  // parameters, such as stochastic gradient descent or ADAM.
  optimizer.apply_gradients(gradients, classifier.parameters)
}

For any supervised learning, you will be able to find this 4 steps. It can vary, some of the model may accumulate gradients a bit and then apply_gradients after several rounds. Some of them may apply additional clipping on the gradients before applying them.

You could find the exact 4 steps in frameworks like Keras or PyTorch.

Validation Dataset and Epoch

We haven’t talked about the validation_dataset parameter you saw earlier for the train method!

For first-order gradients based methods (e.g. neural networks), we need to go over the whole training dataset multiple times to reach the local minima (a reasonable model). When we went over the whole training dataset once, we call it one epoch. Our models can also suffer the overfitting problem. Validation dataset are the data the model never uses when updating its parameters. It is useful for us to understand what our model would be like to data it never sees.

To incorporate the above two insights, our training loop can be further modified to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var classifier = Classifier()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, classifier.parameters)
  }
  var stats = Stats()
  for example in validation_dataset {
    // Only gather guesses, never update the parameters.
    let guess = classifier.apply(input: example.data)
    // Stats will compare guess to the target, and return some
    // helpful statistics.
    stats.accumulate(guess: guess, target: target)
  }
  print("Epoch \(epoch), validation dataset stats: \(stats)")
}

Now I can claim, for any supervised learning task, you can find the above training loop when you dig deeper enough through their abstractions. We can call this the universal supervised training loop.

Unsupervised Learning and Generative Networks

The main difference between unsupervised learning and supervised learning for our training loop is that we won’t have the target provided from the training dataset. We derive the target somewhere else. In unsupervised learning, we derive the target from some transformations of the input. In generative networks, we derive the target from random noises (hence generating something from nothing).

1
2
3
4
5
6
7
8
9
10
11
12
13
var model = Model()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32) {
    let guesses = model.apply(inputs: minibatch.data)
    // Unsupervised learning.
    let targets = model.targets(from: minibatch.data)
    // Generative networks
    // let targets = model.targets(from: noise)
    let loss = model.loss(guesses: guesses, targets: targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, model.parameters)
  }
}

Often times, for this types of tasks, the targets are derived from another set of neural networks and updated jointly. Because of that, there are more whistles and bells in many frameworks when they implement the above training loop. You can find example from Keras on how they derive targets from the input data only (get_masked_input_and_labels) for BERT (a popular unsupervised natural language processing model). Or you can find example from PyTorch how they generate adversarial examples from noises for DCGAN (deep convolutional generative adversarial network).

Deep Reinforcement Learning

Deep reinforcement learning generates the training data by having an agent interacts with the environment. It has its own loop that looks like this:

1
2
3
4
5
6
7
8
while true {
  let action = agent.respond(to: lastObservation)
  let (observation, reward, done) = environment.interact(action: action)
  lastObservation = observation
  if done {
    break
  }
}

The agent took action against our last observation. The environment will be interacted with the action and give a new set of observation.

This is independent from our training loop. In contrast to the supervised learning, in the deep reinforcement learning, we use the interaction loop to generate training data.

Our training loop can be modified to look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
var policy = Policy()
var training_dataset = Dataset()
for epoch in 0..<max_epoch {
  var data_in_episode = Dataset()
  while true {
    let action = policy(inputs: lastObservation)
    let (observation, reward, done) = environment.interact(action: action)
    data_in_episode.append((action: action, reward: reward, observation: lastObservation))
    lastObservation = observation
    if done {
      for (i, data) in data_in_episode.enumerated() {
        // Use all future rewards to compute our target.
        let target = target_from_future_rewards(data_in_episode[i..<])
        // Our input will be the last observation (Q-learning), and
        // potentially also include the action (Actor-Critic model),
        // or the next observation (model-based methods).
        training_dataset.append((input: (data.action, data.observation), target: target))
      }
      break
    }
  }
  // Rather than shuffling the whole training dataset, we just random
  // sample a subset.
  for minibatch in training_dataset.randomSampled().grouped(by: 32) {
    let guesses = policy.apply(inputs: minibatch.data)
    let loss = policy.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, policy.parameters)
  }
}

The training_dataset in above training loop can be referred to as the replay memory in the literature. If we retain all the old data before training, this is often referred to as off-policy training. If instead we remove all training data after each episode, this can be called on-policy training. OpenAI Spinning Up has better explanation on the differences between on-policy and off-policy with a bit more details.

You can find examples of the above training loops in PyTorch or in OpenAI Baselines.

Distributed Learning

Follow the same training loop, we can extend them to train on multiple machines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
var classifier = Classifier()
let machineID = MPI_Comm_rank()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32, on: machineID) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    // Compute gradients from the specific data on this machine only.
    let machineGradients = loss.gradients()
    // Use allreduce primitive to compute gradients summed from all
    // machines.
    let allGradients = allreduce(op: +, value: machineGradients, on: machineID)
    // Applying updates with the same gradients, it should yield the
    // same parameters on all machines.
    optimizer.apply_gradients(allGradients, classifier.parameters)
  }
}

The allreduce primitive go over all machines to sum gradients from them. In reality, it is often implemented with ring-based communication pattern to optimize the throughput. Naively, it can look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
func allreduce(op: Op, value: Tensor, on: Int) -> Tensor {
  MPI_Send(value, to: 0)
  if on == 0 {
    tensors[0] = value
    for i in 1..<MPI_Comm_size() {
      tensors[i] = MPI_Recv(from: i)
    }
    let sum = tensors.sum()
    for i in 1..<MPI_Comm_size() {
       MPI_Send(sum, to: i)
    }
    return sum
  } else {
    return MPI_Recv(from: 0)
  }
}

This naive data-distributed training loop can be extended to more sophisticated distributed training regime. For example, in ZeRO-Offload, its distributed strategy can be represented as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
var classifier = Classifier()
let machineID = MPI_Comm_rank()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32, on: machineID) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    for (i, gradient) in gradients.enumerated() {
      // Each machine only sum the gradients it responsible for. This
      // method will return nil if it tries to reduce a gradient it
      // is not responsible for.
      if let reducedGradient = reduce(op: +, id: i, value: gradient, on: machineID) {
        // Copy the summed gradient to CPU.
        cpuGradients[machineID, i] = reducedGradient.copy(to: .CPU)
      }
    }
    // Apply gradients to the model from CPU.
    optimizer.apply_gradients(cpuGradients[machineID], classifier.parameters[machineID])
    // Broadcast the latest parameters to all machines.
    broadcast(classifier.parameters[machineID])
  }
}

The Universal Training Loop

Finally, we arrived at our universal training loop template:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
var model = Model()
var training_dataset = Dataset()
for epoch in 0..<max_epoch {
  // Collect training dataset either from agent-environment
  // interaction, or from the disk.
  training_dataset = collect(...)
  // Go over mini-batch either on the whole training dataset, or from
  // a subset of it.
  for minibatch in training_dataset.extract(...) {
    // Apply the model to generate some guesses.
    let guesses = model.apply(inputs: minibatch.data)
    // Generate targets either from inputs, from noise, or it already
    // exists in the training dataset. Or a combination of above.
    let targets = targets_from(...)
    // Compute the loss from the model's guess w.r.t. to the target.
    let loss = model.loss(guesses: guesses, targets: targets)
    // First-order gradients from the loss.
    let gradients = loss.gradients()
    // Sum gradients from everywhere (other machines, other GPUs) for
    // this particular node to process.
    if let (i, collected) = gradients.collect(...) {
      optimizer.apply_gradients(collected, model.parameters[i])
      // Broadcast the updated parameters to everywhere.
      broadcast(model.parameters[i])
    }
  }
}

*: The pseudo-code in this article uses Swift. A particular syntax concerns with a[x..<y]. It is semantically the same as a[x:y] in Python, including for cases where some of the subscript is missing: a[x..<] should be the same as a[x:].

No comment yet
February 4th, 2021

I started to work on a new data science related project last year. Our initial setup was fairly traditional: it is Python-centric, with Anaconda for dependency management and environment setup.

Anaconda is interesting. It is probably useful if you have a ton of dependencies and the system tries really hard to figure out the package compatibilities with each other based on their claims (version numbers of their dependencies). For us though, we only use a handful of packages with clean dependencies on each other (your usual suspects: Pandas, SciPy, numpy), and the version number check just means each time package upgrade is half-an-hour SMT solving.

On the other hand, I made my fortune in the past decade by doing app development (Facebook, Snapchat). I’ve been eyeing on Swift since the 1.0 version. Since then, the language matured a lot. After gaining some Swift experience with my last employer, it seems to be a good compromise between expressivity and performance. The noise from the language itself is minimal, and the performance can be tuned if you are hard on it, otherwise still better than raw Python.

After a few months of probing, investing, bug fixes, we’ve migrated our setup to Swift in last December. It has been serving us well so far.

Problems

We have a small project, and the problem with packages mostly around package management and upgrade. Since Anaconda’s environment is not per project based, we have to switch back and forth when entering / leaving the project.

Our project, although small, is a bit exotic. Our core algorithm was implemented in C, and we don’t want to ship a Python plugin in-tree. Hence, we opt-ed to talk with the C lib through standard IO (subprocess). It turns out to be hairy and the core algorithm update process is more than terrible.

Pandas has a reasonable performance if these are builtin functions, once we drop to use apply / lambda, going through a few million rows for a particular column can take 30 to 40 seconds. For these cases, we also cannot use the rest of idle cores efficiently.

However, switching to a new language setup is not an easy task. Besides solving the above said problems, we would still like to keep a few things we liked with our old setup:

  • Jupyter notebook: we really liked to do data exploration with Jupyter notebooks. Anything requires us to compile / run / print would be a no go. The interactive data exploration experience is essential for our workflow.

  • Pandas: we liked Pandas, it is a Swiss Army knife for data science. It also has a big API surface that would be very hard to reimplement from scratch.

  • PyCharm or alike IDE: we liked to use PyCharm for Python development a lot. Data inspection, re-run, and in general, debugging experience within an IDE is no comparison with tools like vim (although I still use vim for a lot of development personally).

Other Potential Choices

Before embarking on this quest, we briefly looked at other potential choices:

  • TypeScript / JavaScript: it has some interesting data exploration patterns, such as observablehq.com. However, it doesn’t have good C library integration points, making the fore-mentioned core algorithm integration problem unsolved.

  • Rust: when I did my investigation, I didn’t notice the evcxr project. Other than that, the syntax for modeling would be a bit noisier than I’d like. The way to call Python through PyO3 is a bit clumsy too.

  • Julia: if I have more experience with the language, I may have a different opinion. But as it stands, the language has its own ecosystem, and I didn’t see a good way to call Python libraries from the language*. On the C lib integration part, it seems to require dynamic linking, and that would be a little bit more hurdle on my toolchain setup.

New Setup

Monorepo

Our new setup is a monorepo, with Bazel as the build system for both our Swift code, C libraries and our Python dependencies. In the meantime, we still have some legacy Python libraries that are now managed by Bazel too.

Bazel’s new rules_python has a pretty reasonable pip_install rule for 3rd-party dependencies. As I mentioned, because we use relatively a small number of Python packages, cross package compatibility is not a concern for us.

All our open-source dependencies are managed through WORKSPACE rules. This worked for our monorepo because we don’t really have a large number of open-source dependencies in the Swift ecosystem. The things we import mostly are Swift numerics, algorithms and argument-parser.

Jupyterlab

We don’t use a separate Jupyter installation anymore. Jupyterlab is installed as a pip_install requirement for our monorepo. Opening a Jupyterlab would be as simple as bazel run :lab. This enables us to in-tree our Swift Jupyter kernel. We adopted the swift-jupyter and added Bazel dependency support. We also have a pending PR to upstream our sourcekit-lsp integration with the Swift Jupyter kernel.

This complicates a bit on plugin management for Jupyterlab. But we haven’t yet found a well-maintained out-of-tree plugin that we would like to use.

Python

To support calling Python from Swift, we opted to use PythonKit library. We’ve upstreamed a patch to make Pandas work better within Swift. We made a few more enhancements around UnboundedRange syntax, passing Swift closures as Python lambda, which haven’t been upstreamed at the moment.

One thing that makes calling Python from Swift easy is the use of reference counting in both languages. This makes memory management cross the language boundary much more natural.

We also wrote a simple pyswift_binary macro within Bazel such that a Swift binary can declare their Python dependencies and these will be setup properly before invoking the Swift binary.

We haven’t yet vendoring our Python runtime at the moment. Just painstakingly making sure all machines on Python 3.8. However, we do intend to use py_runtime to solve this discrepancy in the future.

C Libraries

Swift’s interoperability with C is top-notch. Calling and compiling C dependencies (with Bazel) and integrating with Swift is as easy as it should be. So far, we’ve fully migrated our core algorithms to Swift. The iterations on the core algorithms are much easier.

IDE Integration

For obvious reasons, we cannot use PyCharm with Swift. However, we successfully migrated most of our workflow to VS Code. LLDB support makes debugging easy. However, we did compile our own Swift LLDB with Python support for this purpose (still in the process to figure out with the core team why the shipped Swift LLDB on Linux has no Python support).

Sourcekit-lsp doesn’t recognize Bazel targets. The work for Build Server Protocol support on both Bazel side and on sourcekit-lsp side seems stalled at the moment. We ended up writing a simple script to query compilation parameters from bazel aquery for all our Swift source code, and put these into compile_commands.json. Sourcekit-lsp just has enough support for Clang’s compilation database format to make code highlighting, auto-complete, go to definition and inline documentation work again.

We committed .vscode/settings.json.tpl and .vscode/launch.json.tpl into the codebase. Upon initial checkout, a small script can run to convert these template files into actual settings.json and launch.json. We did this to workaround issues with some VS Code plugins requiring absolute paths. These two files will keep in sync with the templates as part of the build process ever since.

Bonus

Swift’s argument-parser library makes creating command line tools really easy. Furthermore, it also supports auto-complete in your favorite shells. We implemented one all-encompassing CLI tool for our monorepo to do all kinds of stuff: data downloading, transformation, model evaluation, launching Jupyterlab etc. With auto-completion and help messages, it is much easier to navigate than our previous ./bin directory with twenty-something separate scripts.

Is this the End?

We’ve been happy with this new setup since the beginning of this year, but it is far from the end-all be-all solution.

Swift is great for local development, however, the Bazel support on Linux generates binary that dynamically links to Swift runtime. This makes deployment a little bit more involved than we’d like it to be.

Pandas uses internal data structure for DataFrame (column-based numpy). While we do have efficient numpy to tensor conversions, these cannot be leveraged in the context of Pandas (we want to lift the whole data frame, not just numpy-ready columns). Our current solution calls itertuples, which can be quite slow. We’d like to sponsor an open-source project to implement Apache Arrow support for Swift. This should enable us to pass data from Pandas to Swift through Arrow in-memory format, which may be faster than what we currently do.


*: notagoodidea pointed out there is a PyCall.jl library that implemented Python interoporability.

No comment yet
December 29th, 2020

每两年,我都会做一次四年为期的预测。对2020年的预测,怎么也不会预料到这样的突发事件。这样的突变,反而会使得两年前和四年前过于悲观的预测显得乐观不少。

回顾2016年的预测,辅助驾驶技术已经广泛应用于汽车之中。Cruise和Waymo也分别在旧金山和凤凰城开始了无驾驶员的自动驾驶汽车的运营。存活的移动消息服务,基本也都达到了三亿日活左右的标准。

许多长线航班的取消,虽然确实实现了,却是因为不同的原因。而石油价格虽然大幅变化,且维持在40美元左右,变化的方向却完全错了。

VR设备并没有在19年底售出三千万,仅仅售出了三百万台左右。

写了这么几篇的四年预测之后,我的目的也从单纯的追求准确变成了在追求准确的同时,尽量写一些以前没写过的新趋势。这也就是为什么在絮絮叨叨了将近十年的自动驾驶之后,这一篇并没有提了。因为他终于开始走上了产业化的正轨,接下去就是按部就班了。


2024年,却应该是乐观的。在经历了一场大瘟疫之后,短暂的萧条却由于多国政府的措施得当,变成了经济增长期。尤其是传统的发达国家,因为美国的国家正常化(美元占全球交易总额在30%以下)举措,迎来了一波复苏。

科技的发展也并没有因为瘟疫而停下脚步。更多的数据中心、更快的网络,是人们对疫情后世界的基本要求。更多的更持续的工作虚拟化需求加速了往云端迁徙的脚步。到2024年,全世界而言,线上支付会占到消费者总支付的一半以上。

无线互联网的延迟在90%的情况下会在50毫秒以下,25%以上的人口会有机会使用1G以上带宽的互联网。在算力方面,最好的单机工作站可以达到1 Petaflops的全32位浮点矩阵运算速度,近32位浮点矩阵运算速度将会达到10 Petaflops。2011年的超级计算机京也不过是10 Petaflops左右。

但是计算不止发生在云端。在2024年,我们会看到Hyperscalers运营耗电量接近1 Gigawatt的单个数据中心,但是也会看到耗电量不到15 Kilowatt的最后一公里边缘计算中心。成功的边缘计算服务提供商会管理将近25,000个边缘计算中心。这些边缘计算中心是提供高速互动视频和云游戏的关键。

即使有了云游戏,玩3A游戏的人却并没有增加多少,只是现有玩家的游戏时长增加了而已。另一方面,游戏引擎的使用却越来越平易近人。到2024年,20%的独立视频制作者将会使用游戏引擎实现实时的视频特效。一半以上的在线教学课件也会使用游戏引擎进行展示。

mRNA疫苗的成功应用和对疫苗监管的放松使得在2024年,一小部分人最先享受到了个性化疫苗的服务。这些个性化疫苗会从最先预防普通感冒开始,我们甚至可能会听到阿尔茨海默症疫苗的好消息。

虽然在2024年,我们不大可能完成一整套AGI系统的设计,但是一种通用的增强学习算法架构成功的应用到了从短期趋势预测到长期战略计划的方方面面。除了少数在数据融合输入方面尚有问题的领域(比如多边贸易谈判,民事刑事侦查诉讼等),大部分领域的执行将会是这套增强学习算法,而我们只是负责简单的数据整理和维护。从这个角度来看,我们离能自我进化的AGI系统又近了一步。

因为疫情而导致的电影网络同步上映并不会因为疫情的消失而消失。事实上,到了2024年,全年50%的电影会在院线上映的30天内在网络上映。每十几分钟分段的直播和录播混合的互动视频占据了我们剩余的娱乐注意力。我们希望视频的创作者更快的做出我们想要看的东西。90%头部创作者的视频上传周期从每周两更变成了一天一更。

虽然我们迎来了经济的复苏,全球化和全球化相关的全球变暖议题并没有实质的进步。除了山火和热带风暴,我们将在2024年以前见证一次大规模的近大陆架海洋动物死亡事件。因为海水酸化,太平洋沿大陆架将会迎来大量的海带养殖投资。因为海带养殖的盈利性,我们甚至会往深海发展,养殖更多的海带。海带会在美国成为一种时尚的健康食品。

到了2024年,因为植树造林和海带养殖,地球的肺得以有所喘息。但是对于控制温室气体的排放而言,这远远不够。除了少数的例外,大部分国家,包括中国、印度和美国,都将放弃或者达不到2024年的减排目标。我们仍然没有一种可以大规模快速捕获温室气体的技术投入使用。唯一的好消息可能是俄罗斯人多了几个天然不冻港。

但是,站在2024年的门槛上,虽然我们看到了全球化的受阻和意识形态斗争的暗流涌动,却充满希望:只要我们进步得够快,或许不会那么糟呢。

Every two years, I will make a 4-year scenario prediction / planning. However, the previous prediction for 2020 won’t even touch the slightest how dramatic it was. Such drastic change, however, makes seemingly gloomy predictions of 4 years more optimistic.

Right on mark for the 4-year prediction made in 2016, is the prevalence of driver-assistance technologies in all types of cars. Cruise and Waymo finally started their human driver-free autonomous vehicle commercialization in San Francisco and Phoenix, Arizona. The surviving mobile messaging services all will have around 300 million DAUs by now.

A lot of long-haul flights have been cancelled, but for reasons other than what was predicted. There were wide swings of crude oil prices, and stayed at $40 year-end. The directions of these swings are wrong from the prediction.

VR devices haven’t sold 30 millions in 2019. There were only 3 millions sold that year.

After the posts in the past few years, my goal for these changed from the singular pursuit of accuracy to unmask new trends while being accurate. That’s why after almost ten years religiously repetitive discussions on autonomous driving, there is none this time. It is on track of real-world productionization now, the rest will happen over time.


2024 should be an optimistic year. After the global pandemic, the predicted depression, with a combined monetary and fiscal policies from the world governments, will not happen. Developed countries, especially the United States, after its dollar normalization effort (dollar will only account for 30% and lower in global transactions), will be on a strong recovery path.

The technology developments won’t slow down a bit, even during the pandemic. More data centers, faster the internet, is what people expect post-pandemic. More and sustained work digitalization accelerated the migration to the cloud. Online transactions globally, by 2024, would be more than 50% of total consumer transactions.

The latency of wireless networks will improve. 90% of them will be below 50ms. 25% of the world population will have access to 1Gbps internet connection. The best workstation will do 1 petaflops full 32-bit floating-point matrix multiplication, and around 10 petaflops near 32-bit floating-point matrix multiplication. To put this in perspective, Fujitsu K in 2011 can only do 10 petaflops.

Computations are not only monopolized to the cloud. In 2024, we can see at least 1 hyperscaler runs a 1-gigawatt data center. At the same time, there will be data centers with less than 15 kilowatt power consumption on the edge. Successful edge computing providers will manage close to 25,000 edge computing data centers. These data centers will be the key for high-speed interactive videos and cloud-gaming.

Even with the moderate success of cloud-gaming, there won’t be much more gamers for AAA titles. Existing gamers will simply play more of them. On the other hand, game engines penetrated more markets. By 2024, 20% of independent content creators will use game engines for real-time visual effects. More than half of online education kits will use game engines for presentation.

mRNA vaccine and the subsequent streamlined regulation made it possible that in 2024, a privileged few can enjoy personalized vaccines. These vaccines will start with common cold, we will even hear some good news on Alzheimer vaccine development.

We cannot finish the full-blown AGI system by 2024. However, a generalized reinforcement learning system will be successfully applied from short-term trend analysis to long-term strategic planning. Besides a few areas with problems on data input (such as multilateral trade negotiations and civil / criminal litigations), many more places will use this generalized reinforcement learning system, and we will simply be responsible for data input and maintenance. We are inch-closer to a self-evolving artificial general intelligence system.

The online / in-theater lock-step movie releases won’t be gone post-pandemic. By 2024, 50% of movies will have digital release within 30-days of its in-theater release. The rest of our entertainment will embrace a few minutes a-piece live / clips mixed interactive videos. We demand faster feedback loops from content creators. 90% of top content creators will perform daily, compared to twice a week today.

During our economic recovery, globalization and its related topics such as climate change won’t make much progress. In between mega wildfires and mega storms, we will witness a large-scale offshore ocean-going animal’s synchronized death. Due to the acidification of the ocean, we will see growing investment in kelp farming around pacific shore. It is so profitable that we will even build kelp farms in the deep sea. Kelp will be a fashionable health diet.

With the reversal of deforestation and kelp farming, by 2024, the lung of the Earth slowly starts to heal. This is far from enough. With some exceptions, most countries, in particular, China, India and the United States will either give up or cannot reach their 2024 carbon reduction goals. There is no viable large-scale carbon capturing technology in existence. The only good news from the climate-change, is probably a few more year-round non-freezing ports for the Russians.

With the stagnation of globalization and the slow-motion ideology conflicts, we will still be hopeful in 2024: if we can progress fast enough like we did in the past 4 years, maybe, just maybe, the worst won’t come.

No comment yet
December 6th, 2020

From the onset of implementing libnnc, it meant to be a common ground for higher-level language bindings beyond Python. The underlying architecture has been stable for a year or so, and I have been using it for some personal projects for a while. But the raw C interface is not the easiest to use, it is the time to implement some high-level language bindings for that library.

The default high-level language for deep learning likely would be Python. However, I am not happy with how it performs on a typical many-core system even with things that are supposed to help. Swift, on the other hand, has no issues with saturating my many-core system and it has a reasonable Python binding to tap into the rich Python tool-kits. Not to mention the calling C functions from Swift path is as easy as you possibly can get.

This conviction resulted s4nnc, a Swift language binding for libnnc. Because s4nnc is a pure interface to interact with the underlying deep learning library. I paid close attention to its API design. Below are some design notes around why it is, and how Swift as a language fares on such a task. If you want to read the introduction to s4nnc, feel free to visit the GitHub homepage.

What is a Deep Learning Library API

A good deep learning API to me, can be largely modeled after Keras and PyTorch. It concerns, above all, with 2 questions:

  1. How to specify a deep learning model?
  2. How to construct a training loop?

Everything else is nice-to-have and largely orthogonal to these two questions (but these whistles-and-bells are a lot of hard work!).

A training loop consists of a repeated sequence of operations to: evaluate a model, compute gradients against the loss, apply gradients to update model parameters.

The details can be flexible: you could evaluate one part of the model in one round, and another part in another round; you could have different losses for different model outputs each round; you could modify the gradients, scale them, truncate them to whatever you liked; and you could apply gradients to update different model parameters with different optimizers. But at the core, I didn’t see much changes for this 3-step throughout many model training code.

What constitutes a model can be more interesting, but it seems we converged to a concept where a model consists of some inputs, some outputs, and some parameters. Particularly, parameters are stateful and internals to the model itself. Go beyond that, a model could have different inputs / outputs and different input / output shapes during the training loop. However, the shapes and number of parameters during the training are likely to be constant.

Basic Data Types

A deep learning library operates on multi-dimensional arrays (or tensors). In Swift, a concrete tensor can be represented as a value type like Array itself in Swift. That means the Tensor type would need things such as copy-on-write to implement said value-type semantics. Extra attention needs to be paid to make sure throughout the implementation of the API, no unnecessary copy was made. This value-type choice is a bigger deal than it sounds (it sounds like a no-brainer given S4TF made exactly the same choice) because in Python, everything is a reference type.

This becomes more interesting when deciding whether tensor variables could be value types or not. Tensor variable is an abstract tensor type which you can compute gradients (has a grad property). It is bound to a computation graph, and can be the parameter to update during the training loop. While it is possible to make many functions associated with tensor variables taking inout parameters and marking some of tensor variables’ functions as mutating, the more hairy part is about updates during the training loop.

In PyTorch, an optimizer takes a list of parameters and then applies new gradients to update these parameters when step() method is called. This is possible because parameters are reference types in Python. Any updates to the parameters will be reflected to the model who holds these parameters. This is not possible if tensor variables are value types. In Swift, you cannot hold a reference to a value type.

Thus, practically, tensor variables have to be implemented as reference types in Swift.

Despite my best intention, it turns out most of the objects, including Graph, Model, StreamContext are still implemented as reference types. It is possible for some of them (for example: the Model) to be value types. The lack of deinit in struct requires us to wrap a reference type inside a value type to create such API. At the end of day, I don’t see much of the value from API aesthetics or performance-wise to make these value types.

Automatic Differentiation

While Swift has a proposal for automatic differentiation, the automatic differentiation right now is implemented at library level and only applicable to models and tensor variables.

On the API side, it is popular to have a backward() method on the final loss variable. The said method will compute gradients against all variables associated with the final computation of the loss.

This also means we need to keep track of all variables in the computation graph, unless some point is reached and we can free them. In PyTorch, such point is when the step() method is called.

libnnc early on made the decision to avoid holding vast amount of memory by default. That resulted in the interface backward(to tensors: Sequence<Tensor>) where you have to specify to which tensors you compute the gradients against. Because we are doing backward-mode AD, we still compute gradients on all variables up until these tensors. But we don’t compute gradients against variables passed that point. In effect, we can rely on reference-counting to free memory associated with tensors beyond that point.

In return for this a bit more complex interface, you don’t have to worry about scoping to no_grad to avoid unbounded memory allocations.

Optimizers in the Training Loop

An optimizer in a deep learning library represents a particular gradient descent method associated with some parameters to update each round.

One particular challenge is about how to use multiple optimizers with different parameters in one round. While for simpler cases, you could call step() many times in one round. It may be more efficient to call step() once.

Swift makes this particular choice easier by supporting extensions on built-in types.

1
2
3
4
5
public extension Collection where Element: Optimizer {
  func step() {
    ...
  }
}

This is a good starting point to support more concise updates such as: [adam1, adam2].step().

The same pattern can be applied if you want to support gradients with multiple losses: [loss1, loss2, loss3].backward(to: x).

These extension methods are conditional, and type-safe in Swift.

Operators

While Swift allows operator overloading and the ability to introduce custom operators, somehow the ambiguity of * is not addressed. We cannot have consistency with Python because @ cannot be overloaded in Swift language. I have to resort to .* and .+ for element-wise multiplications and additions.

Type-safe Functions

While Swift can enforce some type consistency, without more language level changes, we cannot deduce shapes, and would still encounter runtime errors if shape doesn’t match. Even with language-level support, we may still need an escape-hatch because some tensors could be loaded from IO. Not to mention it would be nice to support dynamic input shapes to a model while it can still statically compute its parameter shapes.

Transparent Multi-GPU Data Parallel Training

One thing annoying in the raw C interface, is the transition from one GPU to multi-GPU, even with simple data-parallel models. Unable to abstract tensor to higher-level in C makes the code unnecessarily complex.

Fortunately, with basic generics, this is not a problem in Swift. However, to use such generics turns out to be more complicated than I expected on the library author side. If I want to avoid runtime type-check (is / as keyword), there are quite a bit of protocol / extension type dance I would need to do. This doesn’t help with Swift’s borderline hostility against protocol-associated types. Luckily, there are new proposals to lift some of the restrictions (and no, some is not all it needs). You can see this monstrosity here. At the end, I have to introduce some runtime type-checks to keep my sanity.

Closing Words

I am pretty happy with the end result of s4nnc. It is small, versatile and does exactly what I set out to: an ergonomics win from the raw C interface. Generics, type-safety and reference-counting in Swift the language really made a difference in the API ergonomics. The language itself is not the fastest, but has a good balance in ergonomics, expressivity and performance. In the next blog post, I am going to detail the Swift data science workflow I had, and why we moved to this from the initial Python one. Stay tuned!

No comment yet
November 1st, 2020

For the past a few months, I’ve worked on Dflat, among many other things. The theme is to explore a workflow using Bazel as the build and deployment system, and Swift as the main language for data exploration, modeling, and production system.

Dflat appears to be an outlier for this theme. I simply have the urge to implement what I considered to be the one-true-way of doing structured data persistence on mobile, and to share it with the wider community. It was designed squarely for mobile data persistence needs, much like Core Data before. Case in point, the backing data engine has always been SQLite.

Up until recently. As I dug deeper in this Swift for data exploration, modeling and production system theme, it is increasingly likely that I need some data persistence mechanism. It may not be as fancy as for mobile, where I can observe changes and rely on one-way data flow to update UI. But it is darned nice to specify schema explicitly, persisting to a proper database, and don’t worry about schema upgrade at all.

For the past couple of the days, I’ve worked on porting Dflat to Linux. With some minimal work (mostly around figuring out the right SQLite compilation flags and moving code from one place to another to make Swift Linux runtime happy), it is done. I’ve also added some Bazel rules for code generation, so if you use Bazel, Dflat would work wonders out of the box.

What does this mean? If you use Swift in Linux, you don’t need to interact with SQLite directly any more. Dflat handles SQLite concurrency control, proper SQLite configuration, strongly-typed queries, data schema management, transparent (and read-only) backward compatible upgrades. Beyond that, it offers query subscription so for a long-running program, you can simply subscribe to a query and update based on the changed results.

An Example

If you happen to do Bazel + Swift on Linux, using Dflat is simple. In WORKSPACE file, add following:

1
2
3
4
5
6
7
8
git_repository(
    name = "dflat",
    remote = "https://github.com/liuliu/dflat.git",
    commit = "3dc11274e8c466dd28ee35cdd04e84ddf7d420bc",
    shallow_since = "1604185591 -0400"
)
load("@dflat//:deps.bzl", "dflat_deps")
dflat_deps()

For the binary that requires to persist some user settings, you can start to edit the schema file: user.fbs.

1
2
3
4
5
table User {
    username: string (primary);
    accessTime: double;
}
root_type User;

In your BUILD.bazel:

1
2
3
4
5
6
7
8
9
10
11
12
load("@dflat//:dflat.bzl", "dflatc")
dflatc(
    name = "user_schema",
    src = "user.fbs"
)
swift_binary(
    name = "main",
    srcs = ["main.swift", ":user_schema"],
    deps = [
        "@dflat//:SQLiteDflat"
    ]
)

Use them in main.swift file:

1
2
3
4
5
6
7
8
9
10
import Dflat
import SQLiteDflat
let workspace = SQLiteWorkspace(filePath: "main.db", fileProtectionLevel: .noProtection)
workspace.performChanges([User.self], changesHandler: { txnContext in
  let creationRequest = UserChangeRequest.creationRequest()
  creationRequest.username = "lliu"
  creationRequest.accessTime = Date().timeIntervalSince1970
  txnContext.try(submit: creationRequest)
})
workspace.shutdown()

Later read the data and update the accessTime:

1
2
3
4
5
6
7
8
9
10
11
12
import Dflat
import SQLiteDflat
let workspace = SQLiteWorkspace(filePath: "main.db", fileProtectionLevel: .noProtection)
let result = workspace.fetch(for: User.self).where(User.username == "lliu")
let user = result[0]
print(user.accessTime)
workspace.performChanges([User.self], changesHandler: { txnContext in
  let changeRequest = UserChangeRequest.changeRequest(user)
  changeRequest.accessTime = Date().timeIntervalSince1970
  txnContext.try(submit: changeRequest)
})
workspace.shutdown()

iPhone 12 Pro v.s. Threadripper 3970x

Dflat benchmark is now available on both iPhone 12 Pro and TR 3970x. It enables some quick apple-to-orange comparisons.

The iPhone 12 Pro benchmark followed exactly the previous benchmark.

The Linux benchmark runs with bazel run --compliation_mode=opt app:BenchmarksBin.

Work iPhone 12 Pro TR 3970x
Insert 10,000 0.072s 0.054s
Fetch 3,334 0.004s 0.006s
Update 10,000 0.085s 0.066s
4-Thread Insert 40,000 0.452s 0.219s
4-Thread Delete 40,000 0.289s 0.193s
10,000 Updates to 1,000 Subscriptions* 3.249s 3.356s

Each subscribed query contains roughly 1,000 objects.

Beyond Server / Client Relationship

The Bazel and Swift Package Manager version of Dflat for Linux packaged SQLite 3.33.0 in source-form. It enables some interesting ideas. With a little bit more work, we may as well compile Dflat to run in WebAssembly. Wasm can be the ideal form to be deployed at edge nodes. When worked on Snapchat app in the past, I’ve long theorized that we could deploy peering nodes for customers, and the server / device communication can be replaced with server / peering node / device with peering node and device syncing materialized views incrementally through some kind of binlog mechanism. It looks like this kind of architecture can be done, if we have some kind of uniformity between peering nodes and mobile devices.

Beyond SQLite

Even Dflat now supports Linux, it doesn’t mean this is a server software. Advanced features such as live query subscription can only happen if you go through a single Workspace instance. To make Dflat work in Linux, we actually deepened the SQLite and Dflat relationship by promoting some code from SQLiteDflat to Dflat.

This doesn’t mean we cannot change the backing data engine. Dflat could potentially help server-grade databases such as PostgreSQL in other ways. The smooth schema evolution would be an interesting way to formalize what people like Uber already did.

A server-grade database such as PostgreSQL could also potentially support live query subscription across many Workspace instances with some substantial redesign of Dflat internals.

Exploration Ahead

Supporting Dflat on Linux enables some interesting design space exploration (for one, I need to change Workspace.shutdown() semantics to better work with short-lived command-line programs). Whether a mobile-focused database would work for a variety of Linux-based workflows is a question without definitive answer still. I would continue my exploration in the meantime and report back on any substantial findings.

No comment yet
October 9th, 2020

I plan to write a series of articles to discuss some simple but not embarrassingly parallel algorithms. These will have practical usages and would most likely be on many-core CPUs or CUDA GPUs. Today’s is the first one to discuss a parallel algorithm implementation for CSV file parser.

In the old days, when our spin disk speed maxed out at 100MiB/s, we only have two choices: either we don’t care about the file loading time at all, treating it as a cost of life, or we have a file format entangled with the underlying memory representation to squeeze out the last bits of performance for data loading.

That world has long gone. My current workstation uses a software RAID0 (mdadm) over two 1TB Samsung 970 EVO NVMe storage for data storage. This setup usually gives me around 2GiB/s read / write speed (you can read more about my workstation here).

The CSV file format is firmly in the former category of the two. The thing that people who exchange CSV files care most, above anything else, is the interoperability. Serious people who actually care about speed and efficiency moved to other formats such as Apache Parquet or Apache Arrow. But CSV files live on. It is still by far the most common format in Kaggle contests.

There exist many implementations for CSV file parsers. Among them, csv2 and Vince’s CSV Parser would be two common implementations. That doesn’t account for standard implementations such as the one from Python.

Most of these implementations shy away from utilizing many-cores. It is a reasonable choice. In many likely scenarios, you would load many small-ish CSV files, and these can be done in parallel at task-level. That is an OK choice until recently, when I have to deal with some many GiBs CSV files. These files can take many seconds to load, even from tmpfs. That indicates a performance bottleneck at CPU parsing time.

The most obvious way to overcome the CPU parsing bottleneck is to fully utilize the 32 cores of Threadripper 3970x. This can be embarrassingly simple if we can reliably breakdown the parsing by rows. Unfortunately, RFC 4180 prevents us from simply using line breaks as row delimiters. Particularly, when quoted, a cell content can contain line breaks and these will not be recognized as row delimiters.

Paratext first implemented a two-pass approach for parallel CSV parsing. Later it is documented in Speculative Distributed CSV Data Parsing for Big Data Analytics. The paper discussed, besides the two-pass approach, a more sophisticated speculative approach that is suitable for the higher-latency distributed environment.

In the past few days, I implemented a variant of the two-pass approach that can max out the NVMe storage bandwidth. It is an interesting journey as I didn’t write any serious parser in C for a very long time.

The CSV File Parsing Problem

CSV file represents simple tabular data with rows and columns. Thus, to parse a CSV file, it is meant to divide a text file into cells that can be uniquely identified with row and column index.

In C++, this can be done in zero-copy fashion with string_view. In C, every string has to be null-terminated. Thus, you need to either manipulate the original buffer, or copy it over. I elected the latter.

Memory-Mapped File

To simplify the parser implementation, it is assumed we are given a block of memory that is the content of the CSV file. This can be done in C with:

1
2
3
4
5
6
FILE* file = fopen("file path", "r");
const int fd = fileno(file);
fseek(file, 0, SEEK_END);
const size_t file_size = ftell(file);
fseek(file, 0, SEEK_SET);
void *const data = mmap((caddr_t)0, file_size, PROT_READ, MAP_SHARED, fd, 0);

OpenMP

We are going to use OpenMP’s parallel for-loop to implement the core algorithm. Nowadays, Clang has pretty comprehensive support for OpenMP. But nevertheless, we will only use the very trivial part of what OpenMP provides.

Find the Right Line Breaks

To parallel parse a CSV file, we first need to break it down into chunks. We can divide the file into 1MiB sequence of bytes as our chunks. Within each chunk, we can start to find the right line breaks.

The double-quote in RFC 4180 can quote a line break, that makes us find the right line breaks harder. But at the same time, the RFC defines the way to escape double-quote by using two double-quote back-to-back. With this, if we count double-quotes from the beginning of a file, we know that a line break is within a quoted cell if we encounter an odd number of double-quotes so far. If we encounter an even number of double-quotes before a line break, we know that is a beginning of a new row.

We can count double-quotes from the beginning of each chunk. However, because we don’t know if there are an odd or even number of double-quotes before this chunk, we cannot differentiate whether a line break is the starting point of a new row, or just within a quoted cell. What we do know, though, is that a line break after an odd number of double-quotes within a chunk is the same class of line breaks. We simply don’t know at that point which class that is. We can count these two classes separately.

A code excerpt would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#define CSV_QUOTE_BR(c, n) \
    do { \
        if (c##n == quote) \
            ++quotes; \
        else if (c##n == '\n') { \
            ++count[quotes & 1]; \
            if (starter[quotes & 1] == -1) \
                starter[quotes & 1] = (int)(p - p_start) + n; \
        } \
    } while (0)
    parallel_for(i, aligned_chunks) {
        const uint64_t* pd = (const uint64_t*)(data + i * chunk_size);
        const char* const p_start = (const char*)pd;
        const uint64_t* const pd_end = pd + chunk_size / sizeof(uint64_t);
        int quotes = 0;
        int starter[2] = {-1, -1};
        int count[2] = {0, 0};
        for (; pd < pd_end; pd++)
        {
            // Load 8-bytes at batch.
            const char* const p = (const char*)pd;
            char c0, c1, c2, c3, c4, c5, c6, c7;
            c0 = p[0], c1 = p[1], c2 = p[2], c3 = p[3], c4 = p[4], c5 = p[5], c6 = p[6], c7 = p[7];
            CSV_QUOTE_BR(c, 0);
            CSV_QUOTE_BR(c, 1);
            CSV_QUOTE_BR(c, 2);
            CSV_QUOTE_BR(c, 3);
            CSV_QUOTE_BR(c, 4);
            CSV_QUOTE_BR(c, 5);
            CSV_QUOTE_BR(c, 6);
            CSV_QUOTE_BR(c, 7);
        }
        crlf[i].even = count[0];
        crlf[i].odd = count[1];
        crlf[i].even_starter = starter[0];
        crlf[i].odd_starter = starter[1];
        crlf[i].quotes = quotes;
    } parallel_endfor

This is our first pass.

Columns and Rows

After the first pass, we can sequentially go through each chunk’s statistics to calculate how many rows and columns in the given CSV file.

The line breaks in the first chunk after even number of double-quotes would be the number of rows in the first chunk. Because we know the number of double-quotes in the first chunk, we now know what class of line breaks in the second chunk are the start points of a row. The sum of these line breaks would be the number of rows.

For the number of columns, we can go through the first row and count the number of column delimiters outside of double-quotes.

Wiring the Cell Strings

The second pass will copy the chunks over, null-terminate each cell, escape the double-quotes if possible. We can piggyback our logic on top of the chunks allocated for the first pass. However, unlike the first pass, the parsing logic doesn’t start at the very beginning of each chunk. It starts at the first starting point of a row in that chunk and ends at the first starting point of a row in the next chunk.

The second pass turns out to occupy the most of our parsing time, simply because it does most of the string manipulations and copying in this pass.

More Optimizations

Both the first pass and second pass unrolled into 8-byte batch parsing, rather than per-byte parsing. For the second pass, we did some bit-twiddling to quickly check whether there are delimiters, double-quotes, or line breaks that needed to be processed, or we can simply copy it over.

1
2
3
4
5
const uint64_t delim_mask = (uint64_t)0x0101010101010101 * (uint64_t)delim;
const uint64_t delim_v = v ^ delim_mask;
if ((delim_v - (uint64_t)0x0101010101010101) & ((~delim_v) & (uint64_t)0x8080808080808080)) {
    // Has delimiters.
}

You can find more discussions about this kind of bit-twiddling logic here.

Is it Fast?

The complete implementation is available at ccv_cnnp_dataframe_csv.c.

The implementation was compared against csv2, Vince’s CSV Parser and Paratext.

The workstation uses AMD Threadripper 3970x, with 128GiB memory running at 2666MHz. It has 2 Samsung 1TB 970 EVO with mdadm-based RAID0.

For csv2, I compiled csv2/benchmark/main.cpp with:

1
g++ -I../include -O3 -std=c++11 -o main main.cpp

For Vince’s CSV Parser, I compiled csv-parser/programs/csv_bench.cpp with:

1
g++ -I../single_include -O3 -std=c++17 -o csv_bench csv_bench.cpp -lpthread

Paratext hasn’t been actively developed for the past 2 years. I built it after patched paratext/python/paratext/core.py by removing the splitunc method. The simple benchmark Python script look like this:

1
2
3
4
import paratext
import sys

dict_frame = paratext.load_raw_csv(sys.argv[1], allow_quoted_newlines=True)

I choose the DOHUI NOH dataset, which contains a 16GiB CSV file with 496,782 rows and 3213 columns.

First, to test the raw performance, I moved the downloaded file to /tmp, which is mounted as in-memory tmpfs.

Software Time
Paratext 12.437s
Vince’s CSV Parser 37.829s
csv2 19.221s
NNC’s Dataframe CSV 4.093s

The above performance accounts for the best you can get if file IO is not a concern. With the said 970 EVO RAID0, we can run another round of benchmark against the real disk IO. Note that for this round of benchmark, we need to drop system file cache with: sudo bash -c "echo 3 > /proc/sys/vm/drop_caches" before each run.

Software Time
Paratext 16.747s
Vince’s CSV Parser 39.6075s
csv2 21.035s
NNC’s Dataframe CSV 7.895s

The performance of our parser approaches 2000MiB/s, not bad!

Is our Implementation Reasonable with Lower Core Count?

csv2 is single-threaded. With only one thread, would our implementation still be reasonable? I moved parallel_for back to serial for-loop and ran the experiment on tmpfs again.

Software Time
csv2 19.221s
NNC’s Dataframe CSV (Many-core) 4.093s
NNC’s Dataframe CSV (Single-core) 45.391s

It is about 2x slower than csv2. This is expected because we need to null-terminate strings and copy them to a new buffer.

Finish It Up with Fuzzing

You cannot really ship a parser in C without doing fuzzing. Luckily, in the past few years, it is incredibly easy to write a fuzz program in C. This time, I chose LLVM’s libFuzzer and turned on AddressSanitizer along the way.

The fuzz program is very concise:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include <ccv.h>
#include <nnc/ccv_nnc.h>
#include <nnc/ccv_nnc_easy.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int LLVMFuzzerInitialize(int* argc, char*** argv)
{
    ccv_nnc_init();
    return 0;
}

int LLVMFuzzerTestOneInput(const uint8_t* data, size_t size)
{
    if (size == 0)
        return 0;
    int column_size = 0;
    ccv_cnnp_dataframe_t* dataframe = ccv_cnnp_dataframe_from_csv_new((void*)data, CCV_CNNP_DATAFRAME_CSV_MEMORY, size, ',', '"', 0, &column_size);
    if (dataframe)
    {
        if (column_size > 0) // Iterate through the first column.
        {
            ccv_cnnp_dataframe_iter_t* const iter = ccv_cnnp_dataframe_iter_new(dataframe, COLUMN_ID_LIST(0));
            const int row_count = ccv_cnnp_dataframe_row_count(dataframe);
            int i;
            size_t total = 0;
            for (i = 0; i < row_count; i++)
            {
                void* data = 0;
                ccv_cnnp_dataframe_iter_next(iter, &data, 1, 0);
                total += strlen(data);
            }
            ccv_cnnp_dataframe_iter_free(iter);
        }
        ccv_cnnp_dataframe_free(dataframe);
    }
    return 0;
}

./csv_fuzz -runs=10000000 -max_len=2097152 took about 2 hours to finish. I fixed a few issues before the final successful run.

Closing Words

With many-core systems becoming increasingly common, we should expect more programs to use these cores at the level traditionally considered for single core such as parsing. It doesn’t necessarily need to be hard either! With good OpenMP support, some simple tuning on algorithm-side, we can easily take advantage of improved hardware to get more things done.

I am excited to share more of my journey into parallel algorithms on modern GPUs next. Stay tuned!


2020-10-18 Update

There are some more discussions in lobste.rs and news.ycombinator. After these discussions, I benchmarked xsv for this particular csv file and implemented zero-copy parsing (effectively pushing more processing to iteration time) on my side. The zero-copy implementation becomes trickier than I initially liked because expanding from pointer to pointer + length has memory usage implications and can be slower if implemented naively for the later case.

Here are some results for parsing from NVMe storage:

Software Time
xsv index 26.613s
NNC’s Dataframe CSV 7.895s
NNC’s Dataframe CSV (Zero-copy) 7.142s

If runs on single-core, the parallel parser was penalized by the two-pass approach. Particularly:

Software Time
NNC’s Dataframe CSV (Single-core) 45.391s
NNC’s Dataframe CSV (Zero-copy, Single-core) 39.181s
NNC’s Dataframe CSV (Zero-copy, Single-core, First-pass) 9.274s
NNC’s Dataframe CSV (Zero-copy, Single-core, Second-pass) 32.963s