No comment yet
March 22nd, 2022

Once for a while, we asked questions like: what can we do with more computations? When we asked that question in 2010, CUDA came along and Jesen Huang gifted everyone under the sun a GTX 980 or GTX Titan in search of problems beyond graphics that these wonderful computation workhorses can help.

Then suddenly, we found out that we can not only solve simulation problems (3D graphics, physics emulation, weather forecast etc.), but also perception problems with 100x more computations. That started the gold rush of deep learning.

Fast-forward to today, as deep learning makes great advances in perception and understanding, the focus moved from pure computations to interconnects and memory. We can do really interesting things now with the computations available today. What can we do, if there are another 100x more computations available?

Put it more blatantly, I am not interested in supercomputers in data centers to be 100x faster. What if a mobile phone, a laptop, or a Raspberry Pi, can carry 100x more computations in the similar envelope? What can we do with that?

To answer that question, we need to turn our eyes from the virtual world back to our physical world. Because dynamics in the physical world are complex, for many years, we built machines with ruthless simplifications. We built heavy cranes to balance out heavy things we are going to lift. Even our most-advanced robotic arms, often have heavy base such that the dynamics can be affine to the control force. Often than not, humans are in the loop to control these dynamics, such as early airplanes, or F1 racing cars.

That’s why the machines we built today mostly have pretty straightforward dynamics. Even with microcontrollers, our jet-fighters or quad-copters actually have pretty simple dynamics control. Only recently, Boston Dynamics started to build machines with whole-body dynamics in mind and actually have sophisticated dynamics.

Now, imagine a world where every machine is much more nimble, more biological-like, because we don’t need to simplify the system dynamics, but to leverage them. To get there, we need to do much more computations.

To control a dynamics system, we normally need to solve optimization problems with hundreds to thousands of variables. These are not crazy numbers, our computers today can solve eigenvalues of a matrix on the rank of thousands pretty easily. The trick is to do this fast. An active control applied at 1000Hz is much more stable than ones at 10Hz. That means do these numerical integrations, inverting matrices, all under 1 millisecond. For this, we need to do much more computations in 1 millisecond than what we can today.

If we are careful, we will plan our gifted 100x computations more strategically. We will work on anything that reduces computation latency, such as sharing memory between CPUs and GPUs. We will mainstream critical works such as PREEMPT_RT to the Linux kernel. We will reduce the number of ECUs so it is easier to compute whole-body dynamics with one beefier computer. We will make our software / hardware packages more easy-to-use; it will scale from small robot vacuums to biggest cranes.

During our first 100x leap, we solved graphics. With our next 100x leap, we solved simulation and perception. Now it is the time to do another 100x leap, and to solve dynamics. I am convinced this is the only way to build machines as efficient as their biological counterparts. And these more dynamic, more biological-like machines will be our answer to be sustainable, greener, where we can build more with less.

No comment yet
March 17th, 2022

There are three things in the past decade that I am not only wrong once about, but actively being wrong while the situation is evolving. Introspecting how that happened would serve as a useful guide for the future.

Cryptocurrencies

I’ve been exposed to cryptocurrencies, particularly Bitcoin pretty early on, somewhere around 2009 on Hacker News. I’ve run the software at that time (weirdly, on a Windows machine). However, the idea of Austria-doctrine based currency sounded absurd to me. If you cannot control the money supply, how do you reward productivity improvements? Not to mention it also sounded otherworldly this is going to fly with regulators with its money-laundry potential.

Fast-forward today, these concerns are all true, but it doesn’t matter.

Covid-19

One day in February 2020, when walking with a friend, I told him that I thought the Covid-19 was almost over: CDC didn’t report any new cases for a couple of weeks, it all seemed under-control. In March 2020, everything seemed real, but I was optimistic: the worst-case, that is, we didn’t do a damn thing, this thing probably was going to fade away in a year or two. Given that we were doing something, probably a couple of months at most?

Fast-forward today, the end is near after 2 and half years. But the world is not functioning as it was before.

2022 Russian Invasion of Ukraine

Feb 22th, 2022, after Putin announced his recognition of Donetsk and Luhansk as independent regions, I told a friend: it was probably the end of the 8-year war, now there would be a long political battle for Ukrainians.

Fast-forward only 8 days later, we are on the verge of World War 3.

What’s wrong? Why at that time, I seemed unable to grasp the significance of these events even when everything was presented?

For a very long time (really, since I was twelve), I loved to read non-fictions. Unsurprisingly, many of these non-fictions discussed people, companies or events that happened in modern times. These books helped me to shape my world views. They also turned out to be very helpful to predict near-future events.

However, non-fictions presents a very short slice of modern history. A snapshot is static. With many of these snapshots, a static world view was built. A static world view is great locally: easy to understand, and easy to predict. But it is terrible for once-in-a-life-time events.

Equipped with modern economics theory, it is easy to see why cryptocurrencies cannot work. But if history is any guide, this is simply a different group of bankers trying to issue private money again. A fixed-supply system indeed worked a hundred years ago prior to WW1. Many central banks who control our money-supply today were privately-owned a century ago. Cryptocurrency folks will try to be the new central banks of our time. They might fail. But at that point, it is less about sound economics theory, but more about politics and excitement. The device itself can be modified to fit whatever utilities and theories we see suit.

A pandemic is not a linear event. No, I am not talking about infection modeling. China’s successful control of SARS in the beginning of this century was an anomaly, not the rule. Once the containment failed, the duration difference between doing a good job and doing a bad job diminished. The virus will take its course to run out. At that point, vaccines and treatments are wonderful things to reduce fatalities, but not to reduce the length of the suffering collectively. After a century, we still cannot meaningfully reduce the length of a pandemic.

Russia’s invasion is a major violation of post WW2 international order. An invasion to a sovereign nation without provocation, not to mention its casually threatening with nuclear arms is unthinkable if all the reference points are from the past 30 years.

But Putin’s talk, with its nationalistic pride, a blame on Soviet Union for Russian’s suffering, can be rooted clearly back to Peter the Great.

Many people now flocked to compare what happened in Ukraine to what happened prior to WW2. I am not so certain. For one, Germany, while suffering from WW1, was a country on the rise. The similarity stopped at personal ambition, and the nationalism running high in that country.

The only way to model a dynamic world, is to read more on history, much much more than what I was doing before. Although I am not sure where to start.

What will happen next? It is anyone’s guess. I would humbly suggest looking over WW2 and trying to find relevant examples before that. Maybe somewhere about 200 years ago.

No comment yet
December 14th, 2021

After Elon’s announcement of the Tesla Bot [1], many people mocked the silly on-stage presentation and joked about the forever Level-5 Tesla Autopilot. The last time when humanoid robots caught my eye was the Honda Asimo, when I was a kid. I saved money to buy Robosapien the toy because that was the only bipedal humanoid robot toy available.

A decade passed, and the world has given us videos such as Atlas from Boston Dynamics [2]. Where are we at with the dream of humanoid robots? Can the Tesla Bot be delivered to happy customers in a reasonable time frame (< 5 years)? And the most crucial question: what commercial values do bipedal robots have now?

By no means of an expert, I set out to do some research to understand the scope of the problem, and the potential answers.

Why

Besides the obvious satisfaction of seeing a man-made object doing the most human-like behavior: walking, there are many more reasons why we need bipedal robots. It is also hard to distinguish whether these are practical reasons, or because the technology is cool, and we are trying to retrofit a reason behind it.

The most compelling answer I see so far, reasoned like this: even in the United States, where building codes have strong preferences on accessibility, many private venues are hard to access with only wheels. Getting on and off a truck with a lift is a hassle. There may be small steps when accessing your balcony / backyard. Even after more than 10 years since the first introduction of robot vacuum, these small wheeled robots still trap between wires, chair legs and cannot vacuum stairs at all. To get through the “last mile” problem, many people believe that legged robots will be the ultimate answer. Between bipedal, quadrupedal or hexapedal configurations, the bipedal ones seem to require the least degree-of-freedom (often means less actuators, especially the high-powered ones), thus, likely to be more power-efficient.

Many generic tasks we have today are designed for humans. It often requires the height of a human, operating with two hands, and two legs. A humanoid robot makes sense to operate these generic tasks.

The devil, however, comes from the fact that we have been building machines that can handle high-value and repeatable tasks for over a century. Many generic tasks we have today, either require high-adaptation, or low-value in itself. Through a century of standardization, these generic tasks are often both.

Have we reached a point where most high-value tasks are industrialized and the remaining tasks are the long-tail low-value ones that cumulatively, make commercial sense for a humanoid robot to adapt to? There is no clear-cut answer. To make the matter even more interesting, there are second-order effects. If the humanoid robots are successful in-volume, lower-end specialized tasks may become less economical to devise automated solutions in themselves. If you need examples, look no further than these Android-based, phone-like devices all around us, from meeting displays outside of your conference rooms, your smart TVs or even the latest digital toys for your babies. That is the economy of scale at work.

How

While people dreamed up the humanoid robots that automate day-to-day chores a century ago, it is a tricky question to answer whether we have all the relevant knowledge now to build it. We need to break it down.

1. Do we know how to build bipedals that can self-balance?

It seems that we knew this since Honda Asimo [3]. However, Asimo deployed what is often called the active dynamic walking system. It requires active control over every joint (i.e. every joint requires an actuator). This is not energy efficient.

Most demonstrations of bipedal robots fall into this category, that includes Atlas from Boston Dynamics. There are a few with some passive joints. Digit from Agility Robotics [4] is one of the known commercial products that tries to be energy efficient by leveraging a passive dynamic walking design.

Do we know how to build bipedals that can self-balance in any circumstances? Most of today’s research focused on this area: how to make bipedals walk / run faster, how to balance itself well with different weights and how to balance with uneven / slippery / difficult terrains.

For day-to-day chores, we are not going to have many difficult terrains nor to run parkours. From the system engineering perspective, falling gently would be a better proposition. On the other hand, to have any practical use, it requires to balance the bipedals with unknown weight distribution. Carrying a water bottle probably won’t change weight distribution much. But what about lifting a sofa?

2. Do we know how to build robot arms?

We’ve been building robot arms for many decades now. The general trend seems towards cheaper and more flexible / collaborative design. Many of these low-cost products consolidated around successful manufacturers such as KUKA or Universal Robots. New entrants that eyed low-cost such as Rethink Robotics had its misses. Acquisitions happened in this space and now KUKA, Rethink Robotics or Universal Robots are owned by bigger companies now.

However, high precision, high degree-of-freedom robot arms are still rather expensive. A sub-millimeter precision 6 degree-of-freedom robot arm can cost anywhere from 25k to 100k. A UR3e [5] weighs 11kg, with limited payload capacity. Arms with higher payload capacity weigh much more (> 20kg).

For home use, there are less constraints on what we can lift: repeatable precision probably can be relaxed to ~0.1 mm range rather than ~0.01mm range. Pressure sensors and pose sensors can be camera-based. We haven’t yet seen a robot arm that meets these requirements and costs around ~5k.

3. How does a humanoid robot sense the world?

A humanoid robot needs to sense the environment, make smart decisions when navigating, and respond to some fairly arbitrary requests to be useful.

During the past decade, we’ve worked very far on these fronts. Indoor LIDAR sensors [6] were used broadly in robot vacuums and the volume, in turn, drove down the cost. Any robot vacuum today can build an accurate floor plan within its first run.

Besides LIDAR, cameras came a long way to be high resolution and useful in many more settings. It can help guide the last centimeter grasp [7], sense the pressure or detect material [8]. The ubiquity of cameras in our technology stack makes them exceptionally cheap. They serve as the basis for many different sensory tasks.

4. How does a humanoid robot operate?

While we have much more knowledge on how to navigate indoor environments [9] to accomplish tasks, we are not that far with how the human-computer interface is going to work when operating a humanoid robot. Many works surrounding this were based on imitation, i.e. a human performs some tasks and the robot tries to do the same. No matter how good the imitation is, this kind of interaction is fragile because we cannot immediately grasp how generalized the imitation is going to be.

If we show a humanoid robot how to fetch a cup and pour in water, can we be assured that they can do the same with a mug? A plastic cup? A pitcher? What about pouring in coke? Coffee? Iced tea? The common knowledge required for such generalization is vast. But if they cannot generalize, it is like teaching a toddler how to walk - it’s going to be frustrating.

On the other spectrum, we’ve come a long way to give computers an objective and let themselves figure out how. We don’t need to tell our rovers on Mars how to drive - we only need to point a direction, and they can get there themselves. We also don’t need to tell our robot arms how to hold a cup - we only need to tell them to lift it up without flipping it.

We may be able to compose these discrete while autonomous actions to accomplish useful tasks. Humanoid robots, particularly those for educational purposes, have been working on the graphical programming interface for a long time [10]. However, I cannot help but feel these are much like touchscreen before iPhone: they exist and work, but in no way a superior method to interact with nor a productivity booster to accomplish things.

Where

If someone wants to invest early in this space, where to start?

SoftBank Robotics [11] has been acquiring companies in this space until recently. Their most prominent one would be the Pepper and Aldebaran’s NAO robot. However, they haven’t had any new releases for some years. Sale of Boston Dynamics doesn’t send a positive signal on their continuing investment in this space.

Agility Robotics [12] is a recent startup that focused on efficient bipedal humanoid robots. Their Digit robots are impressive and have been shipping to other companies for experimentations for quite some time. Their past Cassie bipedal robots are more open. You can download their models and experiment with MuJoCo [13] today. The Digit bipedal robots focused on the last mile (or last 100 feet?) package delivery. This puts them in direct competition against autonomous vehicles and quadcopters. The pitch is about versatility against autonomous vehicles on difficult terrains (lawn, steps, stairs), and efficiency against quadcopters (heavier packages).

Boston Dynamics hasn’t been serious about practical humanoid robots so far. On the other hand, Spot Mini has been shipping world-wide for a while now. The difficulty of a practical humanoid robot from Boston Dynamics comes from technical directions. Spot Mini uses electric actuators. Electric actuators are easy to maintain and replace. It does require some gearboxes and that can introduce other failure points and latencies. However, it can be modular thus serviceable. Atlas uses hydraulic actuators. While it provides high-force with low latency, it is expensive to maintain and breakages often mean messy oil leaks [14]. It would be curious to observe if they have any electric-actuator based variant at work.

Then there comes the Chinese. The Chinese companies working in this space are excellent at reducing the cost. UBTech’s Alpha 1E robot [15], a direct competitor to the NAO robot in the education space, is 1/18th the cost. HiWonder’s TonyPi / TonyBot [16], a much less polished product, is at 1/18th the cost too. It features Raspberry Pi / Arduino compatibility, thus, more friendly to tinkers.

That has been said, both companies are far away from a practical human-sized robot. While UBTech has been touring their human-sized robot Walker X [17] for a few years now, there is no shipping date and it looks like the Honda Asimo from 15 years ago. The company doesn’t seem to have the proper software / hardware expertise to ship such a highly integrated product. HiWonder doesn’t provide any indication that it is interested in human-sized robots. While cheap, it doesn’t seem that both of them are on the right technological path to deliver highly-integrated human-sized humanoids.

Unitree, a robotic company focused on legged locomotion, successfully shipped quadrupedal Unitree Go1 [18] at 1/25th cost of Spot Mini. Their previous models have seen some successes in the entertainment business. While mostly looked like a Mini Cheetah based [19], it is the first accessible commercial product on the market. It remains to be seen if Go1 can find its way into homes, and if so, whether that can help the company fund other quests in the home sector. The company has no official plan to enter the humanoid robot market, and even if they do, the technology requirement would look quite different than a Mini Cheetah variant.

Roborock’s robot vacuums [20] have been quietly gathering home data world-wide for some time now. On the software side and the software / hardware integration side, they are quite advanced. Their robot vacuums are generally considered to be smart around navigation. Their all-well-rounded robots have been gaining market shares around the world against iRobot at the same time, also raising the price steadily. It has been remarkable to observe how they can do both with better products. They haven’t had any stated plan to ship legged robots, not to mention a humanoid one. But their hardware has been most widely accessible among the above companies. Their software has been tested in the wild. It seems they would have quite a bit of synergy in the humanoid robot space.

At last, there is Tesla. The company hasn’t shipped any legged robots, nor any in-home robotics systems. Tesla Bot looks quite like Honda Asimo in technological direction. However, there is no technical details for exact how. The best we can guess is that these details are still in flux. That has been said, Tesla has done a stellar work at system integration when shipping their vehicles from zero to one. As we discussed above, we have these disparate technologies that somewhat can work, but to integrate them well into one coherent product, knowing where to cut features, where to retain the maximum utility, is a challenge waiting for an intelligent team to figure out from zero to one. I won’t be so quick to dismiss that Tesla cannot do this again.

This is an incomplete research note from someone who has done nothing significant in the stated space. You should do your own research to validate some of the claims I made here. Any insights from insiders will be greatly appreciated. Because the nature of this research note, unlike any other essays I posted here, I am going to provide references.

[1] https://www.youtube.com/watch?v=HUP6Z5voiS8

[2] https://www.youtube.com/watch?v=tF4DML7FIWk

[3] https://asimo.honda.com/downloads/pdf/asimo-technical-information.pdf

[4] https://www.youtube.com/watch?v=e0AhxwAKL7s

[5] https://www.universal-robots.com/products/ur3-robot/

[6] https://www.slamtec.com/en/Lidar/A3

[7] https://bair.berkeley.edu/blog/2018/11/30/visual-rl/

[8] https://ai.facebook.com/blog/reskin-a-versatile-replaceable-low-cost-skin-for-ai-research-on-tactile-perception/

[9] https://github.com/UZ-SLAMLab/ORB_SLAM3

[10] http://doc.aldebaran.com/1-14/getting_started/helloworld_choregraphe.html

[11] https://www.softbankrobotics.com/

[12] https://www.agilityrobotics.com/

[13] https://github.com/osudrl/cassie-mujoco-sim/tree/mujoco200

[14] https://www.youtube.com/watch?v=EezdinoG4mk

[15] https://www.ubtrobot.com/collections/premium-robots/products/alpha-1e?ls=en

[16] https://www.hiwonder.hk/products/tonypi-hiwonder-ai-intelligent-visual-humanoid-robot-powered-by-raspberry-pi-4b-4gb, https://www.hiwonder.hk/products/tonybot-hiwonder-humanoid-robot-educational-programming-kit-arduino

[17] https://www.ubtrobot.com/collections/innovation-at-ubtech?ls=en

[18] https://www.unitree.com/products/go1

[19] https://github.com/mit-biomimetics/Cheetah-Software

[20] https://us.roborock.com/pages/robot-vacuum-cleaner

No comment yet
November 8th, 2021

My first time going skiing was in my 20s. The software engineering job was paying me well at that time already. Why not just hire a private trainer for my ski lesson, I was thinking. Afterall, more money spent and a more focused time from the trainer ought to produce better outcomes, right? I definitely enjoyed the lesson for the first two days, but I cannot ski. The trainer was fun, and we did some practice runs on the beginner’s slope. But no, I cannot even do the pizza stop well.

The next year, I switched to a group lesson. In the first hour, I can ski from the top to the bottom on the beginner’s slope without crashing into someone or falling once. I started to enjoy the green line after two hours. The next day, I enjoyed the slopes for the whole day without any more lessons.

While in theory, if I pay enough, it is likely to find a good trainer that can teach me skiing within an hour privately. In practice, it doesn’t happen. It seems that paying money can monopolize someone’s time, but it doesn’t guarantee a better outcome.

It is not only ski lessons. You can observe this across many service industries. Private 1-on-1 tutoring v.s. public (or private) schooling. Family doctors v.s. public (or private) hospitals. Private nurse v.s. assisted living. Nanny v.s. daycare. There are many more factors in each of these industries for why money cannot buy performance (Gresham’s law etc). But you can see the theme.

The driving force of the theme is the market. In these industries, a good practitioner always makes more money when serving more people than one. When the money pooled together, it is also cheaper for the client. In return, more people can afford it. When the market is large enough, regulations can kick in. That in many cases, can guarantee a basic quality of the service.

But if I am rich, can I pay more money than the aggregate to monopolize better services? Potentially, yes. But the market is not an abstract entity with unlimited depth that can automatically facilitate transactions given a supply and demand curve. People are needed to be in the loop to either standardize the market for low-touch transactions, or to work through high-touch transactions directly.

Unfortunately, for these high-touch transactions, the market is miniscule enough that facilitating such transactions exclusively cannot be a full-time job.

That still opens the door of paying a lot more money, the amount that not only exceeds the aggregate for the best practitioners can get in the market, but also enough for a good broker to make a living.

At the end of the day, people are complicated. Serving one master is always a risky business. There are limited capacities, meaning limited upwards trajectory. People who make a reasonable amount of money for one rich person cannot be guaranteed to make more next year with the same person. Less capacities also means fewer experimentations, fewer exchanges of new ideas, and fewer ways to improve upon. You cannot fight the economy of scale any other way.

All in all, this brings us to two questions:

  1. Are there any service industries that haven’t enjoyed the economy of scale forementioned? We’ve witnessed the rapid industrialization of home-cooking during COVID time. Are there more?
  2. We’ve seen the amazing feat of the internet and software in lowering broker fees in standardized markets (stocks, commodities and housing). Can this be applied to non-standardized / one-off transactions? Can connected softwares help the price discovery and performance evaluation in any meaningful way? Airbnb tried in one specific and highly heterogeneous market. It probably is the most successful story we can tell so far. But many still questioned their performance evaluation metrics.

I don’t know the answer to either. But it will be interesting to ponder.

No comment yet
August 25th, 2021

This series of essays explores how cryptocurrencies, despite the frauds and scams in the system, can possibly go beyond the speculative asset category. Rather than investigating the underlying utility value for a cryptocurrency system, this series will focus on power interplays between people, and how that can draw a narrow scenario in which cryptocurrencies become mainstream.

When reading materials on monetary policies and financial history, inflation-enabling property of currency is universally praised by academics. It gives the government the ability to act properly as the lender of last resort. It promotes consumption by suppressing hoarding behavior. It is generally considered pro-innovation because money needs to seek higher-return / riskier investments than itself. This is easier if the baseline is low.

Thus, academics project their likes to inflation-enabling currency back to the cryptocurrency market then predict they won’t be a worthy money replacement because of their limited supply. Even for Proof-of-work cryptocurrencies that don’t have a 21-million cap, it is pretty obvious these are front-loaded. Many of their mining rates are not matched to the economic growth. If you factor in how many wallets lost their private keys, these are strictly supply-limited.

What many academics didn’t factor in, is how higher-level consensus was achieved. In the cryptocurrency world, per-transaction consensus can be achieved through algorithms implemented in C++ / Rust or Go programming languages. Higher-level consensus, decisions such as which version of the software to use, what new features need to be included in a certain release, what are the new features would be interesting to experiment, was achieved, in this day, through human negotiations such as EIP. There could be added motivations to move miners over to newer software through difficulty bomb but nothing prevents larger pools to collude and fork the software.

The supposed supply-limited nature of cryptocurrencies are not immoveable ancient laws set on stone. It is, at its core, a brilliant if not deliberate marketing ploy to attract the unsophisticated. Market participants knew well that it cannot be the Gold standard of our time if there is no circulation. If most participants are HODLers, the circulation will be limited. But no matter, this is only the first stage.

For how ridiculous the marketing goes, look no further than . Brought to you from people with invested interest in cryptocurrencies, it tries to tie every problem we have in the modern world to the decoupling from the Gold standard.

As far as the first stage goes, cryptocurrency market needs to solve the paradoxical challenge that while remain low in circulation, it also needs to get as many people as possible to be HODLers. This helps to gain wide political support in democracies. More importantly, if cryptocurrencies truly want to be the second-coming of commodity money, they need to have the needed breath once the circulation knob is turned on.

Inflation

Before we discuss the “circulation knob”, current conventional consensus still treats cryptocurrencies as speculative assets. For the plot to be “the better commodity money”, cryptocurrencies need yet to prove itself in a high-inflation world. It is actually not obvious that as speculative assets, how cryptocurrencies can compete with harder things such as commodities. It is even more unclear how it would fare with safe heavens once the inflation triggered monetary tightening.

With brilliant market ploy, low circulation and capital controls, it seems possible to maintain the herd psychology long enough that the hardness of cryptocurrencies can be self-reinforcing (at a certain point, you can point CPI / cryptocurrency price graph and claim that it is “inflation-proof”).

“Circulation Knob”

To be “the money” for everyday use, cryptocurrencies need to turn on their circulation knob at a certain point. The circulation knob would be an implementation that allows cryptocurrencies to have better adjusted supply mechanisms. In a healthy economy, this meant no longer to be forced into deflation. It is hard to imagine, as it stands, that Central Banks would allow broader circulation of unregulated currencies in their respective economies.

However, this is not impossible. In modern democracies, no matter how many ivory-towered academics in the Central Banks and how much they disliked cryptocurrencies collectively, they need to appease the political establishment. At the end of the day, Central Banks only care about impacts to their monetary policy tools. This can be managed by nationalizing mining operations, issuing additional cryptocurrencies to Central Banks or adding new money-supply mechanisms that Central Banks can use to play as the lenders of last resort if needed. These new tools in the cryptocurrencies world would be acted on as higher-level consensus aforementioned. It can be possible because this is another validation to the viabilities of cryptocurrencies at large.

The pursuit of hardness for cryptocurrencies, in addition to the competitions between different cryptocurrencies would be the more challenging part when turning on the circulation knob. They will hold out the circulation knob as long as possible to prove its hardness to the others. This will be a complex play between Central Banks, different cryptocurrencies and their respective philosophies.

Notes

Path Dependence

Monetary history is full of accidental path dependent coincidence. One one hand, the prevalence of cryptocurrencies in its current form would be a deterrent to any Central Banks to issue their own alternatives. It is hard to balance the convertibility of their alternative programmable money in relation to others. Getting it wrong, especially on conversion rate and programmability, could be disastrous to their existing monetary system.

On the other hand, the cryptocurrency world gets its first notice on the back of Austrian doctrine. To depart from the nature of limited supply could be death on arrival for the community. Higher-level consensus that is to the satisfaction of the Central Banks may never be reached. It is very difficult to imagine how eliminating all existing monetary policy tools, effectively abolishing or fully privatizing Central Banks (cryptocurrency players with large sums will be the new unregulated “Central Banks”) would be a good idea. It is not impossible, given that the United States has done that several times, with a semi-private Federal Reserve already. The implied transfer of power if that happened would be worth another essay in itself to explore.

What can go wrong?

A lot. This essay simply plotted a narrow path acting in stages that is imaginable how cryptocurrencies can possibly be a form of money. It requires changes in higher-level consensus carefully several times. One early test would be to see how current slew of cryptocurrencies fare with post-Covid inflation to prove their hardness. So far, the result is mixed.

No comment yet
August 20th, 2021

One thing in the past decade the academics failed to appreciate is the power of believing. In democracies like the United States, if the power of believing can be maintained long enough, a supporting structure from both the political and economic side will emerge. In every aspect, cryptocurrencies are moving along the right direction to establish themselves with such supporting structure despite scams and frauds associated around them.

The news of frauds (Polynetwork, IRON) and crackdowns at this point only strengthens the belief in the resilience of the system. People will point to these events and claim the robustness of the cryptocurrency systems. The centralized exchanges, centralized protocol maintainers / miners and the intentional light regulations (or no regulations at all) seems to be a stable triangle to maintain the cryptocurrency systems.

All these, makes it very interesting to contemplate what’s the exact endgame for cryptocurrency.

Many critics claim that the most risks for the cryptocurrencies come down to the slew of stablecoins that effectively set the price for other cryptocurrency assets. They suggest that the stablecoins such as USDT or USDC are backed by fractal reserves at the most. These fractal reserves are so small (less than 4% for USDT), the critics suspect that they will face liquidation crisis if a run of bank scenario happens.

Unfortunately, these critics failed to read the fine footprints of these said stablecoins. While you can exchange either USDT or USDC and with hundreds of alternative coins in decentralized exchanges, you cannot do so between USDT / USD pair or USDC / USD pair. It is these centralized exchanges that control how you can redeem these stablecoins into USD. Like countries with capital controls, the fixed exchange rate can be maintained indefinitely.

The endgame for cryptocurrency won’t be the crash of the stablecoins.

While the stablecoins cannot crash with absence of regulations, for the current slew of stablecoins, their power has been associated with the power of USD. It is possible to have a stablecoin to be associated with a basket of currencies, but that also means the said stablecoins will face not only the U.S. regulations, but regulations from the EU, China or Japan, depending on what is in the basket.

Paradoxically, stablecoins won’t crash, but can land to safety with the slide of the USD from its global reserve currency position. However, that won’t be good news to cryptocurrencies at large.

The academics largely got it wrong in the past decade regarding the cryptocurrencies because they failed to factor in the waning power of the United States. The U.S. has lost both willingness and political power to maintain absolute control globally. Its regulators moved too slowly to defend the USD’s global interests. Meanwhile, the internet moved from fear of creating any digital USD alternatives (e-gold) to that any individual can do ICO. Now, there is no willingness to regulate cryptocurrencies as long as the market is high and people make money.

The waning power of the United States is both a blessing and a curse to the cryptocurrencies.

At the moment, the cryptocurrencies need a powerful democracy that can maintain global hegemony. Without global hegemony, cryptocurrencies will face real issues with intermittent internet connectivities, difficulties of consensus convergence, widely-fluctuating exchange rates against commodities. With any form of government other than democracy, powerful authoritarian or dictatorial governments simply cannot tolerate the loss of capital control.

There could be alternative scenarios that cryptocurrencies succeed without the United States. Central operators need to grow to be powerful transnational entities that have controlling shares in commodities and other life essentials globally, potentially have their own enforcement militias to maintain such hegemony. This scenario has been explored in other anarchist’s readings extensively since the 1980s (or earlier). The problem, of course, is that such a transition of power takes time.

To believe in cryptocurrencies, you have to simultaneously believe that the U.S. power is waning and it can maintain global hegemony for an extended time. However paradoxical this proposition is, it increasingly seems to be a likely scenario for the next 15 years. Will this be the Goldilocks situation for cryptocurrencies? I don’t really know, and the readers, you have to make up your own mind.

Notes

What about KYC (“Know Your Customer”) rules?

If anything, this limits the number of centralized exchanges that can do stablecoin / USD pair. Limited number of centralized exchanges also means that it is easier to collude and fix pricing.

What about increasing regulations / self-regulations on USD?

The heavy regulations on USD (such as KYC rules mentioned prior) and recent news about self-regulations on USD (MasterCard / OnlyFans speculations) suggest the political willingness to regulate. However, USD is an easier target for politicians because no one else profits from USD other than the United States itself. Piling regulations on USD to “save our children” is an easy win in democracies. Piling regulations on cryptocurrencies because regular people make money on it (while the market is high) is a political suicide.

The imbalance of regulations against USD and the no regulations against stablecoins only reinforces the ability to do capital control from centralized exchanges. As long as these regulations exist in one form or another, it is much easier to accept the capital controls imposed by exchanges because you cannot do X with USD but can with stablecoins.

What about utility values of cryptocurrencies?

While there are utility values from cryptocurrencies, I am trying to explore alternative frameworks to assess based on power structures rather than the underlying utility values provided by a technology. This essay treats transition to cryptocurrencies as any other transitions: yes, there are utility values to motivate the transition in the first place. But ultimately, it is about people, especially people in power to make decisions on actions and inactions.

What can go wrong?

A lot. People can do stupid things. No amount of rosy scenario planning can prevent that.

No comment yet
March 19th, 2021

I am a mediocre marathon runner that hasn’t done any long-distances since Covid-19. But out of serendipity today, I was thinking about marathon and the interesting choices you would make during it.

Flat road marathon with support stations every half a mile is … bland. I probably have done that only twice, one in Maui, one in Denver. The only interesting bits for these two are hot temperature or the mile-high altitude. More interesting cases are trail runs with support stations every few miles. You have a lot of decisions to make along the way.

Elite runners probably can bulldoze through the trails. But for everyday runners like me, on the trail, the first decision needs to be made is, what’s going to be my pace for this segment? This is a function of your desired finish time, the elevation gain / loss for this segment and weather conditions. On a good day, at the beginning of the race, with massive elevation loss, you probably want to go as fast as you can. This is economical from the energy expenditure perspective. On the other hand, if there are a lot of ups-and-downs in the beginning, you probably want to keep a steady pace such that your overall finish time would be on-track.

The decision would be a bit different towards the end. If there are a lot of ups-and-downs towards the end, you probably want to take it slow on uphills and speed it up on downhills. This is more efficient energy-wise, and can still keep yourself on-track somewhat.

Besides questions on the pace, if there are a fairly reasonable number of support stations, you would need to make the decision on where to stop for refills. If there are only 3 support stations 6 to 7 miles away from each, you probably want to stop at every support station, since your 1L water bag may run out at probably 10-mile range. It is more interesting if there are not enough support stations such that you have to carry your own water bag, but sufficient number of them so you can make decisions to skip some of these.

This is a decision that can be more interesting than it first appears. If you are breezing through at 7min/mi pace, you probably don’t want to break it and stop for a refill. It is simply not economical. However, if you are running low on water, it is more difficult. You would never ever want to go without water for a mile or so, it is soul-crushingly terrible. In this case, you may want to ration the water intake until the next support station, or stop for a big refill now and try to make up the time by skipping the next two.

For an everyday runner, at mile 12 or 13, you probably would like to consider taking some food. Energy bars, jellies, gels, watermelon chunks, bananas, or M&M beans, all kinds. It is a great way to prepare for the wall. To be honest, I am not great at picking stuff out from my bag while running. As such, getting food out requires a brief stop. When is the most economical time to do so is not obvious. Uphills? Probably, it is definitely easier to stop and restart on uphills. But you also want to get food into the body before the glycogen depletion. Gels probably the fastest, it will take 5 minutes to kick in. Other solids can take longer. A mile or so before the depletion probably is required. Depending on these factors, there could be a range that we can take food more optimally during the race.

These decisions can be even more interesting if it is a self-supported multi-day event. Self-supported here means you carry your own food for these days (excluding water). Multi-day means a marathon a day. The decision is much more interesting because you are not just carrying a pack of energy gels. There are proper solids: pastas, ramens, instant rices, beef jerkys for your after-race celebrations. Simply put, energy gels for 6 days is not an option psychologically.

In this situation, the decisions on when to take what not only need to consider the optimality of the current race, but also the race next day. If the next day would be harder, I probably need to ration my energy gels today and have a bigger breakfast before today’s race to balance it out. It also cannot be completely rational. You would want to go with high energy-weight ratio food so that in the tougher day, the backpack would be lighter. However, you want to keep some celebratory food (beef jerkys anyone?) such that psychologically, there is a strong motivation to finish sooner. It quickly becomes an interesting game between the pace, stop-restart points, energy expenditures / refills and psychological reward functions. I simply don’t know an universal optimal strategy for this game yet with several trials-and-errors.

People say that life is a marathon, not a sprint. Well, I guess. As an everyday person, I would love to participate in the game with many more interleaving decisions and consequences. That is certainly much more interesting than the genetic lottery, isn’t it?

No comment yet
February 14th, 2021

前段时间人人影视组因为侵犯版权问题被调查了。国内引起了对知识产权保护的大讨论。在一定的时期和特定的领域,知识产权有着积极的意义。但是,知识产权制度本身并不是真理,它只是人为的一种创造。

知识产权的本质是将知识,这一属于人类共有资源(Public common)的事物通过国家机器,在有限的时间内赋予其完整的私有产权,将其私有化。定义私有产权的四种性质:即可使用性(right to use),可转移性(right to transfer),排他性(right to exclude)和可销毁性(right to destory)对于知识产权都是适用的。

需要注意的一点是,将知识产权化私有这一行为本身会将知识从人类公有资源中排除。长此以往,人类共享的公有资源将会减少消失。因此,知识产权均有时间限制。比如专利有20年的有效期,而著作权在作者死后50年后失效(美国是70年)。

在知识产权形成的工业革命时期,它具有进步性。但是同时,知识产权对再创造的打击是毁灭性的。在知识产权形成前,很多的文学创造,从耳熟能详的水浒传、金瓶梅、三国志、三国演义,都有各种在民间传说和前作上的再创造。而这样的再创造(或者叫同人),在现代由于版权问题,相对数量是较少的。这本质是一种选择。我们认为通过鼓励知识产权,鼓励创作者盈利,能够产生更多的作品,来弥补相对较少的再创造所造成的损失。

在另一方面,知识产权,或者更窄的说,专利体系,在形成之初还有另外一个目的,即是鼓励公开。专利系统要求提交专利申请并公开。这一行为本身保证了在20年之后,知识不会丢失,而是能够回归到人类公有资源中。

因此,即使到了现代,也有很多高科技的创新,比如发动机制造、材料制造等,是通过商业秘密(trade secret)的方式去保密的。这也就是为什么我们看不到氢弹的专利申请书的原因。

所以,较多使用知识产权保护利益的,是那些反向工程简单,边际效应趋向于零,但是研究开发成本很高的领域,比如:文化产品、软件和医药等。

说了这么多,知识产权真的能够让发明者受益,因此鼓励发明创新吗?

阴极管电视机公认的发明人是Philo Farnsworth(费罗·法恩斯沃斯,以下称法恩斯)。

法恩斯在1906年出生于一个农民家庭。在他15岁的时候,便幻想发明一种能把声音和图像一起传输的装置。后来,在他21岁的时候,便做出了第一台阴极管电视机的原型。在此之前,电视机的原理多是基于机械扫描装置(Nipkow disks)。阴极管电视更加可靠,图像也更清晰。事实上,直到21世纪初,我们的显像装置仍然是基于阴极管的,可见这一发明的有效性。

法恩斯在1926年搬到了加州,并申请了阴极管电视的发明专利。专利在1930年得到了批准,法恩斯打算生产电视,大赚特赚一笔。在他正要开始的时候,美国广播公司(RCA)宣称他们的员工Vladimir Zworykin(费拉蒂米尔·斯福罗金)在法恩斯之前就发明了电视,要专利局裁定法恩斯的专利无效。

在1931年,美国广播公司向法恩斯投出橄榄枝,想要用十万美金获得授权,并招聘法恩斯作公司员工。但是这被法恩斯拒绝了。

法恩斯为了拥有完整的电视发明专利池,举债和美国广播公司打官司。在1939年底终于打赢了所有的官司并和美国广播公司签订了一百万美金的授权协议。但是此时电视的主要专利已近过期,而他也已经债务缠身。雪上加霜的是,时值二战开始,美国暂停了电视等新兴的娱乐服务全力备战。

法恩斯的例子,很难说明专利的争夺是完全偏向发明者一方的。只要拥有雄厚的资本,完全可以通过法律程序消耗发明者的时间和精力。这也部分解释了为什么在当代,拥有大量专利和知识产权的,往往是实力雄厚的大公司。

在文化产品、软件、医药等较多使用知识产权保护的领域,知识产权的保护真的有效吗?

在文化领域,音乐、电影、电视,是广为人知的通过知识产权盈利的领域。这也是我们题目所说的人人影视组侵权的领域。特别是音乐领域,因为其较低的创作门槛和简单的复制方式,是最早受到互联网冲击的领域。这一冲击,看上去似乎发展受到了打击。仅就美国而言,唱片行业在2000年的收入达到了140亿美元,而在2019年是110亿,这还得益于过去几年流媒体收入的发展。

然而,创作者的收入并没有减少。同为年轻的顶级歌手,Taylor Swift在2020年的身价为6500万美金,而Britiney Spears在2000年的身价为3800万美金。由于唱片行业的势微,更多的创作者倾向于用不可复制的体验,比如现场演唱会等形式来盈利。因此也有了2007年,Radiohead将新专辑In Rainbows放到网上直接提供下载的行为艺术。

而SCP一类的互联网联合创作模式也在尝试用一种版权上更开放的态度进行共享和合作。

在软件领域,几乎所有的从业者都认为,软件专利对于这个行业而言是弊大于利的。因为专利的可转移性,特别是在软件专利方面,出现了像Intellectual Ventures这种名字都是主义,实际是做掠夺性专利诉讼(predatorial patent trolling)生意的公司。这类公司自己不搞发明创造,专门通过收购定义模糊广泛的专利,再利用这种专利诉讼赚钱。比如告Facebook侵犯了“一种让不同人通过网络连接联系的方法”的专利,或者告Google侵犯了“一种向服务器请求资源的方法”的专利来盈利。这种滥用专利诉讼对于真正将发明创造转化成对人类有益产品的人只能起到负面作用。

正因为软件专利存在这么多的弊端,大家也发明了将各个公司关于某项技术的专利放到一起,形成一个专利池。这样只需要向专利池付费,就能得到专利池的保护。如果有人告专利池里的技术,大家就一起凑钱打官司。H265编码和所谓的4G / 5G技术,其实核心就是他们的知识产权放在一起的专利池。

在医药领域,因为药品的正外部性,对于医药的知识产权保护一直都是存有争议的。《我不是药神》电影的核心就是在于印度政府为了保障人民能用上廉价药品,对于药品专利的不认可。讽刺的是,也正是因为印度大量生产专利保护的药品,反而积累了大量的药品生产经验,成为了全球主要的药品生产地。即使在美国,因为EpiPen(一种过敏患者用于急救的便携式注射器加药物)在过去的十年价格翻了4倍,也引起了广泛的对药物专利是否应该有价格限制的讨论。

从另一个方面来说,其实,现代药品的开发模式已经不是大公司从最初研发到完成三期一条龙的模式了。更经常的,是在早期由国家投资给大学和实验室,进行药品的研发。在药品有希望的时候,教授通常会成立一个小公司负责将研究商业化。小公司完成了一期和二期实验之后会再打包卖给Lilly这样的大药厂,他们再去完成费时费力的三期实验和上市。正是因为这一模式,也引发了为什么用纳税人的钱进行研发,却把利润都给大药厂赚了的大讨论。

专利对技术影响的一个直接例子就是E-Ink。E-Ink曾经是和OLED并驾齐驱的下一代显示技术。但是E-Ink公司一家垄断了E-Ink的核心专利。在其发明之后的十几年间,E-Ink不授权给别的公司生产,所以无论是刷新率、尺寸、价格或是新的应用都没有得到大的发展。现在,所有的公司都在等待在E-Ink的核心专利在20年代过期之后,引起的产品更新大爆发。

这并不是一厢情愿。LED照明技术的专利大部分在2010年过期。巧合的是,也正是在2010年之后,LED产业迎来了大爆发。很多人家开始装上条状LED、非直接LED照明或者LED变色灯也都是在2010年之后。

正是因为知识的可复制性,我们的生活才能越来越好。在互联网时代,知识的获得和复制都更加廉价了。比如开源软件,就是践行知识共享的生动例子。我小时候也并不理解知识产权和开源的关系,14岁的时候还在电脑报上写文章,斥责别人将开源的FileZilla重新打包在国内卖不尊重知识产权。现在的我更加相信,任何能辅助知识推广的行为都是好的。如果重新打包翻译之后通过收费来推广比免费的推广得更好,有什么不可以呢。

SQLite,一个广泛使用的嵌入式数据库软件,就是将知识共享的精神发挥到了极致。和大部分开源软件需要在重新打包时包含授权条款不同(Permission is hereby granted … without restriction, including without limitation …),SQLite和所有的在知识产权形成前的发明创造一样,仅仅是将所有的权利回归了人类公有资源库(Public domain)之中。

在文化领域,古腾堡计划(Project Gustenberg)致力于通过互联网将所有著作权过期的作品提供给人们免费下载和浏览。每年年初,古腾堡计划都会开心的写一篇文章,介绍都有哪些好的作品已经过期了大家可以免费下载浏览了。

这些项目本身,就是对知识产权制度的反思。你可以问问自己,米老鼠这么一个角色,让迪斯尼拥有了100多年的知识产权,而且只要美国还在,就还会一直持续下去,真的对人类公有资源有任何正面的帮助吗?假设吴承恩将孙悟空的知识产权拥有到了现代,我们还能看得到《大话西游》吗?

这时,我们再回头问,知识产权形成的初期为了鼓励市场竞争和更多的发明创造的目的,真的达到了吗?

1774年,英国当时是全世界的纺织生产大国。为了保护英国的纺织企业,通过法案防止纺织技术出口。Samuel Slater(塞缪尔·斯莱特)将纺织机的制作和生产都默记在脑子里,1789年到美国纽约,根据自己默记的内容,重新制作了纺织机。在现代,斯莱特广泛被认为是美国的工业生产之父。

当然在后来,美国取得技术领先的优势之后,就通过国际条约和机构,将知识产权保护的条例写进了参与国的法律中,以此保护自己的技术优势。TPP(泛太平洋伙伴关系协定)被广泛批判的,也是将美国国内过于严苛冗长的知识产权保护条例写进了合作伙伴的法律中。

知识产权将产权的所有性质赋予给了可复制可传播的知识,将其长期从人类共有资源中隔离出来。在当今合作紧密、知识迭代变化快速的时代,这样完整的知识产权是为了促进发明创造,还是只是保护资本的利益,是非常存疑的。我们不应该全盘接受知识产权的概念而不加思考。对它好的地方要鼓励,坏的地方也要批判。知识产权应该是一个动态的概念,刘慈欣说要多想,在那以前多想,总是对的。

No comment yet
February 7th, 2021

For many machine learning practitioners, training loop is a universally agreed upon concept as demonstrated by numerous documentations, conference papers to use the word without any reference. It would be a helpful concept for many beginners to get familiar with before diving into the rabbit holes of many deep learning tools.

The field of machine learning becomes vast, diverse and ever-more open in the past 10 years. We have all kinds of open-source softwares, from XGBoost, LightGBM, Theano to TensorFlow, Keras and PyTorch to simplify various tasks of machine learning. We have supervised learning, unsupervised learning, generative network and reinforcement learning, choose your own pill. It can be dazzling for beginners. It doesn’t help that many popular softwares we use made simplifications to hide many details from beginners with abstractions like classes, functions and callbacks.

But fear not, for machine learning beginners, there is one universal template. Once you understand it, it is straightforward to fit all existing training programs into this template and start to dig into how they implemented the details. I call this the universal training loop.

An Extremely Simplified Interface

Many high-level framework may provide an extremely simplified interface looking like this:

1
func train(training_dataset: Dataset, validation_dataset: Dataset) -> Classifier

When you do:

1
let trainedClassifier = Classifier.train(training_dataset: training_dataset, validation_dataset: validation_dataset)

You somehow get the trained classifier from that interface. This is what you would find in FastAI’s Quick Start page, or Apple’s Create ML framework.

However, this doesn’t tell you much about what it does. It is also not helpful some of these frameworks provided callbacks or hooks into the training loop at various stages. The natural question would be: what are the stages?

An Supervised Learning Training Loop

It is actually not hard to imagine what a supervised learning training loop would look like underneath the extremely simplified interface. It may look like this:

1
2
3
4
var classifier = Classifier()
for example in training_dataset {
  classifier.fit(input: example.data, target: example.target)
}

It tries to go through all examples in the training dataset and try to fit them one-by-one.

For stochastic gradient descent methods, we had a few modifications to make the training process more stable and less prone to input orders:

1
2
3
4
var classifier = Classifier()
for minibatch in training_dataset.shuffled().grouped(by: 32) {
  classifier.fit(inputs: minibatch.data, targets: minibatch.targets)
}

We randomizes the order of input (shuffled()), group them into mini-batches, and pass them into the classifier, assuming the classifier operates with a group of examples directly.

For many different types of neural networks, shuffled mini-batches will be the essential part of your training loop for both efficiency and stability reasons.

4 Steps to Fit

The magical fit function doesn’t inspire any deeper understanding of what’s going on. It looks like we simply lift the train function into a for-loop.

However, if we can assume that our machine learning model is differentiable and based on gradient methods (e.g. neural networks), we can break down the fit function into 4 steps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var classifier = Classifier()
for minibatch in training_dataset.shuffled().grouped(by: 32) {
  // Given inputs, a machine learning model can guess what the
  // outputs would be. (Labels of the images, positions of the faces
  // or translated texts from the original.)
  let guesses = classifier.apply(inputs: minibatch.data)
  // Loss measures how far away our guesses compares to the targets
  // we knew from the training data. This is supervised learning, we
  // know the answer already.
  let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
  // Based on the loss, gradients give us the direction and magnitude
  // to update our model parameters.
  let gradients = loss.gradients()
  // Update the parameters with gradients from this mini-batch.
  // Optimizer specifies a particular algorithm we use to update
  // parameters, such as stochastic gradient descent or ADAM.
  optimizer.apply_gradients(gradients, classifier.parameters)
}

For any supervised learning, you will be able to find this 4 steps. It can vary, some of the model may accumulate gradients a bit and then apply_gradients after several rounds. Some of them may apply additional clipping on the gradients before applying them.

You could find the exact 4 steps in frameworks like Keras or PyTorch.

Validation Dataset and Epoch

We haven’t talked about the validation_dataset parameter you saw earlier for the train method!

For first-order gradients based methods (e.g. neural networks), we need to go over the whole training dataset multiple times to reach the local minima (a reasonable model). When we went over the whole training dataset once, we call it one epoch. Our models can also suffer the overfitting problem. Validation dataset are the data the model never uses when updating its parameters. It is useful for us to understand what our model would be like to data it never sees.

To incorporate the above two insights, our training loop can be further modified to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var classifier = Classifier()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, classifier.parameters)
  }
  var stats = Stats()
  for example in validation_dataset {
    // Only gather guesses, never update the parameters.
    let guess = classifier.apply(input: example.data)
    // Stats will compare guess to the target, and return some
    // helpful statistics.
    stats.accumulate(guess: guess, target: target)
  }
  print("Epoch \(epoch), validation dataset stats: \(stats)")
}

Now I can claim, for any supervised learning task, you can find the above training loop when you dig deeper enough through their abstractions. We can call this the universal supervised training loop.

Unsupervised Learning and Generative Networks

The main difference between unsupervised learning and supervised learning for our training loop is that we won’t have the target provided from the training dataset. We derive the target somewhere else. In unsupervised learning, we derive the target from some transformations of the input. In generative networks, we derive the target from random noises (hence generating something from nothing).

1
2
3
4
5
6
7
8
9
10
11
12
13
var model = Model()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32) {
    let guesses = model.apply(inputs: minibatch.data)
    // Unsupervised learning.
    let targets = model.targets(from: minibatch.data)
    // Generative networks
    // let targets = model.targets(from: noise)
    let loss = model.loss(guesses: guesses, targets: targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, model.parameters)
  }
}

Often times, for this types of tasks, the targets are derived from another set of neural networks and updated jointly. Because of that, there are more whistles and bells in many frameworks when they implement the above training loop. You can find example from Keras on how they derive targets from the input data only (get_masked_input_and_labels) for BERT (a popular unsupervised natural language processing model). Or you can find example from PyTorch how they generate adversarial examples from noises for DCGAN (deep convolutional generative adversarial network).

Deep Reinforcement Learning

Deep reinforcement learning generates the training data by having an agent interacts with the environment. It has its own loop that looks like this:

1
2
3
4
5
6
7
8
while true {
  let action = agent.respond(to: lastObservation)
  let (observation, reward, done) = environment.interact(action: action)
  lastObservation = observation
  if done {
    break
  }
}

The agent took action against our last observation. The environment will be interacted with the action and give a new set of observation.

This is independent from our training loop. In contrast to the supervised learning, in the deep reinforcement learning, we use the interaction loop to generate training data.

Our training loop can be modified to look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
var policy = Policy()
var training_dataset = Dataset()
for epoch in 0..<max_epoch {
  var data_in_episode = Dataset()
  while true {
    let action = policy(inputs: lastObservation)
    let (observation, reward, done) = environment.interact(action: action)
    data_in_episode.append((action: action, reward: reward, observation: lastObservation))
    lastObservation = observation
    if done {
      for (i, data) in data_in_episode.enumerated() {
        // Use all future rewards to compute our target.
        let target = target_from_future_rewards(data_in_episode[i..<])
        // Our input will be the last observation (Q-learning), and
        // potentially also include the action (Actor-Critic model),
        // or the next observation (model-based methods).
        training_dataset.append((input: (data.action, data.observation), target: target))
      }
      break
    }
  }
  // Rather than shuffling the whole training dataset, we just random
  // sample a subset.
  for minibatch in training_dataset.randomSampled().grouped(by: 32) {
    let guesses = policy.apply(inputs: minibatch.data)
    let loss = policy.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    optimizer.apply_gradients(gradients, policy.parameters)
  }
}

The training_dataset in above training loop can be referred to as the replay memory in the literature. If we retain all the old data before training, this is often referred to as off-policy training. If instead we remove all training data after each episode, this can be called on-policy training. OpenAI Spinning Up has better explanation on the differences between on-policy and off-policy with a bit more details.

You can find examples of the above training loops in PyTorch or in OpenAI Baselines.

Distributed Learning

Follow the same training loop, we can extend them to train on multiple machines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
var classifier = Classifier()
let machineID = MPI_Comm_rank()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32, on: machineID) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    // Compute gradients from the specific data on this machine only.
    let machineGradients = loss.gradients()
    // Use allreduce primitive to compute gradients summed from all
    // machines.
    let allGradients = allreduce(op: +, value: machineGradients, on: machineID)
    // Applying updates with the same gradients, it should yield the
    // same parameters on all machines.
    optimizer.apply_gradients(allGradients, classifier.parameters)
  }
}

The allreduce primitive go over all machines to sum gradients from them. In reality, it is often implemented with ring-based communication pattern to optimize the throughput. Naively, it can look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
func allreduce(op: Op, value: Tensor, on: Int) -> Tensor {
  MPI_Send(value, to: 0)
  if on == 0 {
    tensors[0] = value
    for i in 1..<MPI_Comm_size() {
      tensors[i] = MPI_Recv(from: i)
    }
    let sum = tensors.sum()
    for i in 1..<MPI_Comm_size() {
       MPI_Send(sum, to: i)
    }
    return sum
  } else {
    return MPI_Recv(from: 0)
  }
}

This naive data-distributed training loop can be extended to more sophisticated distributed training regime. For example, in ZeRO-Offload, its distributed strategy can be represented as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
var classifier = Classifier()
let machineID = MPI_Comm_rank()
for epoch in 0..<max_epoch {
  for minibatch in training_dataset.shuffled().grouped(by: 32, on: machineID) {
    let guesses = classifier.apply(inputs: minibatch.data)
    let loss = classifier.loss(guesses: guesses, targets: minibatch.targets)
    let gradients = loss.gradients()
    for (i, gradient) in gradients.enumerated() {
      // Each machine only sum the gradients it responsible for. This
      // method will return nil if it tries to reduce a gradient it
      // is not responsible for.
      if let reducedGradient = reduce(op: +, id: i, value: gradient, on: machineID) {
        // Copy the summed gradient to CPU.
        cpuGradients[machineID, i] = reducedGradient.copy(to: .CPU)
      }
    }
    // Apply gradients to the model from CPU.
    optimizer.apply_gradients(cpuGradients[machineID], classifier.parameters[machineID])
    // Broadcast the latest parameters to all machines.
    broadcast(classifier.parameters[machineID])
  }
}

The Universal Training Loop

Finally, we arrived at our universal training loop template:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
var model = Model()
var training_dataset = Dataset()
for epoch in 0..<max_epoch {
  // Collect training dataset either from agent-environment
  // interaction, or from the disk.
  training_dataset = collect(...)
  // Go over mini-batch either on the whole training dataset, or from
  // a subset of it.
  for minibatch in training_dataset.extract(...) {
    // Apply the model to generate some guesses.
    let guesses = model.apply(inputs: minibatch.data)
    // Generate targets either from inputs, from noise, or it already
    // exists in the training dataset. Or a combination of above.
    let targets = targets_from(...)
    // Compute the loss from the model's guess w.r.t. to the target.
    let loss = model.loss(guesses: guesses, targets: targets)
    // First-order gradients from the loss.
    let gradients = loss.gradients()
    // Sum gradients from everywhere (other machines, other GPUs) for
    // this particular node to process.
    if let (i, collected) = gradients.collect(...) {
      optimizer.apply_gradients(collected, model.parameters[i])
      // Broadcast the updated parameters to everywhere.
      broadcast(model.parameters[i])
    }
  }
}

*: The pseudo-code in this article uses Swift. A particular syntax concerns with a[x..<y]. It is semantically the same as a[x:y] in Python, including for cases where some of the subscript is missing: a[x..<] should be the same as a[x:].

No comment yet
February 4th, 2021

I started to work on a new data science related project last year. Our initial setup was fairly traditional: it is Python-centric, with Anaconda for dependency management and environment setup.

Anaconda is interesting. It is probably useful if you have a ton of dependencies and the system tries really hard to figure out the package compatibilities with each other based on their claims (version numbers of their dependencies). For us though, we only use a handful of packages with clean dependencies on each other (your usual suspects: Pandas, SciPy, numpy), and the version number check just means each time package upgrade is half-an-hour SMT solving.

On the other hand, I made my fortune in the past decade by doing app development (Facebook, Snapchat). I’ve been eyeing on Swift since the 1.0 version. Since then, the language matured a lot. After gaining some Swift experience with my last employer, it seems to be a good compromise between expressivity and performance. The noise from the language itself is minimal, and the performance can be tuned if you are hard on it, otherwise still better than raw Python.

After a few months of probing, investing, bug fixes, we’ve migrated our setup to Swift in last December. It has been serving us well so far.

Problems

We have a small project, and the problem with packages mostly around package management and upgrade. Since Anaconda’s environment is not per project based, we have to switch back and forth when entering / leaving the project.

Our project, although small, is a bit exotic. Our core algorithm was implemented in C, and we don’t want to ship a Python plugin in-tree. Hence, we opt-ed to talk with the C lib through standard IO (subprocess). It turns out to be hairy and the core algorithm update process is more than terrible.

Pandas has a reasonable performance if these are builtin functions, once we drop to use apply / lambda, going through a few million rows for a particular column can take 30 to 40 seconds. For these cases, we also cannot use the rest of idle cores efficiently.

However, switching to a new language setup is not an easy task. Besides solving the above said problems, we would still like to keep a few things we liked with our old setup:

  • Jupyter notebook: we really liked to do data exploration with Jupyter notebooks. Anything requires us to compile / run / print would be a no go. The interactive data exploration experience is essential for our workflow.

  • Pandas: we liked Pandas, it is a Swiss Army knife for data science. It also has a big API surface that would be very hard to reimplement from scratch.

  • PyCharm or alike IDE: we liked to use PyCharm for Python development a lot. Data inspection, re-run, and in general, debugging experience within an IDE is no comparison with tools like vim (although I still use vim for a lot of development personally).

Other Potential Choices

Before embarking on this quest, we briefly looked at other potential choices:

  • TypeScript / JavaScript: it has some interesting data exploration patterns, such as observablehq.com. However, it doesn’t have good C library integration points, making the fore-mentioned core algorithm integration problem unsolved.

  • Rust: when I did my investigation, I didn’t notice the evcxr project. Other than that, the syntax for modeling would be a bit noisier than I’d like. The way to call Python through PyO3 is a bit clumsy too.

  • Julia: if I have more experience with the language, I may have a different opinion. But as it stands, the language has its own ecosystem, and I didn’t see a good way to call Python libraries from the language*. On the C lib integration part, it seems to require dynamic linking, and that would be a little bit more hurdle on my toolchain setup.

New Setup

Monorepo

Our new setup is a monorepo, with Bazel as the build system for both our Swift code, C libraries and our Python dependencies. In the meantime, we still have some legacy Python libraries that are now managed by Bazel too.

Bazel’s new rules_python has a pretty reasonable pip_install rule for 3rd-party dependencies. As I mentioned, because we use relatively a small number of Python packages, cross package compatibility is not a concern for us.

All our open-source dependencies are managed through WORKSPACE rules. This worked for our monorepo because we don’t really have a large number of open-source dependencies in the Swift ecosystem. The things we import mostly are Swift numerics, algorithms and argument-parser.

Jupyterlab

We don’t use a separate Jupyter installation anymore. Jupyterlab is installed as a pip_install requirement for our monorepo. Opening a Jupyterlab would be as simple as bazel run :lab. This enables us to in-tree our Swift Jupyter kernel. We adopted the swift-jupyter and added Bazel dependency support. We also have a pending PR to upstream our sourcekit-lsp integration with the Swift Jupyter kernel.

This complicates a bit on plugin management for Jupyterlab. But we haven’t yet found a well-maintained out-of-tree plugin that we would like to use.

Python

To support calling Python from Swift, we opted to use PythonKit library. We’ve upstreamed a patch to make Pandas work better within Swift. We made a few more enhancements around UnboundedRange syntax, passing Swift closures as Python lambda, which haven’t been upstreamed at the moment.

One thing that makes calling Python from Swift easy is the use of reference counting in both languages. This makes memory management cross the language boundary much more natural.

We also wrote a simple pyswift_binary macro within Bazel such that a Swift binary can declare their Python dependencies and these will be setup properly before invoking the Swift binary.

We haven’t yet vendoring our Python runtime at the moment. Just painstakingly making sure all machines on Python 3.8. However, we do intend to use py_runtime to solve this discrepancy in the future.

C Libraries

Swift’s interoperability with C is top-notch. Calling and compiling C dependencies (with Bazel) and integrating with Swift is as easy as it should be. So far, we’ve fully migrated our core algorithms to Swift. The iterations on the core algorithms are much easier.

IDE Integration

For obvious reasons, we cannot use PyCharm with Swift. However, we successfully migrated most of our workflow to VS Code. LLDB support makes debugging easy. However, we did compile our own Swift LLDB with Python support for this purpose (still in the process to figure out with the core team why the shipped Swift LLDB on Linux has no Python support).

Sourcekit-lsp doesn’t recognize Bazel targets. The work for Build Server Protocol support on both Bazel side and on sourcekit-lsp side seems stalled at the moment. We ended up writing a simple script to query compilation parameters from bazel aquery for all our Swift source code, and put these into compile_commands.json. Sourcekit-lsp just has enough support for Clang’s compilation database format to make code highlighting, auto-complete, go to definition and inline documentation work again.

We committed .vscode/settings.json.tpl and .vscode/launch.json.tpl into the codebase. Upon initial checkout, a small script can run to convert these template files into actual settings.json and launch.json. We did this to workaround issues with some VS Code plugins requiring absolute paths. These two files will keep in sync with the templates as part of the build process ever since.

Bonus

Swift’s argument-parser library makes creating command line tools really easy. Furthermore, it also supports auto-complete in your favorite shells. We implemented one all-encompassing CLI tool for our monorepo to do all kinds of stuff: data downloading, transformation, model evaluation, launching Jupyterlab etc. With auto-completion and help messages, it is much easier to navigate than our previous ./bin directory with twenty-something separate scripts.

Is this the End?

We’ve been happy with this new setup since the beginning of this year, but it is far from the end-all be-all solution.

Swift is great for local development, however, the Bazel support on Linux generates binary that dynamically links to Swift runtime. This makes deployment a little bit more involved than we’d like it to be.

Pandas uses internal data structure for DataFrame (column-based numpy). While we do have efficient numpy to tensor conversions, these cannot be leveraged in the context of Pandas (we want to lift the whole data frame, not just numpy-ready columns). Our current solution calls itertuples, which can be quite slow. We’d like to sponsor an open-source project to implement Apache Arrow support for Swift. This should enable us to pass data from Pandas to Swift through Arrow in-memory format, which may be faster than what we currently do.


*: notagoodidea pointed out there is a PyCall.jl library that implemented Python interoporability.

‹ Newer