No comment yet

I started to work on a new data science related project last year. Our initial setup was fairly traditional: it is Python-centric, with Anaconda for dependency management and environment setup.

Anaconda is interesting. It is probably useful if you have a ton of dependencies and the system tries really hard to figure out the package compatibilities with each other based on their claims (version numbers of their dependencies). For us though, we only use a handful of packages with clean dependencies on each other (your usual suspects: Pandas, SciPy, numpy), and the version number check just means each time package upgrade is half-an-hour SMT solving.

On the other hand, I made my fortune in the past decade by doing app development (Facebook, Snapchat). I’ve been eyeing on Swift since the 1.0 version. Since then, the language matured a lot. After gaining some Swift experience with my last employer, it seems to be a good compromise between expressivity and performance. The noise from the language itself is minimal, and the performance can be tuned if you are hard on it, otherwise still better than raw Python.

After a few months of probing, investing, bug fixes, we’ve migrated our setup to Swift in last December. It has been serving us well so far.

Problems

We have a small project, and the problem with packages mostly around package management and upgrade. Since Anaconda’s environment is not per project based, we have to switch back and forth when entering / leaving the project.

Our project, although small, is a bit exotic. Our core algorithm was implemented in C, and we don’t want to ship a Python plugin in-tree. Hence, we opt-ed to talk with the C lib through standard IO (subprocess). It turns out to be hairy and the core algorithm update process is more than terrible.

Pandas has a reasonable performance if these are builtin functions, once we drop to use apply / lambda, going through a few million rows for a particular column can take 30 to 40 seconds. For these cases, we also cannot use the rest of idle cores efficiently.

However, switching to a new language setup is not an easy task. Besides solving the above said problems, we would still like to keep a few things we liked with our old setup:

  • Jupyter notebook: we really liked to do data exploration with Jupyter notebooks. Anything requires us to compile / run / print would be a no go. The interactive data exploration experience is essential for our workflow.

  • Pandas: we liked Pandas, it is a Swiss Army knife for data science. It also has a big API surface that would be very hard to reimplement from scratch.

  • PyCharm or alike IDE: we liked to use PyCharm for Python development a lot. Data inspection, re-run, and in general, debugging experience within an IDE is no comparison with tools like vim (although I still use vim for a lot of development personally).

Other Potential Choices

Before embarking on this quest, we briefly looked at other potential choices:

  • TypeScript / JavaScript: it has some interesting data exploration patterns, such as observablehq.com. However, it doesn’t have good C library integration points, making the fore-mentioned core algorithm integration problem unsolved.

  • Rust: when I did my investigation, I didn’t notice the evcxr project. Other than that, the syntax for modeling would be a bit noisier than I’d like. The way to call Python through PyO3 is a bit clumsy too.

  • Julia: if I have more experience with the language, I may have a different opinion. But as it stands, the language has its own ecosystem, and I didn’t see a good way to call Python libraries from the language*. On the C lib integration part, it seems to require dynamic linking, and that would be a little bit more hurdle on my toolchain setup.

New Setup

Monorepo

Our new setup is a monorepo, with Bazel as the build system for both our Swift code, C libraries and our Python dependencies. In the meantime, we still have some legacy Python libraries that are now managed by Bazel too.

Bazel’s new rules_python has a pretty reasonable pip_install rule for 3rd-party dependencies. As I mentioned, because we use relatively a small number of Python packages, cross package compatibility is not a concern for us.

All our open-source dependencies are managed through WORKSPACE rules. This worked for our monorepo because we don’t really have a large number of open-source dependencies in the Swift ecosystem. The things we import mostly are Swift numerics, algorithms and argument-parser.

Jupyterlab

We don’t use a separate Jupyter installation anymore. Jupyterlab is installed as a pip_install requirement for our monorepo. Opening a Jupyterlab would be as simple as bazel run :lab. This enables us to in-tree our Swift Jupyter kernel. We adopted the swift-jupyter and added Bazel dependency support. We also have a pending PR to upstream our sourcekit-lsp integration with the Swift Jupyter kernel.

This complicates a bit on plugin management for Jupyterlab. But we haven’t yet found a well-maintained out-of-tree plugin that we would like to use.

Python

To support calling Python from Swift, we opted to use PythonKit library. We’ve upstreamed a patch to make Pandas work better within Swift. We made a few more enhancements around UnboundedRange syntax, passing Swift closures as Python lambda, which haven’t been upstreamed at the moment.

One thing that makes calling Python from Swift easy is the use of reference counting in both languages. This makes memory management cross the language boundary much more natural.

We also wrote a simple pyswift_binary macro within Bazel such that a Swift binary can declare their Python dependencies and these will be setup properly before invoking the Swift binary.

We haven’t yet vendoring our Python runtime at the moment. Just painstakingly making sure all machines on Python 3.8. However, we do intend to use py_runtime to solve this discrepancy in the future.

C Libraries

Swift’s interoperability with C is top-notch. Calling and compiling C dependencies (with Bazel) and integrating with Swift is as easy as it should be. So far, we’ve fully migrated our core algorithms to Swift. The iterations on the core algorithms are much easier.

IDE Integration

For obvious reasons, we cannot use PyCharm with Swift. However, we successfully migrated most of our workflow to VS Code. LLDB support makes debugging easy. However, we did compile our own Swift LLDB with Python support for this purpose (still in the process to figure out with the core team why the shipped Swift LLDB on Linux has no Python support).

Sourcekit-lsp doesn’t recognize Bazel targets. The work for Build Server Protocol support on both Bazel side and on sourcekit-lsp side seems stalled at the moment. We ended up writing a simple script to query compilation parameters from bazel aquery for all our Swift source code, and put these into compile_commands.json. Sourcekit-lsp just has enough support for Clang’s compilation database format to make code highlighting, auto-complete, go to definition and inline documentation work again.

We committed .vscode/settings.json.tpl and .vscode/launch.json.tpl into the codebase. Upon initial checkout, a small script can run to convert these template files into actual settings.json and launch.json. We did this to workaround issues with some VS Code plugins requiring absolute paths. These two files will keep in sync with the templates as part of the build process ever since.

Bonus

Swift’s argument-parser library makes creating command line tools really easy. Furthermore, it also supports auto-complete in your favorite shells. We implemented one all-encompassing CLI tool for our monorepo to do all kinds of stuff: data downloading, transformation, model evaluation, launching Jupyterlab etc. With auto-completion and help messages, it is much easier to navigate than our previous ./bin directory with twenty-something separate scripts.

Is this the End?

We’ve been happy with this new setup since the beginning of this year, but it is far from the end-all be-all solution.

Swift is great for local development, however, the Bazel support on Linux generates binary that dynamically links to Swift runtime. This makes deployment a little bit more involved than we’d like it to be.

Pandas uses internal data structure for DataFrame (column-based numpy). While we do have efficient numpy to tensor conversions, these cannot be leveraged in the context of Pandas (we want to lift the whole data frame, not just numpy-ready columns). Our current solution calls itertuples, which can be quite slow. We’d like to sponsor an open-source project to implement Apache Arrow support for Swift. This should enable us to pass data from Pandas to Swift through Arrow in-memory format, which may be faster than what we currently do.


*: notagoodidea pointed out there is a PyCall.jl library that implemented Python interoporability.

blog comments powered by Disqus