Software is Eating Science

Blog

Vijay Pandurangan

By @vijayp

The poor quality of software engineering in science is unnecessarily delaying the future by 5-10 years. The importance of software to science has grown tremendously: 20 years ago, it was used mainly for statistical analyses of results or other data; today extremely complex software is so integrated into biology, medicine, and other scientific disciplines that breakthroughs would be impossible without it.

Despite this, much of the software that scientists write does not follow basic software engineering best practices. Over the last two years, I have spent a great deal of time teaching, investing in, advising, and working with scientists. I’ve seen collaborators rarely using source control, code not being peer reviewed, sparse (or no) documentation, untracked module and library dependencies, code that only runs one one person’s laptop, and data being stored in an ad hoc manner.  The impacts are as profound and far-reaching as they are predictable: collaboration with peers is onerous, obvious bugs are missed, reusing code is arduous, security and privacy risks abound.  

This strikes at the core of the scientific method: if critical parts of important research can’t be understood or replicated, if iteration and collaboration is slowed, or if errors burrow their way into seminal work undetected, we all suffer.  Just as research conducted without lab notebooks or a clear description of experimental design would be considered poor science, software used as a part of the experimental process must be held to the same standard. While it’s easy to think of scientists working with software as unskilled or undisciplined, this would be unproductive and (more importantly) wrong. Most scientists have simply never seen or learned software engineering best practices, and aren’t encouraged to do so by their organization.  

Even amongst professional developers, much of the discipline is not learned in school, but in industry, from peers and mentors. I’m a good example of this: in 2002 I started my first job, at Google, to work on a combination of devops, SRE, and distributed systems.  In grad school at Carnegie Mellon I’d learned much about Computer Science (CS)—how computers work, tradeoffs between various algorithms, complex theory about how to build distributed systems. Surprisingly, I learned very little about Software Engineering (SWE)—how to develop and build systems in the real world. It was as if I were a physicist hired to build a bridge: understanding much about how bridges work, but uninformed about the complexities of how to actually build a good one in the real world. At Google, I learned these skills via code reviews and instruction from my peers. Things aren’t really that different today; I’ve witnessed senior engineers teach junior ones these skills as I ran larger teams and the startups I invested in grew into large companies. The hardest problems are often discussed within networks of developers that span companies.

Similarly, while scientists increasingly learn CS in school, SWE instruction is extremely rare.  Unfortunately, unlike professional developers, it’s much harder for scientists to learn these skills from mentors since senior scientists have had limited exposure to SWE. Focused instruction definitely helps, as does community: for BIODS 253—a class I just taught at Stanford—I created a slack channel for the students. Having easy access to both peers who were struggling with similar issues but who shared domain knowledge, and me (a software engineer with limited domain knowledge), proved useful. Building on that, I’ve created a slack where scientists and engineers around the world interested in science can interact. You can join the slack here, and follow @sweforscience on Twitter!

While education and community are important parts of the problem, they are not the only issues. Working with scientists over the past two years has convinced me that many of the tools and systems that scientists use do not encourage best practices. As we know from the stark difference in organ donation rates between opt-in and opt-out jurisdictions, the design of systems and default policies can radically alter user behaviour. We should strive to redesign scientists’ tools to help them do a better job of software engineering.

Remember that, at the end of the day, the goal of scientists is research.  Anything used in the pursuit of that goal are merely tools or means to an end, be they DNA sequencing machines, boats to get to interesting parts of the ocean, or software. Just as cars don’t require drivers to understand how anti-lock brakes improve handling, scientific tools and software should hide unnecessary details from users and guide them towards better decisions. Unfortunately, many fail this test in one of two ways:

  • Some interfaces are too complex and heavy-weight: They are designed for software engineer users, not scientists. For instance, running computation in the cloud might today require an understanding of Kubernetes, docker, terraform, or IAM security policies.
  • Other interfaces promote ephemeral thinking: some scientific toolkits were used for ad hoc analysis, and retain defaults that make it hard to write well-designed code. An example of this is Jupyter notebooks, which allow one to execute arbitrary queries in a webpage (extremely useful), but don’t enable source control or a testing framework by default.

Fixes for these issues may range from small UX modifications to complete rethinking of the way systems are designed and code is written. Given the number of tools, systems, and use cases, the path forward needs to be carefully considered.

There’s no doubt that software is already critical to progress in biology, medicine, and other sciences, and that its importance will continue to grow inexorably. I’m convinced that fixing poor SWE practices will bring us the future a few years sooner, and doing so is a moral imperative: think of the cure for cancer that hits the market several years later because data were lost, analysis across datasets were too difficult, code was buggy or couldn’t be run in the cloud.

To that end, I’ve begun working on this problem in a few ways:

  • Academic Education: I taught BIODS253 at Stanford in W21 and hopefully on or before W22. I’m also teaching some of these skills online this summer; please sign up for the next class.
  • Industry: I’m starting to do limited training engagements at biotech and pharma companies. Please reach out to me (vijayp@olima.vc) if you’d like to book one.
  • Community: Please join our slack community and follow @sweforscience on Twitter.
  • Tools: Low-friction systems that make it easier for scientists to do the right thing could be invaluable. If you've been thinking about improving scientific tools, please reach out.
  • Stories: I have a few guest bloggers lined up to write about their experiences. If you’d like to learn more, please contribute!

If you’re a scientist reading this, rejoice! Knowing that writing better software will make you more productive is half the battle. I’ll continue to post on slack and twitter, and will add additional resources here to help you improve your software skills. If you’re a software engineer currently working on these problems (or interested in doing so) please tweet at me or email me (vijayp@olima.vc).