Wednesday December 24, 2025
Somewhat recently, both AWS and Cloudflare experienced major outages: these had some pretty widespread impacts on real humans (and also on corporate bottom lines). Also, just over a year ago, CrowdStrike caused another massive global outage that negatively affected lots of different industries, i.e. air travel, finance, and healthcare.
Everyone relies on software these days! (More or less. Or at least increasingly so.) Events like these, which make very clear just how dependent we are on brittle, complicated, and unreliable software is...a bit disturbing, to say the least.
I think there are a couple real ways forward, and I’ll try to relatively quickly sketch them out. It’s probably worth prefacing this with my deep distaste for the common refrain that we just need better bureaucratic processes or developer protocols within the companies producing software.
I don’t think that this sort of perspective—which generally results in more administrative overhead, red tape, and an AI/LLM savior complex(3)—actually gets us anywhere meaningfully better: it only offers bandaids for much deeper problems. Top-down policy that relies heavily on human input, coordination, and diligence doesn’t really scale; AI and LLM-based attempts to scale this kind of heuristic administrative oversight only serves to delay a more painful (but more essential) reckoning. What we need, I argue, is structural change at the level of programming tools, abstractions, and infrastructure.
None of this is to say that human factors are unimportant! There are lots of cases where human factors are deeply important, but I think only insofar that those factors shape and are shaped by the structural tools, abstractions, and infrastructure we develop around them.
For example, take supply chain attacks, which seem to be inevitable in a world where (a) software projects often have large dependency graphs and (b) few, if any, software packages or modules come with formal security guarantees. Supply chain attacks (and other, related kinds of coordinated attacks on social networks, e.g. Google bombing or Sybil attacks) abuse and exploit notions of trust, which inherently (I think) cannot be strictly and precisely quantifiable—these notions must involve human factors!
The best we can do is develop formal models of trust, and then develop structural tools, abstractions, and infrastructure—whether it’s public key infrastructure, PGP keys and the “web of trust,” Byzantine fault tolerant consensus protocols, decentralized and distributed ledgers, and so on—to help us (a) trust fewer things and (b) make more explicit what we’re trusting.
One major concern that I have with all of these recent outages is that it exposes just how centralized the internet really is. In brief: I think the internet (and software in general) should be far more decentralized and local-first. None of this is particularly revolutionary, given all the recent activity around things like open social networks (and associated protocols) and local-first software.
I think work in these directions is all very timely and important. Should we all be (unwittingly) relying on Amazon AWS to store and access much of our online data? Should we be collaborating online and sharing information primarily through Google Drive’s suite of applications? Should our mattresses be connected to the internet? Should a remote software update from a single entity deny healthcare providers access to local patient data? Should we be relying, for the most part, exclusively on Github (which is now owned by Microsoft) to store, access, and work on open-source code?
I don’t think so. This state of affairs feels increasingly untenable: it harms and disempowers users, frustrates software developers, and seems only to serve an increasingly wealthy and politically powerful technocratic ruling class. Consumer-facing software should not require internet access to function, users should own their data, and computing infrastructure must diversify, decentralize, and perhaps even socialize (although I am less convinced of the latter).
That being said, this is all rather idealistic: there are clear challenges in practice, particularly regarding decentralization. There are good reasons (e.g. convenience, user experience, short-term and small-scale cost effectiveness) why large centralized platforms and infrastructure providers (i.e. the “cloud”) have become hugely popular.
I don’t want to get too in the weeds here, so I’ll just link to some articles and posts that I find interesting in this regard:
The core of my stance on these outages (and, more broadly, on the state of modern software) is this: we need better support from our programming tools and abstractions, both to make more decentralization more practical, and to prevent the bugs that caused these centralized outages from happening in the first place. I’m primarily interested in approaches centered around programming languages (fairly broadly construed), and formal methods.(5)
I think this is, by now, a fairly uncontroversial point. Making distributed systems work correctly and reliably is an immensely difficult task, even for hugely profitable software firms—like Amazon and Google—whose profits depend on these systems working correctly and reliably. There are two complementary responses to this observation:
I think programming languages research has a lot of very fruitful things to say about abstraction design for distributed systems (and abstraction design in general).(6) And I think formal verification—trying to get provable guarantees about global properties of computing systems—should play a much larger role in the software development process, partcularly for foundational and complicated distributed systems. Of course, there is a good deal of research work that still needs to be done to make this more widely practical. Case in point: Amazon (AWS) runs a large programming languages/formal verification research group (probably the largest currently in industry).
A big reason why people (and companies) reach instinctively for managed, cloud-based services is that it’s just easy. I think there is a really interesting, and pretty practical/industry-facing area of work in making locally-managed, fully-owned, and low-dependency compute infrastructure easier and more usable. This really isn’t an area that I think really demands any interesting research advances to make possible(7): we have all the tools we need! It’s simply a matter of the right player(s) coming in and developing better end-to-end tooling, making the user experience palatable, and getting some industry traction. There are already some startups in this space, i.e. Oxide.
Similarly, a big reason why developers often develop in a cloud-oriented way is that it is also just easier. Cloud-based services have lots of industry mindshare, a good deal of developer goodwill, and a cottage industry of developer tooling and related services that make the service fairly painless, at least at the beginning. This is fair, but not optimal. Lots of software should not be developed in a cloud-first way. I believe that better tooling and better user experience for building local-first software can help a lot in this regard. And again, there is already a good deal of practical, industry-oriented work in this space, i.e. Automerge, Ink and Switch’s work (although this is a bit more research-y), and the local-first development community.
In all honesty, I don’t really believe that “better testing” (i.e. the current crop of LLM-driven testing tools, simply “spending more time in staging environments,” or even property-based/fuzz testing—which I am admittedly rather interested in) will move the needle meaningfully forward or get us to where I believe we should want to be. I’ve already written a bit about this in a previous post.
My main gripes with a testing-centric approach are basically:
I think that what these outages point to is actually a fairly desperate need for better programming abstractions, from first principles. Software programs, particularly distributed systems, are increasingly foundational parts of society as a whole, and I believe we need better ways to build and reason about them, not only for commercial reasons, but also for social reasons. The answer isn’t LLM-based tools(9) that help us build vastly more software in the same ways; we need to build vastly better software, in better ways.
There are a couple quotes from an article in ComputerWorld about the AWS outage that I really liked, and I’ll end with them.
From Chris Ciabarra, the CTO of Athena Security:
And from Catalin Voicu, an engineer at N2W Software:
This where programming languages and formal methods research is actually very practically useful, and not just an intellectual curiosity (as some would like to claim)! It can help guide the development and design of simpler, more efficient, and more provably reliable distributed systems, in ways that aren’t just band aids over intractable problems rooted in historical accident and backwards compatibility requirements.(10)