AWS, Cloudflare, and centralization

Somewhat recently, both AWS and Cloudflare experienced major outages: these had some pretty widespread impacts on real humans (and also on corporate bottom lines). Also, just over a year ago, CrowdStrike caused another massive global outage that negatively affected lots of different industries, i.e. air travel, finance, and healthcare.

Everyone relies on software these days! (More or less. Or at least increasingly so.) Events like these, which make very clear just how dependent we are on brittle, complicated, and unreliable software is...a bit disturbing, to say the least.

I think there are a couple real ways forward, and I’ll try to relatively quickly sketch them out. It’s probably worth prefacing this with my deep distaste for the common refrain that we just need better bureaucratic processes or developer protocols within the companies producing software.

I don’t think that this sort of perspective—which generally results in more administrative overhead, red tape, and an AI/LLM savior complex⁽³⁾—actually gets us anywhere meaningfully better: it only offers bandaids for much deeper problems. Top-down policy that relies heavily on human input, coordination, and diligence doesn’t really scale; AI and LLM-based attempts to scale this kind of heuristic administrative oversight only serves to delay a more painful (but more essential) reckoning. What we need, I argue, is structural change at the level of programming tools, abstractions, and infrastructure.

A brief qualification

None of this is to say that human factors are unimportant! There are lots of cases where human factors are deeply important, but I think only insofar that those factors shape and are shaped by the structural tools, abstractions, and infrastructure we develop around them.

For example, take supply chain attacks, which seem to be inevitable in a world where (a) software projects often have large dependency graphs and (b) few, if any, software packages or modules come with formal security guarantees. Supply chain attacks (and other, related kinds of coordinated attacks on social networks, e.g. Google bombing or Sybil attacks) abuse and exploit notions of trust, which inherently (I think) cannot be strictly and precisely quantifiable—these notions must involve human factors!

The best we can do is develop formal models of trust, and then develop structural tools, abstractions, and infrastructure—whether it’s public key infrastructure, PGP keys and the “web of trust,” Byzantine fault tolerant consensus protocols, decentralized and distributed ledgers, and so on—to help us (a) trust fewer things and (b) make more explicit what we’re trusting.

Software centralization and lack of end-user control

One major concern that I have with all of these recent outages is that it exposes just how centralized the internet really is. In brief: I think the internet (and software in general) should be far more decentralized and local-first. None of this is particularly revolutionary, given all the recent activity around things like open social networks (and associated protocols) and local-first software.

I think work in these directions is all very timely and important. Should we all be (unwittingly) relying on Amazon AWS to store and access much of our online data? Should we be collaborating online and sharing information primarily through Google Drive’s suite of applications? Should our mattresses be connected to the internet? Should a remote software update from a single entity deny healthcare providers access to local patient data? Should we be relying, for the most part, exclusively on Github (which is now owned by Microsoft) to store, access, and work on open-source code?

I don’t think so. This state of affairs feels increasingly untenable: it harms and disempowers users, frustrates software developers, and seems only to serve an increasingly wealthy and politically powerful technocratic ruling class. Consumer-facing software should not require internet access to function, users should own their data, and computing infrastructure must diversify, decentralize, and perhaps even socialize (although I am less convinced of the latter).

That being said, this is all rather idealistic: there are clear challenges in practice, particularly regarding decentralization. There are good reasons (e.g. convenience, user experience, short-term and small-scale cost effectiveness) why large centralized platforms and infrastructure providers (i.e. the “cloud”) have become hugely popular.

I don’t want to get too in the weeds here, so I’ll just link to some articles and posts that I find interesting in this regard:

Abstraction (and tooling) support

The core of my stance on these outages (and, more broadly, on the state of modern software) is this: we need better support from our programming tools and abstractions, both to make more decentralization more practical, and to prevent the bugs that caused these centralized outages from happening in the first place. I’m primarily interested in approaches centered around programming languages (fairly broadly construed), and formal methods.⁽⁵⁾

Distributed systems are hard

I think this is, by now, a fairly uncontroversial point. Making distributed systems work correctly and reliably is an immensely difficult task, even for hugely profitable software firms—like Amazon and Google—whose profits depend on these systems working correctly and reliably. There are two complementary responses to this observation:

I think programming languages research has a lot of very fruitful things to say about abstraction design for distributed systems (and abstraction design in general).⁽⁶⁾ And I think formal verification—trying to get provable guarantees about global properties of computing systems—should play a much larger role in the software development process, partcularly for foundational and complicated distributed systems. Of course, there is a good deal of research work that still needs to be done to make this more widely practical. Case in point: Amazon (AWS) runs a large programming languages/formal verification research group (probably the largest currently in industry).

It should be far easier to “roll your own” infrastructure

A big reason why people (and companies) reach instinctively for managed, cloud-based services is that it’s just easy. I think there is a really interesting, and pretty practical/industry-facing area of work in making locally-managed, fully-owned, and low-dependency compute infrastructure easier and more usable. This really isn’t an area that I think really demands any interesting research advances to make possible⁽⁷⁾: we have all the tools we need! It’s simply a matter of the right player(s) coming in and developing better end-to-end tooling, making the user experience palatable, and getting some industry traction. There are already some startups in this space, i.e. Oxide.

It should be far easier to write local-first software

Similarly, a big reason why developers often develop in a cloud-oriented way is that it is also just easier. Cloud-based services have lots of industry mindshare, a good deal of developer goodwill, and a cottage industry of developer tooling and related services that make the service fairly painless, at least at the beginning. This is fair, but not optimal. Lots of software should not be developed in a cloud-first way. I believe that better tooling and better user experience for building local-first software can help a lot in this regard. And again, there is already a good deal of practical, industry-oriented work in this space, i.e. Automerge, Ink and Switch’s work (although this is a bit more research-y), and the local-first development community.

How about testing?

In all honesty, I don’t really believe that “better testing” (i.e. the current crop of LLM-driven testing tools, simply “spending more time in staging environments,” or even property-based/fuzz testing—which I am admittedly rather interested in) will move the needle meaningfully forward or get us to where I believe we should want to be. I’ve already written a bit about this in a previous post.

Looking forward, looking back

I think that what these outages point to is actually a fairly desperate need for better programming abstractions, from first principles. Software programs, particularly distributed systems, are increasingly foundational parts of society as a whole, and I believe we need better ways to build and reason about them, not only for commercial reasons, but also for social reasons. The answer isn’t LLM-based tools⁽⁹⁾ that help us build vastly more software in the same ways; we need to build vastly better software, in better ways.

There are a couple quotes from an article in ComputerWorld about the AWS outage that I really liked, and I’ll end with them.

This where programming languages and formal methods research is actually very practically useful, and not just an intellectual curiosity (as some would like to claim)! It can help guide the development and design of simpler, more efficient, and more provably reliable distributed systems, in ways that aren’t just band aids over intractable problems rooted in historical accident and backwards compatibility requirements.⁽¹⁰⁾

Privacy, more broadly construed, also matters! (At least I think so.) But regardless, unintended leaks of privacy are certainly a bad thing.↩
We really should formally verify parsers and de/serializers. And people are trying to do that! (And have been for some time.) See, for example: Vest (MSR, Northeastern, UMD, CMU), Daedalus (Galois). Or some more foundational theoretical work, which I really don't understand (but seems quite interesting): "Intrinsic Verification of Parsers and Formal Grammar Theory in Dependent Lambek Calculus" from PLDI 2025.↩
This is a bit tongue in cheek! But there's a broader opinion I have here about LLMs and their impact in the software industry that I don't really want to totally unpack right now. The gist is that I believe much of the perceived and actual usefulness of LLMs in industry software development stems from the (unnecessary, but immediately profitable) acceptance of enormous complexity as natural and inevitable. What else to do then, but pray at the altar of a massive, incomprehensible, computationally overwhelming god to save us from this inhumanity! While I do think there is useful work being done (and to be done!) in this direction, it's not really what I believe in or want to spend time on.↩
Christine Lemmer-Webber also does lots of very cool work on decentralized infrastructure! (With Scheme no less, which is even cooler.)↩
Perhaps it's obvious by now what kind of computer science research I find most compelling...↩
I'm certainly not the first or only person to notice this. See recent-ish work by Heather Miller, Mae Milano, Lindsey Kuper, etc. on language designs for distributed systems, and see lots of work every year at programming languages and formal verification conferences on these kinds of topics, i.e. verification techniques for concurrent/distributed systems and weak-memory models. For what it's worth, this is probably the body of work that first got me interested in programming languages as a research field and one that continues to really interest me.↩
Although there is actually really interesting work in terms of better languages for declarative infrastructure, for example, or even language/OS co-design for better distributed systems administration! I think these are actually cool research directions, but they aren't required to make "roll your own" infrastructure possible.↩
This relates a bit to why I like programming languages: I think a language-oriented approach---and often a functional and type-based approach---actually shapes software design in a way that aids testability and debuggability, and allows more complete and effective (formal) reasoning about software, at least at the logical level. There is something to be said about functional languages requiring more complicated compilers and such, but I actually don't personally view this as a big issue, particularly as someone who believes that foundational software infrastructure like compilers should be increasingly formally verified anyway. There are also interesting and important (in my opinion) research problems in highly efficient compilation of functional languages.↩
Perhaps it's more nuanced to say that the answer isn't just LLM-based tools: clearly, LLMs are useful for many software tasks. But I do think that the push for AI/LLMs in software development and the amount of funding in this direction is too much. There are more foundational problems that need to be addressed, and AI/LLMs are just very fancy band-aids. But I'm obviously biased.↩
Another related and very interesting article on this is "Distributed Programming Has Stalled" by Shadaj Laddad.↩