The following is a transcript of a speech given by Dr. Dan Geer at the Security of Things Forum on May 7, 2014. The Forum was held at The Sheraton Commander in Cambridge, Massachusetts. The official copy of Dr. Geer’s speech lives on his web site, and can be found here.
.Security of Things
.Dan Geer, 7 May 14, Cambridge
Thank you for your invitation and to the other speakers for their viewpoints and for the shared experience. With respect to this elephant, each of us is one of those twelve blind men.
We are at the knee of the curve for deployment of a different model of computation. We’ve had two decades where, in round numbers, laboratories gave us twice the computing for constant dollars every 18 months, twice the disk drive storage capacity for constant dollars every 12 months, and twice the network speed for constant dollars every 9 months. That is two orders of magnitude in computes per decade, three for storage, and four for transmission. In constant dollar terms, we have massively enlarged the stored data available per compute cycle, yet that data is more mobile in the aggregate than when there was less of it.
It is thus no wonder that cybercrime is data crime. It is thus no wonder that the advanced persistent threat is the targeted effort to obtain, change, or deny information by means that are “difficult to discover, difficult to remove, and difficult to attribute.”[DG]
Yet, as we all know, laboratory results filter out into commercial off the shelf products at rates controlled by the market power of existing players — just because it can be done in the laboratory doesn’t mean that today you can buy it retail. So it has been with that triad of computation, storage, and transmission capacities. As Martin Hilbert’s studies describe, in 1986 you could fill the world’s total storage using the world’s total bandwidth in two days. Today, it would take 150 days of the world’s total bandwidth to fill the world’s total storage, and the measured curve between 1986 and today is all but perfectly exponential.[MH]
Meanwhile, Moore’s Law has begun slowing. There are two reasons for this. Reason number one is physics: We can’t cool chips at clock rates much beyond what we have now. Reason number two is economics: The cost of new fabrication facilities doubles every two years, which is Moore’s lesser-known Second Law. Intel canceled its Fab42 in January of this year because the capital cost per gate is now rising. By 2018 one new fab will be just as expensive in inflation adjusted terms as was the entire Manhattan Project.[GN] The big players will have to get bigger still, or Moore’s First Law is over because of Moore’s Second Law.
And hardware replacement cycles are no longer driven by customer upgrade lust — by which I mean the need to buy new hardware just because you need new hardware to run new software. “Good enough for everything I need to do” now dominates computing excepting, perhaps, in mobile, but that, too, is a curve that will soon flatten. Only graphics cards are not yet “good enough for everything I need to do”, but every curve has its asymptote. In sum, the commercial off the shelf market is not going to keep allowing us to dream big without regard to the underlying performance costs. We are not going to grow ourselves out of performance troubles of our own making. We were able to do that for a good long run, but that party is over.
We can see that today in cryptography. In the commercial world, cryptographic performance is now a front-and-center topic of discussion both in individual firms, amongst expert discussion groups, and within standards bodies. The commercial world has evidently decided that the time has come to add cryptographic protections to an expanded range of products and services. The question being unevenly debated is whether, on the one hand, to achieve cryptographic performance with ever more adroit algorithm design, especially design that can make full use of parallelization, or to trend more towards hardware implementations. As you well know, going to hardware yields really substantial gains in performance not otherwise possible, but at the cost of zero post installation flexibility. This is not hypothetical; AES performance improvements have of late been because software has been put aside in favor of hardware. At least in the views of some of us, hardware embodiments make the very idea of so-called “algorithm agility” operationally irrelevant because recapitalizing one’s data center so as to get a new hardware-based crypto algorithm spliced in is just not going to happen, nor is turning off some optimized, not to mention amortized, hardware just to be able to use some new software that is consequentially 10X slower. One is reminded of Donald Knuth’s comment that “Premature optimization is the root of all evil.”
This brings us to the hardware question in general terms. The embedded systems space, already bigger than what is normally thought of as “a computer,” makes the attack surface of the non-embedded space trivial if not irrelevant. Perhaps I overstate. Perhaps that isn’t true today, but by tomorrow it will be true. Quoting an authoritative colleague[PG], “[In] the embedded world (which makes the PC and phone and whatnot market seem trivial by comparison), […] performance stays constant and cost goes down. Ten years ago your code had to run on a Cortex-M. Ten years from now your code will need to run on more or less the same Cortex-M, only it’ll be cheaper and have more integrated peripherals.”
Let me pause to ask a teaser question; if those embedded devices are immortal, are they angelic? Let me first talk, though, about the wider world.
Beginning with Stephanie Forrest in 1997,[SF] regular attention has been paid to the questions of monoculture in the networked environment. There is no point belaboring the fundamental observation, but let me state it clearly for the record: cascade failure is so very much easier to detonate in a monoculture — so very much easier when the attacker has only to weaponize one bit of malware, not ten million. The idea is obvious; believing in it is easy; acting on its implications is, evidently, rather hard.
Despite what you may think, I am entirely sympathetic to the actual reason we continue to deploy computing monocultures — making everything almost entirely alike is, and remains, our only hope for being able to centrally manage it all in a consistent manner. Put differently, when you deploy a computing monoculture you are making a fundamental risk management decision: That the downside risk of a black swan [NT] event is more tolerable than the downside risk of perpetual inconsistency. This is a hard question, as all risk management is about changing the future, not explaining the past. So let me repeat, which would you rather have, the inordinately unlikely event of an inordinately severe impact, or the day-to-day burden of perpetual inconsistency?
When we opt for monocultures by choice we had better opt for tight central control. This, of course, supposes that we are willing to face the risks that come with tight central control including the paramount risk of any and all auto-update schemes — namely the hostile control of the auto-update mechanism itself irrespective of whether that hostile control is the result of external takeover of a good controller or the result of a previously good controller going over to the dark side.
But amongst deployed monocultures, computer desktops are not the point; embedded systems are. The trendline in the count of critical monocultures seems to be rising and most of these are embedded systems both without a remote management interface and long lived. That combination — long lived and not reachable — is the trend that must be dealt with, possibly even reversed. Whether to insist that embedded devices self destruct by some predictable age or that remote management of them be a condition of deployment is the question, dare I say the national policy question, that is on the table. In either case, the Internet of Things, which is to say the appearance of network connected microcontrollers in seemingly every device, should raise hackles on every neck. Look at Dan Farmer’s work on IPMI, the so-called Intelligent Platform Management Interface, if you need convincing.[DF] The last sentence before the conclusion of his paper reads “[IPMI] was designed for full control, remote management, and monitoring, and it’s pretty damn good at it.” Farmer tells you, in several ways, that that very fact is why you are hosed.
This is one of my key points for today — that an advanced persistent threat, one that is difficult to discover, difficult to remove, and difficult to attribute, is easier in a low-end monoculture, easier in an environment where much of the computing is done by devices that are deaf and mute once installed or where those devices operate at the very bottom of the software stack, where those devices bring no relevant societal risk by their onesies and twosies, but do bring relevant societal risk at today’s extant scales much less the scales coming soon. As Dave Aitel has put it many, many times, for the exploit writer the hardest part by far is test, not coding.[DA] Put differently, over the years I’ve modified my thinking on monoculture such that I now view monoculture not as an initiator of attack but as a potentiator, not as an oncogene but as angiogenesis.
Fifteen years ago, Lazslo Barabasi argued why it is not possible to design a network that is at once proof against both random faults and targeted faults.[LB] Assuming that his conception of a scale-free network is good enough for our planning purposes, we see that today we have a network that is pretty well immune to failure from random faults but which is hardly immune to targeted faults. Ten years ago, Sean Gorman’s simulations showed a sharp increase in network-wide susceptibility to cascade failure when a single exploitable flaw reached 43% prevalence.[SG] We are way above that 43% threshold in many, many areas, most of them built-in, unseen, silent. Five years ago, Kelly Ziegler calculated that patching a fully deployed Smart Grid would take an entire year to complete, largely because of the size of the per-node firmware relative to the available powerline bandwidth.[KZ] How might we extrapolate from these various researcher’s findings?
The root source of risk is dependence, especially dependence on the expectation of stable system state. Dependence is not only individual but mutual, not only am I dependent or not but rather a continuous scale asking whether we are dependent or not; we are, and it is called interdependence. Interdependence is transitive, hence the risk that flows from interdependence is transitive, i.e., if you depend on the digital world and I depend on you, then I, too, am at risk from failures in the digital world. If individual dependencies were only static, they would be evaluable, but we regularly and quickly expand our dependence on new things, and that added dependence matters because we each and severally add risk to our portfolio by way of dependence on things for which their very newness makes risk estimation, and thus risk management, neither predictable nor perhaps even estimable. Interdependence within society is today absolutely centered on the Internet beyond all other dependencies excepting climate, and the Internet has a time constant five orders of magnitude smaller.
The Gordian Knot of such tradeoffs — our tradeoffs — is this: As society becomes more technologic, even the mundane comes to depend on distant digital perfection. Our food pipeline contains less than a week’s supply, just to take one example, and that pipeline depends on digital services for everything from GPS driven tractors to drone-surveilled irrigators to robot vegetable sorting machinery to coast-to-coast logistics to RFID-tagged livestock. Is all the technologic dependency, and the data that fuels it, making us more resilient or more fragile? Does it matter that expansion of dependence is where legacy comes from? Is it essential to retain manual means for doing things so that we don’t have to reinvent them under time pressure?
Mitja Kolsek suggests that the way to think about the execution space on the web today is that the client has become the server’s server.[MK] You are expected to intake what amount to Remote Procedure Calls (RPCs) from everywhere and everyone. You are supposed to believe that trust is transitive but risk is not. That is what Javascript does. That is what Flash does. That is what HTML5 does. That is what every embedded Browser Help Object (BHO) does. How do you think that embedded devices work? As someone who refuses Javascript, I can tell you that the World Wide Web is rapidly shrinking because I choose to not be the server’s server, because I choose to not accept remote procedure calls.
As they say on Marketplace, let’s do the numbers: The HTTP Archive says that the average web page today makes out-references to 16 different domains as well as making 17 Javascript requests per page, and the Javascript byte count is five times the HTML byte count.[HT] A lot of that Javascript is about analytics which is to say surveillance of the user “experience” (and we’re not even talking about getting your visitors to unknowingly mine Bitcoin for you by adding Javascript to your website that does exactly that.[BJ])
To return to the question of whether immortal embedded systems are angelic or demonic, I ask you the most fundamental design question: So should or should not an embedded system have a remote management interface? If it does not, then a late discovered flaw cannot be fixed without visiting all the embedded systems — which is likely to be infeasible because some you will be unable to find, some will be where you cannot again go, and there will be too many of them in any case. If it does have a remote management interface, the opponent of skill will focus on that and, once a break is achieved, will use those self-same management functions to ensure that not only does he retain control over the long interval but, as well, you will be unlikely to know that he is there.
Perhaps what is needed is for embedded systems to be more like humans, and I most assuredly do not mean artificially intelligent. By “more like humans” I mean this: Embedded systems, if having no remote management interface and thus out of reach, are a life form and as the purpose of life is to end, an embedded system without a remote management interface must be so designed as to be certain to die no later than some fixed time. Conversely, an embedded system with a remote management interface must be sufficiently self-protecting that it is capable of refusing a command. Inevitable death and purposive resistance are two aspects of the human condition we need to replicate, not somehow imagine that to overcome them is to improve the future.
This is perhaps the core of my thesis, that when sentience is available, automation will increase risk whereas when sentience is not available, automation can reduce risk. Note the parsing here, that replacing available sentience with something that is not sentient *will* increase risk but that substituting automation for whatever you have absent sentience *can* make things better. It won’t do so necessarily, but it can. This devolves to a question of what do I mean when I say “sentience is available” and that devolves to some combination of finance and public policy, which is to say the art of the possible both economically and politically. The future, obviously enough, will not be so simple, nor am I making it out to be.
Lest some of you think this is all so much picayune, tendentious, academic perfectionist posturing, here is how to deny the Internet to a large fraction of its users. There are better methods, there are more insidious methods, there are darker paths. My apologies to those of you who are aware of what I am about to describe, but this one example of many is known to several of us, known in the here and now: Home routers have drivers and operating systems that are binary blobs amounting to snapshots of the state of Linux plus the lowest end commodity chips that were extant at the time of the router’s design. Linux has moved on. Device drivers have moved on. Samba has moved on. Chipsets have moved on. But what is sold at Best Buy or the like is remarkably cheap and remarkably old. At the chip level, there are only three major manufacturers, so Gorman’s 43% threshold is surpassed. With certainty born of long engineering experience, I assert that those manufacturers can no longer build their deployed software blobs from source. If, as my colleague Jim Gettys has laboriously measured, the average age of the code base on those ubiquitous low-end routers is 4-5 years,[JG] then you can be assured that the CVE catalog lists numerous methods of attacking those operating systems and device drivers remotely.[CV] If I can commandeer them remotely, then I can build a botnet that is on the *outside* of the home network. It need not ever put a single packet through the firewall, it need never be detectible by any means whatsoever from the interior of the network it serves, but it is most assuredly a latent weapon, one that can be staged to whatever level of prevalence I desire before I ask it to do more. All I need is to include in my exploit a way to signal that device to do three things: stop processing anything it henceforth receives, start flooding the network with a broadcast signal that causes other peers to do the same, and zero the on-board firmware thus preventing reboot for all time. Now the only way to recover is to unplug all the devices, throw them in the dumpster, and install new ones — but aren’t the new ones likely to have the same kind of vulnerability spectrum in CVE that made this possible in the first place? Of course they do, so this is not a quick trip to the big box store but rather flushing the entire design space and pipeline inventory of every maker of home routers.
About now you may ask if it isn’t a contradiction to imagine embedded devices that have no management interface for you but are somehow something that can be managed by various clowns. The answer is “No, it is not a contradiction.” As everyone here knows, an essential part of software analysis is fuzzing, piping unusual input to the program for the purpose of testing.[UW] But that is only testing; I refer you instead to the very important work now appearing under the title “language-theoretic security.”[LS] Let me quote just two paragraphs:
The Language-theoretic approach (LANGSEC) regards the Internet insecurity epidemic as a consequence of ad hoc programming of input handling at all layers of network stacks, and in other kinds of software stacks. LANGSEC posits that the only path to trustworthy software that takes untrusted inputs is treating all valid or expected inputs as a formal language, and the respective input-handling routines as a recognizer for that language. The recognition must be feasible, and the recognizer must match the language in required computation power.
When input handling is done in ad hoc way, the de facto recognizer, i.e., the input recognition and validation code ends up scattered throughout the program, does not match the programmers’ assumptions about safety and validity of data, and thus provides ample opportunities for exploitation. Moreover, for complex input languages the problem of full recognition of valid or expected inputs may be [formally] UNDECIDABLE, in which case no amount of input-checking code or testing will suffice to secure the program. Many popular protocols and formats fell into this trap, the empirical fact with which security practitioners are all too familiar.
And that is really and truly the point. The so-called “weird machines” that result from maliciously well chosen input are the machines, where regardless of whether there is a management interface as such, that allow the target to be controlled by the attacker. The Dartmouth group has now shown numerous examples of such weird machines in practice, including a 2013 USENIX paper[JB] which begins:
We demonstrate a Turing-complete execution environment driven solely by the IA32 architecture’s interrupt handling and memory translation tables, in which the processor is trapped in a series of page faults and double faults, without ever successfully dispatching any instructions. The “hard-wired” logic of handling these faults is used to perform arithmetic and logic primitives, as well as memory reads and writes. This mechanism can also perform branches and loops…
Therefore, we now see that devices that have no management interface cannot be repaired by their makers but they can be commandeered by others if enough skill is brought to bear. Devices that do have a management interface are better off, but only if they protect that interface at all costs. Because the near entirety of commercial Internet usage beyond HTML v4 relies upon Turing-complete languages, the security of these services cannot be proven because to do so would be to solve the halting problem. When weird machine style attacks begin to involve devices that do not have a human user who might be coherent enough to notice that something is amiss, they will proceed in stealth. There is not even a guarantee that their maker knows with precision what went into any one of them after the model year is over. The longer lived the devices really are, the surer it will be that they will be hijacked within their lifetime. Their manufacturers may die before they do, a kind of unwanted legacy much akin to space junk and Superfund sites. BBC science reporting has already said much the same thing.[BB]
As Daniel Bilar showed in his analysis of Conficker,[DB] “attackers and defenders each present moving targets to the other,” that is to say that oscillating advantage is to be expected just as in Nature’s predator/prey dynamics or in game theory. Why? Because a sentient opponent does whatever he can to exploit your code by way of exploiting the assumptions on which your code is built. Sandy Clark showed that if software security is your goal, then “software re-use is more harmful to software security than beneficial.” Why? Because a sentient opponent first has to learn how your code works and you help him by re-using components.[SC] In short, is it time to give up on software security or to double down the way the LANGSEC group shows us ? Do we need more evidence than LANGSEC, Bilar, and Clark, with their collaborators, have given us? Is it time finally to accept Ken Thompson’s seminal observation that you can only trust a program you wrote entirely and to act accordingly?[KT]
To be brusquely clear, cotemperaneously with writing this talk about the future, these very questions about the future may have appeared. We do not know, but the worm called TheMoon that is now working its way through the world’s Linksys routers may be precisely what I have described.[TM] It may be that. It may be not that the forest could burn, but that it is already afire. It may be that we are one event away from not being able to disambiguate hostile action from an industrial accident. That matters a lot.
I don’t expect any of my analysis to change the course of the world, the market, or Capitol Hill. Therefore, let me give my core prediction for advanced persistent threat: In a world of rising interdependence, APT will not be about the big-ass machines; it will about the little. It will not go against devices with a hostname and a console; it will go against the ones you didn’t even know about. It will not be something you can fix for any of the usual senses of the English word “fix;” it will be avoidable only by damping dependence. It cannot and will not be damped by some laying on of supply chain regulations. You are Gulliver; they are the Lilliputians.
My personal definition of a state of security is “The absence of unmitigatable surprise.” My personal choice for the pinnacle goal of security engineering is “No silent failure.” You, for all values of “you,” need not adopt those, but I rather imagine you will find that in an Internet of More Things Than You Can Imagine an ounce of prevention will be worth way, way more than a pound of cure. We have very little time left — the low-end machines of four years from now are already being deployed. As Omar Khayam put it a thousand years ago,
The Moving Finger writes: and, having writ,
Moves on: nor all thy Piety nor Wit.
Shall lure it back to cancel half a Line,
Nor all thy Tears wash out a Word of it.
There is never enough time. Thank you for yours. /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
[DG] Geer D, “href=”www.computerworld.com/s/article/9175363/Advanced_persistent_threat” target=”_blank”>Advanced Persistent Threat,” Computerworld, April 2010
[MH] www.martinhilbert.net/WorldInfoCapacityPPT.html (reflecting Hilbert & Lopez, Science:v332/n6025/p60-65) extrapolated to 2014 with concurrence of its author
[PG] Gutmann P, U Auckland, personal communication [GN] “Slowing Moore’s Law;”
[SF] Forrest S, Somayaji, & Ackley, “Building Diverse Computer Systems,” HotOS-VI, 1997
[NT] Taleb NN, _Fooled By Randomness_, Random House, 2001 [DF] Farmer D, “IPMI: Freight Train to Hell v2.01,” 2013
[DA] Aitel D, CTO, Immunity, Miami, personal communication
[LB] Barabasi L & Albert R, “Emergence of scaling in random networks,” Science, v286 p509-512, October 1999
[SG] Gorman S, et al., “The Effect of Technology Monocultures on Critical Infrastructure,” 2004
[KZ] Ziegler K, “The Future of Keeping the Lights On,” USENIX, 2010
[MK] Kolsek M, ACROS, Slovenia, personal communication
[HT] Trends, HTTP Archive
[BJ] Bitcoin Miner for Websites
[JG] Gettys J, former VP Software, One Laptop Per Child, personal communication
[CV] Common Vulnerabilities and Exposures
[UW] Source concepts at U Wisconsin
[LS] The View from the Tower of Babel, langsec.org
[JB] Bangert J, et al., “The Page-Fault Weird Machine: Lessons in Instruction-less Computation,” USENIX, 2013;
[BB] “Internet of Things: The ghosts that haunt the machine”
[KT] Thompson K, “On Trusting Trust,” CACM, August 1984
[DB] Bilar D, et al., “Adversarial Dynamics: The Conficker Case Study,” Springer, 2013
[TM] “Linksys Worm “TheMoon” Summary: What we know so far,” 27 Mar 14
[SC1] Clark S, et al., “The Honeymoon Effect and the Role of Legacy Code in Zero-Day Vulnerabilities,” ACSAC, 2010
[SC2] Clark S, et al., “Moving Target: An Empirical Study of the Security Properties of Rapid-Release Cycles”
Pingback: Bladerunner Redux: Do Embedded Systems Need A Time To Die? | The Security Ledger
Pingback: IPMI Security Flaw Affects 200k Systems | The Security Ledger
Pingback: Dan Geer's Other Keynote: Embedded Devices Need A Time To Die | The Security Ledger