Multiscale networking, robustness, and rigor

John Doyle

Control and Dynamical Systems

Caltech

 

The twentieth century may be viewed as bringing near closure to the first scientific "revolution," which aimed for a simple, certain, reproducible view of nature, in part by a radical denial of the complex and uncertain.  Quantum mechanics, relativity, the nature of the chemical bond, and the role of DNA in genetics were among the highlights of this reductionist program, as mainstream science focused overwhelmingly on characterizing the fundamental material and device properties of natural systems.  In contrast, it has provided few rigorous and predictive tools for dealing with the complexity and uncertainty of the real world outside the laboratory, particularly the complex networks that are certain to be the focus of both engineering and biology.   Unfortunately, many mainstream advocates of a "new science of complexity" have also neglected rigor and predictability, in favor of attractively vague notions of emergence and self-organization. 

The next decade and century may be the beginning of a second and far more profound scientific and technological revolution associated with networks and systems, with all their complexity.  The management of complex systems to date has depended on taming matter, energy, entropy, and information.  While each has been important throughout human history, substantial progress has resulted when scientific discoveries, based on deep principles, made possible the replacement of ad hoc and implicit treatment by systematic and explicit management.    More recently the structured and systematic management of information in networks and computers, from VLSI design to coding theory to network protocols to object-orientation, has created an astonishing explosion in the complexity of our systems.   The post-genomic era also promises to focus attention in biology on networks and systems, from gene regulation to signal transduction to neural systems to ecosystems. 

This informal and nontechnical essay is an attempt to outline the principles, on which a rigorous new science of complex systems might be based, one that places convergent computation, communication, and control clearly at its center.  These principles will be explained in the context of 4 themes:

  1. Convergent, ubiquitous, pervasive networking
  2. Multi/scale/resolution, heterogeneity
  3. Robustness, reliability, high confidence
  4. Rigor

These four themes will be briefly reviewed, and then a selection of specific examples will be sketched to illustrate the themes.  We are building increasingly integrated, interconnected, and automated networks for information, energy, transportation, and business, yet we are investing relatively little in research directed at understanding how to make these systems robust and predictable.  The 20th century's understanding of chaos and undecidability, together with relativity and quantum mechanics changed completely the 19th century's view of the universe as a clockwork mechanism, but otherwise had surprisingly little impact on the reductionist program.  If the 21st century has surprises in store beyond simply unifying existing theories or cataloging the components of cells, they are likely to be in dealing with the complexity of the networks from which we are built and are now building.

Convergent, ubiquitous, pervasive networking

Everyone is aware of the Internet’s impact on the ongoing convergence of data, voice, and video, as well as the convergence of data, commerce, manufacturing, and transportation in e-commerce.  Various deregulations of utility industries have also created growing convergence of information, financial, transportation and energy networks.  These trends are only mild precursors of the future of ubiquitous, pervasive networking where every object in our world will have the equivalent of a telephone number or an IP address.  The currently distinct networks associated with communications, computing, transportation, energy, consumer products, utilities, health, finance, and manufacturing will certainly blur into a single integrated network of networks.

The future of biology is poised to parallel that of engineering systems.  Just has hardware increasingly becomes a commodity and technological value-added occurs at the network level, so too do the component molecules of biology and the experimental techniques to explore them become commodities.  The focus is beginning to shift to understanding the vast networks that biological molecules create that regulate and control life.  We can reasonably expect that viewing biology in terms of complex feedback networks will become as essential as the role of molecular biology has become over the last few decades.

The rest of this essay will assume that the reader is aware of the debate and discussion surrounding the future ubiquitous, pervasive, networked computing continuum in which our lives will be immersed, and if not alarmed, then at least concerned as to what the scientific foundation might for such a technology.

Multi/scale/resolution, heterogeneous

The central multiscale issue is connecting microscopic device and component properties with the macroscopic behavior of networks and systems.  The study of multiscale phenomena is not new is science.  The microscopic behavior of a gas can be described in terms of the position and velocity of molecules.  Ensembles of molecules can be viewed as stochastic processes with stationary distributions.  Macroscopic behavior is typically described by thermodynamic quantities such as temperature and pressure, and statistical physics has been concerned with connecting all these different scales.   The key feature of systems in statistical physics is that the thermodynamic properties are the consequence of the generic microscopic features and thus sets of measure zero can be neglected.

The statistical physics view of multiscale phenomena has lead to a corresponding view of complex systems that is typically quite different from what arises in engineering or biology. The complex systems studied in physics are typically homogeneous in their underlying physical properties and complexity is most interesting when it is not put in by hand, but rather arises as a consequence of bifurcations or dynamical instabilities, which lead to “emergent or self-organizing” phenomena on large length scales.  Even when long-range correlations arise due to, say, critical phenomena, universality justifies representing the microscopic degrees of freedom by simple, often identical components.

The Internet is one example of a system that may superficially appear to be a candidate for the self-organizing and emergent view of complexity. It certainly appears as though new users, applications, workstations, PCs, servers, routers, and whole subnetworks can be added and the entire system naturally self-organizes into a new, robust configuration.  Furthermore, once on-line, users act as individual agents, sending and receiving messages according to their needs. There is no centralized control, and individual computers both adapt their transmission rates to the current level of congestion, and recover from network failures, all without user intervention or even awareness. It is thus tempting to imagine that Internet traffic patterns can be viewed as an emergent phenomenon from a collection of independent agents who adaptively self-organize into a complex state, balanced on the edge between order and chaos.  Even the ubiquitous self-similarity of Internet statistics could be taken as the classic hallmarks of criticality.

As appealing as this picture is, the reality is that modern internets use sophisticated multi-layer protocols to create the illusion of a robust and self-organizing network, despite substantial uncertainty in the user-created environment as well as the network itself. It is no accident that the Internet has such remarkable robustness properties, as the Internet protocol suite (TCP/IP) in current use was the result of decades of research into building a nationwide computer network that could survive deliberate attack. The high throughput and expandability of internets depend on these highly structured protocols, as well as the specialized hardware (servers, routers, caches, and hierarchical physical links) on which they are implemented. Yet it is an important design objective that this complexity be hidden.

The core of the Internet, the Internet Protocol (IP), presents a carefully crafted illusion of a simple (but possibly unreliable) datagram delivery service to the layer above (typically the Transmission Control Protocol, or TCP) by hiding an enormous amount of heterogeneity behind a simple, very well engineered abstraction. TCP in turn creates a carefully crafted illusion to the applications and users of a reliable and homogeneous network. The internal details are highly structured, heterogeneous, and non-generic, creating apparent simplicity, exactly the opposite from emergent complexity.

Interestingly and importantly, the increase in robustness, productivity, and throughput created by the enormous internal complexity of the Internet and other complex systems is accompanied by new hypersensitivities to perturbations the system was not designed to handle. Thus while the network is robust to even large variations in traffic, or loss of routers and lines, it has become extremely sensitive to bugs in network software, underscoring the importance of software reliability and justifying the attention given to it.  Computer viruses can now use the very applications that make the Internet so popular to propagate quickly and widely.  The infamous and perhaps overrated Y2K bug, though not necessarily a direct consequence of network connectivity, is nevertheless the best-known example of the general risks of high connectivity for high performance. There are many less well-known examples, and indeed most modern large-scale network crashes can be traced to software problems, as can the failures of many systems and projects (e.g. the Ariane 5 crash or the Denver Airport Baggage handling system fiasco).

Robustness, reliability, high confidence

The success of the Internet is greatly due to the extreme emphasis that was placed on robustness in its design.  Robustness and uncertainty management is beginning to replace information, entropy, energy, and materials as the dominant issue in complex systems of all types, including especially biology and software.  It now demands the same structured and systematic mathematical and computational approach that has proven so successful with matter, energy, entropy, and information.   However, just as entropy to a 17th century scientist or information theory or quantum mechanics to an 18th century one would have seemed arcane and mystical, so too does robustness as a specific quantity to be managed in explicit and systematic ways appear to many 20th century scientists.  But one need only look at the complex systems around us to see that robustness has become the dominant issue.

The Boeing 777 has millions of parts, mostly rivets, but 150,000 distinct subsystems, many of which are themselves highly complex components, some with millions of subcomponents, and so on.  Despite this astonishing complexity, which certainly overlaps with simple biological organisms in terms of part count, most deaths in commercial aviation are now due not to vehicle malfunctions, but to higher-level system failures, terrorist attacks, or pilot or air traffic controller error.  What's important, though, is that the overwhelming proportion of the millions of parts in a modern commercial aircraft or the thousands of genes in biological organisms is there purely for robustness and uncertainty management.  In both cases the increasing complexity has produced a net improvement in robustness, provided the environment for which the system was designed doesn't change too dramatically.

Our energy, information, and transportation networks are even more complex than a Boeing 777 and have not yet reached the same level of reliability.  Electric power systems are reliable enough that we generally take them for granted until a multimillion customer, multibillion-dollar outage reminds us that this is naïve.  Other dramatic failures in power, communication, transportation, and space systems, ecological problems, and the presence of auto-immune diseases and cancer remind us that this robustness cannot be taken for granted.  Major success stories, such as the Internet, VLSI design, and the Boeing 777, have been the result of highly structured and systematic processes, with an almost obsessive attention to robustness.  These experiences are suggestive and the details are often poorly understood, but they only hint at the kind of the highly heterogeneous, nonlinear, dynamic, interconnected, integrated networks that are coming.  A robustness perspective is particularly important in this era of exploding technological complexity and the "better, cheaper, faster" world of virtual prototyping.  And any future golden age of biology is certain to depend as much on understanding systems as on understanding their components.

The highly differential robustness of both biological and engineering systems is not accidental, but is one example of an inherent property of interconnected complex systems.  By differential robustness we mean that insensitivity to certain uncertainties, hopefully the most likely, is obtained at the expense of increased sensitivity to other uncertainties, hopefully unlikely.  Thus designers of modern high performance systems must accept hypersensitivity to the extremely unlikely in return for insensitivity to common uncertainties, and this tradeoff drives the introduction of increasing complexity.  Although the subject is in its infancy, there are even conservation laws associated with robustness and uncertainty management in complex systems that are becoming more important than more familiar limitations due to conservation of mass and energy, speed of light, quanta, entropy, information, and computation. Indeed, many complex systems concepts implicitly assume the equivalent of a perpetual motion machine, but now with respect to robustness and information, rather than energy.

Complex systems engineering, and in particular, robustness and uncertainty management, is becoming the central issue throughout technology.  But it is in the biological sciences that a theory of uncertainty management promises to have even more revolutionary impact, and there is a need for the creation of an entirely new discipline focused on theory and its application to real complex technological and biological systems. Example applications include control of cellular processes through signal transduction and cell surface receptors, and the interaction of mechanics and biochemistry in developmental dynamics, and in the movement of macromolecules, cells, and tissues.  Engineering applications including robustness and reliability in complex energy, information, and transportation networks.  A central theme will be a theoretical foundation for the "better, cheaper, faster" movement in engineering design.

In both biology and engineering simple components must be put together to create complex systems and uncertainty at the component level can have highly unpredictable consequences at the system level.  Theorists in a variety of disciplines are beginning to develop sophisticated tools to address robustness and complexity but they currently exist in a too fragmented way within narrow technical disciplines such as controls, dynamical systems, computation, communications, and statistics.   One essential goal of a research program in complex systems is to create a more integrated theory of robustness.

Rigor

Rigor will be an essential feature of any practical theory of complex networks.  The essence of rigor is having a clear picture of what is known and not known, what is proven or not proven, for what is there strong evidence, and what is conjecture.  Much of the recent mainstream scientific publications in complex systems have been marked by a serious lack of rigor in all these senses.  The origins of our current situation can be traced to longstanding differences in styles between scientists, mathematicians, and engineers, to which has been added the unique style of the hacker culture, which has dominated much of computing. 

Mathematicians have long viewed physicists as sloppy and cavalier, while physicists retort that mathematics’ emphasis on rigor would stifle physicist’s creativity.  Both points of view have merit, and it is natural these two communities have developed such different styles.  Physicists are seeking to explain the basic workings of the simplest aspects of the universe and ultimately have experiments to sort out good ideas from bad.  The theoretical physics community rewards bold ideas that are ultimately proven correct experimentally, and there is little or no consequence for being wrong. 

As mathematics has become increasingly abstract, physical intuition, let alone experiment, has been lost as a means of sorting out true from false assertions.  Mathematics has thus come to rely on a theorem/proof style intended to allow the math community to have as much rigor as is possible without experimental verification.  Although working mathematicians share a remarkably uniform view of what constitutes a rigorous proof, the issue is not without controversy, although the arguments are not easily explained to nonexperts. 

Engineers who deal with complex systems are often closer in spirit to mathematicians than to physicists, because the consequences of believing and implementing a bold and provocative but wrong conjecture can be disastrous.  The exception to this is the “hacker culture” which has dominated much of software development.  The essence of this culture is an extreme emphasis on never letting perfection stand in the way of good enough, or the next software release, and an emphasis on features over robustness and reliability.

The “mainstream” complex systems literature that appears in such journals as Science and Nature is largely dominated by the physicist and to some extent hacker cultures, while the more mathematical and engineering approaches are found in more specialized journals or conferences.  This creates the ironic situation where a paper on, say, the Internet will appear in Phys. Rev. Letters, or even Science or Nature, that could not be published in a conference on the Internet.  The authors will argue that the so-called experts in Internetworking are conservative and resistant to new ideas, while the latter will typically argue that the research is obviously wrong.

There are positive aspects to all these cultures that must be blended together properly to create a culture that is appropriate for a research community in complex networks.  There is no doubt that we need the development of entirely new mathematics, and for the most part, the mathematician’s style will be needed.   More importantly complex networks will have little that is directly accessible to either physical intuition or controlled experiments, except on their component parts.  Thus one motivation for rigor mirrors that of the mathematician’s, that it provides the only hope of understanding in the absence of physical insight or experimental verification.  This does not mean that developing appropriate intuition is not essential as well, just that it will be more like that of the pure mathematician than the physicist.

The hacker’s commitment to taking complexity head on will be valuable.  If theory is to make any contribution to the practical needs of engineering and biological networks, if cannot be done in the context of purely academic examples.  The systems must work in the real world.  Ultimately, however, we need to have the engineer’s perspective that systems must work reliably in the real world, not just in the lab or the demo.  And finally, the physicists desire for the simplest possible story will be needed if we are to develop deeper intuition about the intrinsic nature of complexity, and communicate the important ideas to a broad community.

One of the most important distinction between the physics culture and those of the engineer or hacker, or even mathematician, is that the former is concerned with understanding the basic material and component properties of a given universe, while the latter are all concerned with the products of design.  This is so completely taken for granted that its implications are often ignored. 

 

Illustrative examples

The remainder of this paper will discuss general issues associated with robustness and uncertainty in complex systems.  The examples that follow, air bags, compact disk players, Mars Pathfinder, and Formula 1 racing, are used to illustrate some of the key issues in complex networks, but in simple and familiar settings.  Admittedly, these are "toy" examples when compared with the complex networks that exist and are planned, but it is exactly for that reason that they are pedagogically useful.

Often robustness is obtained at the expense of major compromises in efficiency. Digital systems generally are the most extreme example, but examples such as the Internet, VLSI design, and Mars Pathfinder are also success stories where robustness was given priority. They are also the results of highly structured and systematic processes. These can be contrasted with the abandoned upgrades to the air traffic control system and IRS software and the problems with the Denver airport baggage handling system.  While the failures of Challenger, Galileo, TWA 800, the Ariane 5 and the recent outages of power, telephone, and satellite networks can be traced to specific subsystems, it was the catastrophic cascading of tiny failures into system failures that is the most striking feature of these events.  Complex interconnected networks will have numerous component failures, but the network must be robust enough that such failures remain isolated.

The most telling examples are often the most mundane. For example, estimates of corporate yearly costs to operate a single PC are around $24K per year.  Hardware is only $1K of this and the low cost and high reliability of computer hardware is a consequence of highly structured and systematic design procedures.  Software, networking infrastructure and tech support cost a few thousand dollars more.  Most of the total cost is associated with informal "futzing" because of poor software robustness.  This futzing is not an official part of information technology budgets, but represents the dominant cost of computing.  Neither faster hardware nor new software features would affect this cost, which is the focus of most attention, but is completely due to the gap between the system's behavior in a demo and its behavior in real use.  This example, and the Y2K problem as well as others, also reminds us that we should not look to information technology alone as a solution to our problems of robustness of complex systems.

Complexity is driven by robustness

To help visualize the idea that complexity is driven by robustness, we can do a simple thought experiment. Take any familiar complex system, say the 777 (millions of parts), a laptop computer (billions of transistors), or a biological organism (thousands to millions of genes).   List both the primary design goal of the system (e.g. to deliver passengers, computation, or to reproduce its genome) and the sources of uncertainty that the system must be robust to.  For the 777, some uncertainties are flight timing, weather, routing, other traffic, turbulence in the boundary layer, payload size and location, uncertainty in components due to manufacturing and aging, and so on. The organism faces an even longer list of uncertainties.  For the laptop, the dominant uncertainty is that the computation to be done is not fixed and known in advance.  Otherwise, computers, and digital systems more generally, are unique in being perfectly repeatable.

Now imagine an idealized laboratory setting in which uncertainty is greatly reduced or eliminated.  For the 777 this would be virtually impossible to actually do, but we can imagine it in principle. For the case of the idealized 777, a working vehicle could probably be built with a few thousands subsystems, rather than 150,000.  For the laptop and the organism, creating an idealized laboratory environment is not only possible, but wouldn't differ much from the type of laboratory experiments that standard reductionist science demands.  The laptop would have exactly one simple computational problem, say producing digits of p, and perfect sources of power and a benign environment.  The organism would have a steady source of nutrients and constant environmental conditions.  How many parts would be needed?  The laptop only required to compute digits of p could be organized with an entirely different architecture and if speed were an issue, then a purely digital design would no longer be optimal.  Digital systems accept tremendous performance penalties in exchange for robustness, and the complexity of our idealized laptop could be drastically reduced if robustness ceased to be an issue. 

In biology, there actually exists a free-living organism, Mycoplasma, with about 500 genes (humans have about 30,000 genes), so that gives an upper bound on required organism complexity.  Researchers are working to knock out genes to lower this upper bound.  It is interesting to note more generally how frequently gene knockouts that are expected to be lethal yield either no apparent phenotype or a phenotype with completely unexpected features.  Often the "no phenotype" cases are later found to have yielded an organism that lacks robustness to uncertainties that don't arise in a typical laboratory environment.  Biologists often try to conceptualize this in terms of historicity, accident, and redundancy, when they really are much more subtle examples of highly evolved robust systems design. 

Air bags

A familiar example of uncertainty management is automobile airbags. Airbags provide reduced vulnerability to some uncertainties in the environment, such as head-on high-speed collisions with drunk drivers. In return, there is both an increased cost and an increased vulnerability to component failures. For example, a sensor failure leading to spurious deployment could cause injury even in a parked car. For this reason, great care was taken in designing the crash sensors and a diagnostic unit to monitor the operational effectiveness of the airbag, so that such possibilities are minimized.

Even without component failures, air bags can make certain circumstances more dangerous, such as low speed collisions with small passengers. There is a great net reduction in injuries and fatalities, but increased danger in certain unusual circumstances. The next generation of "smart" air bags will include additional sensors for even greater complexity, and hopefully, better robustness. It is interesting to note that experts estimate that an airbag is approximately equal in effectiveness to an additional 300 pounds of metal in the front of the car. Thus a larger vehicle would seem to avoid the tradeoffs of an air bag, but actually just shifts them elsewhere into a greater threat to other vehicles and increased pollution.

Airbags illustrate an important general principle of feedback systems: providing robustness somewhere means increased sensitivity somewhere else. Indeed, it can be proven that there are mathematical formulations of "conservation principles" that make precise this "waterbed effect" in certain cases.  Like energy and entropy, attempts to violate these robustness principles are constantly proposed and fail miserably. But also like energy and entropy, an understanding of the principles allows us to move conserved quantities around to our advantage. These robustness principles are so important to understanding complex systems, and so completely misunderstood that a major priority is to articulate this theory to a broader audience and expand the research effort in this area.

Informally, though, the dominant tradeoffs involved in the design of air bags involve risk and cost. Greater robustness to uncertainty not only costs money but actually increases sensitivity to component failures and certain, hopefully rare, conditions. To perform these tradeoffs requires explicit models of uncertainty and an understanding of feedback. Increasing complexity means that evaluating this tradeoff can be conceptually and computationally overwhelming without a sophisticated supporting technology.

Compact disk players

Another familiar example of an engineering system is a portable CD player, which gives high-fidelity music reproduction that is linear across large dynamic ranges in both the music volume and frequency content, despite large variations in ambient temperature and humidity, and accelerations due to listener motion. A CD player connected with appropriate electronics can amplify microscopic optical variations on the CD into sound levels that can cause physiological damage to an auditorium audience, and even CD players from different manufacturers give indistinguishable performance. CD players are not typically robust with respect to, say, being crushed by a large weight or immersed in water, although presumably these could be designed for, at additional expense. Our transportation, energy, information, and communication systems also exhibit similar robustness properties, and dramatic but rare failures remind us that this should not be taken for granted.

Complexity is another striking feature of engineering and biological systems. We now routinely build multimillion component systems on scales from microchips to airliners to global networks, and components themselves often have many millions of subcomponents, and so on. Biological systems are obviously extremely complex in this sense. An obvious question is whether the robustness of these "systems of systems" is simply a direct consequence of carefully manufacturing extremely robust components at every level. Have biological systems evolved in such a way that they have finely tuned internal dynamics that are themselves robust to environmental and intraorganism uncertainties? Similarly, are the electromechanical components of the CD player manufactured in such that they have the same robustness properties as the entire CD player system?

The advantage for the CD designer in using "perfect" components is obvious, as it would make the design process much easier. Imagine that we collected all the components of a CD player and considered all the possible interconnections of these components. There are a combinatorial huge number of possible designs given a set of components, and only a tiny fraction of interconnections will actually yield a working CD player, even with ideal components. Even more, an almost vanishingly small fraction (a set of measure zero in the thermodynamic limit of a large number of parts) of these potentially viable designs will work if there are nontrivial uncertainties in the components, which creates a daunting design challenge. It may then seem surprising to find that, in fact, CD players have large component uncertainties. The electromechanical parts are highly nonlinear, have limited dynamic range, and have large variation with signal amplitude and frequency, as well as temperature and vibration. Other systems have similar properties.

Thus CD players exhibit robustness to two kinds of uncertainty, those in the environment (temperature variations, vibration and movement, frequency and amplitude variations in the music) and those in the components from which they are built (variations in electromechanical components due to environment and manufacturing tolerances). It is obvious from its external behavior that a CD player has the first kind of robustness, but different functionally equivalent designs could have widely varying robustness of the second kind, and thus have widely varying costs of manufacturing. While it is useful pedagogically to distinguish environmental and component uncertainties, systems must be robust to both and the distinction is a little artificial.

The specific robust design of the interconnection is what allows the system to be much more robust than its components. Cheap parts are highly uncertain, so affordable manufacturability dictates robust design, and design of complex systems is thus effectively dominated by the tradeoffs of uncertainty management. The level and nature of robustness is a critical design choice. While the CD player is highly robust to both environmental and manufacturing uncertainties, the additional level of robustness necessary to protect against deliberate, malicious attack would be prohibitively expensive. It would typically be much cheaper to simply have extra CD players as backup, unless malicious attack was an important and common feature of the environment.

Another issue that is somewhat subtler but turns out to be mathematically more tractable is the issue of frequency and amplitude response of the CD player.  Since human hearing has a limited range of sensitivity and our music has a limited range of content, in both frequency and amplitude, CD technology is tuned to those ranges.  It actually performs rather badly outside those ranges, introducing large distortions that we would not normally be aware of because we can't hear them (and there is no "music" there anyway).  This is an excellent example of tuning a design to be insensitive to specific uncertainties, in this case variations in frequency and amplitude within the limited range, while accepting high sensitivity to uncertainties outside this range.  Indeed, for some signals, higher fidelity reproduction would be obtained by turning off the CD player, as the error it makes is larger than the signal itself.

Mars Pathfinder

The Mars Pathfinder mission offers a convenient example of uncertainty management and the use of virtuality in design. The highest risk phase of the mission was the descent and landing on the surface of Mars. Without going into excessive detail we can sketch a few of the relevant issues. This "cartoon" will be instructive, grossly oversimplified, but not essentially wrong. For more information, see the Mars Pathfinder website, or particularly the entry, descent and landing (EDL) website.

The problem boiled down to one of uncertainty management. If the mission itself were repeated a large number of times, the outcomes (in this case, impact velocities) would likely vary by several meters/second across the missions. The systems engineers job is to estimate this distribution accurately enough to allow cost-effective design trades to be made. (We'll ignore the actual problem of design for now, and focus on the issue of analyzing a given design.) Traditionally, uncertainty was avoided, rather than managed, through highly conservative "stacked margins" and exhaustive physical prototyping to make sure that the impact velocities are safely within a large margin of error. This is an effective but expensive combination that is no longer an option. The "better, cheaper, faster" imperative is primarily achieved by more careful evaluation of the cost/benefits in terms of probabilities of failures, and more use of virtual prototyping to focus (but in no way eliminate the need for) physical testing. For the rationale and implications of the JPL "better, cheaper, faster" vision, see the Develop New Products (DNP) website.

A high-fidelity simulation of the dynamics of the entire descent and landing together with selected component tests were combined to make overall assessments of probabilities of success. The core of the simulation consisted of commercial CAD software that essentially takes assumptions and initial conditions about the vehicle and the environment and generates a single trajectory. As an outer loop, the systems engineers wrote UNIX scripts that performed repeated Monte Carlo trials, varying parameters and initial conditions, to get an estimate of the distribution of possible outcomes (again impact velocity).

While the Pathfinder systems engineers' approach is a step in the right direction, and they did a truly brilliant job in pulling it off, there are serious inadequacies in their approach (that are not their fault) that are exactly the issues that should be driving research.  Currently, there are few efforts in this direction.  One obvious problem is that commercial CAD packages have limited support for explicit representations of uncertainty so that ad hoc methods such as writing UNIX script outer loops are necessary. While there is no substitute for engineering judgment in choosing the distribution of parameters, there is little in the way of software or theoretical support for the systems engineers in this task.

Another equally obvious problem is that most Monte Carlo trials were wasted on benign events in order to get statistically significant sampling to have adequate confidence in the tails of the distributions. While there are well-known statistical methods aimed at addressing this issue, none get around the issue that in order to avoid large numbers of samples one must have guarantees that regions in parameter space are benign, a known NP-hard problem. The high dimension of the space of uncertain yet important parameters makes exhaustive search prohibitive.

A subtler problem with this Monte Carlo approach is that many of the uncertainties in the CAD models are not parametric. For example, the uncertainty in high frequency flexible modes of structural members, the fine-scale behavior of fabric airbags in contact with rocks and the resulting effects on tearing, the interaction of structural and acoustic modes with combustion in the thrusters, and so on, are not explicitly represented in the CAD models (the assumptions are that these effects are neglected) so their impact is difficult to evaluate. The engineers tried to reflect these in an ad hoc way by increasing related parameter ranges, but this is a dangerous practice. Perhaps somewhat less dangerous, but still lacking in sufficient software or theoretical support, is the replacement of chaotic fine scale dynamics with stochastic approximations that can be directly included in the Monte Carlo simulations.

[Note: It would be interesting to update this with a description of the failure of two recent Mars missions.]

Complexity and Formula 1 racing

The technical issues underlying any discussion of complex systems are necessarily difficult to grasp, and no one example can illustrate more than a few points.  Formula One automobile racing is a high technology example with some features of complex systems of systems, but in a relatively simple setting.  Formula One (F1, also called Grand Prix) is the ultimate automotive event in terms of both money and technology. It is a huge spectator sport, primarily outside the US, as well as a testing ground for advanced automotive technology.  Everything from disc brakes to fully electronic fuel injection has first been tried in F1, and then later found its way into our passenger cars.   Since our passenger cars are restricted in their speed and their normal operating conditions are relatively benign, we fail to notice most of the technical changes that have taken place.  In F1 new technology immediately and dramatically translates into victories, and is thus more visible. Billions of dollars and decades of research have been devoted to refining F1 cars, and it is a domain unequalled in fostering a single-minded pursuit of technological excellence. 

In the last decade or so, something striking has happened.  Passenger cars have begun to overtake F1 in technical sophistication, and it is this story that is particularly relevant to our discussion on complexity.  Passenger cars have had an explosion in complexity, almost entirely in electronics, computers, and control systems.  Automated and active braking, active engine and drive train control, automatic air-bag deployment, and active suspensions are common, and sophisticated drive-by-wire traction, steering control, and obstacle avoidance are in development.  Our cars are now safer and much more reliable and robust enabled by a combination of computer and control technology. We pollute less, visit the mechanic less often, and survive a greater variety of dangerous conditions, although this is largely unnoticed because it has no apparent effect on day-to-day operation.

Surprisingly, these active control systems are not used in F1.  They were introduced in a preliminary way a decade ago, with dramatic results.  Again, unlike in passenger cars, new technology in F1 is immediately apparent in performance.  It quickly became clear that actively controlling aerodynamics, steering, suspension, traction, and braking would have such a profound and revolutionary impact on F1 that the sport would be completely changed.  Furthermore, while the use of sensors and computers is ubiquitous in F1, the addition of active feedback was viewed, quite correctly, as an entirely new and distinct technology that was both enormously powerful and poorly understood outside a narrow community.  Rather than accept such a radical transformation of their sport, the ruling body of F1 simply banned all active control.  Of course, the rules don't include a specific blanket ban, but in each section of the rules dealing with engines, brakes, traction, suspension, aerodynamics, structures, etc. there is careful wording that implicitly bans active control. 

Interestingly, our national research emphasis mirrors the F1 rules.  The deep research issues associated with understanding and designing robust complex systems, which are likely to be among the most fundamental questions in the coming century, are relatively neglected.  Like in F1, there is no explicit global ban; the emphasis on components rather than systems creates an implicit ban.  F1 manufacturers can and do spend enormous sums on exotic materials, extensive wind tunnel testing and sophisticated computer-aided design, modeling, and simulation.  They can and do place sensors, transponders, and video cameras on the cars, and connect these with computers for monitoring and diagnostics.   And they are allowed to provide actuation for power assist in steering and braking.  All these are viewed as providing incremental and evolutionary effects. But they cannot use feedback for active control. 

It is actually difficult for most people, even among scientists and engineers, to grasp the significance of what is banned in F1.  Passive control is allowed.  F1 designers can and do use a variety of surfaces and shapes to control the aerodynamics, but they are forbidden from having surfaces that move under automatic control.  They can and do use all the sensors and computers and many of the actuators necessary to implement sophisticated active control.  What is banned is closing the loop with active control.  This is such a subtle difference that it can be difficult for F1 officials to verify.  Understanding how such a subtle and often relatively inexpensive difference could make a far more revolutionary impact than much more expensive materials, structures, and aerodynamic technology requires substantial expertise not only in control engineering but also in supporting F1 technology.  The value-added technology in active control occurs at a very high level of abstraction relative to these other technologies.  It would completely break the F1 "paradigm" as surely as it falls outside the usual training of most scientists and engineers.

Although we may wish that F1 had allowed active control, and allowed us to see the dramatic consequences in action, the F1 officials are not simply antitechnology.  One important issue that the F1 ban on active control wisely avoids is the consequences of not just having faster cars, but ones that are responding automatically to each other and their environment on time-scales much faster than driver's reaction times.  Allowing active control would ultimately result in the cars acting as a highly interconnected network.  The driver would become the limiting factor and safety concerns would demand that they be eliminated in favor of a fully automated system.  The alternative would be for drivers to become mere passengers in a system they had little control over and one that could fail catastrophically due to unpredictable and "emergent" multi-vehicle interactions.  In any case, the spectator value of F1, which relies very much on the human element in addition to the technological, would probably suffer. 

Perhaps unfortunately, society as a whole is much less prudent than F1.  We are proceeding to build complex systems of systems with many of the dangers that F1 sought to avoid with a ban.  It is impossible, and probably undesirable, to try to stop this trend, but we need to take more seriously the consequences.

Other examples of virtual design and uncertainty management

A popular myth that is very ingrained in our technical culture is that the problem of uncertainty due to unmodeled dynamics is adequately handled by simply increasing the resolution of the model until all the relevant phenomena are included. While a detailed discussion of all the flaws associated with this position would take too long, we can point out a few, at the risk of being rather abstract. The fine scale dynamics have a large number of unknown parameter whose distributions have high variance.  Examples of such parameters are those that quantify viscosity, elasticity, conductance, capacitance, permeability, etc.

Even high-resolution models make assumptions about material properties that are uncertain, so the unmodeled dynamics may actually grow in size and complexity. The cost of higher resolution modeling would be prohibitive for very marginal return in fidelity, because, for example, the Pathfinder descent and landing is ultimately dominated by the intrinsic variability of the environment as much as it is by the inadequate resolution of the models. The critical issue is what the appropriate level of fidelity is for a given problem, and the Pathfinder system engineers were again left on their own without software or theoretical support.

It may seem surprising that in spite of the great success of the scientific enterprise and the enormous power of CAD tools and their supporting infrastructure, that Pathfinder system engineers, who are sophisticated and well-educated, were forced to do so much ad hoc fixes and hacking. One problem is that mathematical modeling in science and engineering has traditionally been used to understand relatively simple systems in a single domain. The abstractions, assumptions, and approximations made in the modeling process were part of the domain expertise and rarely represented explicitly.

Modern computer-based modeling grew out of this tradition. A classic example and one of the triumphs of scientific computing is computational fluid dynamics (CFD). Here, the Navier-Stokes equations have been around for more than a century, and are widely believed to reliably capture fluids on a scale below our measurement capability. Thus it is not unreasonable to view much of fluid modeling and simulation as primarily a matter of numerical solutions to PDEs and building faster computers. It is interesting to note that "digital wind tunnels" have not replaced physical ones, despite predictions to the contrary, and despite computer hardware improving at rates that have matched or exceeded all expectations.  Even a cursory overview of CFD is well beyond the scope of this discussion, but there is growing evidence that the traditional focus in CFD has greatly limited its practical applicability.  Particularly for the kind of shear flows that are important in aircraft design, the macroscopic aerodynamic vehicle properties are dominated by large-scale structures in the flow, as well as uncertainty in the boundary conditions. 

Experimentalists have made substantial progress using flow visualization to get a reasonably clear picture of the origin of vorticity in the boundary layer that is the critical feature of turbulent shear flows.  Nevertheless, to quote Kline of Stanford, who has been one of the major contributors to this research

"… the structure results now seem to provide, at long last, a reasonably complete picture of how turbulence is produced and maintained in the boundary layer and of the major eddies in the various regions of the layer.  In nearly every other case in physics such increased knowledge has translated into improved models for computation. That has not been the case in turbulent boundary layers."

 

Fortunately, approaches focusing more on uncertainty management and robustness may soon offer attractive practical alternatives to traditional CFD.

Another important success that has created somewhat unreasonable expectations about the ease of computer modeling is VLSI CAD, where the Mead-Conway design rules, if followed, allow the uncertainties of the physical (silicon) level and the manufacturing process to be almost completely suppressed at the functional design. (This is one of the all-time brilliant examples of a successful protocol-based uncertainty management strategy, but unfortunately this fascinating topic is far too broad to be adequately handled here.) While this approach sacrifices performance, it is essential if designs are to be done at a fast enough pace to take advantage of the continuing advances in fabrication technology. Deep submicron designs threaten to undermine this design paradigm, because the uncertainty at the physical level cannot be completely suppressed.

The Pathfinder descent problem has little in common with CFD or VLSI CAD, except at the lowest subsystem level. The models are heterogeneous combinations of models of structures, fluids, propulsion, and electronics. There are no canonical PDEs or design rules available to isolate the high level logical functioning from the physical uncertainty. Indeed, the traditional approach to spacecraft design did approximate the Mead-Conway philosophy by building in huge conservative safety margins.

It is interesting to compare the relatively ad hoc approach that was taken in the Pathfinder problem with the more systematic analysis that is routinely done for the space shuttle reentry, which is superficially similar. The similarities are that both missions require a vehicle to use combinations of jets and aerodynamics to follow a highly nonlinear trajectory from the vacuum of space through the planet's atmosphere to land on the surface. The space shuttle has simplifying features, however, that allow for successful use of the primarily linear robustness analysis tools that have been developed at Caltech and Honeywell over the last 20 years. They also do repeated analysis of similar missions, and so can afford to invest more effort in learning to use such tools.

More sophisticated tools could probably have made some impact on the Pathfinder program, but their application there would have been much more difficult. Indeed, a focus of current research is to extend the existing tools that have been so successful on the shuttle and other programs to make them more easily usable for problems like Pathfinder. The exact nature of such an extension and the details of the issues involved are unfortunately well beyond the scope of this discussion.  It will require a blending of tools from robust control with bifurcation analysis from dynamical systems.

Uncertainty modeling and management

There are important other examples that would need detailed expositions to begin getting a full picture, but even these simple examples suggest that in complex engineering systems the current implicit treatment of uncertainty is inadequate. While it is quite natural to distinguish between parametric uncertainty, noise, and unmodeled dynamics, it is also important to treat them in a unified way. Noise is typically used to describe dynamics that are not modeled in detail and are often considered part of the environment. Noise is often modeled as a stochastic process when strictly speaking it might be more appropriate to think of it as chaotic. Unmodeled dynamics is used as a catchall for uncertainty that is neither parametric nor noise. The problem of uncertainty modeling is discussed in more detail elsewhere.

Complex engineering systems have relied on the fact that the final design step has traditionally been to add automatic controls to an otherwise completed system, and the control engineer has had the responsibility for doing system wide uncertainty management. Uncertainty has often been treated by building in conservative and expensive safety margins. The "better, faster, cheaper" and "systems of systems" design paradigms that are currently being promoted should motivate a more integrated and systematic treatment of uncertainty throughout the design process.