How a Large Install Base Leads to Robustness
By John Sloan, Senior ACIS Architect
Recently, I was thinking about the way that ACIS’ large install base has affected our development of robust algorithms over the years. This led me to a startling realization: that the functions within a large commercial software package form an evolutionary ecosystem, in the technical sense of a collection of evolving actors which interact among themselves. This means that such packages are complex adaptive systems, which in turn leads to an understanding of the difficulties and successful methods involved in modifying these packages to meet business or robustness goals.
The story begins last year, when I was attending the Solid and Physical Modeling 2008 conference, At that conference, I had the opportunity to discuss some of the work we’ve been doing within the ACIS development group on the mathematical theory underlying tolerant modeling with members of the academic community. This led me to realize a way in which those of us working in industry have a big advantage over researchers in academia – our large install base acts as a real-world laboratory which regularly serves up highly unusual corner cases as defects. These corner cases can be (and are) captured in a defect database, which we can use both to test our algorithms and to give us insight into the obscure pathologies that can arise in a solid model. This defect database is an incredibly valuable part of ACIS’ intellectual property base which cannot be matched by academic researchers. At this year’s conference, I was asked to give a talk on robust modeling from an industrial perspective – this led to my thoughts about the effects that this resource has had upon the robustness of ACIS algorithms, and from there to the concept of ACIS as an ecosystem of functions.
The heart of these ideas flow from two observations about ACIS: 1) that well over a million seats of ACIS have been deployed, which implies that the number of function calls being made within ACIS every year is measured in the billions and 2) that Spatial is not overwhelmed with defect reports, which implies that these function calls almost always work correctly. To be more precise, the probability of any function within ACIS failing when called in a particular way, times the probability of the function being called in that way, must be of order one-in-a-billion. In other words, the heavily used functions within ACIS are experimentally observed to be incredibly robust (at the level of nine nines) when called in the way that ACIS typically calls them. This can be understood intuitively by thinking in evolutionary terms: ACIS’ large install base provides an enormous selection pressure to drive bugs out of typical workflows, resulting in functions which are highly optimized to be robust when called as part of these workflows.
You might have noticed above that I have been careful to not claim absolute robustness for these functions, but instead to discuss the robustness of a function in terms of the ways in which it is called. This is because the selection pressure only applies on a function in the ways in which it is called; a “dead code path” function which is never called by ACIS obviously has no selection pressure to make it robust. If a change were made to another function which caused the dead code path to come alive, the newly activated function would suddenly become a potential source of defects. As a more subtle example, consider a change in the ACIS curve approximator which would cause it to generate quintic, rather than cubic, B-Spline approximations. One property of such approximations is that they would have fewer, more coarsely spaced knots. Since many ACIS functions use the knots of the approximation as seed values in relaxation algorithms, it’s likely that some particularly difficult geometric corner cases would be tipped over the edge from working to not working.
In both of these cases, the appearance of these defects would not be caused by changes to the function itself, but rather would be due to changes to the calling environment in which the function operates. This calling environment, in turn, is created by the other functions within ACIS. This interdependency is what drives the understanding of ACIS as an “ecosystem” of functions, with selection pressure (the physical environment) being applied through modifications to the code which fix defects in customer workflows. To put it another way, the functions within ACIS form a complex adaptive system which has been adapted to reduce defects to a low level. The utility of this understanding is that it leads to observations about change management in large commercial software packages in general, and in ACIS in particular.
The most important of these observations is that complex adaptive systems operate in an equilibrium state which is typically too complex to be understood by a human. This has many ramifications:
- Attempting to put all of the pieces in place statically and then pushing a “go” button tends not to work, because the pieces will not be in equilibrium. Instead, the system should be grown gradually from an understandable system, allowing re-equilibration after each step. This is the underlying driver behind incremental design practices – each iteration of the design adds complexity then allows the equilibrium to shift. This also makes the risk inherent in large waterfall projects obvious – it is much more difficult to find equilibrium in a large system.
- Change is non-local. Even with good object-oriented practices in place, changing the behavior of any one function will change the environment in which other functions operate, which in turn can move them out of the situation for which their robustness is optimized.
- Large-scale rewrites are difficult. The functions that are already in place are highly optimized for robustness and form the environment for other functions which will not be replaced. The functions which replace them both must perform at the existing incredibly high level of robustness and not disrupt the environment of the other functions in such a way that moves them into their non-robust regimes.
- Test suites should be designed to match customer scenarios and workflows as closely as possible, otherwise they will apply selection pressure on the code to be robust in an environment that does not match the environment it will need to operate in “in the wild” (in actual customer workflows).
- Early design mistakes can become unfixable from a practical point of view. The classical biological example of this involves the “blind spot” in the human eye, where the optic nerve passes through the retina. A more optimal design would have the nerves attaching at the back at the retina, but this design mistake was presumably made in an early iteration of the eye. At this point, fixing the design mistake would involve a large-scale “rewrite” of the eye, which is impractical for the reasons cited in the previous bullet.
- “Provably correct” algorithms are extremely important. By this I mean algorithms which can be shown mathematically to converge to the correct answer or fail in a graceful way that forces client code to recognize the failure. The reason this is important is that a function which implements a provably correct algorithm is more likely to be insensitive (read robust) against changes in its environment than a function which uses heuristics to arrive at an answer. For example, in the quintic B-Spline case mentioned above, relaxation algorithms which can be proved to always find the correct answer, regardless of the seeds used, are likely to be much more robust against the change from cubic to quintic approximations than those which have no such proof.
This list is not intended to be comprehensive, instead it is intended to stimulate a new way of thinking about the issues confronting large-scale software design and change management. I know it has helped me in my understanding of the issues in managing ACIS that we confront every day; I hope it will also help you in understanding of your own product.