by taking the approach that this was a software problem, the initial automation bought us enough time to turn cluster management into something autonomous, as opposed to automated.
human operators are progressively more relieved of useful direct contact with the system as the automation covers more and more daily activities over time. Inevitably, then, a situation arises in which the automation fails, and the humans are now unable to successfully operate the system. The fluidity of their reactions has been lost due to lack of practice, and their mental models of what the system should be doing no longer reflect the reality of what it is doing.10 This situation arises more when the system is nonautonomous — i.e., where automation replaces manual actions, and the manual actions are presumed to be always performable and available just as they were before. Sadly, over time, this ultimately becomes false: those manual actions are not always performable because the functionality to permit them no longer exists.
Reliability is the fundamental feature, and autonomous, resilient behavior is one useful way to get that.
"Instead, our system takes a partnership approach which attempts to automate network changes but can escalate the request to human being when needed.
This partnership or collaboration is made possible by a simple but important design concept which we chose early on – a change is a process.
Rather than implementing changes in one blow, our system processes network changes as a series of tasks. This approach enables exit points in which the system can escalate a task to human being when needed, either as a pre-configured condition, which was designed as part of the workflow or as a contingency which was determined in "run-time"."
I find it odd the way people talk about moving through the levels sequentially without really acknowledging there's a massive phase change that happens somewhere between level two and four, where a self-driving car could easily be way more dangerous than the average human driver.
That comes as soon as you have a situation where the automation has to be able to handle all foreseeable scenarios without failing because the human is no longer paying attention. It's a fixture of complex-adaptive-systems-analysis that one encounters scenarios where having anything short of perfection is just as good as useless. The implicit assumption in this post is that there are linear returns to incremental progress. That's not necessarily true at all.
It's quite possible -- I would argue likely -- that we'll be very tantalizingly close-but-not-good-enough to full automation for a decade or three before we actually change the fundamental nature of transportation.