Meta-approach... apply page-six-thinking-hats
Design for appropriate trust, not greater trust.
Show the past performance of the automation.
Show the process and algorithms of the automation by revealing intermediate results in a way that is comprehensible to the operators.
Simplify the algorithms and operation of the automation to make it more understandable.
Show the purpose of the automation, design basis, and range of applications in a way that relates to the users' goals.
Train operators regarding its expected reliability, the mechanisms governing its behavior, and its intended use.
Carefully evaluate any anthropomorphizing of the automation, such as using speech to create a synthetic conversational partner, to ensure appropriate trust.
"As the tools were completed, they replaced their respective manual processes. However the tools provided extensive visibility as to what they were doing and why."
"Even though it is unusual for automation systems to fail, pilots still need to be ready to cope with them if they occur. Unfortunately, it is more common that the pilot is not understanding what the automation is doing or telling them."
Of course, for effective troubleshooting, the details of internal operation that the introspection relies upon should also be exposed to the humans managing the overall system.
I work with teams of both automators and operators. There is a current of mistrust running between them based on a history of inhumane assumptions and practices on both sides in the expression of their skills and intelligence. These teams are software engineers who must learn the language of automated deployment systems, systems engineers who monitor the deployment process and must make intercession decisions when things go wrong, and operations engineers who are responsible for the results of the automated deployment in production.
What the thing might do in the future:
Later in the film, there was a shot of Carnegie Mellon's CHIMP robot considering a simulation of its future actions. In effect, it was predicting what it expected it would be able to do, and running through it, much like an athlete might visualize a race beforehand. It was a real-time rendered 3D animation that also served to give the researchers insight into how CHIMP was thinking about its next movements.
concept-distinction-between-automation-and-autonomy
Simple releases are generally better than complicated releases. It is much easier to measure and understand the impact of a single change rather than a batch of changes released simultaneously. If we release 100 unrelated changes to a system at the same time and performance gets worse, understanding which changes impacted performance, and how they did so, will take considerable effort or additional instrumentation. If the release is performed in smaller batches, we can move faster with more confidence because each code change can be understood in isolation in the larger system. This approach to releases can be compared to gradient descent in machine learning, in which we find an optimum solution by taking small steps at a time, and considering if each change results in an improvement or degradation.
An earlier push of the new feature had involved a thorough canary but didn't trigger the same bug, as it had not exercised a very rare and specific configuration keyword in combination with the new feature. The specific change that triggered this bug wasn't considered risky, and therefore followed a less stringent canary process. When the change was pushed globally, it used the untested keyword/feature combination that triggered the failure. Ironically, improvements to canarying and automation were slated to become higher priority in the following quarter. This incident immediately raised their priority and reinforced the need for thorough canarying, regardless of the perceived risk.
"As a first step, you might consider autoscaling based on multiple custom metrics. This is possible to do, but I don't advise it for two reasons. Most important, I think, is that a multi-metric autoscale policy makes communication about its behavior difficult to reason about. "Why did the group scale?" is a very important question, one which should be answerable without elaborate deduction."
"Assume an open world: continually verify assumptions and gracefully adapt to external events and/or actors. Example: we allow users to kill pods under control of a replication controller; it just replaces them."
Components should be self-healing. For example, if you must keep some state (e.g., cache) the content needs to be periodically refreshed, so that if an item does get erroneously stored or a deletion event is missed etc, it will be soon fixed, ideally on timescales that are shorter than what will attract attention from humans.
Component behavior should degrade gracefully. Prioritize actions so that the most important activities can continue to function even when overloaded and/or in states of partial failure.
The minute we can automate a task, we downgrade the relevant skill involved to one of mere mechanism.
rollout
situational awareness