process-automation-considerations

Meta-approach... apply page-six-thinking-hats

General heuristic evaluation

jakob-nielsen-generic-heuristic-evaluation#list

Visibility of system status
- The system should always keep users informed about what is going on, through appropriate feedback within reasonable time.
Match between system and the real world
- The system should speak the users' language, with words, phrases and concepts familiar to the user, rather than system-oriented terms. Follow real-world conventions, making information appear in a natural and logical order.
User control and freedom
- Users often choose system functions by mistake and will need a clearly marked "emergency exit" to leave the unwanted state without having to go through an extended dialogue. Support undo and redo.
Consistency and standards
- Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions.
Error prevention
- Even better than good error messages is a careful design which prevents a problem from occurring in the first place. Either eliminate error-prone conditions or check for them and present users with a confirmation option before they commit to the action.
Recognition rather than recall
- Minimize the user's memory load by making objects, actions, and options visible. The user should not have to remember information from one part of the dialogue to another. Instructions for use of the system should be visible or easily retrievable whenever appropriate.
Flexibility and efficiency of use
- Accelerators -- unseen by the novice user -- may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent actions.
Aesthetic and minimalist design
- Dialogues should not contain information which is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the relevant units of information and diminishes their relative visibility.
Help users recognize, diagnose, and recover from errors
- Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution.
Help and documentation
- Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, focused on the user's task, list concrete steps to be carried out, and not be too large

More specific to process automation

blog-post-a-mature-role-for-automation#design-heuristics

Design for appropriate trust, not greater trust.

Show the past performance of the automation.

Show the process and algorithms of the automation by revealing intermediate results in a way that is comprehensible to the operators.

Simplify the algorithms and operation of the automation to make it more understandable.

Show the purpose of the automation, design basis, and range of applications in a way that relates to the users' goals.

Train operators regarding its expected reliability, the mechanisms governing its behavior, and its intended use.

Carefully evaluate any anthropomorphizing of the automation, such as using speech to create a synthetic conversational partner, to ensure appropriate trust.

Additional considerations from instagram article

blog-post-continuous-deployment-at-instagram#heuristics

Tests
- The test suite needs to be fast. It needs to have decent coverage, but doesn't necessarily have to be perfect. The tests need to be run often: during code review, before landing (and ideally blocking lands on failure), and after landing.
Canary
- You need an automated canary to prevent the really bad commits from being deployed to the entire fleet. It doesn't need to be perfect, however - even a very simple set of stats and thresholds can be good enough.
Automate the normal case
- You don't have to automate every situation; just automate the known, normal situations. If anything is abnormal, make the automation stop and let humans step in.
Make people comfortable
- I think that a big barrier to this kind of automation is when people feel disconnected and out of control. To address this, the system needs to provide good visibility into what it has done, is doing, and (preferably) is about to do. It also needs good stop mechanisms.
Expect bad deploys
- Bad changes will get out, but that's okay. You just need to detect this quickly, and be able to roll back quickly.

Visibility and Situational Awareness

blog-post-automation-should-be-like-ironman#why

"As the tools were completed, they replaced their respective manual processes. However the tools provided extensive visibility as to what they were doing and why."

article-the-human-factors-in-aircraft-automation#not-understanding-what-automation-is-doing

"Even though it is unusual for automation systems to fail, pilots still need to be ready to cope with them if they occur. Unfortunately, it is more common that the pilot is not understanding what the automation is doing or telling them."

book-site-reliability-engineering#details-of-internal-operation

Of course, for effective troubleshooting, the details of internal operation that the introspection relies upon should also be exposed to the humans managing the overall system.

comment-on-relationship-between-people-and-the-automation#mistrust

I work with teams of both automators and operators. There is a current of mistrust running between them based on a history of inhumane assumptions and practices on both sides in the expression of their skills and intelligence. These teams are software engineers who must learn the language of automated deployment systems, systems engineers who monitor the deployment process and must make intercession decisions when things go wrong, and operations engineers who are responsible for the results of the automated deployment in production.

What the thing might do in the future:

film-lo-and-behold-reveries-of-the-connected-world#simulating-the-future

Later in the film, there was a shot of Carnegie Mellon's CHIMP robot considering a simulation of its future actions. In effect, it was predicting what it expected it would be able to do, and running through it, much like an athlete might visualize a race beforehand. It was a real-time rendered 3D animation that also served to give the researchers insight into how CHIMP was thinking about its next movements.

Automation vs. Autonomous

concept-distinction-between-automation-and-autonomy

Small number of changes

book-site-reliability-engineering#very-small-sets-of-changes

Simple releases are generally better than complicated releases. It is much easier to measure and understand the impact of a single change rather than a batch of changes released simultaneously. If we release 100 unrelated changes to a system at the same time and performance gets worse, understanding which changes impacted performance, and how they did so, will take considerable effort or additional instrumentation. If the release is performed in smaller batches, we can move faster with more confidence because each code change can be understood in isolation in the larger system. This approach to releases can be compared to gradient descent in machine learning, in which we find an optimum solution by taking small steps at a time, and considering if each change results in an improvement or degradation.

Canarying

book-site-reliability-engineering#canarying

An earlier push of the new feature had involved a thorough canary but didn't trigger the same bug, as it had not exercised a very rare and specific configuration keyword in combination with the new feature. The specific change that triggered this bug wasn't considered risky, and therefore followed a less stringent canary process. When the change was pushed globally, it used the untested keyword/feature combination that triggered the failure. Ironically, improvements to canarying and automation were slated to become higher priority in the following quarter. This incident immediately raised their priority and reinforced the need for thorough canarying, regardless of the perceived risk.

Semi-related ideas

blog-post-autoscaling-on-complex-telemetry#make-reason-for-scaling-easy-to-understand

"As a first step, you might consider autoscaling based on multiple custom metrics. This is possible to do, but I don't advise it for two reasons. Most important, I think, is that a multi-metric autoscale policy makes communication about its behavior difficult to reason about. "Why did the group scale?" is a very important question, one which should be answerable without elaborate deduction."

kubernetes-docs-design-principles#assume-an-open-world

"Assume an open world: continually verify assumptions and gracefully adapt to external events and/or actors. Example: we allow users to kill pods under control of a replication controller; it just replaces them."

kubernetes-docs-design-principles#self-healing-before-people-notice

Components should be self-healing. For example, if you must keep some state (e.g., cache) the content needs to be periodically refreshed, so that if an item does get erroneously stored or a deletion event is missed etc, it will be soon fixed, ideally on timescales that are shorter than what will attract attention from humans.

kubernetes-docs-design-principles#degrade-gracefully

Component behavior should degrade gracefully. Prioritize actions so that the most important activities can continue to function even when overloaded and/or in states of partial failure.

article-going-neural#downgrade-to-mechanism

The minute we can automate a task, we downgrade the relevant skill involved to one of mere mechanism.

Glossary

continuous-deploys-glossary#glossary

rollout
- can contain multiple commits
- from blog-post-continuous-deployment-at-instagram
situational awareness
- TODO
- from article-the-human-factors-in-aircraft-automation