Today, I talk about the DevOps Three Ways approach. This three-way idea comes from an article in Gene Kim’s blog (“The Three Ways, The Principles Underpinning DevOps”).
We begin with the first way (“The First Way”).
The first way — Systems Thinking — is illustrated by the following schema:
The idea behind Systems Thinking is to move the product as quickly as possible from left to right, from the developers to the operationals.
For this, we must proceed with optimizations of the system, but global, not local. At least the local optimizations must take into account the overall flow. Indeed, why to optimize locally a system — so at the developer level — if the product can not be put into production — at the operational level?
Concretely, the consequences of First Way are first of all never to transmit a known defect through the flow. For example, you should never let a software with bugs advance, because in the end someone will take care of it. And the software is likely to go back for you to correct it.
Then, when local optimization is done, take care that it does not create an overall degradation. For example, speeding up software development may seem like a good idea; but it is not so if in the end it leads to not being able to deliver the software because of a lack of quality for example.
Obviously, the best example of improvement in the first way is the improvement of the Continuous Delivery rate. Improving Continuous Delivery is not enough if the next step, Continuous Deployment, which is the deployment of the software in production, slows down the overall process.
Once this is done, we can look into improving the Continuous Delivery pipeline. In general, what is done in this pipeline should be done as quickly and as safely as possible. It is therefore a question of finding a balance between speed of execution and quality of execution. If you can combine the two, then one can talk about efficient organization.
And this can not be achieved simply by putting pressure on the teams, with the risk of causing them to overwork. Because fatigue leads to a decline in quality. On the contrary, if the resources feel good, the performance can be improved.
We can also take the question backwards. Instead of talking about what should be done to improve the system, we can talk about what degrades it very strongly.
The first element is the transfer of information. It is a very important loss of time. This is why we try to avoid functional teams, in which there are a lot of information transfer. This results in interruptions of work, waiting times. On the contrary, we seek to develop cross-functional teams. This allows them to have all the necessary tools to be able to advance the software from development to production.
The goal is to identify these wasted time. For that, there is a technique that works very well. This is the Value Stream Mapping. This technique involves studying the workflow through the system — identifying places where value is created, where there are bottlenecks — to maximize throughput.
What must also be understood is that — contrary to beliefs — increasing speed is more important than reducing costs, because mechanically, by speeding up, costs fall. Instead of focusing on saving costs, we must optimize speed.
The time it takes to go from idea to production is called Lead Time.
The gains the organization makes from the reduction in Lead Time are first and foremost to allow your organization to get to the market before your competitors, so to get ahead of them. Your organization can then capture more market share and at the same time increase profitability.
The second way (“The Second Way”) — that is the feedback loops — is illustrated by the following diagram:
The purpose of the “Second Way” is to shorten and amplify feedback loops. The interest of feedback is to tell us how we improve.
This involves retrieving signals from the system and sending them back to the beginning of the system for improvement. This is what allows more generally an organization to improve.
The idea is to measure the flow of work done through the system. If the duration of the cycles is measured, then the feedback will make it possible to reduce them. For example, by measuring the loading time of Web pages, this allows you to take action to make website users happier because pages load faster (multiple studies highlight this relationship). All you have to do is to take an action to reduce this load time.
Another purpose of the “Second Way” is to understand and meet the needs of the customer. One way to achieve this is the A/B Testing. It consists in performing a series of tests giving two possible features to the client: A or B. Then the idea is to obtain a feedback of these choices made by the customer and to answer by determining which feature (A or B) will be deployed.
The third objective of the “Second Way” is to make sure that people have the information they need.
Also take the time to identify where you can get metrics:
- on the servers (available disk space, …),
- in unit tests (number of tests that pass or fail),
- in the KPIs (turnover, gross margin, …),
- in the Learning Reviews (sometimes called Post-mortems),
- in the deadlines for setting up a MTTR (Mean Time To Recovery), …
Then use this information to optimize the system. If it is useful to study what went wrong, it is also useful to see what went well. For example, if there is a significant failure rate in running unit tests, it means that there is a development problem. Even if it is rather correct, there is matter to improve it.
But once these improvements are put in place, you still have to monitor them. If the measures do not show improvement, it means two things:
- the idea was bad,
- or you do not look at the right measures.
It is therefore crucial to measure good things.
The third way (“The Third Way”) — that is the culture of experimentation (“Culture of Experimentation”) — is illustrated by the following diagram:
The first idea of the Third Way is the culture of continual experimentation. This means first of all allocating time to allow the team to experiment, to try new ideas or technologies, to look for ways to go faster, … Examples are the concept of Hack Days: Google allocates 20% of time for this type of activity. During this time, one can for example constantly make tests on the pipeline of Continuous Delivery, on the infrastructure, etc … The objective is to learn to improve.
The second idea is the learning of failure. And for that, you have to find a way to reward the team for taking risks. If not, how can one remain a leader in the market if we never take risks?
The third idea is that repetition is the prerequisite for mastery. Of course it means you have the opportunity to train. This is why the idea behind is to introduce flaws in the system and to practice correcting these problems. What is the goal ? It is to increase resilience, lowering the MTTR. For example, Google uses the concept of Game Days, days during which a failure is introduced into the system (cutting an Internet connection, cutting power to a Data Center, …).
These repeittions ensure that teams are well prepared to deal with this kind of situation, especially in terms of training and tools. They also make it possible to see things that the teams would not have imagined simply by thinking about it.
By injecting these results into the system, the team can make it more resilient and more reliable. The best example of this is the Netflix Chaos Monkey tools. It consists of disabling AWS instances (Amazon Web Services), introducing latency in the communication between the client and the server, etc … Another typical example is the switch from one Data Center to another.
The idea behind all this is that as failures will happen one day, how can we claim that they will not happen and not practice repairing them?
The Third Way, of course, assumes an efficient instrumentation system, so that we can use that feedback and improve the system. It also assumes that these experiments can be implemented quickly, can be done quickly and the system can be restored quickly.
Now you should understand better what is DevOps. It’s not only automation, it’s not only tools. It’s much more. It’s too efficient feedback loops, a culture of continual experimentation and much more.