Understanding the Well Architected Framework: Part 1 (Operational Excellence)
The Well Architected Framework is a set of best practices for developing applications in the cloud. It was initially set out by Amazon Web Services (AWS), but it is now also being promoted by other cloud providers, such as Microsoft Azure.
The framework consists of 6 pillars, each focusing on a different area of cloud application design. These pillars are:
Operational excellence
Security
Reliability
Performance Efficiency
Cost Optimisation
Sustainability
It is important to be aware of these pillars to ensure that your cloud applications are designed with the appropriate considerations. This article will focus on the first pillar, operational excellence.
The Operational Excellence Pillar
The Operational Excellence pillar focuses on building software, infrastructure, and engineering teams that can operate with agility and efficiency. It aims to allow teams to develop new features rapidly, and spend less time fixing bugs or firefighting production issues.
One of the key design principles of this pillar is everything as code (EaC). This means having as much of your workload as possible scripted so that your processes can be automated reliably. In addition, processes defined in code can be committed to a version control system such as Git, which will provide a complete record of the change history. The concept of Infrastructure as Code (IaC) falls under this principle. In IaC, your virtual servers, databases, etc. are all defined in a markup file like YAML, which can be read by your cloud service provider to provision those resources accordingly. For example, CloudFormation is the AWS IaC tool and would be used to set up EC2 instances, Lambdas, and many other AWS services using code.
A continuous integration/continuous delivery (CI/CD) pipeline would also fall into the EaC principle. CI/CD provides an automated pipeline for building, testing, and deploying your code. Traditionally, releasing applications has been risky, as it was a very manual process prone to errors, often with significant downtime. With CI/CD the risk of mistakes is limited, downtime is often reduced, and it is usually very simple to roll back to a previous version of the code should something go wrong.
This leads to the next principle, which is to make frequent, small, reversible changes. The idea behind this is that 1) the platform is being continuously improved 2) it is easier to identify the cause of any issues that may arise since the cause so be isolated to a small part of the code. To understand this, we should again look at the traditional approach. Here we may have version 1 of an application running on our servers. We plan and build many new features, and after 6 months of work, we are ready to release version 2. This goes against the small, reversable changes principle because 1) there will be a lot of code sitting around for a long time not providing any value to the business 2) if any issues arise it is very difficult to understand why, since there have been many changes across many areas of the application. The use of CI/CD makes this process simple because, for example, code can be built and released every time it is merged to a certain branch of your Git repository.
The next two principles are somewhat similar in their application. These are: 1) refine operations procedures frequently, and 2) anticipate failure. Overall, these call for regular review of your systems. In particular, so-called game days are recommended, in which various failure scenarios are tested so that potential bottlenecks and points of failure can be identified and fixed before they ever become an issue, and so that team members are well practiced in your processes should a real failure event occur.
Finally, the last principle is to learn from all operational failures. Despite best efforts being made to adhere to the first 5 principles, issues and incidents within the system will still arise. Such occasions should be taken as an opportunity to learn and improve processes to mitigate any future risk. Learnings should be documented and distributed to ensure widespread understanding and an accurate knowledge base for future reference.

