Which of the following are design principles for reliability in the aws cloud?

The third pillar of the AWS Well-Architected Framework is reliability. It deals with ensuring that your workloads and applications produce the same results every time.

Inhaltsverzeichnis Show

What it is - A quick recap
How we do it at kreuzwerker
Reliability Pillar
In a nutshell
Design principles
Improvement process
Which of the following are design principles for reliability in AWS cloud?
What are the design principles that increase reliability?
Which of the following AWS services are part of the AWS Foundation Services for the reliability pillar of the well architected framework in AWS cloud?
Which of the following are principles of correct cloud design?

Using the five design principles of the reliability pillar, you can create workloads and applications that are reliable for their entire lifecycle.

Automatically Recover From Failure
Automation is a vital element in the reliability pillar. Set up systems that monitor for Key Performance Indicators (KPIs) of your business values. When one of the KIPs reads too low or too high, your monitoring system should automatically notify you and continue tracking the problem.

You can also set up automatic recovery systems that your monitoring systems trigger when there’s a problem.

To prepare as much as possible for failure, you can set up systems that track trends, meaning they can predict future problems.

Test Recovery Procedures
Just as you test your workload’s operating procedures, you should also evaluate its recovery methods. While working the cloud, use automation to cause a failure in your workload and observe how well the recovery systems and procedures work.

It’s also possible to use automation to recreate past failures. If you’re unsure of exactly where a failure occurred, a recreation can help you determine causes and ensure it doesn’t happen again.

Scale Horizontally
Instead of using one large workload, consider breaking it up into several smaller resources. If a failure occurs in an overarching workload, you might have to shut down your entire system for the repair.

Ensure you spread out your requests across the smaller resources so they don’t share a common failure.

Don’t Guess Capacity
Don’t just assume that your workload can handle the demands you place on it. One of the most common reasons that a workload fails is due to resource saturation.

Use AWS tools to monitor the demands placed on your workload and its saturation level. Create systems that automatically reduce demands when your workload approaches saturation.

Manage Change
Use automatic systems to change your workload. Automation removes human error, reducing your risk.

Changes made to automatic systems should be tracked and reviewed, preferably by another automated system.

Work with an AWS Well-Architected Partner
To ensure you are compliant with all five design principles of the reliability pillar, consider working with an experienced AWS Partner. The WOLK team stays up-to-date with the current design principles and best practices of the AWS Well-Architected Framework.

After performing a Well-Architected Review, we can identify any non-compliance issues and mitigate them for you.

This blog post is part of a series about the AWS Well-Architected Framework, what it is, why it makes sense, and how we at kreuzwerker do it. In this entry, we will focus on the Reliability Pillar.

What it is - A quick recap

Using their architects’ and clients’ collective knowledge and experience, AWS is continuously working on a Well-Architected Framework, which consists of key concepts, design principles, and best practices for architecting and running workloads in the AWS Cloud. AWS developed a Well-Architected Framework to understand what makes some customers succeed in the cloud while others fail. They also wanted to identify common problems, decisional and architectural patterns, and anti-patterns. In other words, what is Well-Architected and what is not, and to make this knowledge available to all, regardless of whether someone is just considering migrating to the cloud or is already running thousands of workloads there?

The Well-Architected Framework is built on six pillars

operational excellence 👨🏽‍💻
security 🔒
reliability 💪🏾
performance efficiency 🚀
cost optimization 💵
sustainability 🌳

The AWS Well-Architected Review process provides a consistent approach for customers and partners to evaluate architectures and implement scalable designs. It is based on the previously mentioned six pillars.

It’s important to note that the Well-Architected Review is not an audit. It’s nothing to be afraid of; there are no penalty points for not getting things right the first time. A Well-Architected Review is a way of working together to improve your architecture. The process leads through several foundational questions and checks. It has been derived from years of experience working with the AWS cloud regarding security, cost efficiency, and performance. Hence, it provides sound advice on improvements. It helps you to build secure, high-performing, resilient, and efficient infrastructure for your applications and workloads.

The hard facts about AWS Well-Architected reviews in 2022 are:

it consists of 58 questions in total across all pillars
it takes around 4-6 hours for one workload (without tool support)
the goal is to remediate 45% of the high-risk findings with a minimum of 20 questions answered.

We describe the process from our perspective in more detail here.

How we do it at kreuzwerker

Why should you do it with us?

As a Well-Architected partner, we do at least 20 well-architected reviews per year and have built overall deep architectural expertise for every pillar and hands-on experience.

How do we perform such a review?

For us, it’s an interactive process: we inspect and adapt every time we do it by requesting feedback from our clients and doing a short internal retrospective. As of now, we perform it as follows:

We do it in 2 blocks from 09:00-12:00 and 13:00-15:00 with a lunch break. However, we can continually adapt if we are faster, e.g., we shift the gap, and we are also flexible whether doing it remotely or at your office.
We do it in an interactive, story-telling mode. This means: you talk, we listen, and then dig deeper into specific areas while being able to cover multiple questions.
Our process is supported by tools (more in the other part of the blog post series 🥳)

We do not just handle the questioning but give guidance to answering them.We can tell you how and why there could be improvements to be made.

Reliability Pillar

It is about the ability of a workload to perform correctly and consistently in its intended way. This includes operating and testing the workload through its total lifecycle.

In a nutshell

We all want our workloads to be reliable, available 99,9…9% of the time, and to prevent failure. And if failures do occur, then handle them gracefully. Like Netflix says

The best way to avoid failure is to fail constantly

Achieving this is all about the foundations, how you architect your workload, how you apply and monitor changes, and how your workload detects and handles failures. It depends on

Resiliency is the ability to recover from infrastructure or service disruptions, dynamically acquire resources to meet demand, and mitigate disruptions, such as network issues. In the cloud, everything can fail all the time, and the more loosely coupled your architecture is, the more it needs to handle network issues, like timeouts, high latency, etc.
Availability is the percentage of time that a workload is available for use. For example, the availability of 99,999% allows only maximum unavailability of 5 minutes per year. And doing the math, the more dependencies you have in the request chain, say three services, then the availability of 99.99% for each service results in an overall availability of 99,97%.
Disaster Recovery Objectives are recovery strategies in the event of a disaster. Here two metrics are important: Recovery point (RPO) and Recovery time (RTO). And of course, the costs if you want to have low RPO and RTO values, which the following graphic illustrates:

After summarizing this pillar from our point of view, let’s talk briefly about the design principles that navigate us through each pillar.

Design principles

All pillars have their design principles, and they guide us through them. For the reliability pillar, they are as follows:

Automatically recover from failure: You can trigger automation when a threshold is breached by monitoring a workload for key performance indicators (KPIs). These KPIs should be a measure of business value and technical aspects. This is the base to automatically notify and track failures and install automated recovery processes that work around or repair the failure. If you think one step further, then with more sophisticated automation, it’s possible to anticipate and remediate failures before they occur.
Test recovery procedures: You can and should test how your workload fails and validate your recovery procedures in the cloud. Typically most companies don’t do this. You can use automation, e.g., in pre-production environments, to simulate different failures or recreate scenarios that led to failures in the past. This proactive approach exposes failure pathways that you can test and fix before an actual failure scenario occurs.
Scale horizontally to increase aggregate workload availability: Replace extensive resources with multiple small resources to reduce the impact of a single failure on the overall workload. Distribute requests across various, smaller resources to ensure they don’t share a common point of failure.
Stop guessing capacity: A common cause of failure in on-premises workloads is resource saturation when demand exceeds that workload’s capacity. It is not so easy to scale in-premises. This can likely happen, for example, in the case of a denial of service attack. In contrast, you can monitor demand and workload utilization in the cloud and automate the addition or removal of resources. So it is possible to maintain the optimal level to satisfy demand without over-or under-provisioning, for example, based on specific utilization metrics. There are still limits, but some quotas can be controlled, others can be managed, while still others are unchangeable (see Manage Service Quotas and Constraints).
Manage change in automation: Changes to your infrastructure should be made using automation and IaC Tool, such as CDK or terraform. The changes that need to be managed include changes to the automation, which then can be tracked and reviewed, for example, in a VCS such as git and services such as GitHub with a branching model, such as git-flow.

Improvement process

The architectural improvement process includes understanding what you already have and what you can do to improve the current state of your workload architecture. It selects targets for improvement, tests and adapts them, and quantifies your success. Afterward, you share what you have learned so that it can be replicated elsewhere, and then you repeat the cycle ♻️

Setting the Foundations
- Being aware of the service quotas is the first step. We notice that many clients are aware of them when they first hit them. Our approach is to inform our clients, set the alarm when the number is coming close to its limits, and incorporate them in their architecture decisions. For example, when designing a multi-tenant architecture, each client should have a separate S3 bucket: the limit of a bucket per account is 1000. So it might make sense to think about a proper prefix schema in a single bucket.
- For the network topology, we recommend using AWS DNS service Route53 and Cloudfront. These services are protected by default in the DDoS protection service AWS Shield. Furthermore, we also recommend the usage of AWS Transit Gateway if we hear that the network is planned to be expanded or there is a multiple VPN and Direct Connect connection planned.
Rethink the Workload Architecture
- We suggest the AWS Builder Library, which is how AWS builds and operates its software.
- The next big topic is workload segmentation: how are the contracts? How tight or loose is the coupling? Do we see a possible future improvement in using services like SQS or Eventbridge for a more event-driven architecture? How are the requests structured, e.g., are they idempotent? Are re-tries, backoff strategies, throttling, and timeouts in place, which are crucial in distributed systems?
- How does your workload scale? Do you have a mechanism in place to perform load testing?
Properly implement Change Management
- A lot of clients have monitoring in place, but not all of them have it adapted for monitoring when changes occur. We ask which metrics they generate, how they aggregate them if they get alarmed, and how they get alarmed. We also think tracing is crucial for being able to find the root cause and location if failures occur quickly. Most clients have never heard of the possibility of automated responses and remediation, such as Systems Manager (SSM) automation. We create awareness and also add example implementations.
- Some clients have runbooks in place. We find them crucial as they are well-defined responses and procedures for known events, such as deployments. Most clients have tests as part of their deployment pipeline. However, very few test how to roll back in case of failure. We point out such cases and provide solutions.
Do Failure Management with grace
- What runbooks are for change are playbooks for failure: well-defined procedures for such cases. Additionally, we explain blameless post-mortems and create awareness for chaos engineering and game days.
- Backups?! Are they in place, and if so, are they encrypted, and do you regularly test playing the backups back in? Most clients make the first two points. However, they never tested the third. We tell them about the Gitlab.com database incident in 2017, where 5 out of 5 backup mechanisms did not work, and they had to use a 6-hour old backup from a staging database.
- We mention Disaster Recovery (DR) and the different types and define which one is the most suitable one with a balance between costs and RPO & RTO. With DR, it is the same as with backups: it only works if you regularly test it!

Conclusion

Based on the pillar principles and improvement process, our conclusion is:

KPIs business values need to be in place. However, we see lots of clients are not aware of them.
Most clients are unaware of service quotas, only if they hit them. Furthermore, back-off strategies and proper timeouts of service calls are not in place.
CI/CD pipelines are in place; however, rollbacks are not.
Load testing: for example, you can use a prebuilt solution from AWS to generate load on your application and use AWS Aurora’s cloning feature to have a copy of your production data in a pre-prod environment.
Backups and DR are sometimes implemented. However, they are not tested regularly or at all.
Generally, we encourage many to test for failure with principles such as chaos engineering.

Take care, and the final words are: we’re happy to perform an AWS Well-Architected Review with you and tackle those issues together.

You want to know more about the AWS Well-Architected Framework, here are the other parts of our series:

Which of the following are design principles for reliability in AWS cloud?

Design Principles Implement a strong identity foundation. Enable traceability. Apply security at all layers. Automate security best practices.

What are the design principles that increase reliability?

Reliability principles.

Define your reliability goals..

Build observability into your infrastructure and applications..

Design for scale and high availability..

Create reliable operational processes and tools..

Build efficient alerts..

Build a collaborative incident management process..

Which of the following AWS services are part of the AWS Foundation Services for the reliability pillar of the well architected framework in AWS cloud?

Reliability There are four best practice areas and tools for reliability in the cloud: Foundations – IAM, Amazon VPC, AWS Trusted Advisor, AWS Shield. Change Management – AWS CloudTrail, AWS Config, Auto Scaling, Amazon CloudWatch. Failure Management – AWS CloudFormation, Amazon S3, AWS KMS, Amazon Glacier.

Which of the following are principles of correct cloud design?

There are 6 principles of cloud computing architecture design, including reasonable deployment, business continuity, elastic expansion, performance efficiency, security compliance, and continuous operation.