Red Hat Cluster Update | Thi Le's Portfolio

IMPACT AT A GLANCE

PROBLEM

Our Voice of Customers (VoC) department reported an increase in the number of support tickets regarding long maintenance times during cluster updates.

This trend implies that extended maintenance periods lead to long application downtime, which is highly undesirable.

SOLUTION

I focused on exploring full and partial cluster updates while ensuring a consistent experience between the Command Line Interface (CLI) and web UI.

This provides the users, cluster admins, the flexibility to allocate their time and resources for maintenance time while keeping their applications running.

RESULTS

As of Q2 2023, the support tickets about long maintenance time during cluster update has decreased by 63% (target was 50%).

87% of click-through-rate to resume partial updates on notification and emails. This ensures less cluster failure.

As of Q1 2023, there have been very few support tickets related to conditional updates.

NOT BORED YET? LET'S DIG DEEPER THEN :)

OVERVIEW

BIG PROBLEM

Our Voice of Customers (VoC) department has reported an increase in the number of support tickets regarding long maintenance times during cluster updates. To address this issue, the Technical PM and I focused on exploring full and partial cluster updates while ensuring a consistent experience between the Command Line Interface (CLI) and web UI.

This gives the cluster admins the flexibility to update control plane and worker nodes separately.

THE TEAM

Lead UX designer - Me
UX writer
UX researcher
Product manager
Front-end & Back-end engineers
Voice of Customers (VoC) team

TIMELINE

January 2022 - May 2022

This project consisted of 3 parts:

Designing the workflow for full and partial cluster updates (January 2022 - March 2022)
Implementing conditional updates (March 2022 - May 2022)
Establishing a notification system (May 2022)

CHALLENGES

1CUSTOMER CONVENIENCE VS. RED HAT EFFORTS

Offering users separate control plane and worker node updates can address maintenance time challenges, but it necessitates additional safeguards from Red Hat to ensure successful updates and prevent cluster failures. It's a balance between customer convenience and our efforts to deliver a seamless and efficient service.

To tackle this, I proposed a notification system allowing users to opt in for updates, ensuring they complete them and avoid critical failures. This system aims to enhance user accountability and provide proactive measures for a smoother and more secure update process.

2SIGNIFICANT TECHNICAL LEARNING CURVE

When I started working on cluster updates, I faced a steep learning curve with no prior knowledge. To overcome this, I dedicated extensive time to studying update documentation and relied on my product manager and engineering colleagues for valuable insights. While UX primarily focuses on improving user experiences, it's crucial to have a strong understanding of the product foundation and the underlying decision-making process.

3THE LACK OF DIRECT USER FEEDBACK

Direct feedback from actual users was difficult to obtain beyond initial feedback from the VoC department due to security and corporate red tape. As a result, the design was validated through multiple internal tests with SREs, whose responsibilities closely resemble those of the targeted personas. Since the user tests were mainly focused on technical knowledge, it is possible that there were biases as the participants were Red Hat employees. More details on this will be discussed in the user testing section.

GOALS

BUSINESS GOALS

1REDUCING SUPPORT TICKETS FOR CLUSTER UPDATE

We value customer feedback and have actively listened to their opinions. By improving the cluster update experience, we aim to benefit our customers while also staying at the forefront of industry standards.

2INCREASE USER SATISFACTION

The focus was not so much on educating the target user, the cluster administrator, on cluster updates since they typically have a basic understanding of clusters. Instead, the priority was to ensure consistency in the experience between the CLI and web UI. By improving the cluster update experience, users may feel more satisfied with the product and have a better overall experience, potentially increasing customer retention and loyalty.

USER GOALS

In addition to addressing users' concerns about long maintenance time, these are the primary goals for improving the update experience:

1ENHANCE EFFICIENCY

Improving the update experience can help reduce the time and effort required by administrators to perform updates, increasing efficiency and potentially enabling them to focus on other tasks.

2BOOST PRODUCTIVITY

By simplifying the update process, administrators may be able to perform updates more quickly and accurately, potentially improving their productivity and contributing to overall team efficiency.

APPROACHES & DESIGN

PART 1: UPDATE STRATEGIES | FULL & PARTIAL CLUSTER UPDATES

TL;DR

Offering users the options to update full or partial cluster can provide benefits such as reduced downtime, improved control plane functionality, and compatibility with custom configurations, while also requiring fewer resources. Hence, reduce the support tickets for long maintenance time.

1UNDERSTAND THE CLI WORKFLOW

To ensure a consistent user experience across both the web UI and CLI, I conducted research on how users typically their clusters using the CLI based on existing documentation. While it's true that there are numerous sub-steps involved in ensuring a successful update, this simplified version provided me with a clear understanding of the update workflow on the CLI, which enabled me to develop a workable workflow on the web UI.

Click on image to enlarge

2DEVELOP WEB UI WORKFLOW

In collaboration with a back-end engineer and a PM, I defined various use cases for update strategies and created workflows for each of them. I particularly enjoy creating workflows as they provide a bird's-eye view that allows me to understand various use cases. Moreover, my workflow designs are helpful for collaborating with other designers and developers.

Click on image to enlarge

3HI-FI DESIGNS

USE CASE 1: When everything looks good to update

Full cluster update

Partial cluster update

Click on image to enlarge

USE CASE 2: There are paused worker nodes, but users can still update

Click on image to enlarge

USE CASE 3: There are paused worker nodes, but users cannot update until paused worker nodes are resumed and done updating

Click on image to enlarge

PART 2: IMPLEMENTING CONDITIONAL UPDATES

1WHAT IS A CONDITIONAL UPDATE?

TL;DR

Prior to version 4.9, it was not recommended to use versions with known bugs. However, with the release of version 4.9 and onwards, known risks can be identified and supported if users choose to update to these versions. These versions are known as conditional updates.

Once again, providing users with more flexibility in choosing updates that suit their needs requires additional guardrails from Red Hat. We strive to strike a balance between user autonomy and ensuring the necessary safeguards are in place to maintain the stability and security of the system.

2CONDTIONAL UPDATE AND UPGRADEABLE=FALSE

The upgradeable=false issue can hinder users from updating to specific minor versions due to a previously addressed bug associated with conditional updates. To effectively tackle this, understanding the relevant use cases through documentation or collaborating with peers is crucial. Let's shift our attention from technical terms to the user interactions necessary to proceed when encountering these use cases.

SCENARIO 1
Upgradeable=False is most likely to happen in the next minor version

WHAT CAN USERS DO?
In order to update to the next minor version or a patch version (z-stream), users are required to take action to resolve this issue. The steps to resolve this issue will be listed in the alert banner.

SCENARIO 2
Conditional updates are most likely to happen in patch versions, but can also happen in BOTH patch and minor versions.

WHAT CAN USERS DO?
Once users become aware of known risks, they have the freedom to decide whether they want to proceed with the update, choose a different recommended version, or postpone the update entirely.

SCENARIO 3
When BOTH Upgradeable=False and conditional updates happen to a minor version, Upgradeable=False will precede conditional update issues.

WHAT CAN USERS DO?
Before updating to a certain minor version, users may encounter Upgradeable=False issue that needs to be resolved first. After that, they may come across conditional update issues which they need to address before deciding to proceed with the update or opt for other recommended versions or wait.

3DESIGN

WORKFLOW

I will never stop talking about how much I love creating workflows.

(Click image to enlarge)

USE CASES

PART 3: ESTABLISHING A NOTIFICATION SYSTEM

Notifications are important during cluster update to keep users informed about the progress and any issues that may arise. This helps users to take necessary actions and avoid any potential problems.

I created a notification system for cluster updates with input from PM and internal SREs. It includes severity levels and actionable CTAs while avoiding intrusiveness, using PatternFly components.

USER TESTING

Note: We solely tested the update strategies (full and partial cluster updates) and notification system.

Due to security and corporate policies, obtaining direct feedback from customers was a challenging task. Therefore, the UXR team approached 5 internal SREs, who had similar responsibilities as cluster admins, to validate the new update design.

OBJECTIVES

During the design process, I conducted minor tests with SREs to validate specific interactions. However, the primary objective of the study was to assess whether the target users could comprehend the update strategies and effectively navigate the new design to meet their requirements.

Questions to ask:

What is your experience with updating clusters in the past?
What are the most common issues that you have encountered during the update process?
How do you prefer to receive notifications about cluster updates and their associated risks?
Are there any specific features or capabilities that you would like to see in a cluster update design?

Test use cases (Show prototype):

Interact as you normally would when you update a cluster
How do you interpret the update strategies?
How do you proceed to resolve the issue presented?

OVERALL FINDINGS

In general, all test users expressed satisfaction with the ability to choose various update strategies. They also noted that the notification system would be highly beneficial in addressing concerns and resolving issues as they arise.

Here are from actual quotes from the test users:

“I really like the new update options. It gives us more control over the update process and allows us to better manage our workload. It's great to have this kind of flexibility.”

“Being able to choose between full or partial cluster updates has been a game-changer for us. It's really helped us to better manage our resources and avoid unnecessary downtime.”

“I am really impressed with the new notification system for cluster updates. It's intuitive and provides us with the information we need to make informed decisions about when and how to update. The severity levels are especially helpful in determining how urgently we need to act, and the actionable CTAs make it easy to take action without having to dig through documentation or run scripts.”

RESULTS & LEARNING

RESULTS

As of Q2 2023, the support tickets about long maintenance time during cluster update has decreased by 63% (target was 50%).

87% of click-through-rate to resume partial updates on notification and emails. This ensures less cluster failure.

As of Q1 2023, there have been very few support tickets related to conditional updates according to Red Hat VoC. It's also possible that users are opting to either select a different update version or wait for the next release..

REFLECTION

Don't underestimate users: This experience highlighted the importance of acknowledging that users possess more knowledge and expertise than we may initially assume. It taught me the significance of not underestimating users and always prioritizing their needs throughout the design process.

Collaboration is key: Designing for cluster update involves many different teams and stakeholders, such as product management, engineering, and SREs. It's important to work closely with these groups to ensure that everyone's needs are being met.

Keep it simple: Cluster update can be extremely complicated, so it's important to simplify the process as much as possible for users. This could involve breaking down the update process into smaller, more manageable steps or providing clear, concise instructions for users.

User testing is critical: It can be difficult to anticipate all of the different use cases and scenarios that might arise during a cluster update. User testing can help identify potential issues or pain points in the update process and inform improvements to the system.

Communication is essential: Cluster updates can be stressful for users, particularly when there are potential risks or downtime involved. Providing clear and timely communication throughout the update process can help ease anxiety and build trust between users and the update system.

Overall, the new cluster update design has significantly improved the user experience of updating the OpenShift Container Platform, and is a testament to the importance of UX in complex technical systems.

Red Hat OpenShiftCluster Update UX

IMPACT AT A GLANCE

OVERVIEW

CHALLENGES

GOALS

APPROACHES & DESIGN

USER TESTING

RESULTS & LEARNING

Red Hat OpenShift
Cluster Update UX