Cloudera Machine Learning Workspace Provisioning Pre-Flight Checks
Information about Cloudera Machine Learning Workspace Provisioning Pre-Flight Checks
At Cloudera, we believe that data can make what is impossible today, possible tomorrow
There are many good uses of data. With data, we can monitor our business, the overall business, or specific business units. We can segment based on the customer verticals or whether they run in the public or private cloud. We can understand customers better, see usage patterns and main consumption drivers. We can find customer pain points, see where they get stuck, and understand how different bugs affect them. With data, we can discover new market opportunities, and review where we stand compared to the global market. We can track feature adoption, see how new features are picked up and what usage/consumption they generate. With data, we can set better goals, know where we are and where we want to go. And in the end, we can make better decisions.
At Cloudera, we practice what we preach. As the Cloudera Data Platform (CDP) gains popularity and more and more customers make it a critical piece of their infrastructure, we set out to create the best data platform in the enterprise. Today, we will highlight a new feature that showcases one great example of using data in the service of our customers.
“I’m excited to share this feature because this is a success story!”
Late last year, we saw our customers struggling to get CML Workspaces up. The elevated escalation count put a strain on our engineering team, trials were slowed down, and even worse our customers had a very bad experience with our product. We needed to figure this out.
We tried to understand “is this a systemic issue?” or “how widespread is this problem?”, and the results were alarming. Customers experienced issues more than half, 57% of the time.
There are two phases of CML Workspace creation; first, we create a K8s Cluster via the liftie APIs – this is the ‘Provision’ step; second, we install the CML service. The above chart shows the workspace provisioning results broken down for CML releases between June ‘20 and Jan ‘21.
Once we saw the results, we dug in and analyzed dozens of failure modes. We discovered that actual product bugs caused only a small portion of the failures.
The most common failures we found were instance types requested in unsupported regions, failures due to conflicts between the admin-provided CIDR address ranges, and environments where CML Workspaces were failing due to an unhealthy DataLake.
Okay, we identified the problem: we attempt to create CML Workspaces when we know they will surely fail.
Preflight checks to the rescue
Liftie and CML engineers teamed up to solve this problem. They built a framework and released a series of checks over the course of the last few quarters to catch issues early. The results are astonishing.
For the most recent – Aug ‘21 – release, customers experienced issues with the workspace creation just 7% of the time. For 39% of the attempts, we caught issues early and showed a meaningful error message, this saved hours and hours of work for support, engineering, and our customers.
This was a data-driven project. We used data to qualify the problem, to understand the issues, and to measure our progress and the outcome. The result is a significantly more stable platform and a new framework that all other CDP Data Services will benefit from.
Get started with Cloudera Machine Learning in CDP Now, you can start here.