How To Avoid Downtime When Migrating To Kubernetes

Kubernetes
Pattern
Pattern
Leonid Mirsky
,
CEO
October 2, 2023

If you’re running your applications using Docker containers, there’s one factor that can determine how efficient your team will be. It all depends on which Docker orchestration platform you use in production.

A few years ago, the only way to run Docker containers on a fleet of Linux machines was by writing your own orchestration code with configuration management tools like Puppet or Chef.

Nowadays, specialized Docker orchestration platforms like Kubernetes or DCOS make deploying and managing Docker containers in production much easier. There is no need to teach your development team how to write configuration management code. They can operate and debug their applications by themselves from an easy-to-use CLI or a dashboard.

After helping several businesses make the transition to Kubernetes, I can say that the increase in efficiency and the ability to give your development team full control over production is definitely worth the risk — if you take a few precautions.

Kubernetes is still a pretty young project. It’s evolving rapidly, which can complicate the migration process. In the last month alone, the project received more than 600+ commits, with 20+ commits per day (Github pulse). Unfortunately, this pace means that features are changing so fast that it can be difficult to track the most current best practices. Documentation has a hard time keeping up, too.

We helped a few B2B companies ranging from 10+ and 100+ people in size migrate their legacy deployment systems to Kubernetes. It’s possible to make the switch and do it without service interruptions.

With some advance planning, you can make the changeover smooth and fairly risk-free. Here’s the process I’ve developed to make the transition easier.

Spend Time Fine-Tuning The Readiness And Liveness Configuration

In Kubernetes, each application’s container (pod) can specify 2 configuration options that directly affect its uptime.

  1. The readiness configuration indicates when your service is ready for traffic.
  2. The liveness configuration specifies when the service is compromised and should be restarted.

Why these parameters are important to your overall system’s health?

These settings affect how your application will behave in 2 critical scenarios: during a new version’s release (deployment) or during unexpected service errors.

For example, a new version of your application won’t start receiving traffic until its readiness check comes back with a good status. We can utilize this behavior to wait for important resources or to make sure that your application has enough time for initialization.

There are other protects in place, too. If we want to restart an application in case of an error and the application does not crash automatically, we can specify an appropriate LivenessProbe setting to signal to Kubernetes when the application should be restarted.

As you do this preliminary work before the actual migration, it’s important not to rush. Spend time fine-tuning these settings to test how they affect the overall stability of your individual components and the whole system. It’s especially important to check how the whole system functions during new version rollouts.

Migrate Gradually From The Staging Environment

Standard advice is to keep the production and staging parity consistent so it will be easy to discover bugs or configuration issues before they reach production and affect the end users.

However, when you deal with such a risky task as migrating your cloud infrastructure to a new platform, it’s important to make a gradual transition.

I found that migrating the staging environment first helps team members try out the new tools and allows them to discover issues that could cause problems in production down the road.

The downsides of running your applications differently on staging for a short period of time are nothing compared to the risks of migrating both staging and production all at once.

In some cases, it’s possible to take the gradual migration process even one step beyond the staging environment. Instead of switching all users to a Kubernetes-based production environment, it might be possible to start migrating the clients to the new platform one by one.

Leave Time For The Team To Get Comfortable With The New Tools

There are so many tools and platforms out there that it’s hard to assemble a team with a common operational experience. Some team members will be more familiar with Heroku, and others will have experience writing Puppet code to describe how the application should be deployed.

Even with a highly competent team, you’ll still need to adapt to Kubernetes’ CLI and YAML configurations. It’s important to leave enough time for all team members to try these tools in a risk-free environment before you move the whole Kubernetes stack to production.

Don’t Go Live Without Sufficient Container Level And Node Level Monitoring

When I say monitoring, I mean actionable dashboards.

Fortunately, Kubernetes has a pretty straightforward process with well-documented instructions on how to setup a decent monitoring system.

The hard part — which is important and not documented at all — is how to create meaningful dashboards using these monitoring tools. The dashboards should expose all the information you might need to diagnose a production issue as soon as it happens. The last thing you want is to start scanning through hundreds of metrics while your system is down and your customers are affected.

Manually Break Things To Test Your Recovery Assumptions

You can guess at your weak spots, but you can’t assume your system will recover from unexpected errors without running a few relevant scenarios first. It’s important to test your assumptions about your system’s resistance.

Find potential problems by stopping one or two database or queue instances and watching for unexpected errors in your applications. Or try killing one of the Kubernetes nodes and watch how your whole system is affected while the pods are re-scheduled to other nodes.

Just spending a few hours running these tests has the potential to uncover a few scenarios you couldn’t have anticipated. It’s better to discover and repair the relevant settings in a controlled setting before moving with the migration process to production.

Conclusion

I’ve helped supervise several Kubernetes migrations now, and in every single one, the developers were very happy with the transition. Given this feedback, I’m very comfortable saying that if you currently deploy your applications using configuration management tools like Puppet or Chef, your developers will love switching to Kubernetes. Its CLI is one of the best in the industry and allows developers take full control over how their applications are run in production.

Even with support, switching the infrastructure of a live environment without an interruption in services is not a trivial task.

For a smooth transition, it’s crucial to pay attention to important Kubernetes configurations and to make the migration process gradual. Taking your time with each of the steps mentioned above will safeguard the migration process and help you transition to Kubernetes, which in turn will make your whole team more efficient.

Pattern
Pattern

Pattern
Pattern

Your DevOps Partners

Scaling a cloud-enabled startup requires DevOps expertise. We partner with your engineering team to help you build and scale your cloud infrastructure.

Contact us
Contact us illustration