Optibus helps transit providers better run mass-transportation through advanced artificial intelligence and optimization algorithms. Its SaaS platform plans and schedules the movements of every vehicle and driver, providing detailed insight into how this affects operations, on-time performance and costs. Some of the most complex and large-scale transit operations worldwide use the Optibus platform to improve quality of service, reduce costs, streamline operations and reduce congestion and emissions.
The Optibus system is very CPU intensive – it supports many concurrent users, which in turn run heavy optimization algorithms. Although the system was fully Dockerized, scaling out resources to handle the high load was cumbersome, slow and involved many moving parts and scripts (using Ansible, CloudFormation, and Python code).
Making infrastructure changes was difficult because many parts of the deployment process relied on a custom Python code that was developed by someone who was no longer working with the company.
To solve these issues, Optibus wanted to migrate and modernize their existing cloud infrastructure but didn’t have in-house DevOps expertise to do so.
Our scaling cycle was complex, which made it error-prone. This caused us to do a lot of over-provisioning of our resources, which was expensive. We wanted to scale our resources in a smart way, based on customer demands. We also wanted to have an easier way for team members to add a new microservice to the system, so that they won’t need to understand all the internal workings of our homemade tools. We wanted to use something that is ready-made and had much more flexibility than what we had before.
Eitan Yanovsky, CTO & Co-Founder
We were looking to hire an in-house DevOps engineer, but we couldn’t find someone suitable for us. My concern with going with a consultant instead of hiring someone was that we would end up in a similar state that we’d been in before, where the external consultant would do all the DevOps work and we would not gain the in-house knowledge and experience needed to maintain and improve the infrastructure in the future. Leonid addressed my concerns, helping us devise a plan to teach the team members and involve them in the work so they would gain the hands-on experience needed to operate the new system in the future.
Eitan Yanovsky, CTO & Co-Founder
Following a deep dive analysis of the Optibus system and infrastructure, we developed a plan to gradually migrate parts of the infrastructure and custom deployment scripts to Kubernetes. Because Kubernetes is open source and has a large community of ready-made tools, it makes it easier for team members to add new services without the need to develop in-house automation.
To address the scaling challenges, we developed a custom plugin for Kubernetes to automatically scale Optibus worker components based on the amount of customer requests present in a queue.
We also replaced the custom Python deployment logic with standard Kubernetes tools such as Helm and Kops and created a secure way for team members to access the Kubernetes clusters from home.
To help the team members get up to speed with the new tools, we ran multiple hands-on training sessions which covered various aspects of production-grade Kubernetes maintenance – crisis simulations, an overview of monitoring and logging tools, and common failure scenarios.
Toward the end of our work, we provided a roadmap for how Optibus can continue to improve their cloud operations. We also helped the team create a new DevOps position and interview candidates.
Leonid helped us replace our custom deployment logic, which combined Ansible and CloudFormation to deploy our services on fixed dedicated machines, with services that run on Kubernetes and autoscale based on demand. He helped us structure our deployment logic so that it will be easy for us to add more microservices in the future and helped us improve our monitoring, which was not that sophisticated before.
Eitan Yanovsky, CTO & Co-Founder
Leonid knows very well how to differentiate between what’s important to do right now and what is less important and can wait until later. Discussions around scalability can get quite theoretical and you can end up with a very complicated plan. But Leonid focuses on what will bring the highest value with the least amount of time in order to prioritize and build the right roadmap of changes. He’s also very articulate and easy going, and makes you feel like he’s part of the team.
Eitan Yanovsky, CTO & Co-Founder
As a result of working with Opsfleet, the migration to Kubernetes is now complete, giving Optibus a much more flexible way to deploy and maintain its services and making it easier to add new services in the future. Transitioning from static environments to Kubernetes created the dynamic infrastructure Optibus needed to continue and evolve its architecture toward microservices.
If it’s quite complicated for a developer to add a new service to the system, then he will choose the easy option of adding more code or more APIs to an already existing codebase. This tendency makes it hard to switch from a single monolith to a dynamic microservices architecture. We don’t want to handle all the mess of how to deploy, scale and provision machines needed for our services. Kubernetes saves us a lot of time because most of these tasks are automated.
Eitan Yanovsky, CTO & Co-Founder
With improvements in monitoring and autoscaling logic, we helped Optibus reduce the provisioning of servers by two thirds. With the improved insight and the ability to scale automatically, the Optibus team can better identify and avoid resource waste.
With the training they’ve received, the Optibus team now better understands how to operate their Kubernetes-based cloud infrastructure and are continuing to build their in-house knowledge so that they can investigate and fix future problems faster. Optibus has also hired someone to fill the newly developed DevOps position, giving them the in-house expertise they sought to make future improvements to their infrastructure to meet their evolving needs.
"We use very expensive machines to run our algorithms so it was very important for us to understand how we can better utilize them. Before, once we hit some memory error or we suspected that we might need more resources, we provisioned more machines to prepare for the worst-case scenario. Now our infrastructure scales automatically and our resources match exactly our processing needs. We can now tune our infrastructure usage and not waste a lot of resources."