DevOps Culture edit

Before you start reading this blog post, I want you consider this question "Why Quiqup needs DevOps?" maybe it will help you to consider what is a DevOps?

If you think DevOps are just the magicians that handle your infrastructure and services, you should continue reading this blog post to learn what "DevOps" really means.

Why DevOps

DevOps was born because Devs wanted to ship their features fast to lead in the marketplace and make customers love their product. On the other side, Ops wanted to improve stability over anything else, and new features mean instability. This creates a big conflict within the company.

We needed to work together towards the same goal. Here, companies started to study how to do achieve this and they copied some of the Lean principles to eliminate bottlenecks and enhance productivity. Here, concepts like Agile, continuous delivery and DevOps were born.

Note: Lean is a management philosophy used by Toyota manufactures that allowed them to grow from a small company to the world's largest automaker in the 1990s (beating all their competitors).

So what is DevOps?

In the State of Devops report 2017, devops is described as:

DevOps is an understood set of practices and cultural values that has been proven to help organizations of all sizes improve their software release cycles, software quality, security, and ability to get rapid feedback on product development.

In this same document, they also compared high performing companies vs lower performing companies. Where they show that high performers reduce speed (or lead times) while improving stability.

Keep in mind DevOps is not a team, it is a culture, so engaged leadership is essential for successful DevOps transformations.

DevOps Principles

As Devops, our main goal is to help software delivery being fast and reliable. Gene Kim, John Willis, Jez Humble and Patrick Debois defined 3 ways to reach this goal on their book The Devops Handbook. I am going to try and make a quick resume of those 3 ways:

1 - Flow of work: this principle pursuits a continuous release of features to our customers. Our goal here is to reduce our lead time. Practices that help reducing lead times are:

  • Eliminate constraints: All the work in the workflow should be identified. In general, this includes: design (queue, analysis, work and approvals), development (queue, estimation, development, tests and approvals), qa (queue, automated tests and manual tests) and deployment (queue, ops work, approvals, deployment, verification). Once we have the whole workflow, bottlenecks should be determined and reduced (or eliminated).
  • Continuous delivery: this is to make deployments as part of daily work. It is usually achieved by using continuous integration, reduce batch size, easy rollback,... Its main objective is to make releases boring, so that we can deliver frequently and get quick feedback on what users care about.

  • Reduce Waste: this is partially done work, extra processes with no value, features that are not needed, task switching, blocked tickets, motion, bugs (the longest it takes to find a bug, the more expensive it becomes to fix it), manual work (or toil) and heroics.
  • Improve our daily work: accumulating problems and technical deb, we can end up jus performing workarounds. Mike Orzen observed that "Even more important than daily work is the improvement of daily work". At least 20% of all Development and operation cycles should be invested on refactors, automate work and NFRs.
  • Integrate designated Ops and QAs into dev teams. So you have independent teams that dont need to open tickets to other teams to complete a feature. The structure of the teams is highly important, according to Conway's Law "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations".

2 - Feedback: in order to have a resilient system, we should increase the information flow from as many areas as possible (sooner, faster and cheaper).

  • Create telemetry and analyse it. We should have a list of events for our system at business level, Application level, Infrastructure level, Client software level and deployment pipeline level. This should be analysed and raise alerts if they dont behave according to a pattern.
  • Test releases and probe hypothesis by enabling different types of deployment (canary deployment, A/B testing and Blue-Green).

  • Continuous integration: where each release is verified by an automated build that should detect errors as quickly as possible. This allows devs to get quick feedback to fix bugs as early as possible. Devs should be able to run unit tests, acceptance tests, integration tests (and have a visible test coverage). We could also perform performance tests by running the same integration test multiple times in parallel. Static code analysis could be also automated for security or clean code.

3 - Continual Learning and Improvement: this focuses on organizational knowledge. It also aims to teach people how to think: changing behaviour creates culture. To achieve this we need:

  • Safety culture. This allows to give your boss bad news, so everyone can focus on what caused a problem instead of who caused it. The boss should be also able to share bad news: workers are problems solvers, but they can help just if they know the problem. We should also perform blameless postmortem to incentivize learning rather than punishment, avoiding solutions like "be more careful". Google uses something called "error budget" to allow certain level of errors to happen, so they can test hypothesis and learn from their failures till they meet their error budget.
  • Test resilience. One of the main example for this is Netflix's chaos monkey. This ensures resilience by testing it. Netflix original idea was that practice makes perfect, so the only way to become better at failing is by breaking things.
  • Convert local discoveries into Global improvements. Achieved by adding telemetry or creating documentation for it. Postmortems should be available for everyone to read and learn from it.
  • Plan training. This could be done within the company, where a team teaches new concepts or skills to other people, and by assisting to courses, conferences or meeting weekly to read and comment a book.

Conclusion

In Quiqup, we meet some of this good practices: we have lightning talks, automated pipelines, blog posts, Datadog, Tableau..  but not do so well with others like resilience testing, autonomous teams, A/B testing, reduce manual testing,... 

It is fine, and even normal, to not be perfect, but we should always work towards improving our practices and culture. Different teams and leaders promote some of this practices and the DevOps team tried to meet some others, but Quiqup needs to work as a company to meet this objectives.

Hope you liked this blog post and that you understand a bit better the DevOps culture 🙂