Day 2: Operating in the cloud

Applications, IT systems and infrastructures all go through a similar lifecycle: from planning (Day 0) to implementation (Day 1) to operations (Day 2).

This blog post sheds light on how the phases change through the use of cloud technologies and how IT teams can meet the resulting challenges in the area of “Day 2 Operations”.

Day 0, Day 1, Day 2 - The classic life cycle of IT systems

Day 0: In this phase, the planning, preparation and first steps for the implementation of a new system or a new infrastructure are undertaken. Day 0 activities usually include:

  1. Planning:
    Defining the requirements, objectives and scope of the project. Identifying the resources that will be needed and creating an implementation plan.
  2. Evaluation:
    Assessing the available options, technologies or solutions that meet the requirements. Selecting the most suitable solution for implementation.
  3. Design:
    Creating a detailed design for the new system or infrastructure. This includes determining the architecture, configuration and integration of components as well as defining the required resources.
  4. Procurement:
    Acquiring the necessary hardware, software or services for implementation - a relic from traditional IT that is virtually eliminated for cloud setups.

Day 1: The implementation or deployment of the system takes place in this phase. It typically includes:

  1. Installation / deployment:
    Installation of the hardware components (not applicable in the cloud setup) and installation/deployment of the software applications in accordance with the previous design and planning.
  2. Configuration:
    Configuration of the systems and applications according to the company's specific requirements. This includes setting up network connections, user settings, security policies, etc.
  3. Integration:
    Integration of the new system into the existing IT infrastructure or other applications to ensure smooth communication and collaboration.
  4. Testing:
    Performing tests and checks to ensure that the new system works properly and meets requirements. This includes checking the functions, performance tests and troubleshooting.

Day 2: Day 2 Operations is about ensuring that systems run smoothly, performance targets are met and user requirements are fulfilled. It includes monitoring, maintenance, troubleshooting, updates, scaling and optimization.

Day 2 Operations tasks mainly include:

  1. Monitoring and analysis:
    Continuously monitoring system performance, resource consumption, utilization and other relevant metrics to identify and analyze potential issues early.
  2. Troubleshooting and maintenance:
    Identifying and rectifying errors, faults or security vulnerabilities that may occur during operation. Regular maintenance work such as installing patches, updates or configuration changes.
  3. Scaling and capacity planning:
    Monitoring resource requirements and scaling the infrastructure to ensure it can keep pace with increasing user demands.
  4. Security and compliance:
    Ensuring the security of the systems and compliance with legal regulations and company guidelines relating to data protection and security.
  5. Optimization and improvement:
    Identifying bottlenecks or areas where improvements can be made to optimize performance, efficiency or user experience.

Agile project management as part of the DevOps strategy means that Day 0, Day 1 and Day 2 form a feedback loop. The data collected and experience gained from Day 2 is used to draw conclusions for Day 0 of the next iteration of the feedback loop.

Day 2 Operations: Tasks before the cloud

In the pre-cloud era, IT operations were often manual and time-consuming. Day 2 operations tasks included:

  • Hardware and network management: IT teams had to manage physical servers and network devices in the data center. This included monitoring hardware and network faults, maintaining hardware components and configuring network settings.
  • Operating system and application management: Operating system and applications had to be manually installed and configured on each server by the IT teams. Regular updates and patches had to be carried out to ensure that the system remained secure and stable.
  • Monitoring and troubleshooting: The system had to be continuously monitored by the IT teams to detect problems at an early stage. If errors occurred, they had to be investigated and rectified manually.
  • Backup and recovery: The IT teams had to make regular backups to ensure that the systems could be restored in an emergency.

Day 2 operations in the cloud: what has changed

With the advent of cloud technology, Day 2 operations tasks have changed significantly. The cloud automates many time-consuming tasks that were previously performed manually. This allows IT teams to focus on strategic tasks and add value to the business instead of dealing with manual tasks. The most important changes are as follows:

  • Operating system and application management: in the cloud, IT teams no longer have to worry about physical servers and network devices. Cloud providers such as AWS, Azure and Google Cloud offer virtual servers and networks that can be automatically scaled, managed and configured.
  • Operating system and application management: In the cloud, IT teams no longer need to manually install or configure operating systems and applications. Cloud providers offer pre-built images that can be deployed with one click. They also offer automatic updates and patches to ensure the system remains secure and stable.
  • Monitoring and troubleshooting: Cloud providers offer tools and services for monitoring applications and systems in real time. This allows IT teams to quickly identify and fix problems before they lead to major outages.
  • Backup and recovery: Cloud providers offer automated backup and recovery services that enable IT teams to respond quickly to outages and restore data.

Grafik: Komplexität von Day 2 Operations in verschiedenen Setups

Challenge Day 2 Operations in a cloud-native setup

A cloud-native setup has many advantages, such as increased speed in software development and full utilization of the cloud potential for operating applications. The complexity for developers is reduced, but is shifted to the system architecture.

The microservices architecture brings challenges for maintenance and support, as distributed architecture makes maintenance and an overview of the system more difficult. The operated software is also updated more frequently. This makes it more difficult to maintain an overview of these changes and their effects on operations.

The mass of tools for development, provision, monitoring and support of and for software is enormous. Acquiring expertise and an overview takes time, especially as these tools often work independently of each other.

Another challenge is simply the shift that DevOps brings with it, i.e. the change from centralized IT teams to decentralized development teams that work together with DevOps and SecOps teams to operate their software, true to the principle of “you built it, you run it”.

It is therefore fair to say that cloud native brings with it a shift in complexity from Day 1 to Day 2, which is increasingly falling on the shoulders of developers due to DevOps. In addition, the current shortage of skilled workers makes it difficult to form in-house teams of experts.

DevOps is still justified. In the cloud-native context, however, its implementation needs to be reconsidered, as the complexity of Day 2 operations means that developers have less and less time to continue developing software. It is important to develop an Ops strategy that does not drive Dev and Ops into isolation from each other again.

Solution approaches for day 2 operations in a cloud-native context

In a cloud-native context, work and team structures need to be rethought in order to relieve the burden on developers. Two tools have emerged that are intended to rebalance the workload for the teams. The two approaches are not mutually exclusive. On the contrary, they are often combined.

SRE (Site Reliability Engineer)

  • SRE is a role within a DevOps team. This role has the task of preventing bottlenecks by supporting Ops and Dev in the event of additional work that could hinder the workflow.

IDP (Internal Developer Platform)

  • An IDP is a collection of tools, services and processes that support and accelerate the work of software development teams while abstracting the underlying infrastructure.
  • The platform engineering team is responsible for managing the IDP and provides developers with centralized expertise on IDP usage.
  • The IDP thus forms an interface for developers and the platform team to ensure shared responsibility and communication between devs and ops.

Something that is often underestimated: In a DevOps setup where developers are given tools to develop, deploy and monitor, you still need teams to maintain these platform tools. This includes updates that close functional errors or security gaps, as well as adjustments and extensions to the platform.

Conclusion

Day 2 Operations is rightly becoming more and more of a focus for companies using cloud technologies. The complexity that is being shifted to Day 2 Operations is hampering development teams and negatively impacting the competitiveness of the business. SRE sub-teams or roles and the use of an IDP have proven to untie the knot.

Would you like to find out more or are you looking for support in operating your cloud systems? We look forward to hearing from you.

Message to Claranet

Claranet Managed Container Services

Claranet Managed Kubernetes Services

What is DevOps? Advantages, disadvantages and limitations