The Importance of Error Handling, Quality Control, and Documentation from the Perspective of an IT Operations Team

Long-term responsibility for operating IT systems brings experiences that are all too familiar to many in this field. A good example of this is the process of launching a new product. Here, we’ll skip the initial phases of a project and focus directly on implementation:

After thorough functional and technical testing by the quality assurance team and the fixing of identified errors by the developers, as well as the setup of the production environment by the responsible IT operations team, comes the big day: the product goes live.

Initially, the launch seems successful, but unexpectedly, performance deteriorates until it comes to a complete standstill. The subsequent questions about why this happened and how to solve it are familiar. Even though this example represents an extreme case, such situations and similar ones occur repeatedly over the years.

In this article, we explore from the perspective of an IT operations team, the potential causes of such incidents and possible solutions.

Challenges in Error Diagnosis

Often, the first step in troubleshooting is to check the application and system log files to identify potential causes. But this is precisely where difficulties often begin.

In the worst case, errors were not logged at all. If error messages do exist, they are often too general to precisely locate the problem. Moreover, without a deep understanding of the code, it can often be challenging to comprehend what caused the issues. This latter situation, at least in our experience, primarily occurs with Java applications.

A practical example illustrates the problem: An application generates a stack trace over a hundred lines long, the actual cause of which is a “FileNotFoundException”. However, the stack trace does not specify which file is missing. Only consultation with the responsible developer led to a solution.

Such challenges will be familiar to many. But how can they be avoided? After numerous discussions with software development professionals, various reasons have emerged, some of which we would like to highlight.

Lack of Experience in Daily Operation of Systems and Applications

A stack trace can show the problematic line of code. With code access and appropriate development experience, this is helpful information.

However, the situation is often different for operations teams. Code access is not always available, and even if it is, programming knowledge may be lacking. Additionally, there’s the pressure to restore functionality as quickly as possible, often at night, when monitoring alerts the on-call team.

It’s also important to consider that the format of log messages not only affects the efficiency of evaluation processes but also has a significant impact on the load on central systems and can significantly simplify or complicate data analysis.

Tunnel Vision

A common phenomenon is “tunnel vision”. One is so familiar with the matter that one can easily identify the cause of a problem, forgetting that outsiders do not readily understand the connections.

Time and Budget

Often, time and budget are only allocated to ensure that the application meets the basic requirements and provides all necessary basic functions. Structured error handling can be time-consuming and resource-intensive, and unfortunately, it often receives lower priority.

The Role of Staging and Load Testing

In this section, we want to explore the importance and benefits of different environments in the context of a project’s operation and in troubleshooting and further development of a product. Providing dedicated environments for different teams can prove extremely useful. The design of these environments can vary depending on the budget, product requirements, team size, and available time.

Staging

We outline an approach for staging that applies in the context of a more traditional setup with servers and virtual machines but can also serve as a basis for modern cloud and container environments:

Development Environment

Here, developers have full control and can test their applications and try out different approaches and solutions. It’s advantageous if this environment is as similar as possible in the layout to the production environment to identify potential error sources early. Some errors may only become apparent when the individual components run separately.

QA Environment

Here, the QA team conducts manual and automated tests to detect code errors and malfunctions. Load tests are also often carried out in this environment, which should simulate the layout of the production environment.

PreLive Environment

This optionally established environment simulates the live environment and can be used by the QA team and/or the operations team to conduct load tests, more on this later.

It also offers several advantages to the operations team. It allows for the automatic setup of systems with tools like Puppet or Ansible to be prepared and checked for error-free operation. It also provides a platform to evaluate monitoring checks and set appropriate thresholds. Additionally, the deployment of applications can be tested for potential problems, such as compatibility issues with certain library versions or missing dependencies.

Live Environment

This is the final environment where the product is operated for end-users. Ideally, careful preparation and testing in the previous environments lead to smooth operation in the live environment.

A small example where a problem became apparent only after careful preparation:

A product had a special web interface for various administrative tasks. During initial tests, everything worked flawlessly. However, it was later discovered by chance that when the web interface was used simultaneously by multiple people performing tasks, it led to mutual blockages of processes. These blockages meant that actions were only successful in some cases. Thus, this web interface was only “single-user capable.”

Load Testing

The importance of realistic load tests can be illustrated with an example:

A web application developed by an external service provider, displaying content based on IP address (geolocation), failed due to overload shortly after going live.

The service provider also conducted load tests beforehand. So, how could this happen despite load tests? It turned out the load tests did not reflect “normal” usage. Some may already guess what happened.

In the load tests conducted by the service provider, all requests came from the same IP address with the same content. This led to the same query being executed for every request in the underlying database. Now, the result of this database query was naturally stored in the database cache. Every further request to the system could thus be answered without real load, in the millisecond range.

Why do we emphasize that the environment used for load tests should be as similar as possible to the live environment? Various factors contribute to this, including:

Differences in network topology and latency times can significantly affect the behavior of the application.
The hardware configuration, such as the use of different CPU types or switching from hard drives to SSDs, can lead to performance bottlenecks elsewhere.
The behavior of virtual machines or containers often significantly differs from that of physical servers.

These factors can distort test results if the test environment does not reflect the real conditions of the production environment.

Summary

A well-thought-out staging concept is beneficial not only during the initial project phase but also for later troubleshooting and product development. It maximizes the benefits of the resources invested throughout the product’s life cycle.

In conclusion, it should be emphasized that this model does not provide a universal solution or protect against all unforeseen problems, but this approach has proven to be extremely effective in our experience and has saved us many sleepless nights.

The Importance of Documentation

Although for many of us in the IT industry, regardless of specialization, documentation often feels like a tedious obligation, but it plays a crucial role in software, processes, and operational workflows. Too often, however, documentation is seen as a secondary task that is then pushed aside by new projects or seemingly more urgent tasks. It is not uncommon for us to encounter systems for which we are responsible that have no or only incomplete or outdated documentation. This also applies to many commercial and open-source products.

Most of us have certainly encountered the following challenges:

Installation instructions contain command-line commands that do not work or only worked in older versions.
Despite meticulously following the step-by-step instructions, several attempts fail to achieve the desired result.
Provided sample configurations do not allow the application to start, lead to different functions, or do not set user rights correctly. Often, there is also a lack of explanation of the options used in both the sample configurations and other parts of the documentation.
Documentations that merely refer to source code are incomprehensible without the corresponding expertise.
Own documentations contain inserted jokes, uncommented and unsorted commands, or it is noted that the documentation will be created at a later date.

Given the constantly growing complexity in the IT landscape (I also wrote about this topic in my last blog post Innovation vs. Stability), high-quality documentations are more important than ever. This is not only relevant for trainees or professionals new to the field who need to understand the basics but also for experienced experts with decades of experience in their area.

Therefore, an appeal to all of us: Good documentation can save us all from a lot of frustration, whether through an additional comment in the source code or by writing a guide that is understandable for people who have not previously dealt with the topic, or for someone who has to fix a complex problem at three in the morning in a half-asleep state.

Other Perspectives

Finally, we would like to suggest that interested individuals – whether as software developers, quality assurance staff, project managers, product managers, or in another role – can share their personal perspectives on the topic as a guest post on our blog.

The Importance of Error Handling, Quality Control, and Documentation from the Perspective of an IT Operations Team

Challenges in Error Diagnosis