Q.) A production database server is experiencing random restarts during its peak load times. Database logs show that these are abrupt terminations, not proper shutdowns. System logs reveal the Linux OOM (Out Of Memory) killer is responsible for ending the database process. Analyse and provide resolution / approach.
Q.) Users reported major slowdowns across several applications running on a Linux server. Initially, monitoring showed that CPU and memory usage were only around 60-70%, which made the cause unclear. The operations team noticed the system becoming very unresponsive, with simple commands sometimes taking minutes to complete. This issue worsened during peak usage times and improved slightly during periods of low traffic. Analyse and provide resolution / approach.
Q.) Production servers unexpectedly stopped responding. Users faced application errors, and SSH access became unreliable. Our investigation pinpointed the cause: the root filesystem was 100% full. Analyse and provide resolution / approach.
Level L2 and above
I'll be posting a series of Linux-related questions covering various skill levels. Feel free to share your insights and expertise. Your contributions will benefit learners at all stages, from those in current roles to those preparing for Linux interviews.
Q.) A production database server is experiencing random restarts during its peak load times. Database logs show that these are abrupt terminations, not proper shutdowns. System logs reveal the Linux OOM (Out Of Memory) killer is responsible for ending the database process. Analyse and provide resolution / approach.
Answer:
Since this is a production database server, I'm going to treat it like it's a critical process. What that means is that I'm going to go to an extreme to ensure that it is not terminated.
If the Linux OOM is responsible for ending the database process, it's doing so because the available RAM and swap space are being exhausted by some processes, and apparantly that production database is considered a main culprit by the OOM Killer. The OOM Killer may be terminating other processes as well, but our primary concern right now is the production database server, so the focus will be on how to go about preventing it from being terminated by the OOM Killer.
There are several kernel parameters that can be adjusted to modify the behavior of the OOM Killer, but the one I'm going to use is the oom_score_adj parameter (aka OOM Score). Every process has a OOM score associated with it. The OOM Score is a value between -1000 and +1000. By setting the OOM Score of a process to a value of -1000, this will prevent the OOM Killer from terminating that process. So, I will set the OOM Score value of the database server process to -1000.
Note: I'll leave it to your selection as to which tool/utility to use to learn the PID of the database server process.
Once you have the PID of the database server process, you can adjust the OOM Score of the process, using the following command:
# echo -1000 > /proc/<PID_of_database_server_process>/oom_score_adj
After making this adjustment to the oom_score_adj parameter for the database server process, the OOM Killer will have to go pick on some other process(es) to terminate when the RAM and swap space is at a critically low level. This adjustment should put an end to the abrupt terminations of our producion database server!
The End
Q.) Users reported major slowdowns across several applications running on a Linux server. Initially, monitoring showed that CPU and memory usage were only around 60-70%, which made the cause unclear. The operations team noticed the system becoming very unresponsive, with simple commands sometimes taking minutes to complete. This issue worsened during peak usage times and improved slightly during periods of low traffic. Analyse and provide resolution / approach.
Answer:
Since CPU and memory are my performance workhorses, I'm going to use one of my tools/utilities - top, htop, ps, pidstat - to examine which of these components is being taxed the heaviest during peak usage. When I discover the resource hogs, I might go to tools like cpulimit, and cgroups, to limit the CPU and memory resources that a process can use.
As part of my investigation, I may discover that the unresponsiveness is due to heavy disk space swapping. We know that heavy disk space swapping can lead to increased CPU and memory usage. When the Linux system has to rely on the hard disk for memory purposes, the CPU is heavily utilized to manage the swapping process - moving data between RAM and the much slower swap space.
We don't necessarily think about the CPU's involvement when we're talking about swap space, but the CPU is heavily involved in the process of swapping data. The CPU needs to manage the swapping of pages between RAM and the hard disk, which can lead to increased CPU utilization - which can lead to an unresponsive system when executing simple commands.
I'll see what information the free -m command is showing me. What I'm looking for here is the amount of RAM and swap space being used. If I see that the system is making heavy use of swap space, I'll investigate the reason or, if the system permits, just go ahead and invest in adding additional RAM to the system. If my hands are tied, and my system won't accdommodate any additional RAM, then I'll have a look at a tool like ulimit, to control any memory-hungry applications and/or services.
Q.) Production servers unexpectedly stopped responding. Users faced application errors, and SSH access became unreliable. Our investigation pinpointed the cause: the root filesystem was 100% full. Analyse and provide resolution / approach.
Answer:
Knowing in advance, that the root filesystem being 100% full, is the cause of these issues, will allow me to focus on what's causing the filesystem to fill to capacity.
I'll begin by verifying that the root filesystem is indeed 100% full, by using the following command:
# df -h /
My next move is to locate the directories that are consuming the most space. I'll interrogate that by using the following command:
# du -h --max-depth=1 / | sort -hr
Note: I'm only interested in looking at the top-level directories in the root directory, so that's why I'm using the "--max-depth=1" option with the du command.
Once I identify the directory that is consuming the most disk space, I'll drill down into it further, using that same du command construct above:
# du -h --max-depth=1 /<directory-in-root-consuming-most-space> | sort -hr
As you would suspect, this will give me a closeup look at any excessively large files and/or directories, in this directory.
Next, I'm going to cleanout the package cache. All depending on how many updates, or installs, are performed over time, this area can grow significantly. So, we'll clean it up. The procedure is a simple and harmless way to free up space. The command to make that happen:
# dnf clean all
Note: This command will remove all cached package files from /var/cache/dnf, which can accumulate over time, and consume significant, uncessary space.
My next act is going to make some adjustments on my log files. Since the majority of my log files are in the /var/log directory, I'm going to investigate there using the following command:
# du -sh /var/log/*
Note: This is going to let me see how much space each file and/or directory is consuming.
Once the hogs (i.e. fat log files) are identified, the contents of each log file can be cleared, without having to delete the actual file. This can be achieved using the following command:
# truncate -s 0 /var/log/<filename-to-be-cleared>
If you want to hold on to the contents of log files for only a specific amount of time, the journalctl command can be used for a little bit more flexibility. For example:
# journalctl --vacuum-time=9d
* This will configure the system to keep log file content for the last 9 days only!
Thus, eliminating the need to clear files individually
These steps should get you back on the road with a system that no longer has a root filesystem that is 100% full.
Additional food for thought:
I didn't mention anything about getting rid of kernels (files in the /boot directory) that are not being used, but it might be worth having a look at what kernels you have installed, that are absolutely not being used. These files are not humongous, but if there's no use for them, why keep them around - a penny saved, is a penny earned
Finally, to prevent running into this "100% full" situation again, it would be a very good idea to put something in place to routinely monitor the utilization of the root filesystem periodically, and to send some sort of alert to notify the sys admin when a certain percentage is reached.
Well, I've rambled enough, so I think I'll stop here!
Thanks @Trevor for pointing to the right direction. Your contributions and engagements in these Linux Interview series are unparalleled from the beginning. You have given a lot of correct and thought provoking approach and troubleshooting strategy to almost all of the questions posted. I am 100% sure if any budding Linux admin will just go through it - they will be able to ace any linux interview.
1000 kudos to you!
Red Hat
Learning Community
A collaborative learning environment, enabling open source skill development.