Q.) You have received a high priority incident in your support queue about a particular linux server being too slow to respond - as an admin, how will you handle / troubleshoot / gather information about the issue ?
Q.) Explain what is happening here ?
sysctl -a | grep -i Commit && grep -i Commit /proc/meminfo
vm.nr_overcommit_hugepages = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 1
vm.overcommit_ratio = 500
CommitLimit: 23449596 kB
Committed_AS:23578974 kB
Q.) How will you approach to troubleshoot this issue :
fsck.ext4: Bad magic number in super-block while trying to open /dev/vdb
/dev/vdb: The superblock could not be read or does not describe a correct ext4 filesystem
I'll be posting a series of Linux-related questions covering various skill levels. Feel free to share your insights and expertise. Your contributions will benefit learners at all stages, from those in current roles to those preparing for Linux interviews.
I'd like to respond to that 2nd question, that's showing some output based on the
/proc/meminfo file.
@Trevor Thanks for pointing that out! I have modified the question accordingly. Your explanation is spot on!
Here's my take on that first question: the slow server
My research shows that the three primary causes of a sluggish Linux system are:
1) CPU
2) RAM
3) Disk I/O
There are several tools that can be used to investigate how these areas of a system are
performing/functioning. Two of the more popular and useful tools used are: 1) top and
2) sar
One of the beauties about the top utility is that it provides a real-time look at what's
happening on a Linux system.
When looking at the information provided by the top utility, one of the key pieces of inforrmation to view first is the "load average". You will see three values show for the load average. Those three values refer to the past one, five, and fifteen minutes of system operation. To put into perspective those three values, an ideal load average is when its value is lower than the number of CPUs in the Linux server. For example, with only one CPU in the Linux server, it's best if the load average is below 1. Generally speaking, if a 1-minute average is above the number of physical CPUs on the system, then the system is most likely CPU bound.
Some other worthy information to look at in the output of the top utility would be the following for the CPU (for each CPU if multiple CPUs exist on the system):
- us: This percentage represents the amount of CPU consumed by user processes.
- sy: This percentage represents the amount of CPU consumed by system processes.
- id: This percentage represents how idle each CPU is.
Again, this information represents real-time statistics. The values for us, sy, and id will
identify if the CPU (or CPUs) is bound by user processes or system processes.
Moving on to the sar utility, it is a tool that collects system data every 10 minutes by default. However, this collection interval can be changed by editing the /etc/cron.d/sysstat file.
The command sar -u provides information about all CPUs on the system, starting at midnight.
As is the case with the top utility, the main things to view in the sar -u command output are %user, %system, %iowait, and %idle. This information can tell you how far back the server has been having issues.
To check RAM performance, the sar -r command will provide the last day's memory usage.
The main thing to look for in RAM usage is %memused and %commit. A important note about the %commit field: This field can show above 100% since the Linux kernel routinely overcommits RAM. If %commit is consistently over 100%, this result could be an indicator that the system needs more RAM.
For disk I/O performance, use sar -d, which gives you the disk I/O output using just the device name. For this output, looking at %util and %await will give you a good overall picture of disk I/O on the system. The %util field refers to the utilization of that device. The await field contains the amount of time the I/O spends in the scheduler. Await is measured in milliseconds. The value that appears in the await field, that reflects issues on a system, is not a one-size-fits-all - that is, the impactful value will depend on the specific environment.
As a side note, to get the name of the disk device(s), use the command: # sar -dP
Whatever your tool of choice to glean performance information, you'll want to use something that provides information on the all important resources: CPU, RAM, Disk
Red Hat
Learning Community
A collaborative learning environment, enabling open source skill development.