ATB Team

Effective Linux Server Performance Troubleshooting to Maximize Efficiency

Linux server performance troubleshooting

Linux servers are the backbone of many high performance computing systems. However, like any complex system, they may occasionally experience performance bottlenecks or degradation. In these situations, Linux server performance troubleshooting becomes a critical task to ensure optimal functioning.

we’ll walk you through the best practices, tools, and techniques for troubleshooting common performance issues on Linux servers. From analyzing CPU usage to optimizing memory and disk I/O, we’ll cover it all.

1. Check System Resource Utilization

CPU Usage

top

Example Output:

top - 15:25:45 up 1 day,  3:52,  3 users,  load average: 1.15, 1.34, 1.56
Tasks: 123 total,   2 running, 121 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.3 us,  2.7 sy,  0.0 ni, 85.0 id,  0.0 wa,  1.3 hi,  0.7 si,  0.0 st
MiB Mem :  16202.1 total,  12134.5 free,   1256.3 used,   1811.3 buff/cache
MiB Swap:  2048.0 total,   2048.0 free,      0.0 used.   13123.4 avail Mem
  • Interpretation: The load average is 1.15, 1.34, 1.56, which is typical on a 2-core system. If these values are greater than the number of CPU cores (e.g., 4), it indicates that the system is overloaded.
  • CPU usage shows that 10.3% of the CPU time is used by user processes (us), 2.7% by system processes (sy), and 85% is idle (id). This means the system is not under CPU pressure.

Memory Usage

free -h

Example Output:

              total        used        free      shared  buff/cache   available
Mem:           16Gi        3.5Gi        11Gi        322Mi        1.5Gi        12Gi
Swap:          2.0Gi        0.0Gi        2.0Gi
  • Interpretation: The system has 16 GB of RAM, and currently 3.5 GB is used. 11 GB is free, and the swap is unused. The available memory (12Gi) indicates there’s no memory pressure.

Disk Usage

df -h

Example Output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   30G   18G  63% /
tmpfs            4.0G  1.2M  4.0G   1% /dev/shm
  • Interpretation: /dev/sda1 is 63% full, which is acceptable but should be monitored if the disk fills up. The tmpfs is almost empty, which is typical for temporary file storage.
iostat -xz 1

Example Output:

Device            r/s     w/s   rkB/s   wkB/s   avgrq-sz   avgqu-sz   await   svctm  %util
sda               2.15    1.32   123.5   45.1    83.1      0.03      5.5    4.4    20%
  • Interpretation: The disk sda is performing well with an average request size of 83.1 KB, and the average wait time (await) is 5.5 ms. The disk is 20% utilized, so there’s no apparent disk I/O bottleneck.

2. Identify High-CPU Processes

ps aux --sort=-%cpu | head -n 10

Example Output:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1234  45.2  3.0  786432 24576 ?       S    15:10   0:15 /usr/bin/python3 app.py
user1     2345  15.1  2.2  256000 10240 ?       S    15:20   0:04 /bin/bash
  • Interpretation: The Python application (/usr/bin/python3 app.py) is consuming 45% of CPU, which is significant. If this is unexpected, further investigation into the application is needed.

3. Check Load Average

uptime

Example Output:

 15:25:45 up 1 day,  3:52,  3 users,  load average: 1.15, 1.34, 1.56
  • Interpretation: The load averages (1.15, 1.34, 1.56) are all below 2, which is typical for a multi-core system. If these numbers were above 2 on a 2-core system, it would suggest CPU overload.

4. Disk I/O Bottlenecks

iostat -x 1

Example Output:

Device            r/s     w/s   rkB/s   wkB/s   avgrq-sz   avgqu-sz   await   svctm  %util
sda               2.15    1.32   123.5   45.1    83.1      0.03      5.5    4.4    20%
  • Interpretation: The disk sda is performing well with low await time (5.5 ms), and the disk utilization is at 20%. If await time is consistently high (e.g., >100 ms), it could indicate a disk I/O bottleneck.

5. Network Performance Issues

netstat -i

Example Output:

Iface   MTU  RX-OK  RX-ERR  RX-DRP  TX-OK  TX-ERR  TX-DRP  Flg
eth0    1500  100000  0       0       50000  0       0       BMRU
lo      65536 2000    0       0       2000   0       0       LRU
  • Interpretation: The network interface eth0 is receiving and transmitting data normally with no errors (RX-ERR, TX-ERR are 0). If you see non-zero values for errors, it might indicate packet loss or network issues.

6. Check System Logs for Errors

dmesg | grep -i error

Example Output:

[ 123.456789] blk_update_request: I/O error, dev sda, sector 12345678
  • Interpretation: This log indicates a disk I/O error on device sda. If these errors persist, the disk may be failing and should be checked further with tools like smartctl.

7. Investigate Running Services

systemctl list-units --type=service

Example Output:

UNIT                            LOAD   ACTIVE SUB     DESCRIPTION
apache2.service                 loaded active running Apache2 web server
mysql.service                   loaded active running MySQL Community Server
...
  • Interpretation: Check if unnecessary services (like unused web servers or databases) are consuming system resources. You can stop them if they’re not needed.

8. Check for Resource Limits

ulimit -a

Example Output:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
...
  • Interpretation: If the ulimit for open files or processes is set too low, it might cause issues. For example, if nofile is set too low, a program might not be able to open enough files. Adjust as needed in /etc/security/limits.conf.

9. Review Cron Jobs

crontab -l
cat /etc/crontab

Example Output:

# m h  dom mon dow   user  command
0 2 * * * root /usr/bin/backup.sh
  • Interpretation: Check if any cron jobs are running too frequently or consuming excessive resources. For example, if a backup job is scheduled to run every minute, it could create unnecessary load.

This process allows you to identify and interpret performance bottlenecks in different areas, such as CPU, memory, disk, network, and running processes. Depending on your results, you can take action such as optimizing code, upgrading hardware, or adjusting system configurations.

Frequently Asked Questions (FAQ) for Linux Server Performance Troubleshooting

1. How do I know if my server is under heavy load?

Q: How do I check if my server is under heavy load?

A: You can check the load average using the uptime or top command. The load average represents the number of processes waiting for CPU time over 1, 5, and 15-minute intervals. If these values consistently exceed the number of CPU cores (e.g., load average greater than the number of cores), it indicates the server is overloaded.


2. How do I diagnose high CPU usage?

Q: What should I do if my server shows high CPU usage?

A: Use the ps command to list the processes consuming the most CPU. Focus on high-CPU processes, and identify if they are expected or require optimization. If a particular process is consuming too much CPU unexpectedly, investigate or optimize that application.


3. How can I check memory usage on my server?

Q: How do I check if my server is running low on memory?

A: Use the free -h command to check available, used, and cached memory. If you notice that memory is almost completely used and swap is being utilized heavily, your server might be under memory pressure.


4. How can I troubleshoot disk performance issues?

Q: How do I check for disk performance issues on my Linux server?

A: Use the iostat command to monitor disk I/O performance. If disk utilization is high (close to 100%), or the average wait time (await) is large, it may indicate disk I/O bottlenecks. Also, check for disk space usage using the df -h command.


5. How can I check network performance issues?

Q: How can I check if there are network issues affecting my server?

A: Use netstat -i to check for errors in network interfaces. If you see packet loss or errors, it could be a sign of network issues. You can also use ping or traceroute to diagnose latency or connectivity problems.


6. What should I do if my server is running low on disk space?

Q: How do I check disk usage and resolve low disk space issues?

A: Use the df -h command to see disk usage across all filesystems. If any filesystem is near 100% usage, identify large files or directories using the du -sh command to free up space. Consider moving or archiving old files.


7. How do I check if any hardware issues are affecting performance?

Q: How can I detect hardware issues such as disk failures or memory errors?

A: Use dmesg | grep -i error to check system logs for any hardware-related errors, such as disk I/O errors or memory issues. Additionally, tools like smartctl can help monitor the health of disks, and memtest can be used to check RAM for errors.


8. How can I investigate running services consuming too many resources?

Q: How do I check if any services are consuming too many resources?

A: Use systemctl list-units --type=service to list active services. Check the resource usage of each service (using top or ps) and stop or optimize any unnecessary services.


9. How can I monitor system limits and adjust resource usage?

Q: How do I check and adjust system resource limits like open files or processes?

A: Use ulimit -a to check current resource limits. If needed, adjust these limits in /etc/security/limits.conf for users or system-wide.


10. How do I troubleshoot high disk I/O?

Q: How can I check and fix high disk I/O on my Linux server?

A: Use iostat to check for high disk utilization (%util) or long wait times (await). If disk utilization is high, consider moving some data to another disk, optimizing the application, or upgrading the disk.

Leave a Comment

Table Of Content