Linux server performance troubleshooting
Linux servers are the backbone of many high performance computing systems. However, like any complex system, they may occasionally experience performance bottlenecks or degradation. In these situations, Linux server performance troubleshooting becomes a critical task to ensure optimal functioning.
we’ll walk you through the best practices, tools, and techniques for troubleshooting common performance issues on Linux servers. From analyzing CPU usage to optimizing memory and disk I/O, we’ll cover it all.
1. Check System Resource Utilization
CPU Usage
top
Example Output:
top - 15:25:45 up 1 day, 3:52, 3 users, load average: 1.15, 1.34, 1.56
Tasks: 123 total, 2 running, 121 sleeping, 0 stopped, 0 zombie
%Cpu(s): 10.3 us, 2.7 sy, 0.0 ni, 85.0 id, 0.0 wa, 1.3 hi, 0.7 si, 0.0 st
MiB Mem : 16202.1 total, 12134.5 free, 1256.3 used, 1811.3 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 13123.4 avail Mem
- Interpretation: The load average is
1.15, 1.34, 1.56
, which is typical on a 2-core system. If these values are greater than the number of CPU cores (e.g.,4
), it indicates that the system is overloaded. - CPU usage shows that 10.3% of the CPU time is used by user processes (
us
), 2.7% by system processes (sy
), and 85% is idle (id
). This means the system is not under CPU pressure.
Memory Usage
free -h
Example Output:
total used free shared buff/cache available
Mem: 16Gi 3.5Gi 11Gi 322Mi 1.5Gi 12Gi
Swap: 2.0Gi 0.0Gi 2.0Gi
- Interpretation: The system has 16 GB of RAM, and currently 3.5 GB is used. 11 GB is free, and the swap is unused. The available memory (
12Gi
) indicates there’s no memory pressure.
Disk Usage
df -h
Example Output:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 30G 18G 63% /
tmpfs 4.0G 1.2M 4.0G 1% /dev/shm
- Interpretation:
/dev/sda1
is 63% full, which is acceptable but should be monitored if the disk fills up. Thetmpfs
is almost empty, which is typical for temporary file storage.
iostat -xz 1
Example Output:
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 2.15 1.32 123.5 45.1 83.1 0.03 5.5 4.4 20%
- Interpretation: The disk
sda
is performing well with an average request size of 83.1 KB, and the average wait time (await
) is 5.5 ms. The disk is 20% utilized, so there’s no apparent disk I/O bottleneck.
2. Identify High-CPU Processes
ps aux --sort=-%cpu | head -n 10
Example Output:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1234 45.2 3.0 786432 24576 ? S 15:10 0:15 /usr/bin/python3 app.py
user1 2345 15.1 2.2 256000 10240 ? S 15:20 0:04 /bin/bash
- Interpretation: The Python application (
/usr/bin/python3 app.py
) is consuming 45% of CPU, which is significant. If this is unexpected, further investigation into the application is needed.
3. Check Load Average
uptime
Example Output:
15:25:45 up 1 day, 3:52, 3 users, load average: 1.15, 1.34, 1.56
- Interpretation: The load averages (
1.15
,1.34
,1.56
) are all below 2, which is typical for a multi-core system. If these numbers were above 2 on a 2-core system, it would suggest CPU overload.
4. Disk I/O Bottlenecks
iostat -x 1
Example Output:
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 2.15 1.32 123.5 45.1 83.1 0.03 5.5 4.4 20%
- Interpretation: The disk
sda
is performing well with lowawait
time (5.5 ms), and the disk utilization is at 20%. Ifawait
time is consistently high (e.g., >100 ms), it could indicate a disk I/O bottleneck.
5. Network Performance Issues
netstat -i
Example Output:
Iface MTU RX-OK RX-ERR RX-DRP TX-OK TX-ERR TX-DRP Flg
eth0 1500 100000 0 0 50000 0 0 BMRU
lo 65536 2000 0 0 2000 0 0 LRU
- Interpretation: The network interface
eth0
is receiving and transmitting data normally with no errors (RX-ERR
,TX-ERR
are 0). If you see non-zero values for errors, it might indicate packet loss or network issues.
6. Check System Logs for Errors
dmesg | grep -i error
Example Output:
[ 123.456789] blk_update_request: I/O error, dev sda, sector 12345678
- Interpretation: This log indicates a disk I/O error on device
sda
. If these errors persist, the disk may be failing and should be checked further with tools likesmartctl
.
7. Investigate Running Services
systemctl list-units --type=service
Example Output:
UNIT LOAD ACTIVE SUB DESCRIPTION
apache2.service loaded active running Apache2 web server
mysql.service loaded active running MySQL Community Server
...
- Interpretation: Check if unnecessary services (like unused web servers or databases) are consuming system resources. You can stop them if they’re not needed.
8. Check for Resource Limits
ulimit -a
Example Output:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
...
- Interpretation: If the
ulimit
for open files or processes is set too low, it might cause issues. For example, ifnofile
is set too low, a program might not be able to open enough files. Adjust as needed in/etc/security/limits.conf
.
9. Review Cron Jobs
crontab -l
cat /etc/crontab
Example Output:
# m h dom mon dow user command
0 2 * * * root /usr/bin/backup.sh
- Interpretation: Check if any cron jobs are running too frequently or consuming excessive resources. For example, if a backup job is scheduled to run every minute, it could create unnecessary load.
This process allows you to identify and interpret performance bottlenecks in different areas, such as CPU, memory, disk, network, and running processes. Depending on your results, you can take action such as optimizing code, upgrading hardware, or adjusting system configurations.
Frequently Asked Questions (FAQ) for Linux Server Performance Troubleshooting
1. How do I know if my server is under heavy load?
Q: How do I check if my server is under heavy load?
A: You can check the load average using the uptime
or top
command. The load average represents the number of processes waiting for CPU time over 1, 5, and 15-minute intervals. If these values consistently exceed the number of CPU cores (e.g., load average greater than the number of cores), it indicates the server is overloaded.
2. How do I diagnose high CPU usage?
Q: What should I do if my server shows high CPU usage?
A: Use the ps
command to list the processes consuming the most CPU. Focus on high-CPU processes, and identify if they are expected or require optimization. If a particular process is consuming too much CPU unexpectedly, investigate or optimize that application.
3. How can I check memory usage on my server?
Q: How do I check if my server is running low on memory?
A: Use the free -h
command to check available, used, and cached memory. If you notice that memory is almost completely used and swap is being utilized heavily, your server might be under memory pressure.
4. How can I troubleshoot disk performance issues?
Q: How do I check for disk performance issues on my Linux server?
A: Use the iostat
command to monitor disk I/O performance. If disk utilization is high (close to 100%), or the average wait time (await
) is large, it may indicate disk I/O bottlenecks. Also, check for disk space usage using the df -h
command.
5. How can I check network performance issues?
Q: How can I check if there are network issues affecting my server?
A: Use netstat -i
to check for errors in network interfaces. If you see packet loss or errors, it could be a sign of network issues. You can also use ping
or traceroute
to diagnose latency or connectivity problems.
6. What should I do if my server is running low on disk space?
Q: How do I check disk usage and resolve low disk space issues?
A: Use the df -h
command to see disk usage across all filesystems. If any filesystem is near 100% usage, identify large files or directories using the du -sh
command to free up space. Consider moving or archiving old files.
7. How do I check if any hardware issues are affecting performance?
Q: How can I detect hardware issues such as disk failures or memory errors?
A: Use dmesg | grep -i error
to check system logs for any hardware-related errors, such as disk I/O errors or memory issues. Additionally, tools like smartctl
can help monitor the health of disks, and memtest
can be used to check RAM for errors.
8. How can I investigate running services consuming too many resources?
Q: How do I check if any services are consuming too many resources?
A: Use systemctl list-units --type=service
to list active services. Check the resource usage of each service (using top
or ps
) and stop or optimize any unnecessary services.
9. How can I monitor system limits and adjust resource usage?
Q: How do I check and adjust system resource limits like open files or processes?
A: Use ulimit -a
to check current resource limits. If needed, adjust these limits in /etc/security/limits.conf
for users or system-wide.
10. How do I troubleshoot high disk I/O?
Q: How can I check and fix high disk I/O on my Linux server?
A: Use iostat
to check for high disk utilization (%util
) or long wait times (await
). If disk utilization is high, consider moving some data to another disk, optimizing the application, or upgrading the disk.