🖥️

Server Maintenance Checklist

Monthly process to ensure that your servers are safeguarded against system failures.

Introduction:

Preparation:

Record basic details

Software and system checks:

Check and update software

Update your control panel

Check remote management tools

Check the server resource usage

Troubleshoot CPU utilization

Troubleshoot RAM utilization

Troubleshoot network utilization

Free up server storage space

Data checks:

Verify your backups are working

Review user accounts

Security checks:

Perform server malware scan

Change server passwords

Hardware checks:

Check Redundant Array of Independent Disks (RAID) fault tolerance

Check cable integrity

Sources:

Related checklists:

Introduction:

As a systems administrator or IT technician, one of your roles is the maintenance of server computers. When your service offering is dependant on your machines running like clockwork, you can't afford not to have a tight process for regular maintenance.

That's exactly why we've made this checklist - to make your life easier and cover the ground for a solid best practice checklist for server maintenance.

Whether you're responsible for an array of remote virtual private servers, managing a cloud server farm or the servers for a localized intranet, this checklist will help you save time and effort in ensuring you're getting the most out of your machines.

Preparation:

Record basic details

To kick off our server maintenance checklist, you must first ensure that all details of the maintenance procedure (and the server itself) are recorded.

Do this by completing the following form fields.

Static IP address of server computer

MAC address of server computer

FIrst name of maintenance technician

Last name of maintenance technician

Date of server maintenance

Are you using remote management tools?

1

Yes
2

No

If you wish to record any extra information, feel free to do so below.

Additional comments

Software and system checks:

Check and update software

Linux operating systems are very frequently updated, and more often than not these updates come with important security and vulnerability patches.

Even if you're using Update Manager to automatically configure your updates, or you've recently updated your system, be sure to check just in case.

Updating in Linux is very straightforward - it can be done with a simple line in the terminal:

sudo apt-get update && sudo apt-get upgrade

As well as the operating system, other software components will need to be updated as well. Use the sub-checklist below to make sure everything is running the latest version.

(Subtasks)

1

Operating system updates have been installed
2

Other application updates have been installed
3

Server rebooted (if kernel update was installed)

After this has done installing, you should be good to go.

If you have a kernel update, you will need to reboot your server unless you use a tool like Ksplice.

Update your control panel

Make sure to update your server (or hosting) control panel as well - this one's a double-whammy, as it also updates the software it controls.

Example:

With WHM/cPanel, you must manually update PHP versions to fix known issues. Simply updating the control panel does not also update the underlying Apache and PHP versions used by your OS.

Check remote management tools

Next up in the server maintenance checklist, you need to check in on your remote management tools.

Use the sub-checklist below to tick off the tools as you check that there are no bugs or quirks in the system.

(Subtasks)

1

Remote console
2

Remote reboot
3

Rescue mode

If your server is co-located or with a dedicated server provider, check that your remote management tools work.

Check the server resource usage

The most important thing is to keep your server monitored so that you know what's going on the server at every moment. It's important to keep track of details like CPU load, disk space and RAM utilization.

This will also help you understand whether or not you need to upgrade certain components, add more servers or migrate to a different hosting plan.

Performance monitoring tools like systat will provide baseline performance data for each of your server modules. Fill out the sub-checklist below after checking each component.

(Subtasks)

1

Disk
2

CPU
3

RAM
4

Network

Should you find that the server is pushing a component particularly hard, it's best that you perform a spot of quick troubleshooting.

High server load could just be the result of temporary high traffic or application utilization but it could also be the symptom of a much deeper malady and may require components be upgraded or certain processes be deactivated.

Declare which of the checked components were under particularly high strain in the multiple choice field below.

Which resources were under high strain?

1

Disk
2

CPU
3

RAM
4

Network

Troubleshoot CPU utilization

So you've determined that the server's CPU utilization is high - but you want to know why it's high.

So, input the following into the terminal:

top

to get a quick sample of recent server load statistics.

The result will look something like this:

top - 14:08:25 up 38 days, 8:02, 1 user, load average: 1.70, 1.77, 1.68
Tasks: 107 total,   3 running, 104 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id, .7%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:   1024176k total,   997408k used,    26768k free,    85520k buffers
Swap:  1004052k total,     4360k used,   999692k free,   286040k cached

  PID USER    PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9463 mysql   16   0  686m 111m 3328 S   53  5.5 569:17.64 mysqld
18749 nagios  16   0  140m 134m 1868 S   12  6.6   1345:01 nagios2db_status
24636 nagios  17   0 34660  10m  712 S    8  0.5   1195:15 nagios
22442 nagios  24   0  6048 2024 1452 S    8  0.1   0:00.04 check_time.pl

From this, you can determine where exactly those load strains are coming from. In the Process ID (PID) column, you will be able to see what application (if any) is hogging the CPU power.

In the table above, MySQL is taking up 53% of the total CPU power.

If user or server loads are particularly high, you may want to consider expanding server resources so that there is more CPU power to better handle the strain.

Troubleshoot RAM utilization

If you are trying to diagnose memory issues, your first port of call should be the "htop" command.

Enter it into the terminal:

htop

and sort by "MEM%".

The "MEM%" row clearly shows how much RAM is being utilized. If this value is small, then you may want to consider expanding your server resources by upgrading the RAM.

Troubleshoot network utilization

If your network is under high strain, you can use the open source command line program NetHogs to monitor real-time network traffic bandwidth used by processes and applications running on the server.

To execute it, you will need to have root permissions, so run the following command as shown in the terminal.

sudo nethogs

This simple output shows the amount of traffic per each application currently accessing the network. You will be able to isolate troublesome bandwidth-hogging applications and deal with them accordingly.

If you need to delve deeper into Nethogs troubleshooting, run the following command to open the help documents:

sudo man nethogs

Free up server storage space

By making sure that you're keeping your server's partition clean and trim of unnecessary data, old unused program versions, and archival logs, you can minimize the strain on your system and make sure no problems arise from not having enough free space!

Conventional IT wisdom advises that at least 20 to 30% of disk space should be left free at all times.

If you're running low on storage space, you should definitely prioritize expanding server resources and invest in additional disk drives.

Above 80% capacity, data is at risk of corruption, and performance is significantly affected - especially on mechanical hard disk drives.

Data checks:

Verify your backups are working

First up, make sure that your backups are actually working.

Whilst you should already have automatic system backups scheduled regularly, these efforts are in vain if you haven't even tested if the backups are doing what they're supposed to be doing.

You'll want to run some test recoveries if you are going to delete critical data. Remember to double check that you are saving to the correct backup location.

Review user accounts

Now you need to review the user accounts on your server.

Especially if you're managing server hosting for your clients, you will most likely have had client cancellations or other user changes that you will need to stay on top of.

Even old employee logs and data should be removed. Keeping old website and user data is not just a security risk, but a legal risk as well - what's more, many of this forgotten data will just be sitting in the dark recesses of your servers soaking up precious resources.

Primarily, you should be looking for any old or outdated sites and users so you can promptly erase this data.

Check against your archives with the sub-checklist below.

(Subtasks)

1

Old employee data
2

Obsolete / inactive website files
3

Additional outdated user data

Security checks:

Perform server malware scan

It should be part of your routine process to run a malware check on your server machines. ClamAV is a useful tool for scanning against known databases of viruses and malware for Linux machines.

Input the following into the terminal to update virus definitions:

sudo freshclam

Then input the following to scan all files on the computer whilst only displaying infected files after they're found whilst the process runs in the background:

sudo clamscan -r -i / &

After the scan is complete, ClamAV will give you the option to either delete or quarantine the threats it has found.

Change server passwords

You're almost there! Password security is essential to proper server maintenance, so it follows that the next task in this checklist is to change the passwords associated with your server.

Consider using an automatic random password generator together with a password organizer to maximize security and durability of your passwords.

Hardware checks:

Check Redundant Array of Independent Disks (RAID) fault tolerance

Now you need to perform some basic checks to ensure your RAID system is, and will continue to, work properly.

First, check that the error notification system is set up and working - this is especially important as most RAID levels tolerate only a single disk failure. If you miss a RAID notification, a simple disk replacement could turn into a complete system failure, so make sure that you thoroughly check each of the disk drives.

There are a number of additional steps you can take to ensure that you are in line with contemporary RAID best practices, outlined in the sub-checklist below. Tick each one off as you go.

(Subtasks)

1

Check for disk read errors
2

Install Adaptec Storage Manager
3

Perform all recommended driver, controller firmware, and storage management application updates
4

Run system consistency check
5

Replace drives that have either failed completely or are starting to show signs of failing (medium errors, S.M.A.R.T. errors, etc.) immediately

Check cable integrity

Cable wear and tear is a big factor that is overlooked when determining points of system failures. It's a simple task that should be incorporated into the routine maintenance process for all wire-dependant hardware.

Check all connection points carefully to ensure the following is true:

(Subtasks)