The Stale pidfile Syndrome

Many unix daemons record their process ids in pidfiles. Startup and shutdown scripts use these pidfiles to determine which process to send signals to. Examples are Apache, Postgres and OpenSSH.

This method is unreliable.

Is the PID valid?

Some daemons create pidfiles, but do not remove them when the process exits. This could be because the daemon was never programmed to remove the file, or it could be because the process crashed before it could remove the file.

Sometimes the system crashes and processes do not get a chance to remove their pidfiles even if they want to.

In both cases, pidfiles remain, but their contents are invalid. The daemon is no longer running.

Is the process running?

Some scripts attempt to compensate for the above problems by sending a signal 0 to the process named in the pidfile. The assumption is that if no error is returned, then the process is still running.

This test is inadequate. The daemon may have died and a different process may now have that process id. In this case, a script may mistakenly think the daemon is still running and send a signal to the wrong process. This is most likely and most dangerous when the script is run as root.

This test can also fail at boot time. If a stale pid remains when the system starts, another process may now have pid named in the pidfile. When the startup script runs, it may use the signal 0 trick and decide the process is already running and not start the daemon.

Suggestions

There may not be a fool-proof way of determining unambiguously whether a given daemon is still running, but here are some suggestions that may help: