After spending an inordinate amount of time dealing with a serious system problem which, after we'd pushed the panic button with the vendor, turned out to be a problem of our own making. For it transpires that:
- an essential system daemon, PNSD, creates a unix domain socket in
/tmp
when it starts;
- the access times of unix sockets do not change on AIX; and
- the
fuser
command shows the socket as unused, although lsof
does report the right information.
Consequently, an apparently minor change to the housekeeping of temporary directories caused all LAPI services to fail spectacularly until the problem was corrected by a PNSD restart.
The lessons are obvious, but still worth noting.
Firstly, system daemons should never put communication files into /tmp
a. Instead they should always use a directory under /var
. And what applies to communication files apply doubly to log files — logging to a file in a temporary file system with a generic name like serverlogs
is just asking for trouble.
Secondly, it is always a mistake to make assumptions about the access and permission settings on non-standard files. Just because it is an object in the file system, doesn't mean its going to behave in the same way as a regular file. Rather, these things should be tested, along with any commands used to query them, before putting anything into a production script.
Thirdly, always review changes when a serious problem occurs, especially if it spans two or more separate systems logical entities, and even when it does not seem likely that the change might have caused the problem. It's better to find out that the problem is one that you've induced yourself before you go complaining to vendor support about it.
As for /tmp
, I'm beginning to wonder whether it wouldn't be better to phase it out altogether in favour of temporary space in user home directories. This would greatly simplify system management by removing the need to clear out the directory every day and to apply quotas to prevent one user from hogging the space, to the detriment of everyone else. It would also avoid the endless problems that hit all users whenever the system runs short of temporary space because of one person's misjudgement. Worst of all, on clustered systems at least, the performance of /tmp
is generally that of a single SCSI or SAS disc, whereas the user file systems will often have been configured to stripe across a large number of discs or servers in order to maximise bandwith.
There are obvious problems with this approach, especially in a clustered environment, where certain of the assumptions used to generate unique temporary file names — usually just appending the PID of the creating process to the end of a string — no longer work particularly successfully. But these problems could be easily addressed by changing tmpnam()
and it's more secure siblings to use better name generation patterns.
Although it seems unlikely that /tmp
might disappear anytime soon, it's a nice dream, isn't it?
ETA: I have it on good authority that IBM are making changes to the latest version of LAPI to allow the PNSD socket file to be relocated to a sensible directory. Definitely a victory for common sense.