sawyl | Entries tagged with aix

Asked to investigate why large numbers of access times had been updated on a file system, I confidently ruled out an NFS client on the grounds that the file system had been exported read-only on the AIX server. However when I examined the situation more closely, I noticed that reading a file on a client did in fact result in an update of the atime on the server. Something that doesn't appear to happen when serving NFS file systems using Linux.

According to our best hypothesis:

the server exports the file system read-only
the client mounts the file system normally
a program on the client traverses the file system reading each file
the read triggers the NFS daemon to access the file on the server. It is this action that changes the atime
the program on the client cannot use utime() to reset the atime because the file system is read-only

The behaviour is odd, but I'm not sure whether it's contrary to the RFC or whether it's just a detail of the AIX implementation of NFS...

A little while ago, I tried to get SciPy to build on AIX using IBM's XL series of compilers and the ESSL library. The results were disappointing. I couldn't get SciPy to link correctly and the only procedure I was able to find online was fiendishly complicated. After spend some days working on the problem, I gave up, put it aside and shifted my attention to more important things.

So I wasn't exactly delighed when SciPy's priority increased sufficiently to put it back on my radar. But, full of brio after the holidays and in search of an interesting puzzle to crack, I decided to give it another go. And what do you know? It only took me a few hours to come up with a solution, saved here for posterity and for any other AIX admin foolish enough to want to give this a try.

( The gory details... )

Onwards to the horrible — and more important — mess that is netCDF...

This may have been fixed in later versions of Python — I've been using 2.6.6 — but I've noticed that the standard distutils don't seem to work properly when (a) the interpreter has been built using the the XL compilers; and (b) when the extension being built needs to be linked with the C++ compiler, e.g. in SciPy. When both these conditions are met, distutils generates an invalid command link that features both the C++ compiler as the linker (correct) and C compiler as the second argument (horribly incorrect), as this example from an existing problem ticket shows:

error: Command "xlC xlC -qlanglvl=extc89 ..." failed with exit status 252

Not being willing (or able) to build each component by hand, I came up with a rough and ready fix — the if block that follows the platform test for 'aix5' — for distutils.unixccompiler.link() that seems to work around the problem:

( Source code... )

It's a long way from perfect — the test shouldn't really be for AIX but something like self.compiler_cxx.startswith("xl") and the code should search through the list, rather than simply removing element 1 — but it's a reasonable first cut at a solution to a deeply annoying problem.

I've been doing quite a lot of work with python and numpy and matplotlib of late and I've made a couple of useful discoveries:

numpy structured arrays are an almost perfect replacement for tables in R, provided that you don't try to use the dtype parameter to explicitly request a field type of |O4, e.g. to accomodate datetime objects, because (a) this seems to cause the current version of numpy to complain; and (b) seems to be unnecessary.
using matplotlib.ticker.FixedLocator to override the standard X-axis ticks provides cleaner labelling than most of the other methods when working with time/date sequences.
doing a yaxis.get_major_ticks()[0].label1On = False switches off the first label of the Y-axis, avoiding ugly collisions between labels on the two different axes.

I've used these discoveries to put together a script that plots out nmon and topas data for one or more machines, making it easy to compare and contrast the performance of nodes that share the provision of a service, e.g. GPFS, LoadLeveler etc.

I've been looking into an annoying network problem which caused connections to a particular system to fail apparently at random. After initial investigations with tcpdump showed that large amounts of traffic were being lost, I suspected a problem with the EtherChannel configuration. But I later discovered that the device had been configured in failover mode, the backup adaptor had never been connected, and dropping the backup out of the configuration made no difference.

So I went off in a different direction, combing through netstat output in search of errors. Although I found a suspiciously high number of duplicate packets and large number of drops due to a full listeners queue in the statistics output (netstat -s), I failed to find anything conclusive. In desperation, I tried listing the number of packets dropped at each layer of subsystem using netstat -D (which is, I think, AIX specific). This showed that almost all the outbound packets were being dropped at the interface level, which made me suspect an IP configuration problem.

Sure enough, when I checked the routing table, I discovered that the system had added a second default route to an old and invalid gateway, probably during a recent reboot. When I removed this, the system abruptly returned to normal and the network suddenly returned to its old reliable self. I'm not quite sure where this route came from. I suspect that when it was removed after the gateway became invalid, it was deleted using route delete rather than chdev -l inet0. This disabled the route on the running system but left it in the ODM, where it was reloaded when the system was rebooted.

The lessons here? Always check the before checking anything else. Always, always use chdev to change the routing on AIX, and if you're not sure of the options to use, use smitty rmroute.

An interesting AI koan from today: when your system hangs immediately after a network boot, it is really running? And if it is, how can you tell if it isn't far enough up to have started any services?

After spending an inordinate amount of time dealing with a serious system problem which, after we'd pushed the panic button with the vendor, turned out to be a problem of our own making. For it transpires that:

an essential system daemon, PNSD, creates a unix domain socket in /tmp when it starts;
the access times of unix sockets do not change on AIX; and
the fuser command shows the socket as unused, although lsof does report the right information.

Consequently, an apparently minor change to the housekeeping of temporary directories caused all LAPI services to fail spectacularly until the problem was corrected by a PNSD restart.

The lessons are obvious, but still worth noting.

Firstly, system daemons should never put communication files into /tmpa. Instead they should always use a directory under /var. And what applies to communication files apply doubly to log files — logging to a file in a temporary file system with a generic name like serverlogs is just asking for trouble.

Secondly, it is always a mistake to make assumptions about the access and permission settings on non-standard files. Just because it is an object in the file system, doesn't mean its going to behave in the same way as a regular file. Rather, these things should be tested, along with any commands used to query them, before putting anything into a production script.

Thirdly, always review changes when a serious problem occurs, especially if it spans two or more separate systems logical entities, and even when it does not seem likely that the change might have caused the problem. It's better to find out that the problem is one that you've induced yourself before you go complaining to vendor support about it.

As for /tmp, I'm beginning to wonder whether it wouldn't be better to phase it out altogether in favour of temporary space in user home directories. This would greatly simplify system management by removing the need to clear out the directory every day and to apply quotas to prevent one user from hogging the space, to the detriment of everyone else. It would also avoid the endless problems that hit all users whenever the system runs short of temporary space because of one person's misjudgement. Worst of all, on clustered systems at least, the performance of /tmp is generally that of a single SCSI or SAS disc, whereas the user file systems will often have been configured to stripe across a large number of discs or servers in order to maximise bandwith.

There are obvious problems with this approach, especially in a clustered environment, where certain of the assumptions used to generate unique temporary file names — usually just appending the PID of the creating process to the end of a string — no longer work particularly successfully. But these problems could be easily addressed by changing tmpnam() and it's more secure siblings to use better name generation patterns.

Although it seems unlikely that /tmp might disappear anytime soon, it's a nice dream, isn't it?

ETA: I have it on good authority that IBM are making changes to the latest version of LAPI to allow the PNSD socket file to be relocated to a sensible directory. Definitely a victory for common sense.

I think I've finally cracked my monster task, the one that has been haunting me for months: to rewrite the hosts files for the main clusters.

A simple task you might think, except that each machine has 10 network interfaces, each machine has around four aliases per interface, and there are something 240 machines. And as if that wasn't simple enough, some of the aliases are redefined on different clusters, but each machine needs to know the principle hostname of every other machine regardless of cluster. Oh, and yes, the aliases (and potential duplicates) need to be defined in a very precise order or Breakage Will Occur.

Now all I have to do is spent the rest of the week applying relational algebra to my various different versions of the files to ensure that everything is correctly defined. What fun...

I've encountered another interesting LoadLeveler oddity, again involving reservations, but this time due, I suspect, to a timing bug rather than an design effect.

Essentially, what appears to have happened is that a reservation was created with a start time of T+2 minutes and the request was granted by the first scheduler. Less than a minute later, the second scheduler allocated a queued job to the nodes that had just been reserved by the first scheduler. Then, when the reservation became active, the job associated with it was unable to run because it lacked the necessary resources.

My suspicion is that the problem is due to some sort of race between the two scheduler instances: that the two daemons are not communicating frequently enough to ensure that they both have upto date copies of the current reservation list. I'm not entirely surprised there are problems. I'm deliberately creating a reservation as near to the current time as possible — not something anyone in the right mind would normally want to do — and I'm relying on LoadL to coordinate across multiple machines, which is asking for trouble.

Time to dig through the manual: perhaps there's some parameter I can tweak to increase the scheduler synchronisation frequency...

I've discovered a nasty little problem with LoadLeveler reservations. A problem which means that somes, just sometimes, the reservation actually prevents the job from running, but which is kind of hard to describe.

Imagine a pair of reservations. Both reservations have at least one node in common between them, but because the first reservation is due to finish before the second is due to start, this is not a problem.

Now, consider what happens when RESERVATION_CAN_BE_EXCEEDED is set to true in the LoadL_config file and a job is submitted into the first reservation. If the job has a wallclock time that is longer that the reservation, but which means that it will finish because the second reservation becomes active, there is no problem. But if the wallclock time of the job means that it will end after the second reservation is due to start — i.e. if the current time plus the wallclock time of the job is larger than the start time of the second reservation — the two resource requests will clash and consequently, the job will not run.

Thus, under certain circumstances, running a job in a reservation may actually hamper its execution, rather than assist it. And, worst still, because the placement of reservations is dependent on the state of the system, the problem does not necessarily occur repeatably...

Today's excitement came in the form of an attempt — successful, naturally — to interface sissasraidmgr with Nagios. Today's challenge involved trying to work out how to get a whole bunch of cool data out of stupid Java generated interface that only seems to support access via an interactive web browser.

After months of talking about it, I've finally implemented a set of LoadLeveler prolog and epilog programs. In the process I've discovered:

any output generated by the prolog gets stamped on by the job itself
the epilog has to be run as the job user in order to update the output files
information about prolog or epilog problems is written to the StarterLog of the node where the master task executed
an easy solution to the epilog output problem is to write the program in perl or C an dup stdout and stderr to $LOADL_STEP_OUT and LOALD_STEP_ERR because this also redirects any child processes
not all LOADL variables are available on all nodes of a multi-node job (the important ones are usually only set on the master)
the programs are run by the job starter daemon and are unaffected by environment changes in the job script
that the precise details of writing prologs and epilogs doesn't seem to be terribly well documented

And, most importantly, that it doesn't matter if the epilog programs don't work but if the prolog doesn't work, the system will start to trash the jobs until the problems are fixed...

I've yet another frantically busy week, with my time split between my own work and acting as consulting unix guru to others. Of my own projects, the most interesting has involved an attempt to monitor the temperatures of the p575s in real time. The actual data capture isn't too complicated and simply involves logging into each HMC in turn and:

running lssycfg to obtain a list of the managed frames;
running lshwinfo against each frame to obtain the temperatures of each node in the frame.

The data from lshwinfo can then be massaged into an appropriate form and dumped out to a file for further analysis.

I decided to use the excellent LiveGraph utility to visualise the data in real time. LiveGraph accepts data in a pseudo-CSV form — something that makes it trivial to import the data into Excel for post-processing and analysis — and displays the results in a constantly updating chart window that can be configured to show either the entire data file or just the tail data.

My only minor quibble with it as a tool is that it doesn't seem to be possible to open a particular data source from the command line — something that matters to me because I want to package the whole thing up so that it can be launched with a single command that frees the end user from having to know anything about the underlying data source. Perhaps, if I have time next week, I'll investigate the API in more detail to see if I can't use that to solve the problem.

I've spent the last week or two working on a script to create instant LoadLeveler reservations, even in situations where resource shortages would normally prevent them from being created. In the process, I've uncovered a couple of LoadLeveler gotchas.

I've found that a job that starts and finds that it cannot write to its stdout or stderr files will go into user hold. Unless the initial directory is set with "# @ initialdir", then the initial directory is the current working directory of the llsubmit used to submit the job. If the directory is not writable by the job owner and the error and output paths are not explicitly routed to another directory, the job will not start to run.

This unpleasant fact caused us much worry when we discovered that jobs submitted using a script — a script that resided in a directory that was owned by a user other than the one submitting the job — went into user hold when submitted interactively; while near-identical jobs submitted as part of a job chain from a running batch job, whose working directory was set to $HOME ran flawlessly.

Also, if an existing job is cancelled to make room for a new reservation, it takes a few seconds to release its resources. Once the resources are released, the next idle job will jump into the gap unless the scheduling temporarily paused causing the attempt to create a new reservation to fail. If the scheduling is paused by draining the schedulers, no new work will be started and, and here's the real gotcha, now new jobs will be accepted for submission into the batch system, until the schedulers are resumed.

There may be a solution to this dilemma. Perhaps, if all jobs are allocated floating resources at submit time, the resource limit could be reduced slightly prior to the attempt to create the reservation in order to stop any new jobs from starting. Then, once the reservation has grabbed the newly released jobs, the limit could be brought back up to allow normal work to continue to schedule.

Still, these minor problems aside, it looks as though I might be on course to have something flawed but functional working by the end of the week — which precisely matches my initial estimate that it would take two weeks to get sorted out.

Rather to my horror, I've realised that my one big outstanding localisation, essentially a change to all the hostnames and a reconfiguration of LoadLeveler, can't be done be done on a per-cluster basis. Instead, it seems as though it all needs to be done in one change to prevent the applications people from having to create two sets of configuration files.

Given the complexities of the host files and nature of the change, I'm not at all confident that it's going to work. Still, it's not like I have any choice. The changes have got to be made and the LoadLeveler queues have to be empty for it to work, so it's better to do it now, when only a handful of people are using the machines in earnest than to wait until next week when the queues will be full...

Current Music: Ravel - Mother Goose

Another training day, this time learning about Tivoli Storage Manager — a backup product markedly similar to our old friend Unitree — on a course cut down from five days to one to fit in with busy schedules and, presumably, to save money.

The topics covered included the gory details of how to set up a library and configure its devices and tapes; how to create and manage storage pools; how to repack sparse tapes and recycle old tapes; and how to manage the database and recovery logs. All worthy and, to some extent, interesting stuff presented by an enthusiastic and knowledgable trainer with much sage advice to offer.

There was, however, one slight oversight. Despite all the information provided, by the end of the day we still didn't know how to either, (a) back up, or (b) restore, data using TSM...

Current Music: Purcell - King Arthur

Spent the day on an excellent course learning all about QLogic's SilverStorm Infiniband switches. The training, provided by someone from QLogic with detailed knowledge of our set-up, was wonderfully concise and covered exactly what we needed to know.

The morning covered the topological details of the switches, each of which is built from a series of 24 port Mellanox switch chips linked together in a fat tree, and the IBM host channel adapters, which are actually switches in their own right in order to allow the Infiniband connections to be shared across LPARs.

Then, in the afternoon, we moved on to look at the subnet manager software and some of the query commands — something that required us to remember our theory from the morning. We looked at the web and command line interfaces, discovered how to interpret the error reports and learnt what was meant by a symbol error. We were also given an important warning: never to request statistics from the AIX host channel adapters because this, apparently, can cause the card to lock-up...

Slightly more successful day today, running through the details of creating new NSDs, setting up GPFS file systems on them and replacing failing discs. After the guys in France resolved some annoying problems with the cluster — GPFS had not been reconfigured following the replacement of one of the servers — we managed to run through a bunch of useful exercises in time to finish by early afternoon.

While it wasn't the best course I've ever been on, the content was good, the instructors were enthusiastic, and I came away knowing more than when I started, but the whole thing was rather let down by the sheer flakiness of lab cluster we were supposed to be using.

Current Music: Mendelssohn - Elijah

Distinctly mixed results today: the presentations on LoadLeveler were pretty interesting and I picked a few good hints, although much of it was familiar. The practical was rather less successful. Following yesterday's problems, the nodes were still well and truly screwed — it looked like there was a problem with RSCT — so some of the exercises were distinctly problematic. Fortunately, I'd already run through most of the scenarios — trying out WLM, setting up fairshare, trying out a submit filter — on the test cluster, so I wasn't really too bothered by the problems. I'm not sure what the others thought though...

An unfortunate problem with IO subsystem on the lab cluster coupled with the absence of the sysadmin — apparently in transit to us from Montpellier — forced us to abandon much of today's course. Instead, the instructor, P and I hung around in the classroom batting about ideas while the rest of the gang went back to the office to get on with something more productive. My excuse for not doing the same? I'd earmarked this week for design and planning rather than coding, so nothing was lost by my sitting around and waiting for the lab cluster to come back.

Profile

sawyl

August 2018

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Syndicate

Page Summary

Style Credit

Style: Neutral Good for Practicality by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Jul. 4th, 2025 07:35 pm

Tales of a Fourth Grade Nothing

Entries tagged with aix

When read-only is not read-only

Building SciPy on AIX

Python, distutils and IBM XL compilers

Experiments with matplotlib

Debugging network problems

Schrodinger's netboot

Use of /tmp considered harmful

Rewriting hosts

Reservations: a race condition?

Reservations: a unfortunate catch

Excitements galore

Of prologs and epilogs

Visualising pSeries temperature data

LoadLeveler and emergency reservations

Complex dependencies

TSM training

Qlogic Infiniband training

Practical GPFS: day 4

Practical LoadLeveler: day 3

Practical GPFS: day 2