sawyl | Entries tagged with gpfs

During testing of my current project — a file tree analyser that uses of the GPFS API to speed up access to the metadata — we discovered an interesting bug that resulted in some of the file ownerships being incorrectly attributed.

Investigating, I quickly realised that the fault was caused by a race condition in the code: because the dump of file namespace took a non-trivial amount of time to complete, it was possible for a file to be removed after its name had been retried but before its inode data had been read and for the inode number to be reused by a new file. This meant that it when the file names were joined with the file attributes using the inode number as a unique key, the original file path was assigned the attributes — including the ownership — of the new file. Once I understood this, I changed the join to use inode number and generation ID in place of simple inode number only to find that this completely broke the matching process

Digging deeper into the code and printing out some of the generation numbers, I discovered that the values returned by gpfs_ireaddir() in the form of a gpfs_direntx_t failed to match those returned by gpfs_next_inode() in a gpfs_iattr_t structure. From the sorts of values being returned, I wondered whether the problem might be caused by a mismatch between the variable types and replaced the 32 bit routines with their 64 bit equivalents only to experience exactly the same problem.

Looking more closely, I eventually realised that the lowest 16 bits of the two generation IDs were the same while the highest 16 bits were only set in gpfs_iattr_t.ia_gen. Masking the field appropriately, I was able to combine the generation ID with the inode number to create compound key that I was able to use to join both structures in a coherent way, trading one form of inaccuracy — incorrectly assigned ownerships — for a more acceptable one — ignoring files deleted and recreated during the scan.

I've discovered a couple of elegant spin-offs from my current work on GPFS. Playing around with the pool structures and calls in the API, I've found it to be much faster than mmdf, largely because the latter goes out to each NSD and queries it for its usage information — something that's not of much interest if you just want to know whether the pools are full or not.

I've also realised that if I combine a couple of other calls, I can both identify the fileset that owns a particular directory and pull out the three levels of quota — user, group, fileset — for it. This makes it trivially easy to summarise all the limits imposed on a particular user in a way that indicates both their current level of consumption, how much they have left, and the level at which the resource is being constrained.

And best of all, the code to do all this can be wrapped up in a handful of Cython calls. OK so the fileset identification code seems to need an inode scan, which seems to mean it can't be run as a normal user, and I haven't yet found a way to dump out a complete list of filesets defined within the file system — although this information must be available be mmlsfileset and mmrepquota can do it — but it's an extremely handy thing to get out of something originally intended for a completely different purpose...

Having considered recent difficulties, I've concluded that most of our problems have been brought on by a lack of date. So, as a good scientist, I've tried to come up with a way to remedy the situation. My solution? To run a series of tests that involve deliberately breaking the data mirror by taking half the logical disks out of service for a fixed period and timing the resynchronisation process.

According to my theory, this ought to show whether the resync time is proportional to the duration of the split — and, by implication assuming a constant load on the file system, the amount of data written during the period — or whether it is dependent on the size of the metadata, which can be assumed to be relatively constant over the period.

If the initial experiment indicates that the time to resynchronise varies with the duration, then it ought to be possible to repeat the tests with gradually increasing intervals and to use the resulting data to create a model. If the resync time appears to be constant, at least within the time frame of the split, then we can be relatively confident that, for short outages at least, we can predict the duration and hence the impact of an outage of the mirror.

While I'm not sure anyone will actually go for this — the impact of the data gathering process may be too sever to tolerate — I think it's worth pitching...

Today's increadably neat discovery? GPFS callbacks. I know these have been around for a while but until a suggestion from one of my colleagues prodded me into reading the documetation, I hadn't realised just how many things you can do with them.

My immediate ideas include: starting and stopping the batch software when its spool file system mounts or unmounts; automatically exporting file systems via NFS to other systems; requesting hardware replacements when the declustered arrays reach their maximum number of failures; raising alerts on reconstruct failures; and those are just my first ideas. The potential opportunities for laziness, impatience and hubris are vast!

Arriving early this morning I noticed something odd hanging from the atrium ceiling. On closer inspection, I discovered it was a modest sized weather observation blimp installed as part of this week's expo.

The rest of the morning proceeded more or less as expected, upto and including the depressingly predictable meltdown that occurred when the storage was shutdown for an OS upgrade. After much fiddling we decided to cut our losses and reschedule the storage update for later in the week, by which time we hoped to have come up with a strategy to ameliorate the impact of taking the nodes out of service.

A little while ago, we discovered an interesting GPFS glitch that caused panicky unmounts on some of the systems. When we investigated in detail, we traced the problem to revoke messages from a specific NSD server in the storage cluster and when we examined the network connectivity, we realised that the topology was such that it forbade direct connections to the problem server from the compute environment. Consequently, during token operations involving more than one compute cluster — operations that required arbitration from the NSD server in order to resolve the lack of direct connectivity between the compute clusters — the NSD could not be contacted, triggering a revoke of one of the requesting.

Following this discovery, we've been working on interesting ways to try to work around the communications problem. The US suggested configuring the NSD with an IP alias in a different subnet and using one of the nodes with universal connectivity to act as a router But during a chance conversation with the colleague who was working on implementing the solution, I pointed out that we might be able to solve the problem by altering the subnet mask on the NSD server to partition the topology into directly connected systems and systems that needed to be accessed via that gateway but we would still need come up with a way to route the traffic back from the indirectly connected systems without changing their netmasks, a problem I thought was intractable. So imagine my surprise when my colleague suggested this as a solution!

According to my understanding of IP routing, the determination of whether a system is local or not is made by comparing the network components of an address with the network components of the machine's own interfaces. If the components match, the address is on the local network and the routing table is not examined. If the address is remote, attempts are made to match the target address to entries in the routing table starting with host routes and widening out until the default route is reached as a destination of last resource.

However it turns out that AIX, post 5.1 at least, does not behave like this. Rather it seems to examine the routing table regardless of the network component of the target address, allowing host specific routes to be matched ahead of the local interfaces. Thus my colleague was able to increase the size of the mask on the NSD, add host specific routes on the other side and have the traffic route through the gateway, contrary (to my, at least) theoretic expectations, giving us a nice clean solution to our underlying problem.

After much consideration and many, many experiments I think I've finally cracked the annoying performance problem I've been investigating for most of past couple of weeks. Essentially, the problem was that the run times for a series of standard builds which used to complete in around 900 seconds had suddenly started taking much longer than expected — up to three times longer in some cases.

With the times varying from routine to routine across runs, indicating that the problem was independent of the amount of compute work being performed, I immediately suspected the root cause might lie somewhere in the IO subsystem. I ran a series of experiments: parallel builds with a range of different source, destination and compiler configurations, all bundled up in resubmitting job chain to control for the usual diurnal fluctuations in load. Initial results were disappointing until I added a step to control for the location of the temporary directory. This generated a clear signal: changing $TMPDIR to point to something other than the default location — the root directory of a busy GPFS file system optimised for large block IO — gave noticeably superior performance.

A further set of experiments, this time using a trivial bit of code to time the performance of the mkstemp() call under a range of different conditions confirmed the problem: that the large block GPFS file systems offered suboptimal performance when working with small files and that the performance of the call fell off noticeably as the target directory filled up with files. This last test offered some interesting insights in the behaviour of GPFS metadata caching, with the performance falling off when all the inodes were created in quick succession but not if a few minutes were allowed to pass between file creation runs (presumably because this gives sufficient time for the metadata to synchronise).

All of which suggests that it ought to be possible to significantly improve compilation performance by changing the location of $TMPDIR to point to a user-specific empty directory on a GPFS file system optmised for small block IO.

ETA: Finally traced the root cause of the problem back to an application configuration file that was changed a couple of weeks ago, overriding the default value of $TMPDIR and changing it to point to the GPFS file system. Mystery solved!

Profile

sawyl

August 2018

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Syndicate

Style Credit

Style: Neutral Good for Practicality by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated May. 10th, 2026 03:29 am

Tales of a Fourth Grade Nothing

Entries tagged with gpfs

GPFS inode and generation numbers...

Things spun out of GFPS API work

Stand back: I'm going to try science!

The power of GPFS callbacks

A blimp and a crisis

An interesting routing problem

Temporary files on GPFS