sawyl: (Default)
After a day of stressing over a yet another urgent OS upgrade, I found myself forced to spend my evening reconfiguring my mail server following my ISP's decision to migrate to Exchange. I'd been dreading the change, thinking my mail server still ran sendmail. But to my surprise I discovered that it was running postfix, making it a breeze to reconfigure.

All I had to do was:

  1. change fetchmail to point to the new POP server along with the old one
  2. change /etc/postfix/sasl_password to add my username and password for the on the SMTP server
  3. run postmap hash:/etc/postfix/sasl_passwd to rebuild the DB file
  4. change relayhost in /etc/postfix/main.cf
  5. bounce the fetchmail and postfix
  6. send myself a couple of test messages

And it all worked like a charm. It's days like this that make me feel like I might actually be worthy of counting myself a professional sysadmin...

sawyl: (Default)
And so, after finishing the Work That Must Be Finished and doing a global stop and start LoadLeveler has settled itself back down and started scheduling again as normal. I'm more than slightly relieved — not only has my vision of days of instability followed by a cold start vanished, but my original hypothesis about the cause of the problem seems to have been justified.
sawyl: (Default)
Rather dispiriting day spent crawling through the logs, investigating a series of connected scheduling problems. Despite their superficial dissimilarities, almost all the problems had a comment fingerprint: a timeout of some sort. Having concluded that the timeouts were caused by on-going maintenance on the storage subsystem, I found myself in the unfortunate position of being required to tell a number of unlucky users that there was no immediate solution, and the only thing we could do was push on with the work and hope for a happy ending...
sawyl: (Default)
Following the difficulties encountered with various computer systems after the introduction of a leap second this morning, an interesting post describing how Google add an extra second to the day without introducing a clock jump:

Computers traditionally accommodate leap seconds by setting their clock backwards by one second at the very end of the day. But this “repeated” second can be a problem. For example, what happens to write operations that happen during that second? ...

The solution we came up with came to be known as the “leap smear.” We modified our internal NTP servers to gradually add a couple of milliseconds to every update, varying over a time window before the moment when the leap second actually happens. This meant that when it became time to add an extra second at midnight, our clocks had already taken this into account, by skewing the time over the course of the day.

And even more apposite now than it was when it was written, Falsehoods programmers believe about time.

sawyl: (Default)
The following, from Ken MacLeod's latest, made me laugh with recognition:

Reminded, Ferguson invoked the [Police National AI]. The overnight correction work had reset the system's suspicion parameters, but in the wrong direction. After requesting arrest warrants for every one who had been on Easter Road for the past week, the PNIA was once more in the process of being talked down.

MacLeod, K., (2008), The Night Sessions, Orbit: London, 82

Even if the balky technological entities I curate don't actually suffer from paranoia, my attempts to coax them back into production often feel more like a form of psychoanalysis than they do real computing.

sawyl: (Default)
In general, our user community is pretty clueful. I guess it comes with the territory: to know that you've got a requirement for HPC, you generally have to know what you're doing. But every so often you get a question that makes you stop and think.

For example, on Friday, I was contacted by a user wanting to know how to use telnet in to the system. But, least you think I'm being unfair, the user had (a) the name of the remote system; (b) the name of the command they needed to use, i.e. telnet; and (c) ten years of experience with Unix! Full of cheerfulness following lunch in Topsham, I bit back a curmudgeonly retort and pinged off a reply containing the required command and thought no more about it.

Thought no more about it until, that is, today when I logged on to investigate the failure of one of the tape drives. After getting my overtime payment for doing not much more than telling the operations people to follow their normal procedures, I checked my mail to discover that I'd got two replies to my telnet email. The first contained a literal transcript of a telnet session that stopped at the point where the user had been prompted for their password, followed by a plaintive request that I inform them of their password — this despite repeated assurances over the years that sysadmins do not know user passwords. The second request, dated a few minutes later, was to inform me that I no longer needed to inform them of their password because they'd remembered it.

I guess that on balance I should just be thankful it was such an easy request to sort out...
sawyl: (Default)
Had a classic user request today. The guy had a helpful set of data manipulation tools, which he'd sold to the head of his division as the cat's pyjamas, which he wanted us to install and support centrally, and he would have given us detailed instructions on how to install it except that, oops, he couldn't actually get it to build. In the end, it turned out to be even worse than it sounded: he couldn't even get it to pass configure, so he hadn't realised that the yacc code wasn't accepted by our incredibly retro version of bison and the parts that would actually compile wouldn't link.

No wonder he wanted someone else to take on the support...
sawyl: (Default)
You know you're having a dull week when the most interesting problems to be passed your way relate to job scheduling, workload management and sendmail.
sawyl: (Default)
Noticed that the Register has got something about some open source support companies getting together in an attempt to persuade businesses to use Nagios, Zenoss etc, rather than proprietary horrors like Tivoli.

Sounds like nothing but a good thing...
sawyl: (Default)
Perhaps it was my fault. Perhaps I was unclear.

When the two HSM guys turned up to consult with my colleague, they tried to persuade him to find some spare disc space to allow them to mess around with stuff. When my colleague asked me if I knew whether a particular file system could be recycled, I said, "It's Friday. There's a bank holiday coming up. I really wouldn't make any changes to the system. If I were you, I'd leave it completely alone and sort it out next week." He looked suitably comprehending and went off. Then a few hours later, under the watchful eyes of my colleagues, the visitors were allowed to unmount one of the HSM managed file systems, the one that had spent the last 18 hours kicking up a stream of errors, and surprise surprise, the system panicked in a heap. Unwilling to sacrifice my precisely timetabled plans for this afternoon, I left others to dump the system and went home.

To my mind, today's problem simply emphasises two important rules of system administration: never, ever make changes on a Friday; and hierarchical storage managers are a total waste of time, far better to buy a bunch of cheapo discs than arse around with tapes and virtual storage.
sawyl: (Default)
While the case for sudo-vs-root is slightly different for large, multi-admin systems, but this article is vaguely interesting, not so much for what it says as the thoughts it provokes.

I suppose the main point of using sudo in a production environment isn't so much security as CYA: sudo generates a nice audit trail of events, giving you proof that your minor change wasn't the one the screwed the system. Of course there's still the problem of people just starting root shells and bypassing the audit trail that way, but that can be easily dealt with by coming down like the wrath of God on anyone who breaks the rules. After all, what's the point in having a security policy if it's casually violated?
sawyl: (Default)
Cron is driving me crazy. I've been looking into a problem whereby a whole bunch of my jobs have been failing because they're unable to find python. Fair enough, I thought, maybe I need to set the path explicitly from within the crontab, so I did.

It didn't help, but I have no idea why. Displaying the environment with printenv shows the path to be correct. Running python -c "import os; print os.environ['PATH']" reports the correct path but running python -c "import os; os.system('python2.4 -V')" fails even though the directory with python2.4 is blantantly in the path.

One for tomorrow, I think...
Updated: Turns out my problem was wood-for-the-trees related. When, rested and restored, I revisited the problem, I noticed that instead of adding the directory with the python2.4 binary in the path, I'd added the binary itself to the path. No wonder everything was hinky.
sawyl: (Default)
Although I generally resent being on call, sorry, providing reasonable endeavours support, it does have its redeeming moments. Today, for example, the sense of pathetic relief in the voice of the console jockey was almost sufficient compensation for taking the call.

Almost. But not quite. No, sufficient compensation involves putting in a claim for three hours of Sunday overtime, despite only have actually worked on the problem for two minutes and twenty three seconds...
sawyl: (Default)
Had a meeting today to try to thrash out an answer to last week's question about the necessity of root passwords. I attended in my usual role meeting role of domine canis, Aquinas to the HC's Doctor Universalis, and enjoyed myself rather more than usual.

Maybe it's the result of an upbringing heavy on dialectics, but I reckon there's nothing better than a good argument to clarify which of the points up for discussion are in question. There's nothing like a challenge for firming up one's own beliefs, for as Mill says, if you don't challenge stuff it becomes, "deprived of its vital effect on the character and conduct, the dogma becomes a mere formal profession." And we wouldn't want that, now, would we?

Got root?

Feb. 22nd, 2006 05:11 pm
sawyl: (Default)
Today's big question: to what extent do sysadmins require the root password? My general feeling is that, given a correctly setup system with a decent sudo configuration, the answer is probably a lot less than people think they do.
sawyl: (Default)
Today looks like being a bad day. First I spent five painful minutes trying to put my left contact lens in only to belatedly discover that I had it inside out, then when I got in to work I found that the workstation network was out of action after an automatic job had accidentally nuked the system configuration files, including the password file.

It's like one of those geekish unix games that involve escaping from a deliberately convoluted disaster situation, like having a completely empty password file, without reinstalling from scratch: how do you role out a set of fixes to a cluster of workstations if none of the machines have a root account? Is there a neat, hackerish solution or do you just have to go around with a whole bunch of boot CDs?

Updated: It turned out that the best way to fix the problem was to arm 7–8 people with CDs with copies of the correct password file, then send them out to boot each affected machine into single user and fix things that way. Unpleasant and time consuming when you've got a couple of hundred boxes to do, but ultimately effective.
sawyl: (Default)
On several occasions over the last few weeks, I've found myself unconsciously calling to mind Neal Stephenson's rhyme about King Coyote from The Diamond Age:

Castles, gardens, gold and jewels
Contentment signify, for fools
Like Princess Nell; but those
Who cultivate their wit
Like King Coyote and his crows
Compile their power bit by bit
And hide it places no one knows.

I suspect it's a sign of my dismissive attitude to documenting arcane system voodoo: the docs are the gardens, gold and jewels; the real power, compiled bit by bit, comes from years of immersion in the ongoing lossage that is the Inorganic Ideologue.

sawyl: (Default)
Here's just one way to totally nark your sysadmin: make all your files world writable and then bitch when one of your colleagues runs a brain dead, super broken script that does an rm -Rf $UNSET_VARIABLE/, does a delete on everything from the root directory down and trashes them all. You can, of course, make up for this act of shocking incompetence in the eyes of your system wrangler by interrupting them when they're in the middle of hauling stuff back from tape to explain why you're a special case, why you absolutely positively have to leave a bunch of writable stuff lying around and why it's not the result of lazy design on your part. Yeah, that ought to soothe any frayed nerves and put you back in your admin's good books.

Does all that sound like too much work? Well, why not try a simpler approach: just run an rm -Rf / to see what it does. I bet it won't do all that much damage. I mean, it's not like you're root or anything, so it's not like anything really bad could happen...

I swear, next time I do a serious system installation, I'm going to configure an impossibly fascist MLS regime that makes it totally impossible for such things to happen, one that makes it totally impossible for users to log on, create any files, submit any jobs, stuff like that...

NFS blues

Dec. 15th, 2005 07:43 pm
sawyl: (Default)
This afternoon, I was asked to look into some of the NFS performance problems and to come up with a plan to remove the worst of the bottlenecks. As soon as the subject came up I felt my heart sink and then later, after the discussion had finished and the full horror of the thing had sunk in, I found that I wanted to do nothing more than to lock myself in the lavatory and spend the rest of the day crying quietly...

Slow moves

Nov. 16th, 2005 11:07 pm
sawyl: (Default)
A slight, totally slight, gain in order: discovered that the netmask on the mac was way out, so although it could talk to localnet, the rest of the world was like shrouded in silence. Also "fixed" DNS by dumping it's nasty ass but despite yesterday's upbeat, SMTP auth is still splorking with "makeconnection... failed: Invalid argument" Probs just a wrong parameter in my client-info file, but question is which one?

Profile

sawyl: (Default)
sawyl

July 2017

S M T W T F S
       1
23 4 5 6 7 8
9 10111213 14 15
161718 19202122
23242526272829
3031     

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 27th, 2017 06:47 pm
Powered by Dreamwidth Studios