Rambles around computer science

Diverting trains of thought, wasting precious time

Tue, 15 Sep 2009

clone() and the unusual process dynamics

I had a weird problem with my X login scripts recently on my Lab machine---I noticed that for X sessions, the LD_LIBRARY_PATH environment variable wasn't being set, even though every other variable set from my ˜/.profile was still appearing correctly. After I bit of digging, I discovered why, but that only opened up an extra can of mystery.

The basic problem was that ssh-agent is a setuid program. so at the start of loading, the Linux loader removes all dynamic linker options (including LD_PRELOAD and LD_LIBRARY_PATH) from the environment to avoid running user code with elevated privileges. (It would make more sense just to ignore them, but still to propagate them to children... but anyway.) I was still a bit puzzled though, because I wasn't knowingly running my session as a child of ssh-agent---my login scripts do start one up if it's not already running, but it's supposed to run as a sibling of my main X session script, rather than using the ssh-agent <command> mechanism to start a child session. And ps agreed with me---my session wasn't descended from an ssh-agent process... but I did have two running, where I thought my scripts went to pains to ensure there was only one.

  PID TTY      STAT   TIME COMMAND
19233 ?        Ss     0:00 /bin/bash -l /home/srk31/.xsession
19573 ?        S      0:00  \_ ssh-agent
19589 ?        S      0:00  \_ fvwm
19593 ?        Ss     0:00      \_ <... more X clients>
19433 ?        Ssl    0:00 /bin/dbus-daemon --fork --print-pid 4 --print-address
19432 ?        S      0:00 /usr/bin/dbus-launch --exit-with-session /home/srk31/

The explanations for this were somewhat convoluted. Firstly, the Fedora xinit scripts (FC7) do this (in /etc/X11/xinit/xinitrc-common, sourced from /etc/X11/xinit/Xsession):

# Prefix launch of session with ssh-agent if available and not already running.
SSH_AGENT=
if [ -x /usr/bin/ssh-agent -a -z "" ]; then
    if [ "x" != "x" ]; then
        SSH_AGENT="/usr/bin/ssh-agent /bin/env TMPDIR="
    else
        SSH_AGENT="/usr/bin/ssh-agent"
  fi
fi

...and later (in /etc/X11/xinit/Xsession)...

# otherwise, take default action
if [ -x "/.xsession" ]; then
    exec -l  -c "  /.xsession"
elif [ -x "/.Xclients" ]; then
    exec -l  -c "  /.Xclients"
elif [ -x /etc/X11/xinit/Xclients ]; then
    exec -l  -c "  /etc/X11/xinit/Xclients"
else
    # should never get here; failsafe fallback
    exec -l  -c "xsm"
fi

In other words, they test whether an ssh-agent is running, and arrange to start one if not. But in between testing and starting one, they run a shell---which naturally starts my login scripts. These check for ssh-agent themselves and, finding none, start one. Then later, the Fedora scripts start another one. It's a classic “unrepeatable read” race condition, but without any concurrency---just interleaving of foreign code (my login scripts).

Next, why wasn't my session showing up as a child of one of the ssh-agent processes? ps's output was doubly confusing because the top of my process tree was a bash -l .xsession process, when that's the last to be launched by the sequence initiated in the Fedora scripts! Well, strace revealed that my processes were using the clone() system call to spawn new processes (which is subtly different from fork(), in that it allows shared address spaces and hence multithreaded programming). As we know, when a process clones itself in order to start a new process, one of the resulting pair replaces itself with the new process image, while the other continues on its merry way. In the case of both ssh-agent and dbus-launch, the parent process was the one which replaced itself, leaving the child to continue the work of SSH agentery or DBUS launchery or whatever. This is really confusing because it contradicts the usual expectations about causal ordering from parent to child processes---but it's perfectly allowed, and has always been possible in Unix.

What was the fix? Sadly, there isn't a good one---I don't have the permission to edit the Fedora scripts on my Lab machine, and there's no configurable flexibility for disabling the ssh-agent launching or fixing the racey logic. So I added a hack to my .xsession shell script which detects the case where SSH_AGENT_PID is already set to a child of the shell process (since the ssh-agent my scripts created is a sibling) and if so, kills that process and re-sets SSH_AGENT_PID to the one I created earlier (which, handily, I keep stored in ${HOME}/.ssh/agent-$(uname -n)). As usual, completely horrible.

[/devel] permanent link contact


Powered by blosxom

validate this page