futex deadlock stump

Discussion:

(too old to reply)

Brilte

2006-08-14 19:30:29 UTC

I am a little new to C and programming C on Unix (in my case Linux) but
not to programming. I am writing a system monitor, in C for Linux
(http://sf.net/projects/netnms).

My code runs fine for some random amount of time, anywhere between 1
hr+ to 2 minutes, before it meets a futex deadlock on my 2.6 kernel.

I've performed some checks and tried debugging by writing to stderr and
catching the output through an strace redirected using nohup &.

After the several weeks trying to find the cause of the deadlock i am
completely stumped as to whats happening and why the deadlock occurs.

Basically it always happens inside a logging library that i have
written for my system monitor when the signal sigusr1 is delivered.
Sigusr1 is a mechanism i have chosen to use to alert a parent process,
from a child process.

Can anyone give any pointers as to the possible source/reason for futex
deadlocks

My logging source code that the deadlock occurs in is....

} else if (env->log_env_pntr->logmethod_set == 2) {
/* Log to logfile */
time_t curtime = 0;
struct tm timestamp = {0,0,0,0,0,0,0,0,0};
int time_errorcount = 0;

if (!netnms_signal_block(env, SIGUSR1)) {
return 0;
}

if (!netnms_signal_block(env, SIGCHLD)) {
return 0;
}

while (time(&curtime) == -1) {
if (time_errorcount < ATTEMPT_LIMIT) {
time_errorcount++;
continue;

} else {
/* Failed to get epoch */
return 0;
}
}

fprintf(stderr, "1\n");

if (!netnms_signal_unblock(env, SIGUSR1)) {
return 0;
}

if (!netnms_signal_unblock(env, SIGCHLD)) {
return 0;
}

fprintf(stderr, "1.1\n");
fprintf(stderr, "%X\n", &curtime);
fprintf(stderr, "%X\n", &timestamp);

if (localtime_r(&curtime, &timestamp) == NULL) {
fprintf(stderr, "1.3\n");
return 0;
}

fprintf(stderr, "2\n");
}

From the above code snippet, netnms_signal_block and

netnms_signal_unblock are used to control the delivery of signals. In
this case, when the signal handler is called to process the sigusr1,
the block/unblock functions pick up that fact and do not change the
signal mask.

The strace below shows the futex deadlock in action and i've found it
to occurs anywhere inside the above code block. At first i thought it
was a problem with thread-safety but im using local variables, blocking
the non thread-safe time function and using localtime_r.

The strace output is as follows...

rt_sigprocmask(SIG_SETMASK, NULL, [USR1 CHLD], 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, NULL, [CHLD], 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
write(2, "1.1\n", 4) = 4
write(2, "BFA0BB14\n", 9) = 9
write(2, "BFA0BAE8\n", 9) = 9
write(2, "2\n", 2) = 2
write(2, "3\n", 2) = 2
write(2, "4\n", 2) = 2
write(2, "5\n", 2) = 2
write(2, "6\n", 2) = 2
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, NULL, [USR1 CHLD], 8) = 0
rt_sigprocmask(SIG_SETMASK, NULL, [USR1 CHLD], 8) = 0
time(NULL) = 1155581415
futex(0x3f80bc, FUTEX_WAIT, 2, NULL

Maxim Yegorushkin

2006-08-14 20:05:41 UTC

Permalink

Post by Brilte
I am a little new to C and programming C on Unix (in my case Linux) but
not to programming. I am writing a system monitor, in C for Linux
(http://sf.net/projects/netnms).
My code runs fine for some random amount of time, anywhere between 1
hr+ to 2 minutes, before it meets a futex deadlock on my 2.6 kernel.

[]

Post by Brilte
The strace below shows the futex deadlock in action and i've found it
to occurs anywhere inside the above code block. At first i thought it
was a problem with thread-safety but im using local variables, blocking
the non thread-safe time function and using localtime_r.
The strace output is as follows...
rt_sigprocmask(SIG_SETMASK, NULL, [USR1 CHLD], 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, NULL, [CHLD], 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
write(2, "1.1\n", 4) = 4
write(2, "BFA0BB14\n", 9) = 9
write(2, "BFA0BAE8\n", 9) = 9
write(2, "2\n", 2) = 2
write(2, "3\n", 2) = 2
write(2, "4\n", 2) = 2
write(2, "5\n", 2) = 2
write(2, "6\n", 2) = 2
rt_sigprocmask(SIG_SETMASK, NULL, [USR1 CHLD], 8) = 0
rt_sigprocmask(SIG_SETMASK, NULL, [USR1 CHLD], 8) = 0
time(NULL) = 1155581415
futex(0x3f80bc, FUTEX_WAIT, 2, NULL

strace is a wise first step. Now that you know the thread is blocked on
a futex, you can attach gdb to the process, find the thread blocked in
that call and find the caller in the backtrace.

Brilte

2006-08-15 10:05:54 UTC

Permalink

(gdb) bt
#0 0x002aa402 in ?? ()
#1 0x0039ce3e in __lll_mutex_lock_wait () from /lib/libc.so.6
#2 0x00345627 in _L_lock_1849 () from /lib/libc.so.6
#3 0x003453cb in __tz_convert () from /lib/libc.so.6
#4 0x00343979 in localtime_r () from /lib/libc.so.6
#5 0x001141e8 in netnms_log_write (env=0x804f918, loglevel=8,
format=0x804e1e4 "Sigusr1 received") at log.c:277
#6 0x0804bf5b in sigusr1 () at controlproc_signals.c:205
#7 <signal handler called>
#8 0x003458cb in __tzfile_compute () from /lib/libc.so.6
#9 0x0034540d in __tz_convert () from /lib/libc.so.6
#10 0x00343979 in localtime_r () from /lib/libc.so.6
#11 0x001141e8 in netnms_log_write (env=0x804f918, loglevel=8,
format=0x124604 "recordlock_control.c::check_recordlock: Checking
recordlock")
at log.c:277
#12 0x00123db4 in netnms_check_recordlock (env=0x804f918,
recordlock_id=13, locknum=2) at recordlock_control.c:199
#13 0x0804ad3e in run_poll_time () at run_poll_time.c:190
#14 0x0804a0c6 in main () at controlproc.c:99
(gdb)

This looks to me like the localtime_r function is not thread safe. I am
wrong? Whats going on?

Maxim Yegorushkin

2006-08-15 12:25:32 UTC

Permalink

Post by Brilte
(gdb) bt
#0 0x002aa402 in ?? ()
#1 0x0039ce3e in __lll_mutex_lock_wait () from /lib/libc.so.6
#2 0x00345627 in _L_lock_1849 () from /lib/libc.so.6
#3 0x003453cb in __tz_convert () from /lib/libc.so.6
#4 0x00343979 in localtime_r () from /lib/libc.so.6
#5 0x001141e8 in netnms_log_write (env=0x804f918, loglevel=8,
format=0x804e1e4 "Sigusr1 received") at log.c:277
#6 0x0804bf5b in sigusr1 () at controlproc_signals.c:205
#7 <signal handler called>
#8 0x003458cb in __tzfile_compute () from /lib/libc.so.6
#9 0x0034540d in __tz_convert () from /lib/libc.so.6
#10 0x00343979 in localtime_r () from /lib/libc.so.6
#11 0x001141e8 in netnms_log_write (env=0x804f918, loglevel=8,
format=0x124604 "recordlock_control.c::check_recordlock: Checking
recordlock")
at log.c:277
#12 0x00123db4 in netnms_check_recordlock (env=0x804f918,
recordlock_id=13, locknum=2) at recordlock_control.c:199
#13 0x0804ad3e in run_poll_time () at run_poll_time.c:190
#14 0x0804a0c6 in main () at controlproc.c:99
(gdb)
This looks to me like the localtime_r function is not thread safe. I am
wrong? Whats going on?

It appears to me that you deadlock yourself.

Post by Brilte
From the stacktrace I gather localtime_r is interrupted by a signal and

then you call it again from the signal handler. localtime_r is not in
the list of async signal safe functions, so one should not be calling
it from a signal context. (please see the list
http://www.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_04.html#tag_02_04_03)

Apart from that, man localtime_r states that it is a thread-safe
function. It does not state it is reentrant (the one that can be
interrupted by another invocation of itself). They probably use a
non-recursive mutex in its implementation, this may be why the thread
blocks trying to acquire the mutex which is already held by the very
same thread.

Brilte

2006-08-15 13:01:57 UTC

Permalink

Ok, understand that. I took the assumption that thread-safety and
reentrant were the same thing.

What would you recommend? The only solution i can think ok, is that i
move the call to localtime_r in between the signal block code? But
then, i might as well switch to localtime.

Is there an alternative to localtime altogether that is re-entrant.

Thanks for your help.

Nils O. Selåsdal

2006-08-15 13:17:02 UTC

Permalink

Post by Brilte
Ok, understand that. I took the assumption that thread-safety and
reentrant were the same thing.

There is also the term asynch-signal-safe that needs to be concerned.
That's the issue you seem to have hit here :)

Post by Brilte
What would you recommend? The only solution i can think ok, is that i
move the call to localtime_r in between the signal block code? But
then, i might as well switch to localtime.

Then you have to make sure you don't call localtime from any other
threads at the same time - assuming your application is multithreaded.

Post by Brilte
Is there an alternative to localtime altogether that is re-entrant.

The best thing is to minimize code in signal handlers, or atleast
make very sure the signal handler is never called when any
data it accessed /directly or indirectly e.g. via library calls).
This gets tedious, and perhaps even more so if threads are involved
as the signal handler might be "randomly" executed in any thread
that doesn't block the signal.

Preferrably set a global flag which one inspect each iteration of a main
loop if that is feasible.

Brilte

2006-08-15 15:13:22 UTC

Permalink

Ok, this explains a lot. It has all been very informative. Thanks for
your help.

Maxim Yegorushkin

2006-08-15 14:50:34 UTC

Permalink

Post by Brilte
Ok, understand that. I took the assumption that thread-safety and
reentrant were the same thing.
What would you recommend? The only solution i can think ok, is that i
move the call to localtime_r in between the signal block code? But
then, i might as well switch to localtime.

I would think of not handling signals in signal handlers. If your
application's main loop is select/poll/... based it should be easy. In
the signal handler I would write the signal number to a pipe and then
handle it later in the main loop.