Discussion:
Why 8 bit exit status codes?
(too old to reply)
Andreas Kempe
2024-02-02 16:05:14 UTC
Permalink
Hello everyone,

I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().

This did bite some colleagues at work at one point who were porting a
modem manager from a real-time OS to Linux because they were returning
negative status codes for errors. We fixed it by changing the status
codes and I never really thought about why this is the state of
things... until now!

Having a look at man 3 exit on my FreeBSD system, it states
Both functions make the low-order eight bits of the status argument
available to a parent process which has called a wait(2)-family
function.
and that it is conforming to the C99 standard
The exit() and _Exit() functions conform to ISO/IEC 9899:1999
(“ISO C99”).
C99 7.20.4.3 § 5 states
Finally, control is returned to the host environment. If the value of
status is zero or EXIT_SUCCESS, an implementation-defined form of the
status successful termination is returned. If the value of status is
EXIT_FAILURE, an implementation-defined form of the status
unsuccessful termination is returned. Otherwise the status returned
is implementation-defined.
which I read as the C standard leaving it to the implementation to
decide how to handle the int type argument.

Having a look at man 2 _exit, the system call man page, it says
nothing about the lower 8 bits, but claims conformance with
IEEE Std 1003.1-1990 ("POSIX.1") which says
in Part 1: System Application Program Interface (API) [C Language], 3.2.2.2 § 2
If the parent process of the calling process is executing a wait() or
waitpid(), it is notified of the termination of the calling process
and the low order 8 bits of status are made available to it; see
3.2.1.
that only puts a requirement on making the lower 8 bits available.
Looking at a more modern POSIX, IEEE Std 1003.1-2017, that has
waitid() defined, it has the following for _exit()
The value of status may be 0, EXIT_SUCCESS, EXIT_FAILURE, or any
other value, though only the least significant 8 bits (that is,
status & 0377) shall be available from wait() and waitpid(); the
full value shall be available from waitid() and in the siginfo_t
passed to a signal handler for SIGCHLD.
so the mystery of why the implementation is the way it is was
dispelled.

The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes
larger than 8 bits?

I don't know if I have ever come into contact with software that deals
with status codes that actually looks at the full value. My daily
driver shell, fish, certainly does not.
--
Best regards,
Andreas Kempe
Scott Lurndal
2024-02-02 16:33:40 UTC
Permalink
Post by Andreas Kempe
Hello everyone,
I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().
<snip>
Post by Andreas Kempe
The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes
larger than 8 bits?
The definition of the wait system call. Recall that the
PDP-11 was a 16-bit computer and wait needed to be able
to include metadata along with the exit status.
Andreas Kempe
2024-02-02 20:02:16 UTC
Permalink
Post by Scott Lurndal
Post by Andreas Kempe
Hello everyone,
I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().
<snip>
Post by Andreas Kempe
The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes
larger than 8 bits?
The definition of the wait system call. Recall that the
PDP-11 was a 16-bit computer
I'm afraid that's a tall order. I had yet to learn how to read when
they went out of production. :) Please excuse my ignorance.
Post by Scott Lurndal
and wait needed to be able to include metadata along with the exit
status.
I'm a bit unclear on the order of things coming into being. Did their
C implementation already use exit() with an int argument of size 16
bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
opting mask out the lower 8 bits on platforms with wider ints to
maintain backwards compatibility?
Scott Lurndal
2024-02-02 20:15:24 UTC
Permalink
Post by Andreas Kempe
Post by Scott Lurndal
Post by Andreas Kempe
Hello everyone,
I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().
<snip>
Post by Andreas Kempe
The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes
larger than 8 bits?
The definition of the wait system call. Recall that the
PDP-11 was a 16-bit computer
I'm afraid that's a tall order. I had yet to learn how to read when
they went out of production. :) Please excuse my ignorance.
Post by Scott Lurndal
and wait needed to be able to include metadata along with the exit
status.
I'm a bit unclear on the order of things coming into being. Did their
C implementation already use exit() with an int argument of size 16
bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
opting mask out the lower 8 bits on platforms with wider ints to
maintain backwards compatibility?
The status argument to the wait system call returned
a two part value; 8 bits of exit status and 8 bits
that describe the termination conditions (e.g. the
signal number that stopped or terminated the
process).


Here's the modern 32-bit layout (in little endian form):

unsigned int __w_termsig:7; /* Terminating signal. */
unsigned int __w_coredump:1; /* Set if dumped core. */
unsigned int __w_retcode:8; /* Return code if exited normally. */
unsigned int:16;

It's just the PDP-11 unix 16-bit version with 16 unused padding bits.

SVR4 added the waitid(2) system call which via the siginfo argument has
access to the full 32-bit program exit status.
Andreas Kempe
2024-02-02 21:20:22 UTC
Permalink
Post by Scott Lurndal
Post by Andreas Kempe
I'm a bit unclear on the order of things coming into being. Did their
C implementation already use exit() with an int argument of size 16
bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
opting mask out the lower 8 bits on platforms with wider ints to
maintain backwards compatibility?
The status argument to the wait system call returned
a two part value; 8 bits of exit status and 8 bits
that describe the termination conditions (e.g. the
signal number that stopped or terminated the
process).
unsigned int __w_termsig:7; /* Terminating signal. */
unsigned int __w_coredump:1; /* Set if dumped core. */
unsigned int __w_retcode:8; /* Return code if exited normally. */
unsigned int:16;
It's just the PDP-11 unix 16-bit version with 16 unused padding bits.
Thank you for the clarification, but I don't think I have any problem
grasping how the implementation works. My thought are why they did
what they did.

Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.

If we already have exit() and wait() using ints and want to stuff our
extra information in there without changing the API, it also makes
sense.
Post by Scott Lurndal
SVR4 added the waitid(2) system call which via the siginfo argument has
access to the full 32-bit program exit status.
Lawrence D'Oliveiro
2024-02-02 21:40:32 UTC
Permalink
Post by Andreas Kempe
Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it makes
sense, but otherwise I don't understand from an API perspective why one
would use a data type with the caveat that only half is used.
The other half contains information like whether the low half is actually
an explicit exit code, or something else like a signal that killed the
process. Or an indication that the process has not actually terminated,
but is just stopped.
Keith Thompson
2024-02-03 02:17:52 UTC
Permalink
Post by Andreas Kempe
Post by Scott Lurndal
Post by Andreas Kempe
I'm a bit unclear on the order of things coming into being. Did their
C implementation already use exit() with an int argument of size 16
bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
opting mask out the lower 8 bits on platforms with wider ints to
maintain backwards compatibility?
The status argument to the wait system call returned
a two part value; 8 bits of exit status and 8 bits
that describe the termination conditions (e.g. the
signal number that stopped or terminated the
process).
unsigned int __w_termsig:7; /* Terminating signal. */
unsigned int __w_coredump:1; /* Set if dumped core. */
unsigned int __w_retcode:8; /* Return code if exited normally. */
unsigned int:16;
It's just the PDP-11 unix 16-bit version with 16 unused padding bits.
Thank you for the clarification, but I don't think I have any problem
grasping how the implementation works. My thought are why they did
what they did.
Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.
C tends to use int values even for character data (when not an element
of a string). See for example the return types of getchar(), fgetc(),
et al, and even the type of character constants ('x' is of type int, not
char).

In early C, int was in many ways a kind of default type. Functions with
no visible declaration were assumed to return int. The signedness of
plain char is implementation-defined. Supporting exit values from 0 to
255 is fairly reasonable. Using an int to store that value is also
fairly reasonable -- especially since main() returns int, and exit(n) is
very nearly equivalent to return n in main().

Ignoring all but the low-order 8 bits is not specified by C. Non-POSIX
systems can use all 32 (or 16, or ...) bits of the return value.

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Andreas Kempe
2024-02-03 13:21:29 UTC
Permalink
Post by Keith Thompson
Post by Andreas Kempe
Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.
C tends to use int values even for character data (when not an element
of a string). See for example the return types of getchar(), fgetc(),
et al, and even the type of character constants ('x' is of type int, not
char).
I thought the reason for the int return type was to have an error code
outside of the range of the valid data, with EOF being defined as
being a negative integer. A reason that isn't applicable for the
argument passing to exit by a program.
Post by Keith Thompson
In early C, int was in many ways a kind of default type. Functions with
no visible declaration were assumed to return int. The signedness of
plain char is implementation-defined.
I realised that char was a bad example just as I posted. I should have
chosen unsigned char instead.
Post by Keith Thompson
Supporting exit values from 0 to 255 is fairly reasonable. Using an
int to store that value is also fairly reasonable -- especially
since main() returns int, and exit(n) is very nearly equivalent to
return n in main(). Ignoring all but the low-order 8 bits is not
specified by C. Non-POSIX systems can use all 32 (or 16, or ...)
bits of the return value.
Yes, in my original post, I detailed that the restriction does not
come from the C standard, but from POSIX. I'm not sure which came
first.

If C was first with having an exit() function and an int return for
main, I can imagine that it went something like this

- C chooses int for main
- C uses int in exit() to match main
- OS folks want to store extra data in the exit status, but they
want to match the C API
- let's just stuff it in the upper bits and keep the API the same with
an imposed restriction on the value in POSIX

or POSIX exit() was constructed with the int from main in mind, or it
could just be, as you point out, that int is a nice default integer
type and there wasn't much thought put into it beyond that.

I can speculate a bunch different reasons, but I'm curious if anyone
knows what the actual reasoning was.
Janis Papanagnou
2024-02-03 15:38:39 UTC
Permalink
Post by Andreas Kempe
Post by Keith Thompson
Post by Andreas Kempe
Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.
C tends to use int values even for character data (when not an element
of a string). See for example the return types of getchar(), fgetc(),
et al, and even the type of character constants ('x' is of type int, not
char).
I thought the reason for the int return type was to have an error code
outside of the range of the valid data, with EOF being defined as
being a negative integer. A reason that isn't applicable for the
argument passing to exit by a program.
Post by Keith Thompson
In early C, int was in many ways a kind of default type. Functions with
no visible declaration were assumed to return int. The signedness of
plain char is implementation-defined.
I realised that char was a bad example just as I posted. I should have
chosen unsigned char instead.
Post by Keith Thompson
Supporting exit values from 0 to 255 is fairly reasonable. Using an
int to store that value is also fairly reasonable -- especially
since main() returns int, and exit(n) is very nearly equivalent to
return n in main(). Ignoring all but the low-order 8 bits is not
specified by C. Non-POSIX systems can use all 32 (or 16, or ...)
bits of the return value.
Yes, in my original post, I detailed that the restriction does not
come from the C standard, but from POSIX. I'm not sure which came
first.
If C was first with having an exit() function and an int return for
main, I can imagine that it went something like this
- C chooses int for main
- C uses int in exit() to match main
- OS folks want to store extra data in the exit status, but they
want to match the C API
- let's just stuff it in the upper bits and keep the API the same with
an imposed restriction on the value in POSIX
or POSIX exit() was constructed with the int from main in mind, or it
could just be, as you point out, that int is a nice default integer
type and there wasn't much thought put into it beyond that.
I can speculate a bunch different reasons, but I'm curious if anyone
knows what the actual reasoning was.
AFAICT; "historical reasons". You have some bits to carry some exit
status, some bits to carry other termination information (signals),
optionally some more bits to carry other supplementary information.
If you want that information all carried across a single primitive
data type you have to draw a line somewhere. Given that these days
one can not assume that more than 16 bit in the default 'int' type
guaranteed it seems quite obvious to split at 8 bit. (For practical
reasons differentiating 255 error codes seems more than enough, if
we consider what evaluating and individually acting on all of them
at the calling/environment level would mean.)

Janis
Scott Lurndal
2024-02-03 21:34:29 UTC
Permalink
Post by Andreas Kempe
Yes, in my original post, I detailed that the restriction does not
come from the C standard, but from POSIX. I'm not sure which came
first.
The restriction predates both. It was how unix v6 worked; every
version of unix thereafter continued that so that existing applications
would not need to be rewritten.

It was documented in the SVID (System V Interface Definition) which
was part of the source materials used by X/Open when developing
the X Portability Guides (xpg) (which became the SuS).

Ken and Dennis chose to implement the wait system call (which
the shell uses to collect the exit status) with an 8-bit value
so they could use the other 8 bits of the 16-bit int for metadata.

This could never be changed without breaking applications, so
we still have it today in unix, linux and other POSIX-compliant
operating evironments.
Keith Thompson
2024-02-03 23:29:08 UTC
Permalink
Post by Andreas Kempe
Post by Keith Thompson
Post by Andreas Kempe
Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.
C tends to use int values even for character data (when not an element
of a string). See for example the return types of getchar(), fgetc(),
et al, and even the type of character constants ('x' is of type int, not
char).
I thought the reason for the int return type was to have an error code
outside of the range of the valid data, with EOF being defined as
being a negative integer. A reason that isn't applicable for the
argument passing to exit by a program.
I don't think there's one definitive reason for either decision.
Post by Andreas Kempe
Post by Keith Thompson
In early C, int was in many ways a kind of default type. Functions with
no visible declaration were assumed to return int. The signedness of
plain char is implementation-defined.
I realised that char was a bad example just as I posted. I should have
chosen unsigned char instead.
The exit() function predates unsigned char (see K&R1). It probably even
predates char. (C's predecessor B was untyped, with characters being
stored in words which were effectively of type int. There was an exit()
function, but it apparently took no arguments.)

Changing exit()'s parameter type to reflect the range of valid values
undoubtedly wasn't considered worth doing -- especially since a wider
range of values might be valid on some systems.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Rainer Weikusat
2024-02-05 16:11:09 UTC
Permalink
[...]
Post by Andreas Kempe
If C was first with having an exit() function and an int return for
main, I can imagine that it went something like this
- C chooses int for main
- C uses int in exit() to match main
- OS folks want to store extra data in the exit status, but they
want to match the C API
- let's just stuff it in the upper bits and keep the API the same with
an imposed restriction on the value in POSIX
or POSIX exit() was constructed with the int from main in mind, or it
could just be, as you point out, that int is a nice default integer
type and there wasn't much thought put into it beyond that.
I can speculate a bunch different reasons, but I'm curious if anyone
knows what the actual reasoning was.
This should be pretty obvious: A C int is really a machine data type in
disguise, namely, whatever fits into a common general purpose register
of a certain machine. C was created for porting UNIX to
the PDP-11 (or rather, rewriting UNIX for the PDP-11 with the goal of
not having to rewrite it again for next type of machine which would need
to be supported by it). Putting a value into a certain register is a
common convention for returning values from functions (or rather, Dennis
Ritchie probably thought it would be a sensible convention at that
time). Hence, having main return an int was the 'natural' idea and
allocating the lower half of this int to applications whising to return
status codes and the upper half to the system for returning
system-specific metadata was also the 'natural' idea.

Surely, eight whole bits must be enough for everyone! :-)
Andreas Kempe
2024-02-05 19:02:24 UTC
Permalink
Thank you everyone for the different informative replies and
historical insight! I think I have gotten what I can out of this
thread.
Lawrence D'Oliveiro
2024-02-03 21:37:55 UTC
Permalink
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register.

Signed characters make no sense.
Joe Pfeiffer
2024-02-04 03:33:19 UTC
Permalink
Post by Lawrence D'Oliveiro
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register.
Signed characters make no sense.
Except in architectures where they do. If you're doing something where
it matters (or even if you want your code to be more readable) used
signed char or unsigned char as appropriate.
Lawrence D'Oliveiro
2024-02-04 06:41:25 UTC
Permalink
Post by Joe Pfeiffer
Post by Lawrence D'Oliveiro
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed
did sign extension when loading a byte quantity into a (word-length)
register.
Signed characters make no sense.
Except in architectures where they do.
There are no character encodings which assign meanings to negative codes.
Scott Lurndal
2024-02-04 16:25:03 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Joe Pfeiffer
Post by Lawrence D'Oliveiro
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed
did sign extension when loading a byte quantity into a (word-length)
register.
Signed characters make no sense.
Except in architectures where they do.
There are no character encodings which assign meanings to negative codes.
But then 'signed char' doesn't necessarily need to be used
for character encoding (consider int8_t, for example, which
defines a signed arithmetic type from -128..+127.

On the 16-bit PDP-11, signed 8-bit values would not have been uncommon,
if only because of the limited address space.
Richard Kettlewell
2024-02-04 08:49:13 UTC
Permalink
Post by Joe Pfeiffer
Post by Lawrence D'Oliveiro
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register.
Signed characters make no sense.
Except in architectures where they do.
Such as?
Post by Joe Pfeiffer
If you're doing something where it matters (or even if you want your
code to be more readable) used signed char or unsigned char as
appropriate.
Signed 8-bit integers are perfectly sensible, signed characters not so
much.
--
https://www.greenend.org.uk/rjk/
Kees Nuyt
2024-02-05 17:22:59 UTC
Permalink
On Sat, 03 Feb 2024 20:33:19 -0700, Joe Pfeiffer
Post by Lawrence D'Oliveiro
Signed characters make no sense.
Nor did 6 bit characters, but in the 1980s we had them:
3 characters in a 24 bit word.
Welcome to what was then called mini or midrange computers.

(Yes, looking at you, Harris, with its Vulcan Operating System)
--
Regards,
Kees Nuyt
Lawrence D'Oliveiro
2024-02-05 22:41:39 UTC
Permalink
Post by Lawrence D'Oliveiro
Signed characters make no sense.
3 characters in a 24 bit word.
I see your sixbit and raise you Radix-50, which packed 3 characters into a
16-bit word.

None of these used signed character codes, by the way. So my point still
stands.
Keith Thompson
2024-02-05 23:51:37 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Lawrence D'Oliveiro
Signed characters make no sense.
3 characters in a 24 bit word.
I see your sixbit and raise you Radix-50, which packed 3 characters into a
16-bit word.
None of these used signed character codes, by the way. So my point still
stands.
My understanding is that on the PDP-11, making plain char signed made
code that stored character values in int objects more efficient.
Sign-extension was more efficient than zero-filling or something like
that. I don't remember the details, but I'm sure it wouldn't be
difficult to find out.

At the time, making such code a little more efficient was worth the
effort -- and character data with the high-order bit set to 1 was rare,
so it didn't make much difference in practice.

I don't know whether there are efficiency issue on modern platforms. If
modern CPUs have similar characteristics to the PDP-11, that could
impose some pressure to keep signed characters. And the representation
requirements for the character types (especially with C23 requiring
2's-complement) mean that signed characters don't cause many practical
problems.

Since C code has always had to work correctly if plain char is signed,
there wasn't much pressure to make it unsigned (though some platforms do
so).

I'd be happy if some future C standard mandated that plain char is
unsigned, just because I think it would make more sense, but I don't
think that's likely to happen. But the historical reasons for allowing
plain char to be signed are valid.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Scott Lurndal
2024-02-06 00:16:56 UTC
Permalink
Post by Keith Thompson
Post by Lawrence D'Oliveiro
Post by Lawrence D'Oliveiro
Signed characters make no sense.
3 characters in a 24 bit word.
I see your sixbit and raise you Radix-50, which packed 3 characters into a
16-bit word.
None of these used signed character codes, by the way. So my point still
stands.
My understanding is that on the PDP-11, making plain char signed made
code that stored character values in int objects more efficient.
Sign-extension was more efficient than zero-filling or something like
that. I don't remember the details, but I'm sure it wouldn't be
difficult to find out.
The PDP-11 had two move instructions:

MOV (r1)+,r2
MOVB (r2)+,r3

MOV moved source to destination. MOVB always sign-extended the byte
to the destination register size (16-bit).
Lawrence D'Oliveiro
2024-02-06 00:58:31 UTC
Permalink
Post by Keith Thompson
My understanding is that on the PDP-11, making plain char signed made
code that stored character values in int objects more efficient.
Sign-extension was more efficient than zero-filling or something like
that.
The move-byte instruction did sign-extension when loading into a register,
not storing into memory.

There was no convert-byte-to-word instruction as such.
Keith Thompson
2024-02-06 02:31:36 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Keith Thompson
My understanding is that on the PDP-11, making plain char signed made
code that stored character values in int objects more efficient.
Sign-extension was more efficient than zero-filling or something like
that.
The move-byte instruction did sign-extension when loading into a register,
not storing into memory.
There was no convert-byte-to-word instruction as such.
Right, so if you wanted to copy an 8-bit value into a 16-bit register
with sign-extension, you do it in one instruction, whereas zeroing the
top 8 bits would require at least one additional instruction, probably a
BIC (bit-clear) following the MOVB. You'd probably need more
instruction space to store the mask value of 0xff00 -- pardon me,
0177400. And I expect that copying a character into a register would
have been a common operation.

Given those constraints, I'd say it made sense *at the time* for char to
be signed on the PDP-11, especially since it was pretty much assumed
that text would be plain ASCII that would never have the high bit set.

If the PDP-11 had had an alternative MOVB instruction that did
zero-extension, we might not be having this discussion.

Question: Do any more modern CPUs have similar characteristics that make
either signed or unsigned char more efficient?
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Lawrence D'Oliveiro
2024-02-06 03:10:52 UTC
Permalink
Post by Keith Thompson
If the PDP-11 had had an alternative MOVB instruction that did
zero-extension, we might not be having this discussion.
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register.

Signed characters make no sense.
Keith Thompson
2024-02-06 04:00:33 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Keith Thompson
If the PDP-11 had had an alternative MOVB instruction that did
zero-extension, we might not be having this discussion.
The signedness of plain char is implementation-defined.
Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register.
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Richard Kettlewell
2024-02-06 17:00:25 UTC
Permalink
Post by Keith Thompson
Post by Lawrence D'Oliveiro
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making
sense.

I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
--
https://www.greenend.org.uk/rjk/
Rainer Weikusat
2024-02-06 17:35:01 UTC
Permalink
Post by Richard Kettlewell
Post by Keith Thompson
Post by Lawrence D'Oliveiro
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think that's just a (probably traditional) misnomer. A C char isn't a
character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.
Kaz Kylheku
2024-02-06 18:04:16 UTC
Permalink
Post by Rainer Weikusat
Post by Richard Kettlewell
Post by Keith Thompson
Post by Lawrence D'Oliveiro
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think that's just a (probably traditional) misnomer. A C char isn't a
character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.
Sure, except for the part where "abcd" denotes an object that is a
null-terminated array of these *char* integers, that entity being formally
called a "string" in ISO C, and used for representing text. (Or else "abcd" is
initializer syntax for a four element (or larger) array of *char*).

If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value,
even though the constant has type *int*, and "\xff"[0] does likewise.

This has been connected to needless bugs in C programs. An expression like
table[str[i]] may result in table[] being negatively indexed.

The <ctype.h> function require an argument that is either EOF
or a value in the range of 0 to UCHAR_MAX, and so are incompatible
with string elements.

All this crap could have been avoided if *char* had been unsigned.
*unsigned char* never needed to exist except as a synonym for plain
*char*.

Speaking of synonyms, *char* is a distinct type, and not a synonym for either
*signed char* or *unsigned char*. It has to be that way, given the way it is
defined, but it's just another complication that need not have existed:

#include <stdio.h>

int main(void)
{
char *cp = 0;
unsigned char *ucp = 0;
signed char *scp = 0;
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
printf("%d\n", '\xff');
}

char.c: In function ‘main’:
char.c:8:27: warning: comparison of distinct pointer types lacks a cast
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
^~
char.c:8:38: warning: comparison of distinct pointer types lacks a cast
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
^~
char.c:8:50: warning: comparison of distinct pointer types lacks a cast
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Rainer Weikusat
2024-02-06 18:30:46 UTC
Permalink
[...]
Post by Kaz Kylheku
Post by Rainer Weikusat
Post by Richard Kettlewell
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think that's just a (probably traditional) misnomer. A C char isn't a
character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.
Sure, except for the part where "abcd" denotes an object that is a
null-terminated array of these *char* integers, that entity being formally
called a "string" in ISO C, and used for representing text. (Or else "abcd" is
initializer syntax for a four element (or larger) array of *char*).
If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value,
even though the constant has type *int*, and "\xff"[0] does likewise.
This has been connected to needless bugs in C programs. An expression like
table[str[i]] may result in table[] being negatively indexed.
The <ctype.h> function require an argument that is either EOF
or a value in the range of 0 to UCHAR_MAX, and so are incompatible
with string elements.
All this crap could have been avoided if *char* had been unsigned.
*unsigned char* never needed to exist except as a synonym for plain
*char*.
All of this may be true¹ but it's all besides the point. The original C
language had three integer types, char, short and int, which were all
signed types. It further supported declaring pointers to some type and
pointers were basically unsigned integer indices into a linear memory
array. Char couldn't have been an unsigned integer type, regardless if
this would have made more sense², because unsigned integer types didn't
exist in the language.

¹ My personal theory of human fallibility is that humans tend to fuck up
everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits. Had they been avoided,
human ingenuity would have found something else to fuck up.

² Being wise in hindsight is always easy. But that's not an option for
people who need to create something which doesn't yet exist and not be
wisely critical of something that does.
Kaz Kylheku
2024-02-06 18:38:06 UTC
Permalink
Post by Rainer Weikusat
¹ My personal theory of human fallibility is that humans tend to fuck up
everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits.
Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

A fractured skull reveals a human trait (accident proneness, weak bone)
rather than the workplace trait of not enforcing helmet use.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Rainer Weikusat
2024-02-06 19:02:00 UTC
Permalink
Post by Kaz Kylheku
Post by Rainer Weikusat
¹ My personal theory of human fallibility is that humans tend to fuck up
everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits.
Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...
I wrote about C types and somewhat more generally, programming language
features, and not "safety devices" supposed to protect human bodies from
physical injury.
Kaz Kylheku
2024-02-06 21:22:57 UTC
Permalink
Post by Rainer Weikusat
Post by Kaz Kylheku
Post by Rainer Weikusat
¹ My personal theory of human fallibility is that humans tend to fuck up
everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits.
Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...
I wrote about C types and somewhat more generally, programming language
features, and not "safety devices" supposed to protect human bodies from
physical injury.
Type systems are safety devices. That's why we have terms like "type
safe" and "unsafe code".

Type safety helps prevent misbehavior, which results in problems like
incorrect results and data loss, which can have real economic harm.

In a safety-critical embedded system, a connection between type safety
and physical safety is readily identifiable.

"Type safety" it's not just some fanciful metaphor like "debugging";
there is a literal interpretation which is true.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Rainer Weikusat
2024-02-06 21:37:50 UTC
Permalink
Post by Kaz Kylheku
Post by Rainer Weikusat
Post by Kaz Kylheku
Post by Rainer Weikusat
¹ My personal theory of human fallibility is that humans tend to fuck up
everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits.
Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...
I wrote about C types and somewhat more generally, programming language
features, and not "safety devices" supposed to protect human bodies from
physical injury.
Type systems are safety devices. That's why we have terms like "type
safe" and "unsafe code".
They're not, at least not when safety device is supposed to mean
something like hard hats. That's just an inappropriate analogy some
people like to employ. This is, however, completely besides the point of
my original text which was about providing an explanation why char is
signed in C despite all kinds of smart alecs with fifty years of
hindsight Ritchie didn't have in 1972 are extremely concvinced that this
was an extremely bad idea.
Lew Pitcher
2024-02-06 19:25:27 UTC
Permalink
Post by Rainer Weikusat
[...]
Post by Kaz Kylheku
Post by Rainer Weikusat
Post by Richard Kettlewell
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think that's just a (probably traditional) misnomer. A C char isn't a
character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.
Sure, except for the part where "abcd" denotes an object that is a
null-terminated array of these *char* integers, that entity being formally
called a "string" in ISO C, and used for representing text. (Or else "abcd" is
initializer syntax for a four element (or larger) array of *char*).
If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value,
even though the constant has type *int*, and "\xff"[0] does likewise.
This has been connected to needless bugs in C programs. An expression like
table[str[i]] may result in table[] being negatively indexed.
The <ctype.h> function require an argument that is either EOF
or a value in the range of 0 to UCHAR_MAX, and so are incompatible
with string elements.
All this crap could have been avoided if *char* had been unsigned.
*unsigned char* never needed to exist except as a synonym for plain
*char*.
All of this may be true¹ but it's all besides the point. The original C
language had three integer types, char, short and int, which were all
signed types.
This view ignores the early implementation of (K&R) C on IBM 370 systems,
where a char was 8 bits of EBCDIC. In EBCDIC, all alphabetic and numeric
characters have their high bit set (alphabetics range from 0x80 through
0xe9, while numerics range from 0xf0 through 0xf9). A char in this
implementation, by necessity, was unsigned, as C "guarantees that any
character in the machine's standard character set will never be negative"
(K&R "The C Programming Language", p40)
Post by Rainer Weikusat
It further supported declaring pointers to some type and
pointers were basically unsigned integer indices into a linear memory
array. Char couldn't have been an unsigned integer type, regardless if
this would have made more sense², because unsigned integer types didn't
exist in the language.
¹ My personal theory of human fallibility is that humans tend to fuck up
everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits. Had they been avoided,
human ingenuity would have found something else to fuck up.
² Being wise in hindsight is always easy. But that's not an option for
people who need to create something which doesn't yet exist and not be
wisely critical of something that does.
--
Lew Pitcher
"In Skills We Trust"
Rainer Weikusat
2024-02-06 20:01:43 UTC
Permalink
[Why-oh-why is char not unsigned?!?]
Post by Lew Pitcher
Post by Rainer Weikusat
All of this may be true¹ but it's all besides the point. The original C
language had three integer types, char, short and int, which were all
signed types.
This view ignores the early implementation of (K&R) C on IBM 370 systems,
where a char was 8 bits of EBCDIC. In EBCDIC, all alphabetic and numeric
characters have their high bit set (alphabetics range from 0x80 through
0xe9, while numerics range from 0xf0 through 0xf9).
Indeed. It refers to the C lanuage as it existed/ was created when UNIX
was brought over to the PDP-11. This language didn't have any unsigned
integer types as the concept didn't yet exist.
Keith Thompson
2024-02-06 19:15:13 UTC
Permalink
Post by Rainer Weikusat
Post by Richard Kettlewell
Post by Keith Thompson
Post by Lawrence D'Oliveiro
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think that's just a (probably traditional) misnomer. A C char isn't a
character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.
Here's a quote from the 1974 and 1975 C reference manuals:

A char object may be used anywhere an int may be. In all cases the
char is converted to an int by propagating its sign through the
upper 8 bits of the resultant integer. This is consistent with the
two’s complement representation used for both characters and
integers. (However, the sign-propagation feature disappears in other
implementations.)

In more modern terms, that last sentence suggests that plain char was
unsigned in some implementations

K&R1, 1978, is more explicit:

There is one subtle point about the conversion of characters
to integers. The language does not specify whether variables
of type char are signed or unsigned quantities. When a
char is converted to an int, can it ever produce a negative
integer? Unfortunately, this varies from machine to machine,
reflecting differences in architecture. On some machines
(PDP-11, for instance), a char whose leftmost bit is 1 will be
converted to a negative integer ("sign extension"). On others,
a char is promoted to an int by adding zeros at the left end,
and thus is always positive.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Keith Thompson
2024-02-06 19:08:45 UTC
Permalink
Post by Richard Kettlewell
Post by Keith Thompson
Post by Lawrence D'Oliveiro
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.

I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance
justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Andreas Kempe
2024-02-06 23:13:21 UTC
Permalink
Post by Keith Thompson
Post by Richard Kettlewell
Post by Keith Thompson
Post by Lawrence D'Oliveiro
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.
I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance
justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.
I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.

The bench was done by moving a byte from the stack to eax using a loop
of 10 movzbl/movsbl running 10M times. Both instructions gave on
average about 0.7 cycles per instruction measured using rdtsc. The
highest bit in the byte being set or unset made no difference.
Scott Lurndal
2024-02-06 23:27:23 UTC
Permalink
Post by Andreas Kempe
Post by Keith Thompson
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.
I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance
justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.
I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.
A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.
Andreas Kempe
2024-02-07 00:26:08 UTC
Permalink
Post by Scott Lurndal
Post by Andreas Kempe
Post by Keith Thompson
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.
I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance
justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.
I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.
A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.
Yeah. I tried some different variations and by adding some data
dependencies by incrementing the value and moving it around, I could
get some difference between the two, approx 10 to 30 %, but I'm not
sure how much is due to the instruction itself or other effects of
manipulating the data.

Funnily enough, the zero extend was the more performant in these tests
making unsigned char possibly more performant.

My intention wasn't really to claim they're exactly the same, but that
that I don't think there is any real performance benefit to be had by
switching char to unsigned. Even if the 10-30 % are a real thing, I
wonder how much software is actually using char types in a way where
it would make a difference?
Scott Lurndal
2024-02-07 00:46:17 UTC
Permalink
Post by Andreas Kempe
Post by Scott Lurndal
Post by Andreas Kempe
Post by Keith Thompson
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.
I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance
justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.
I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.
A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.
Yeah. I tried some different variations and by adding some data
dependencies by incrementing the value and moving it around, I could
get some difference between the two, approx 10 to 30 %, but I'm not
sure how much is due to the instruction itself or other effects of
manipulating the data.
The logic for sign extension (MOVSX) isn't complex, the added gate delay
wouldn't affect the instruction timing. Fan the sign bit out
to the higher bits through a couple of gates to either select the
sign bit or the high order bits when storing into the new register.

Sign extension on load (MOV from memory) will happen in the load unit before
it hits the register file, most likely.

The x86 MOVBE instruction is a slight more complex example.
Post by Andreas Kempe
Funnily enough, the zero extend was the more performant in these tests
making unsigned char possibly more performant.
Within what margin of measurement error?
Post by Andreas Kempe
My intention wasn't really to claim they're exactly the same, but that
that I don't think there is any real performance benefit to be had by
switching char to unsigned. Even if the 10-30 % are a real thing, I
wonder how much software is actually using char types in a way where
it would make a difference?
We use uint8_t extensively because the data is unsigned in the range 0-255.

And generally want wrapping behavior modulo 2^8.
Andreas Kempe
2024-02-07 02:11:26 UTC
Permalink
Post by Scott Lurndal
Post by Andreas Kempe
Post by Scott Lurndal
Post by Andreas Kempe
I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.
A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.
Yeah. I tried some different variations and by adding some data
dependencies by incrementing the value and moving it around, I could
get some difference between the two, approx 10 to 30 %, but I'm not
sure how much is due to the instruction itself or other effects of
manipulating the data.
The logic for sign extension (MOVSX) isn't complex, the added gate delay
wouldn't affect the instruction timing. Fan the sign bit out
to the higher bits through a couple of gates to either select the
sign bit or the high order bits when storing into the new register.
Sign extension on load (MOV from memory) will happen in the load unit before
it hits the register file, most likely.
The x86 MOVBE instruction is a slight more complex example.
Post by Andreas Kempe
Funnily enough, the zero extend was the more performant in these tests
making unsigned char possibly more performant.
Within what margin of measurement error?
Here's an example of a test I played around with. The body of my loop
does this 10M times for this test. movzbl is switched for movsbl when
testing the other configuration.

movzbl -24(%rsp), %eax
movb %al, -25(%rsp)
movzbl -25(%rsp), %eax
movb %al, -26(%rsp)
movzbl -26(%rsp), %eax
movb %al, -27(%rsp)
movzbl -27(%rsp), %eax
movb %al, -28(%rsp)
movzbl -28(%rsp), %eax
incl %eax
movb %al, -24(%rsp)

This is the data, unit is total cycles for a run, from 2000 runs of
10M each for the two different instructions:

movzbl:
mean = 1.24E+08
variance = 3.95E+12

movsbl:
mean = 1.38E+08
variance = 3.44E+12

ratio movsbl/movzbl = 1.11

Performing a two-tail student t-test gives

p-value: 0.00E+00

Something is causing these two test runs to give different performance
results. I will not pretend I know enough about the inner workings of
Intel's magic box to explain why.
Post by Scott Lurndal
Post by Andreas Kempe
My intention wasn't really to claim they're exactly the same, but that
that I don't think there is any real performance benefit to be had by
switching char to unsigned. Even if the 10-30 % are a real thing, I
wonder how much software is actually using char types in a way where
it would make a difference?
We use uint8_t extensively because the data is unsigned in the range 0-255.
And generally want wrapping behavior modulo 2^8.
Sure, but if you are using uint8_t, you have sidestepped the whole
issues of char being signed or unsigned so a change wouldn't really
affect you.
Scott Lurndal
2024-02-07 15:22:04 UTC
Permalink
Post by Andreas Kempe
Post by Scott Lurndal
Post by Andreas Kempe
Funnily enough, the zero extend was the more performant in these tests
making unsigned char possibly more performant.
Within what margin of measurement error?
Here's an example of a test I played around with. The body of my loop
does this 10M times for this test. movzbl is switched for movsbl when
testing the other configuration.
movzbl -24(%rsp), %eax
movb %al, -25(%rsp)
movzbl -25(%rsp), %eax
movb %al, -26(%rsp)
movzbl -26(%rsp), %eax
movb %al, -27(%rsp)
movzbl -27(%rsp), %eax
movb %al, -28(%rsp)
movzbl -28(%rsp), %eax
incl %eax
movb %al, -24(%rsp)
This is the data, unit is total cycles for a run, from 2000 runs of
mean = 1.24E+08
variance = 3.95E+12
mean = 1.38E+08
variance = 3.44E+12
ratio movsbl/movzbl = 1.11
Sehr interresant. Ich weiss nicht, warum es ist.
Post by Andreas Kempe
Post by Scott Lurndal
We use uint8_t extensively because the data is unsigned in the range 0-255.
And generally want wrapping behavior modulo 2^8.
Sure, but if you are using uint8_t, you have sidestepped the whole
issues of char being signed or unsigned so a change wouldn't really
affect you.
While most C compilers have a compile-time option to select the signed-ness of
char, using uint8_t sidesteps the issue completely.
Richard Kettlewell
2024-02-07 10:29:58 UTC
Permalink
Post by Keith Thompson
Post by Richard Kettlewell
Post by Keith Thompson
Post by Lawrence D'Oliveiro
Signed characters make no sense.
You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)
I still don’t see any explanation for signed characters as such making
sense.
I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.
Having a basic 8-bit integer type be signed type makes sense (in
context) for performance reasons and perhaps for usability reasons too.

But that’s really not the same as “signed characters make sense”. For
signed characters to make sense there has to be encoding where some
signs (or control codes, etc) are encoded to negative values. I’ve never
heard of one.

“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values. If the purpose was
purely the latter it would have been called ‘short short int’ or
something like that.
Post by Keith Thompson
I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance
justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.
i.e. we’re still suffering the locked-in side-effects of an ancient
decision even though the original justification has become irrelevant.
It might or might not have been a reasonable trade-off at the time,
disregarding what were then hypotheticals about the future, but (indeed
with hindsight) I think from today’s point of view it was clearly the
wrong decision.
--
https://www.greenend.org.uk/rjk/
Rainer Weikusat
2024-02-07 15:30:23 UTC
Permalink
[...]
Post by Richard Kettlewell
Post by Keith Thompson
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.
Having a basic 8-bit integer type be signed type makes sense (in
context) for performance reasons and perhaps for usability reasons too.
But that’s really not the same as “signed characters make sense”. For
signed characters to make sense there has to be encoding where some
signs (or control codes, etc) are encoded to negative values. I’ve never
heard of one.
“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values.
Computers have absolutely no idea of "characters". They handle numbers,
integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.
Richard Kettlewell
2024-02-07 20:20:12 UTC
Permalink
Post by Rainer Weikusat
Post by Richard Kettlewell
“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values.
Computers have absolutely no idea of "characters". They handle numbers,
integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.
Language designers do, however, have an idea of “characters”.
--
https://www.greenend.org.uk/rjk/
Lawrence D'Oliveiro
2024-02-07 20:58:01 UTC
Permalink
Post by Richard Kettlewell
Language designers do, however, have an idea of “characters”.
Unicode uses the terms “grapheme” and “text element”. Actually it also
uses “character”, but it seems less clear on what that means. It is not
the same as a “code point” or “glyph”.

<https://www.unicode.org/faq/char_combmark.html>
Richard Kettlewell
2024-02-08 11:21:56 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Richard Kettlewell
Language designers do, however, have an idea of “characters”.
Unicode uses the terms “grapheme” and “text element”. Actually it also
uses “character”, but it seems less clear on what that means. It is not
the same as a “code point” or “glyph”.
<https://www.unicode.org/faq/char_combmark.html>
Sure, but this was all happening in the 1970s, long before Unicode
existed.

K&R1 explicitly says char is “capable of holding one character in the
local character set” (and mentions EBCDIC as a concrete example on the
same page - the problem must have been obvious already).
--
https://www.greenend.org.uk/rjk/
Rainer Weikusat
2024-02-08 16:34:10 UTC
Permalink
Post by Richard Kettlewell
Post by Rainer Weikusat
Post by Richard Kettlewell
“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values.
Computers have absolutely no idea of "characters". They handle numbers,
integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.
Language designers do, however, have an idea of “characters”.
I don't quite understand what that's supposed to communicate. Insofar
the machine is concerned, a character is nothig but an integer and a
data type sufficient to hold a characters is thus necessarily an integer
type of some size. In a language without unsigned integer types, it'll
necessarily also be an signed integer type.
Keith Thompson
2024-02-08 16:53:33 UTC
Permalink
Post by Rainer Weikusat
Post by Richard Kettlewell
Post by Rainer Weikusat
Post by Richard Kettlewell
“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values.
Computers have absolutely no idea of "characters". They handle numbers,
integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.
Language designers do, however, have an idea of “characters”.
I don't quite understand what that's supposed to communicate. Insofar
the machine is concerned, a character is nothig but an integer and a
data type sufficient to hold a characters is thus necessarily an integer
type of some size. In a language without unsigned integer types, it'll
necessarily also be an signed integer type.
Early C (pre-K&R1) didn't explicitly have unsigned integer types, but
char was effectively unsigned in some implementations, in that
converting a char value to int would zero-fill the result rather than
doing sign-extension.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Rainer Weikusat
2024-02-08 17:46:20 UTC
Permalink
Post by Keith Thompson
Post by Rainer Weikusat
Post by Richard Kettlewell
Post by Rainer Weikusat
Post by Richard Kettlewell
“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values.
Computers have absolutely no idea of "characters". They handle numbers,
integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.
Language designers do, however, have an idea of “characters”.
I don't quite understand what that's supposed to communicate. Insofar
the machine is concerned, a character is nothig but an integer and a
data type sufficient to hold a characters is thus necessarily an integer
type of some size. In a language without unsigned integer types, it'll
necessarily also be an signed integer type.
Early C (pre-K&R1) didn't explicitly have unsigned integer types, but
char was effectively unsigned in some implementations, in that
converting a char value to int would zero-fill the result rather than
doing sign-extension.
According to Ritchie's "The Development of the C Language"

,----
| During 1973-1980, the language grew a bit: the type structure gained
| unsigned
|
| [...]
|
| the similarity of the arithmetic properties of character pointers and
| unsigned integers made it hard to resist the temptation to identify
| them. The unsigned types were added to make unsigned arithmetic
| available without confusing it with pointer manipulation. Similarly, the
| early language condoned assignments between integers and pointers
`----
Keith Thompson
2024-02-08 18:23:29 UTC
Permalink
Post by Rainer Weikusat
Post by Keith Thompson
Post by Rainer Weikusat
Post by Richard Kettlewell
Post by Rainer Weikusat
Post by Richard Kettlewell
“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values.
Computers have absolutely no idea of "characters". They handle numbers,
integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.
Language designers do, however, have an idea of “characters”.
I don't quite understand what that's supposed to communicate. Insofar
the machine is concerned, a character is nothig but an integer and a
data type sufficient to hold a characters is thus necessarily an integer
type of some size. In a language without unsigned integer types, it'll
necessarily also be an signed integer type.
Early C (pre-K&R1) didn't explicitly have unsigned integer types, but
char was effectively unsigned in some implementations, in that
converting a char value to int would zero-fill the result rather than
doing sign-extension.
According to Ritchie's "The Development of the C Language"
,----
| During 1973-1980, the language grew a bit: the type structure gained
| unsigned
|
| [...]
|
| the similarity of the arithmetic properties of character pointers and
| unsigned integers made it hard to resist the temptation to identify
| them. The unsigned types were added to make unsigned arithmetic
| available without confusing it with pointer manipulation. Similarly, the
| early language condoned assignments between integers and pointers
`----
Right. K&R1 (1978) had "unsigned", but only for unsigned int. Still,
the signedness of char was effectively implementation-defined, though it
wasn't stated in those terms.

From K&R1:

A character or a short integer may be used wherever an
integer may be used. In all cases the value is converted to
an integer. Conversion of a shorter integer to a longer always
involves sign extension; integers are signed quantities. Whether
or not sign-extension occurs for characters is machine dependent,
but it is guaranteed that a member of the standard character
set is non-negative. Of the machines treated by this manual,
only the PDP-11 sign-extends. On the PDP-11, character variables
range in value from -128 to 127; the characters of the ASCII
alphabet are all positive. A character constant specified with
an octal escape suffers sign extension and may appear negative;
for example, '\377' has the value -1.

The sentence "Whether or not sign-extension occurs for characters is
machine dependent" might be written in more modern terms as "The
signedness of char is implementation-defined".

signed char and unsigned char (and unsigned short and unsigned long)
were added in ANSI C 1989, possibly earlier.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Lawrence D'Oliveiro
2024-02-08 21:57:57 UTC
Permalink
Post by Keith Thompson
The sentence "Whether or not sign-extension occurs for characters is
machine dependent" might be written in more modern terms as "The
signedness of char is implementation-defined".
signed char and unsigned char (and unsigned short and unsigned long)
were added in ANSI C 1989, possibly earlier.
Here’s an odd thing: what happens when you shift a signed int? K&R allows
left-shift with the obvious meaning, and says that, for right-shift,
whether the top bits are zero-filled or sign-extended is implementation-
defined; newer C specs say that left-shifting a negative value is simply
“undefined”, and right-shifting a negative value is “implementation-
defined”.
Scott Lurndal
2024-02-08 22:30:52 UTC
Permalink
Post by Keith Thompson
The sentence "Whether or not sign-extension occurs for characters is
machine dependent" might be written in more modern terms as "The
signedness of char is implementation-defined".
signed char and unsigned char (and unsigned short and unsigned long)
were added in ANSI C 1989, possibly earlier.
Here’s an odd thing: what happens when you shift a signed int? K&R allows
left-shift with the obvious meaning, and says that, for right-shift,
whether the top bits are zero-filled or sign-extended is implementation-
defined; newer C specs say that left-shifting a negative value is simply
“undefined”, and right-shifting a negative value is “implementation-
defined”.
There were extant hardware implementations exhibiting both behaviors. So they
made the behavior implementation-defined in the compiler.
Lawrence D'Oliveiro
2024-02-08 23:26:55 UTC
Permalink
Post by Scott Lurndal
Post by Lawrence D'Oliveiro
Here’s an odd thing: what happens when you shift a signed int? K&R allows
left-shift with the obvious meaning, and says that, for right-shift,
whether the top bits are zero-filled or sign-extended is implementation-
defined; newer C specs say that left-shifting a negative value is simply
“undefined”, and right-shifting a negative value is “implementation-
defined”.
There were extant hardware implementations exhibiting both behaviors.
Except the current spec doesn’t mention a choice between two behaviours.
Keith Thompson
2024-02-08 22:31:01 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Keith Thompson
The sentence "Whether or not sign-extension occurs for characters is
machine dependent" might be written in more modern terms as "The
signedness of char is implementation-defined".
signed char and unsigned char (and unsigned short and unsigned long)
were added in ANSI C 1989, possibly earlier.
Here’s an odd thing: what happens when you shift a signed int? K&R allows
left-shift with the obvious meaning, and says that, for right-shift,
whether the top bits are zero-filled or sign-extended is implementation-
defined; newer C specs say that left-shifting a negative value is simply
“undefined”, and right-shifting a negative value is “implementation-
defined”.
True. On the other hand, there are no shift operations on character
types (or short or unsigned short). The integer promotions are
performed on both operands of "<<" or ">>", so the value that's shifted
is at least as wide as int or unsigned int.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Kaz Kylheku
2024-02-08 19:54:12 UTC
Permalink
Post by Rainer Weikusat
According to Ritchie's "The Development of the C Language"
,----
| During 1973-1980, the language grew a bit: the type structure gained
| unsigned
|
| [...]
|
| the similarity of the arithmetic properties of character pointers and
| unsigned integers made it hard to resist the temptation to identify
| them. The unsigned types were added to make unsigned arithmetic
| available without confusing it with pointer manipulation. Similarly, the
| early language condoned assignments between integers and pointers
`----
It seems like a very odd rationale, given how things played out.

The difference between two pointers ended up signed (ptrdiff_t).
So pointer arithmetic doesn't work exactly like unsigned. That's mostly
a good thing, except that pointers farther from each other than half the
address space cannot be subtracted. (ISO C mostly takes that away anyway
since pointers to different objects may only be compared for exact
equality, and canno tbe subtracted. If no object is half the address
space or larger, subtraction overflow will never occur.)

Moreover, unsigned ended up necessary for representing a simple byte
in a nice way.

Not only that, but unsigned types are useful for bit manipulation,
without running into nonportable behaviors around shifting into and out
of the sign bit.

If you have a 32 bit int and want a 32 bit field, you want unsigned int.

Very odd to see the existence of unsigned math justified in terms of
some story about pointers.

It seems Ritchie really didn't think much about portability; he
probabably thought it was fine to do 1 << 15 with a 16 bit signed int
to calculate a mask for the highest bit, since that happened to work in
the systems he designed. If someone wanted C on their weird machine
where that misbehaves, or produces an alternative zero that compares
equal to regular zero, that was their problem.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Lawrence D'Oliveiro
2024-02-02 21:13:41 UTC
Permalink
I'm wondering why, at least on Linux and FreeBSD, a process exit status
was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().
I’ve never used that many different values. E.g. 0 for some test condition
passed, 1 for failed, 2 for unexpected error.
This did bite some colleagues at work at one point who were porting a
modem manager from a real-time OS to Linux because they were returning
negative status codes for errors.
True enough:

***@theon:~> python3 -c "import sys; sys.exit(1)"; echo $?
1
***@theon:~> python3 -c "import sys; sys.exit(-1)"; echo $?
255

But you could always sign-extend it.
Keith Thompson
2024-02-02 21:23:52 UTC
Permalink
Post by Lawrence D'Oliveiro
I'm wondering why, at least on Linux and FreeBSD, a process exit status
was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().
I’ve never used that many different values. E.g. 0 for some test condition
passed, 1 for failed, 2 for unexpected error.
The curl command defines nearly 100 error codes ("man curl" for
details). That's the most I've seen. 8 bits is almost certainly
plenty if the goal is to enumerate specific error conditions.
It's not enough if you want to pass more information through the
error code, which is why most programs don't try to do that.

Since int is typically 32 bits (but only guaranteed by C to be at
least 16), the exit() function could theoretically be used to pass
32 bits of information, but that's not really much more useful than 8
bits. If a program needs to return more than 8 bits of information,
it will typically print a string to stdout or something similar.

(On Plan 9, a program's exit status is (was?) a string, empty for
success, a description of the error condition on error. It's a cool
idea, but I can imagine it introducing some interesting problems.)

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */
Lawrence D'Oliveiro
2024-02-02 21:38:54 UTC
Permalink
Post by Keith Thompson
The curl command defines nearly 100 error codes ("man curl" for
details). That's the most I've seen.
Another reason for staying away from curl, I would say. It needlessly
replicates the functionality of a whole lot of different protocol clients,
when all you need is HTTP/HTTPS (maybe FTP/FTPS as well). That’s why I
stick to wget.
Post by Keith Thompson
(On Plan 9, a program's exit status is (was?) a string, empty for
success, a description of the error condition on error. It's a cool
idea, but I can imagine it introducing some interesting problems.)
What, not a JSON object?
Rainer Weikusat
2024-02-05 16:12:52 UTC
Permalink
Keith Thompson <Keith.S.Thompson+***@gmail.com> writes:

[...]
Post by Keith Thompson
(On Plan 9, a program's exit status is (was?) a string, empty for
success, a description of the error condition on error. It's a cool
idea, but I can imagine it introducing some interesting problems.)
That's interesting to know as I have been using the same convention for
validation functions in Perl for some years: These return nothing when
everything was ok or a textual error message otherwise.
Loading...