2021-04-12

Don't use fixed-width integers to detect a type's size

As a compiler developer, I see a lot of code that deals with data types and many different ways of using data types.

A problem that I have encountered a few times, despite being rather rare, is using fixed-width integer types to parameterize code on the size of types. When writing compilers or other programs that need to operate on data types directly, you often need code that behaves differently for different static types. For example, a serialization framework might need to change behaviour depending on the size of different integer types. In a compiler, you might encode the size of a type in a data structure like the Intermediate Language. One attempt to implement such behaviour is to overload a function using fixed-width integer types, with each overload being responsible for handling a specific size. However, a few subtleties in the C++ type system make this approach problematic for code maintainability and portability.

In this post, I explain the different problems that can happen when using fixed-width integer types to overload functions and provide alternative approaches for achieving the intended goal.

The naive approach

Let’s consider a simple function, foo(), with four overload definitions, one for each signed integer bit-width:

1
2
3
4
void foo(int8_t) {}
void foo(int16_t) {}
void foo(int32_t) {}
void foo(int64_t) {}

Naively, one might think that foo() is callable with any integer type. This is, however, not the case.

Trying to compile this simple example:

1
2
3
4
int main() {
    uint32_t x = 42;
    foo(x);
}

we quickly get errors about an ambiguous call:

error: call to 'foo' is ambiguous
    foo(x);
    ^~~
<source>:4:6: note: candidate function
void foo(int8_t) {}
     ^
<source>:5:6: note: candidate function
void foo(int16_t) {}
     ^
<source>:6:6: note: candidate function
void foo(int32_t) {}
     ^
<source>:7:6: note: candidate function
void foo(int64_t) {}
     ^
1 error generated.

The compiler even helpfully lists all the functions we’ve defined as possible overload candidates.

The call is ambiguous because implicit conversions do not impose an ordering on overload resolution. Since we are passing a uint32_t to foo(), and foo() only has definitions with signed types, the compiler will try to apply an implicit conversion. Unfortunately for us, as far as the compiler is concerned, int8_t and int64_ (and every int*_t in between) are equally appropriate targets for an implicit conversion from uint32_t. The compiler doesn’t consider the fact that int32_t and uint32_t have the same bit-width when deciding which overload to pick for the call.

Note that changing all the overloads to use uint*_t instead won’t help because that will just invert the problem.

The wrong solution

To work around the problem, one might be tempted to define overloads for both signed and unsigned types:

1
2
3
4
5
6
7
8
void foo(int8_t) {}
void foo(int16_t) {}
void foo(int32_t) {}
void foo(int64_t) {}
void foo(uint8_t) {}
void foo(uint16_t) {}
void foo(uint32_t) {}
void foo(uint64_t) {}

However, besides possible issues with code duplication, there are still portability problems with this approach.

Let’s consider this example now:

1
2
3
4
int main() {
    unsigned long x = 42;
    foo(x);
}

At first glance, the code seems like it should work fine. And it will… sometimes.

If you compile the code using clang, everything is fine. With gcc, everything is still ok. But with MSVC, you get an error like this:

error C2668: 'foo': ambiguous call to overloaded function

No, this not an MSVC bug -_-

Let’s try changing the unsigned long to unsigned long long

1
2
3
4
int main() {
    unsigned long long x = 42;
    foo(x);
}

Now MSVC accepts the code, but gcc and clang say the call is ambiguous!

So, what’s going on?

The problem

If you’re using fix-width integer types, you probably already know that C++ doesn’t guarantee what the bit-width of primitive integer types should be. For example, long is 64 bits on some systems but 32 bits on others.

A subtle implication is that multiple primitive integer types can have the same bit-width. For example, in gcc and clang on x86-64, long and long long are both 64 bits. However, in MSVC (also on x86-64) int and long are 32 bits.

Now, because fixed-width integer types are just typedefs for some primitive type, one of the two types will not have an associated fixed-width typedef. For gcc and clang, int64_t can only map to one of long or long long. Similarly, for MSVC, int32_t can only map to one of int or long.

As a result, calling foo() with whichever type is not covered by a typedef will require an implicit conversion. And, as we previously saw, implicit conversions are not prioritized, so we get ambiguity.

The special char problem

The type char has an additional, subtle oddity that breaks the code. For other integer types like int, signed is implied when not explicitly specified (i.e. int is the same as signed int). However, C++ requires that char, signed char, and unsigned char all be distinct types. Also, char can be either signed or unsigned. It’s up to the compiler to decide which it will be.

As a result, the exact behaviour of, for example, foo('a') will depend on the exact compiler you use. It might call foo(int8_t) or foo(uint8_t)… or something else. Using clang as an example, int8_t is a typedef for signed char and uint8_t is a typedef for unsigned char. So foo('a') will not match any of the overloads we have defined. However, 'a', which is of type char, is subject to integer promotion, which happens before implicit conversions. So, because clang also happens to define int32_t as a typedef of int, foo('a') will actually call foo(int), which is foo(int32_t).

You can see this for your self by looking at the un-optimized assembly code clang generates for the main function in:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#include <cstdint>

void foo(int8_t) {}
void foo(int16_t) {}
void foo(int32_t) {}
void foo(int64_t) {}
void foo(uint8_t) {}
void foo(uint16_t) {}
void foo(uint32_t) {}
void foo(uint64_t) {}

int main() {
    foo('a');
}

which looks something like this (see here):

main:
 push rbp
 mov rbp,rsp
 mov edi,0x61
 call 401130 <foo(int)>  // <-- this is the call to foo('a')
 xor eax,eax
 pop rbp
 ret 
 nop WORD PTR cs:[rax+rax*1+0x0]
 nop DWORD PTR [rax+0x0]

The proper solution(s)

The basic solution is to define overloads for the primitive integer types and use sizeof() to get each type’s size.

However, a more compact approach is to use templates and sizeof():

1
2
3
4
template <typename T>
void foo(T x) {
    constexpr auto size = sizeof(T);
}

If you want to be sure that only integer types are accepted, add a static_assert with std::is_integral_v<T> (or std::is_integral<T>::value if you only have C++11):

1
2
3
4
5
template <typename T>
void foo(T x) {
    static_assert(std::is_integral_v<T>, "foo() can only be called with integer arguments");
    constexpr auto size = sizeof(T);
}

If you need to implement different behaviour depending on the size, use if constexpr (a plain if would also work):

1
2
3
4
5
6
7
8
9
template <typename T>
void foo(T x) {
    static_assert(std::is_integral_v<T>, "foo() can only be called with integer arguments");
    constexpr auto size = sizeof(T); 
    if constexpr (size == 1) {}
    else if constexpr (size == 2) {}
    else if constexpr (size == 4) {}
    else if constexpr (size == 8) {}
}

You can also usestd::enable_if (yes this is going to get ugly):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
template <typename T>
typename std::enable_if<std::is_integral_v<T> && sizeof(T) == 1, void>::type
foo(T) {}

template <typename T>
typename std::enable_if<std::is_integral_v<T> && sizeof(T) == 2, void>::type
foo(T) {}

template <typename T>
typename std::enable_if<std::is_integral_v<T> && sizeof(T) == 4, void>::type
foo(T) {}

template <typename T>
typename std::enable_if<std::is_integral_v<T> && sizeof(T) == 8, void>::type
foo(T) {}

A final note about sizeof()

An astute reader will have noted that using sizeof() can also pose portability problems.

sizeof() evaluates to the number of bytes used to store a given type.

While we usually assume that a byte is 8 bits long, C++ doesn’t actually guarantee this. Specifically, the number of bits in a byte is defined as whatever number of bits is used to store a char; i.e. CHAR_BITS (see cppreference.com). Because C++ also doesn’t specify the exact number of bits in a char (only that it must be at least 8), a compiler could define char as being larger than 8 bits. In fact, every integer type could be 64 bits long and sizeof() would evaluate to 1 for all of them, and the compiler would still be spec-compliant.

However, in practice, platforms where char (and therefore, a byte) is not 8 bits are extremely rare. So, for most practical purposes, assuming that a byte is 8 bits longs is ok and sizeof() will behave as we expect :)