Binary library packages using C++ modules

0. Introduction

As I wrote previously about C++ modules, I have found out that the Standard Committee has likely completely ignored the today practice of software development when designing the C++ modules, just like before with “export templates”: C++ cannot exist without the environment allowing to turn the language statements into the executable code. And that the only one used today is clumsy and prone to ABI compatibility problems.

As you know, the first C++ compiler has been created as a translator to C, which then was compiled using the C compiler, as it was easiest to do. Later compilers have been translating C++ directly, but they were still producing the same output as the C compiler: an object file containing variables and functions. Distribution of libraries was also done the same way: whatever can be aligned to a variable or function was put into the library, and everything else was provided by header files.

It wasn’t the best system even for C language, but in case of C++ it caused much more problems. Some things had to be emulated, some generated special way, even new features were required in the linker, like “strong linkage” support (C language uses vague linkage only). But the most important problem was when the shared libraries were introduced, suddenly language entities have been split into “static” and “dynamic” part, which not only isn’t defined in the language, but no development tools are even capable of checking the resulting ABI incompatibilities (it’s possible only through special tools that maintain access to multiple versions of the source code that should guard the compatibility version line).

God knows that C++ needs the new modularization system, and urgently. But this requires to create at least some new standard for linkage, libraries, distribution, and – what is most problematic for C++ – the compiler-independent (although no problem with platform-dependent) format for the interface.

1. The present situation

The software distribution and dependencies were done with the only modularization system available, that is, the one created for C language, which relies on the following rules:

There is a library file that contains two types of entities, mapped to a named symbol:
- A variable (saved space of particular size, possibly with some initial contents)
- A function (executable code fragment)
As an interface part, there’s distributed a header file containing language-defined entities, which should include symbols that should be mapped to entities found in the library
The compiling part uses language entities provided by the header file, which may leave named references to other symbols to be provided “elsewhere” (that is, by a library file)
Once the single parts are compiled, the linker should create the target by providing entities from the library to the object files that refer to them by name.

There are also several different forms in which the target can be provided:

An archive with objects with assigned symbolic names, which may also contain references to symbols not provided in this archive (aka “static library”)
A self-executable file with entities assigned to symbols, which can be used in two forms:
- as an executable file, if it provides the main execution entry
- a shared library otherwise

All of these forms can contain linkage entities – variables or functions – which may contain references to other variables or functions that have to be found possibly in other libraries used for linkage. The linkage can be done as a part of the build process, in order to resolve entities already there, or done by the OS-provided tools when running the executable file.

2. Problems before C++

As I said, this is also not the best modularization system for the C language – just in case of the C language it wasn’t seen as a problem. There were actually two reasons for it: the things provided in the header files were minor and little influential to the overall build process, and second was, well, that the C language programmers are – as my experience shows – used to just live with problems, or dodge them by using adhesive tape, instead of solving them.

The C language exists since 1978 year and many problems were identified already in the beginning of 1980, and lots of them have still not been solved (see the Wikipedia page for C23 to see what kind of problems they solved “after all these years”, and see that there’s still no function that returns true when two strings are equal). No different is the linkage and modularization system. C++ has been provided with lots of resolution for many problems (although not that new features didn’t bring in new problems). But it’s still improving. I hope that making you aware of how many problems haven’t been solved for such a long time, and that they still do apply to C++ just because it uses the same modularization system, can convince you how urgently the new and improved modularization system is needed for C++. Even the C language might benefit from it.

For starters then: there do exist entity types in the C language that can be used in the sources, influence the result of compiling, but cannot be provided anyhow in the library files:

Preprocessor macros
Global constants (as compile-time constants – that is, for example, which defines a global array size)
Structures
Inline functions

Let’s say that the preprocessor is kind of another story. The macro defined as #define ESIZE 64 and defined as #define END } are treated the same way by the language, both are allowed and both can be placed in the header file and this way be shared between the compiled source files. Provided in the first place the global compile-time constants and inline functions by C++ have given an opportunity to take this out of the preprocessor and hence increase a chance to provide them anywhere else than header files – but this didn’t happen so far. But then, as inline functions were added to C as well – although I’m not sure if it changed anything except the hint for inlining because C still uses only the vague linkage mode – this could be also done without the use of preprocessor.

As libraries feature only variables and functions, they never are complete. That itself wasn’t a problem, until the shared libraries came to life. For static linkage there’s no difference which parts are in the libraries and which have to be completed by information provided in the header – in result everything was in one place. With shared libraries things got different: entities provided in the header became a “static part”, and those in the library – “dynamic part”. The difference was that:

The “static part” are things that are taken by the compiler when compiling a single object file and things provided by it are “encoded inside” (“hardcoded” I would say). The header file only allows them all to be encoded the same way (although the compiler is incapable of checking if you have provided entities tied to particular symbols using the same header file – not even mentioning compiling with different environment of macros, compiler options etc.).
The “dynamic part” are things contained in the libraries – variables and functions – that are linked to the requiring executable file when it is being executed. This happens at a completely different time than building the requiring target file and it is not guaranteed to access always the same library file (which is also an advantage known as “separated upgrade”).

A short example: If you have a function defined in the source file, and the header provides only its signature, it’s dynamic. The same function provided in the header file with inline modifier is static. A class definition provided in the public header of the library is completely static. Every method defined inside it is static. Unless you only announce the method so that it is defined in the source file separately, and this way it’s dynamic. Do you have any language marker as a modifier to mark these things? Of course not. The language and the compiler are not even aware of these things. But they do exist and every programmer must be aware of it. That’s why the “pimple” pattern is so popular. Not only because it speeds up compiling during development. Also because you can freely change whatever you want in the “implementation” class – add or delete fields, change them, add or modify methods inside, whatever. Only one thing you can’t change: the existing methods in the “interface class”, which is the only visible class in the public header file. As the implementation class isn’t provided in the public header file, it’s the dynamic part of the library. Of course, the “pimple” pattern causes then problems with the basic feature of OO programming: derivation. But without this, classes are prone to ABI compatibility problems.

Imagine that you have a library that provides a structure in the header file. Your application is using this header and this structure, as the library provides a function that fills in the fields of this structure. You call this function in your application. The library is shared and it links to your application during execution time. Now you get the new version of the library and you install this in your system. You run your application and it crashes. Why? Because the developer of the library has added two extra fields in the structure in this new version. The function that fills this in is referring to these fields. But your object is the old version of the structure from before the upgrade, so your object is smaller and the fields that this function tries to fill in refer to a memory already outside of this object (note that what really happens here is memory override and a crash is only one of the possibilities here).

How do you think, how easy it is to make such a problem? It looks extremely easy. It doesn’t happen so often only because today developers are well aware of it. There are also already tools that allow to verify problems like this – just like there are lint tools, code style tools, coverage, and all that. But there’s nothing defined on the language level and nothing that the compiler can make you aware of.

But the fact that problems are not trivial to avoid is only one side of the story. The other is: imagine that this structure is somehow added to the library. There are known fields – even if not their types as such, then at least their size and hence the offset, as well as the size of the structure. And when fields are referred to, there aren’t just assembly code commands with appropriate immediate mode offset to refer to the particular field, but – just like with variables and functions – there’s a reference to appropriate name symbol in the library. This way, if your application allocates the object of this structure, it allocates the size as specified by the library, uses fields’ offsets as specified in the library using constant values provided in the library information. It doesn’t mean that all ABI compatibility problems disappear, but at least they are easier to track. There still are rules for it, but they are way lighter and easier to avoid: in this case all you are not allowed to do is to remove fields, or change the type of existing fields, but you can still add new ones or reorder existing ones. And the worst thing that can happen if you violate this rule is that you have a runtime linkage error because your application refers to a nonexistent field. It doesn’t compare to a problem of memory override that would be caused by this situation today.

3. Adjustment to C++

There are things that the standard committee gives no damn of, and then they had to be solved with adhesive tape by the compiler vendors. It simply starts with a very simple thing: the first vendor of the compiler provides his own way how to do things, and then there are other compiler vendors that, volens nolens, adjust to the existing situation, and that’s how the “standard” comes to life. Fortunately, there’s no need for a fully portable standard for it – things must be compatible only within the frames of a single operating system. But still, some standard must be defined, or otherwise libraries compiled by one compiler cannot be used by applications compiled with a different compiler.

In the early days, I remember that in the 1990 years, there were two different compilers on Windows, one provided by Microsoft, the other provided by Borland (that company is a kind of different story in itself, but that’s not important here), there was also Watcom, and at some point MinGW was screwed in, each one having a different system for mangling C++ entities. At some point all have likely aligned to the Microsoft’s definition, as without it it was not possible to distribute software dependencies. Early library vendors were also providing binary packages with alternative “Microsoft version” and “Borland version”. On Linux things were looking much better since the beginning because it has only one compiler available: GCC (“GNU C Compiler” initially, but after adding support for C++ and Fortran it was renamed to “GNU Compiler Collection”), and this one has been defining the standard here. I’m not sure how it was in case of other compiler vendors on other POSIX systems (like Sun’s C++ compiler, for example), but as gcc has been available on all POSIX systems, it has de facto dictated the standard there.

What is even funnier, this standard was changing, as changes were needed with the evolving language, making this way libraries compiled with the earlier standard incompatible – and so on. Things look much better now, but I have still never heard about any standard committee to standardize it. Pity, but that’s our reality. The standard for this has been defined by:

gcc development team, in case of Linux (clang followed, for example)
Microsoft in case of Windows
Apple in case of Mac (not sure which compiler they were using before clang – might be that since clang they have even adjusted to the latest standard provided by GCC, I didn’t check)

What things were these C++-specific things? For example:

The way to encode the function-like or global-variable C++ entity identifier as a linker symbol name (“mangling”)
The way to encode “physical” part of a class, that is, the “class characteristic object” (containing, at least in the current C++ standard, the VTable and the RTTI data).

4. Linker features required

Let’s start with this most general feature: The linker should feature immediate constants.

I’m not an expert of the linker, and I think an appropriate feature is provided already, but it’s not in use by compilers. It’s about having a value assigned to a symbol that is initialized and the value itself will be filled as a replacement of a symbol, not the alleged address of the entity (as it is for functions and variables, even in read-only flavor). Such a value should be also able to be used and interpreted by the linker itself, if this defines, for example, a size of some other data.

Here is required such a possibility that when an object is referred to, its address is replaced by the linker, but when you need to reach out to one of the fields of primitive type (integer or pointer) there’s an offset used likely with some address that is defined by the compiler directly in the assembly code (“hardcoded”). With this feature, this offset, instead of being directly encoded, should also use a symbol replacement. However this time the linker would have to replace it not with the address of the entity, but with the content that is written in the symbol definition.

Similar, although simpler thing is with the size of the structure. Whenever you use the sizeof (MyStruct) expression, it should refer to an appropriate symbol in the object file with a constant that contains the value being this size. This constant is then directly replaced in every place where the reference to this symbol is used. It might even have an appropriate naming convention, as long as linkers support appropriate types of symbols, so this is not just a “constant”, but a constant being a size of the structure, and therefore it holds an appropriate naming convention and name mangling.

This can be then extended to compile-time constants. You theoretically compile-in things here, but still, there’s no need to physically read them from memory if it can be provided already as an immediate value.

A bigger problem is that constants in the language can be used by way different ways than the address-based entities. Offsets in structures, their size, or just constants to be used in some evaluations are piece of cake. Worse things are such as, for example, when your constant defines a size of an array, which is a global variable. Hence the linker would have to provide also a possibility to not only support constants, but also to use constants when resolving the variable: linker must also “link itself” by referring to the size of the variable that is provided also by the constant found at a given symbol.

Even bigger problem is with inline functions and templates. But then, very few of them are to be solved by a linker. From the linker we need only these things:

Have a symbolic entity type, which define the convention that the name is using, and also how this should be understood
Support immediate constants, where the linker fills in the contents of the symbol, not its address
Support global variables with symbol-defined size

5. C++ entities that could use new linker features

It doesn’t mean that C++ compiler would have to make use of these new linker features in the new modularization. Whether such a linker exists, it depends on the system, and it only allows or not to provide particular language features, such as sharable classes or templates. Might be that many of them could be provided in the future. Such as:

External classes. This is a class that has this size and field layout defined in the library. As such, it will go with less restrictions referring to the ABI compatibility.
Compile-time external constants. They do exist today, but the compiler can generate a constant symbol that resolves to an immediate value.

This can also extend to functions, if these are used with them, as well with their signature. The functions themselves are referred in the linker by its address only, but the size and layout of the stack frame for arguments is dependent on the signature. This can also be encoded by appropriate constants so that the call form remains compatible even if, for example, someone replaces int32_t with int64_t in one of arguments. This doesn’t free the library developer from any compat requirements, but at least gives them more flexibility.

6. C++ interface form files

And here is the thing absolutely required for module: a standardized interface file format, which will be portable, at least in the frames of a single operating system. That is, it may contain some variants that might be specific for the system, but this format is at least interchangeable in a single operating system and platform, simply just as portable the binary packages can be. It’s not easy to define as it would have to be a mixture of C++-language specific things and platform-specific things.

The first biggest problem is the preprocessor. This could be at best got rid of, but it’s not that easy. The preprocessor macros are still used in header files and translation of projects to the form not using it won’t be that simple. They are being used for different purposes, and not all of them could be allowed in this form. You might want for example:

To reshape a constant per the application’s need.
To enable or disable a library feature – the reason being that an enabled feature turns on a dependency on some external library, or that it requires some specific support from the operating system that is not always available.
To turn on or reshape some development-specific thing, for which the library support isn’t needed, or is provided anyway as it’s considered cheap.

This is what I know that is being used in various projects, but this list definitely isn’t complete, it’s just to give you idea. This list should be better completed by reviewing various different projects to see how they are using preprocessor and especially -D option in the compiler. Interesting is the case when appropriate -D option is used for the compile command to compile the library user code (not the library itself!), which influences on the results of interpretation of this library’s header file. Every such case better be got rid of or somehow replaced, for example:

Any specific value should be contained in the library specific configuration so that they can be specific for the installation.
If there is a feature that could be enabled or disabled for the whole library, then this should be provided for both the library itself and its interface. The final library form may contain also some manifest file that enumerates the features, of which it is expected that a compatible library must have the list of features identical with the requiring library user or otherwise linkage is not possible. This should be also included in the format because this feature list should be attached to the library. Something like this is being used today by providing versioning to functions in the library, but this is also using some linker tricks to achieve it. This time it would have to be an official feature of the linker, coordinated by the features in the interface file.
This kind of configuration could be provided in the runtime configuration, but if there is any reason to hardcode it in the target, this could also in perspective use the immediate constant entity feature in the linker. This way there should be some method to transform the compiler option into a constant – likely something like a symbol that is provided with some defined value and can be replaced by a compiler option.

Anyway, the influence of the preprocessor macros on the form of the interface must be completely eliminated. The definition of C++ modules should make things motivating enough, I’m not sure up to which grade it is so, but then if you have a module-instrumented C++ source file, there should be the following restrictions applying to it:

The use of any preprocessor directive is restricted to the use in the implementation only.
Module interfaces can provide a preprocessor macro, but it must be syntactically complete. That is, they can resolve to a single instruction, single declaration or a single value, but never a fragment of one.
The use of #include in the module-instrumented source files is only allowed in the Global Module Fragment. Symbols provided by it can be only used in the implementation. Reexporting any parts of it, including when a symbol is a part of the definition provided explicitly in the module, require that the export declaration be explicitly declared.
Preprocessor directives may be used for conditionally including some part of the declaration, but then this information must be added to the feature list (so that a binary package with this thing included is different to the binary package with this thing excluded, both in the interface and in the implementation).

With these restrictions on, the module interface extracted from the source file should be free of any preprocessor dependencies – that is, the use of -D option in the compiler doesn’t influence anything in the interface.

Next, every language entity as defined in this interface must have its form appropriately encoded: the name, the signature, encoded built-in types, encoded references to library-defined or user-defined types. This is something that must be agreed upon by all compiler vendors. It’s so because this is the kind of immediate form that is understood directly by the compiler after it parses and interprets the contents of a header file. Every compiler does it and each one is generating some form of the internal database from it. The thing here is to have an interchangeable format of this database so that it can be loaded quickly by the compiler, without the need to preprocess and parse the header files with all indirect includes inside.

I know that actually the compiler doesn’t distinguish between header file and the source file, but then there is already known a practice of so-called “precompiled headers”, which is a “half-compiled” form of a header. This is a good place to start, but can’t really be a final solution. That’s also because, for example, even on Linux gcc and clang use different formats, different methods of creating them, different filename extensions – because it was never thought of as for any other use than for the current build process. This time we need something that will be produced identically by every compiler on the platform, so that it is interchangeable, and can be distributed together with libraries instead of header files.

For example: such things as structures, classes (stating that we don’t provide method bodies), or even class templates, should look the same everywhere, regardless of the platform and compiler vendor. This is the language-specific entity. On the other hand, it should have such a form that every compiler should be able to turn it quickly into its internal database as being prepared to build the rest of it when compiling the source code.

A little bit different story is with the inline functions. Inline functions could be provided with their out-of-line version (I think in the new modularization system it should be required, today it’s optional), but the inline version must be completely contained in the interface (that’s why inline functions shall not be allowed to be recursive, and multiple inline functions shall also not be mutually recursive). For inline functions it must be predicted that at the place of their call they will be “dissolved” into the calling environment. Therefore it’s unlikely that they can be compiled into the target assembly code. Probably some fragments of it can, as long as this can be framed independently, but normally the compiler should do optimizations on the code after the inline function has been dissolved (for example: there’s usually no argument passing on the stack, argument values are taken directly from the code where they were evaluated, the writing to the variable being returned is using direct writing to the variable assigned etc.). That’s why there has to be rather created some smart binary coding method to which the function body can be translated, and the compiler will still have to translate it to its assembly code. This binary code need not be portable in general, only within the frames of the particular operating system – so it can be without problems specific to the target system and machine platform. But still it’s unlikely to simply be the same form of assembly code as the compiler would do with normal functions. Not only because of required simplifications that would be specific to the place of usage (alleged call), but also that likely this same would have to be done to function templates, which’s details after compiling into a normal code would be dependent on the type parameters.

7. Binding together

As I mentioned already the binary code for inline functions, you might ask: would it be possible to have a smart compiler that translates it to the assembly code on the fly, which would be called by the linker? Not sure how complicated this is, but if this is possible, you may simply have a C++-specific dynamic linker, which in perspective can free us all from all problems known as “ABI compatibility” – or at least most of them, while shifting all the remaining problems to the category of “linker errors” (which is a large improvement towards the undefined behavior).

If you have a function template, for example, its final form needs to be a function containing the assembly code to execute, but some specifics need to be resolved depending on the size and specifics of some types used as template parameters. Templates, however – in distinction to preprocessor macros – don’t have syntax-specific dependencies; all their properties are at the C++ level, and as such, can be so encoded. If you have a specific inline function to be expanded, you use the general method to do it, as for all functions. If a type size, or specific builtin operation – this should be defined in the binary code standard for the inline functions. All this must be so simple that finally this can be done quickly by the subroutine called by the dynamic linker.

But that’s not all. What we currently have is that we have libraries, they are being opened when an application needs it, it’s loaded into memory (text only), and then there are per-application instances. The latter must be created anyway, but as for the “text” part we don’t really need to have the whole library in the memory – only text entities that are in use.

This, however, means that the library would not be the separate namespace, as it is today. This is unfortunately a method that potentially allows to provide a symbol of the same name, but by different libraries, while the potential conflict would not be properly prevented from and it’s hard to detect. This time you might want to require particular module contained in the library package and only require its specific version, maybe possibly somehow the variant (as a set of features – mentioned above). Those should be able to coexist, as long as the system can identify symbols provided by particular variant of the library. So, this way, you may have a specific library and a symbol provided by this very library package, but it is known from upside, which of the libraries package provides this symbol (because it is encoded in the interface files that the target was requesting) and that symbol is loaded to the memory if not yet present.

Once having this, the system can simply use single C++ “module form files” that provide both interface and implementation in one file, and the unit of caching in the system is still the single linker entity as required by the application. As long as it is possible to do any dependency tracking between entities, the system must load only those entities that are in use. The system also records the time when the entity was last open and when the next application requires access to the same entity, it will wink into the already cached entity, or will load the entity anew, if the library file is newer. Anyway, loaded and cached in memory are single entities, not whole libraries.

This way we don’t even need library files anymore – that is, the *.a and *.so files. Every module is simply compiled into module form files, which finally consist of interface and implementation, but provided as a single file. When compiling a library user, the request to the given module is resolved to opening the module form file, in order to reach out to its interface, but the exactly same module file is then used for linkage – both static and shared.

8. Considerations

Not that I have even made any feasibility study on that. This is only a pure concept, which can be also implemented partially. Most important thing for modules, as I mentioned in the previous article, is that the transition into modules can be done smoothly, according to the procedure I have shown. This means that there must exist initially a method to provide just “pure” module interface files with the traditional library files, but also single module form files containing the interface and implementation in one, just as well as generate a header file out of the binary interface file (yes, including “decoding” the inline functions).

I admit, I just sketched the concept and haven’t even checked if all these things are possible to be done, but I believe they just require a bit work and general development, which includes a bit giving up of the market competition, and most of the work would have to be done by the compiler vendors. But I do believe that it’s worth it.