Format string considered not exactly that harmless

Introduction

I have written this article basing on getting unexpectedly involved with the formatting feature added to C++20 and also earlier defined as the {fmt} library, which has been looking since the very beginning as an unfortunate attempt to revive the C’s printf function in the new light. Note then that as a software developer I am interested mainly in how particular solution improves productivity and I evaluate solutions exclusively with that regard.

Since software development activities consist of three types of activities:

  1. Writing the code the first time
  2. Reading the code someone else has written
  3. Fixing the code someone else has written

Having something to be written “shorter” helps only in the efficiency for the first one, which happens exactly once in the lifetime of the software, while the other two happen and repeat multiple times. That’s simply mathematics and the matter of lying to yourself. There’s also a well known term in the software development for solutions that help in the first one, but make life hard in the others: the “write-only software”.

Productivity improves then, even if you have a solution that seems hard for the first one, but then clears your way out forever for the next two ones. An example of that case is when you are using longer or shorter names of the local variables, especially those that last the lifetime of a long function. If they are short, it’s easy to write, but then it’s a hell of the one who will be fixing it. Conversely, if they are long, it’s tough for the writer (although good dev tools offer some help here to expand them), but the code looks clearer for those who will be fixing it.

I decided to make this introduction first because all solutions using the formatting specifier in the string is being always explained by its enthusiasts that “it is short” and therefore “easy to write”. So, yes, this exactly refers to activity 1. Without mentioning the truth about the rest.

Some historical background

How did this exactly happen that the C language got that printf function?

Because someone wanted this to be a function, not a language-builtin feature (which is a good idea in general – Pascal did it as a language feature and it resulted in an inconsistent and limitedly useful feature), and once so decided, they could only do it in the frames in which the language capabilities allow it. Wherever those are lacking, some replacement must be done. That’s more-less it.

Because, what exactly did the other languages have at that time?

Let’s make a short example: we have data for a rectangle: left, top, right, bottom. We have variables with these exactly names that carry these data. Now you want to display the origin point as cartesian coordinates (so, x=left, y=bottom) and their sizes (width=right-left, height=top-bottom) and it should look like: origin=x,y dimensions=width x height. We also have a color in RGB format and we display it as color=#RRGGBB (hex).

write(*, *) 'origin=', left, ',', bottom, ' dimensions=', (right-left), ' x ', (top-bottom)
write(*, '(Z6)') 'color=#', color

As VisualBasic is kinda too evolved, and it still does have many original concepts, it got too complicated for historical explanation, so let’s use the Commodore C64 version, also similar was that on Atari XE. Note here that the semicolon was different to comma in that comma was adding a space between printed parameters, while semicolon makes them just glued together, and also left alone at the end prevented the EOL to be added.

PRINT "origin="; left; ","; bottom; " dimensions="; (right-left); " x "; (top-bottom)
PRINT "color=#", HEX$(color)
Writeln("origin=", left, ",", bottom, " dimensions=", (right-left), " x ", (top-bottom));
Writeln("color=#", HexStr(color, 6));

(Let’s not forget also that formatting a floating-point value in Pascal is done by using value:width:precision expression inside the Writeln arguments).

Ok, but now let’s turn to something more from today. So let’s try scripting languages. All of them have such a feature as “interpolated string”, that is, you specify everything you need in a string. So let’s try now…

puts "origin=$left,$bottom dimensions=[expr {$right-$left}] x [expr {$top-$bottom}]"
puts "color=#[format %06X $color]"

Of course not every language is capable of having such a feature as interpolated string, so let me show you also how it would look like without it. The puts instruction in Tcl gets only one string to print (actually Tcl is known of that in this language “everything is a string”, so that’s not a big deal for it anyway), but there’s also a concat instruction to glue in values, so this can be used instead. So, let me show you how it would look like if there is no string interpolation and let’s imagine also that concat is the only way where you can interpret the value as string:

puts [concat "origin=" $left "," $bottom " dimensions=" [expr {$right-$left}] " x " [expr {$top-$bottom}]
puts [concat "color=#" [format %06X $color]]

Of course, let’s not forget that the format command in Tcl has been created after the C’s printf function, but we are using here just the simple format specifier for a single value. This isn’t the only way to do it in Tcl because we have both possibilities – either this way, or the same way as in C’s printf by simply using puts [format "%d %d %d" $a $b $c], which looks similar to printf("%d %d %d\n", a, b, c);.

Ok, I may also mention Python, for which all variations are available, but you still can do simply:

print("origin=", left, ",", bottom, " dimensions=", (right-left), " x ", (top-bottom))
print("color=#", format(color, '02X'))

Not that this is the only way available, the string formatting like in printf or interpolated strings (f-strings in Python) are also available.

So, what do all these things have in common?

The way of formatting the printable string is to glue all values together, while format configuration details are specified directly by the value being formatted.

Suffices to say that when the new file stream system was being developed for C++, the first thing to do was to return to the old, good method of formatting, which predated the C language, stating that C++ already has appropriate capabilities to provide them in the “userspace” (not as a language-builtin feature):

cout << "origin=" << left << "," << bottom << " dimensions=" << (right-left) << " x " << (top-bottom) << endl;
cout << "color=#" << hex << setfill('0') << setw(6) << color << endl;

Did people like it?

Well, some did, but not all. This C++ version was attempting to be free of formatting string, but then one annoying treat of this is that the formatting flags modify the stream state. This is not only annoying for the programmer, but also causes performance problems.

In the meantime, in other languages many derivative works have been developed basing on the original idea of the format string, slightly improved, without the most important problems of the C’s printf version, but still a format string. I personally find that obnoxious.

Sinking in C

Now, why has C introduced this kind of formatting? There’s simply one and the only reason: because of this language’s limitations. Of course, it could be just as well added as a language feature, as it was done with many other languages before, but there were good reasons not to do it this way. Even if you had to resolve to a concept-mixing weird style of a format string.

The following limitations of the C language have influenced this solution:

  • Inability to handle dynamic strings. This is actually something that lasts up until today and it doesn’t seem anyone willing to solve it. This means that you can’t simply have a format function that would get a string and some kind of value because someone would have to maintain the lifecycle of the dynamic string produced this way
  • Inability to handle the protocol of various types of data passed to a function. When passing, every type allowed to be represented as value is passed some standard way – usually aligned to 16 or 32 bits, 64-bit integers and double as two 32-bit slices, etc. – and the structure of the passed arguments must be known from upside by the function that extracts them.

In effect, this was already not easy to implement (this call actually mixes static strings and something that could only be a dynamically allocated string, which’s lifetime must be controlled), while already looks clumsy:

print("Height: ", format(d, "0.4f"), " index: ", format(i, "02d"), "\n", NULL);

so it was decided it will be easier to do this way:

printf("Height: %04.f index: %02d\n", d, i);

Variadic functions in C is something that was carried over from “K&R” C to ANSI C, after the latter has introduced function signatures. This introduced “fixed convention”: what values of what types are passed, is exactly as the function has declared, and type conversions are also done if needed. “K&R” C had only “forced convention”, that is, whatever parameters were passed to a call, it was taken as a good deal, and the stack frame was constructed basing on the actual value types of the passed parameters. In case of variadic function the convention is fixed, but only up to the last explicit parameter, all the other parameters use the “forced convention”.

The “forced convention” is the only way to pass parameters of possibly different kind to the same function. The problem is, the called function just gets the stack frame with parameters, but it knows nothing about its size, size of the particular parameters nor how to interpret them. Therefore there are exactly three known conventions, how to use variadic functions in C:

  1. Everything you pass to a function must be a pointer to the same kind of type, and you always pass NULL as the last parameter. This is the model of the execl* family functions. A similar method is often used for declaring arrays so that you can pass this array then as a single pointer without the size. A subvariant of that method is to provide values with tags: the last fixed argument is the first tag value, after which there’s expected a specific value, and then either next such sequence, or a termination tag. Theoretically this could be even provided with some syntactic sugar that avoids the need of this termination value by using a preprocessor macro, but this requires a variadic macro feature, which has been invented only in the C and C++ standards in 2011.
  2. The last fixed argument provides a complete information of every next passed argument. Their number and characteristics (size to grab and the way to interpret) must match exactly the types of values being passed. This is the model of all “formatting” functions, be it of the kind of printf or strftime.
  3. The function is variadic, but it expects actually exactly one parameter after the last fixed argument. From the fixed arguments it should be determined what type of the value should have been passed.

What’s interesting is that the string in quotes is the only sensible way to provide the type information for the case 2 because… this is the only type of array that got automatically added a single 0 value at their end, that is, the NUL character. The NUL-terminated string itself is actually another thing referring to the C language limitations also dated at “K&R” times and it exists only because this was the only way to pass the string by passing a pointer to an array without the need to pass also its size explicitly (in case of arrays of integers 0 value is as good as any other). There wasn’t any other merit to create this because NUL-terminated string is more error prone and less performant than a combination of a pointer to an array and the explicit size (for example, in distinction to memcpy, the strcpy function can’t make use of the processor’s specific memory copying acceleration, it must simply copy byte-by-byte, which is especially a problem with 4-byte aligned machines).

As a slight digression, it could be nice to note that many languages, much younger than C, although scripting ones, have also repeated the same mistake by not requiring functions to have fixed signatures – these are Perl and Javascript. It seems to be less of a problem because you can’t have any crash or memory override in such languages, but still the increased opportunity for programmer mistakes have been supported. Somehow other languages, like Tcl or Python, could have fixed function signatures.

So, printf has used the approach 2: the format string should contain appropriate tags marked with the % character, and for every such entry there is requested an appropriate number of parameters from the call, usually one (there are some special cases, when arguments get ignored or 2 arguments are required, but that’s a technical detail). Basing on this, the user must just take care of that every percent-sign-entry corresponds with the appropriate arguments in the call. And also the type specifier in this entry should correspond with the type specifics of the argument so that the function grabs appropriate number of bytes from the call frame and interprets this correct way.

Just after first writing of this article it has come to my mind that actually they could still use a different approach: The convention 1, while requiring that the last fixed argument is a string that declares the first format. The formatting tag should be always at the end, and if it’s not present, there are no further arguments. It would look like this:

print("Height: %0.4f", d, " index: %02d", i, "\n");

I think that could be also an acceptable solution, but likely no one has thought it would be a good idea and of course for an unchecked method of passing parameters this could be more error prone. It definitely would suffer from the same problem as printf like you can’t print a string by printf(get_str()); because you risk having % characters there, you should use printf("%s", get_str()); (and passing an explicit string requires % to be written double). In here it would be even worse, you need to use print("%s", get_str(), "");. Forgetting the last empty string would cost you a crash because only the last untagged string terminates the argument list. Alternatively the list can be terminated by NULL, but then it’s not such a thing as very rarely used execl function, but one of the most often used one.

And here is the whole explanation. And no, it wasn’t any kind of “new approach” or “comfortable enough”, “better readability” or whatever other bullshit you can invent here. The true reason to invent such a sorry solution was just one: to meet half way with the limitations of the C language.

Note that it mixes up also two completely unrelated solutions: the format specification of a single value and the “string stencil filling”, that is, replacing tags in a tagged string with specified values (I’ll elaborate on that later). It’s only added because it’s more comfortable than any elaborated value formatting with default settings. In C++ you can write cout << "Height: " << d << " index: " << i << endl; and happily use default format, while if C resolved to format function as shown above, you can’t just simply pass d to it, you have to preformat it by calling format(d, "e"). Even if you have any default settings support and some ability to recognize the type (C11 has added such a feature), the best you could imagine is to call the same with an empty format argument.

Therefore people who have developed this solution are excused. They just worked with what they had at that time, and – what is extremely important to take into account – in the 1970 years and a bit later the high level compiled languages were a novelty (Fortran was merely a wrapper for the assembly language, while anything high-level, like Smalltalk, were interpreted, at best using a virtual machine). Just as well, the number of software solutions and the level of complexity for it were negligible – at least in comparison to what we have today. There was no experience – including bad experience – with particular solutions, and people simply tried to use what was available and had to get used to solutions that had no better alternative. Therefore also excused are people, who had to work with this for such a long time and have found it good enough, even if there are better alternatives already.

But preferring this solution over language-provided flags and gluing values in their right order reminds me of people from some deep provinces in some forgotten lands who still eat meals made of dog meat (I didn’t try myself, but my grandpa, who survived WW2, told me that it stinks), even though they could just as well eat pork or beef. The same thing: people, who were doing it in the far past, did it because they had no other choice, except for starving. But even they could never prefer this over pork, beef or lamb!

But then, am I not simply trying to convince you about my personal preferences?

No. String formatting is objectively bad – and here I’m going to explain, why.

Short introduction to Esoteric Languages

So, you probably heard about esoteric programming languages – like Intercal, Befunge, Brainfuck, Unlambda, Smetana or Malbolge. Ah, I almost forgot – there is one more, which is even today still used to create software: Perl.

There are also esoteric algorithms. For example, there are esoteric sort algorithms, such as dropsort. Beats all algorithms because it has linear complexity. How does it work? Well, it copies the elements to the output, skipping those that are not in order. Who said that all elements must be preserved?

The general idea of esoteric languages is to implement specific language solutions that makes the programming harder, something that are the absolute opposite of being readable, error-avoiding, comfortable and useful. Intercal is a good example because it cumulates not just one, but multiple stupid ideas, among which there are some interesting ones:

  • The PLEASE instruction was required to be used for some number of instructions, otherwise the program is ill-formed as too unkind. But not too many because this way it might be also ill-formed as too fawning.
  • Instead of parentheses, Intercal uses single and double quotes. For example, (x-y*(a-(f-e))) in Intercal would be written as 'x-y*"a-'f-e'"'.
  • And a later addition, COME FROM, which is an inverted version of GO TO (note that in older languages lines had their explicit numbers, as the way to edit the source was only to enter and record a single line at a time – and GO TO referred to that line). It has to be defined in the jump destination location and should specify the line from which the jump should be done. In other words, when subsequent statements are executed, the next one is not, and the jump is made because this line is mentioned at the destination location with COME FROM.

Some specific category of esoteric languages make the languages being some kind of parody of the Turing Machine: Brainfuck, Doublefuck, or Befunge. Befunge is a one-feature language and actually extended version of Brainfuck, which contains only simple instructions of moving down the tape and increasing or decreasing integers on the tape. Befunge is a two-dimensional version of it, where the execution cursor moves through a matrix with instructions. You have additional instructions of changing direction also to horizontal and vertical, and the executed are instructions that are met on the matrix. In order to make a loop you simply have to organize your instructions so that the execution cursor runs in a rectangle path. So, that’s the method of relying on the text layout in the source code driven to absurd.

Why do such things exist? Why people work so hard to even create language specifications and implement compilers, which from the business point of view seems completely wasteful? Because this can serve as a warning and a physical proof that some ideas are just stupid and counter-productive, but without implementation you don’t have a proof. And beside playing around with ideas or intellectual amusement, they do have a value, which is: to openly declare (by sneering), what language features are bad because they contradict readability and usefulness, are most possibly error prone, and simply make the programming harder.

Some didn’t get the joke

Not that it didn’t inspire – unfortunately – creators of other languages. I have mentioned already Perl as one of the languages that prefers often crazy and nice looking statements to rule stability and usefulness, especially for people using also other languages. For example, you can do this in Perl:

if (is_higher($n))
{
    slip($x);
}

but this won’t work:

if (is_higher($n))
    slip($x);

Not that this is impossible without braces. You simply have to specify it this way:

    slip($x)
if (is_higher($n));

I wouldn’t even guess which function call is executed first.

It also has things like unless that can be used instead of if with not, and similar until as a complement for while, and many alternative ways to do the very same thing multiple different ways, all of them the same useful, all of them you have to know if you try to interpret someone else’s code.

It has even inspired me once to create another esoteric language (although I didn’t have enough time and lust to make something out of it and I abandoned this project) – I called it Legal – in which every instruction was a single sentence starting from an uppercase letter and ended with a dot. Instructions were grouped in paragraphs, and single instruction, especially with conditions, as sections. Because if you can write an instruction slip($x) if (is_higher($n)) then you can even go further and say something like Is the value greater than 5? If not, execute the statement mentioned in section 5., can’t you.

There are many bad language decisions that have been later repeated in other languages as well. For example someone didn’t get the joke from Befunge, or even more, Whitespace (in which the only characters you can use in the source code is space, tabulator and end of line, and beside this it bases on the same idea as Brainfuck) when creating the make tool, for which the configuration script, Makefile, had to distinguish between the tabulator and space, and depending on which was used as the very first character in the line, the following statement is interpreted as either just the script statement, or a shell command to execute. In my early programming days I was sometimes using Dos Navigator (its clone still exists in some incarnation) to edit files and I was once looking for an error a long time when it replaced all tabs with spaces in the makefile.

That was condemned a long time ago (likely that’s why some are trying to switch to ninja, also my agmake is free of that problem), but it still didn’t stop the creators of Python from making it rely on the source code formatting in the language correctness. Some people even praise this feature as something that “makes you keep the program source decently formatted” (playing idiots, of course, because the problem is not that you are forced to keep the program formatted, but that the meaning and syntactic correctness of the program depend on it).

That’s also not the only stupid thing that was developed in Python. There was a well known problem with multiple inheritance in C++ if a conflict has developed, when you have two identical method signatures in two different (and unrelated), but common derived classes. The most sensible solution (although not so simple to define in the language without having a deprecation period) was to disable this possibility and hide the method solely in the base class (so that doesn’t automatically get exposed in the interface) – that could be also a desired solution for C++. But no, Python not only allows this, but it also allows to specify in the class configuration, which parts of the base class should be how derived. Only to next recommend in the official documentation that multiple inheritance can lead to lots of problems and should be absolutely avoided. Congratulations.

These all above are proofs that even if some solution was already condemned as counter-productive in the software development, these treats still have fans that can’t live without them. When I looked for some tutorial how to print values in Python (which included also the methods I showed above), the author of this tutorial said that “format method of string” is actually the “best solution” (???). Yes, that’s exactly what he said. Doing print("a={} b={}" %a %b) is by him better than print(f"a={a} b={b}"). I wouldn’t believe unless I saw it with my own eyes. And it was so badly inspiring that someone has even created a library called {fmt}, which is doing the same thing for C++, and that has even been later submitted to the C++20 standard and it’s already available for the most compilers.

I wouldn’t even find it bad that people are so fixed on this string formatting in their own software (for open source projects I don’t really care that someone decided to use a counter-productive technique), except for the fact that in the C++20 and {fmt} the format configuration structure isn’t even a part of the public API, so there’s even no way to make any extensions to improve it, nor to reuse the old good C++ iostream manipulators to configure the output. Likely someone really doesn’t understand, why this string formatting is so counter-productive. Especially that C++ doesn’t suffer the C language limitations and there’s no reason to follow these solutions.

Those who know the history and still chose to repeat it

Many people even misunderstand, what the real reason is that printf is so counter-productive.

No, it’s not the matter of explicit type specifier, which can be solved by automatic type recognition. In today C it’s a minor problem since compilers check and report the wrong specification, including for other functions following this scheme. The biggest problems are contained in a factor generally named as “productivity”. Let’s try this example, written already in the {fmt} style without type specifiers, so that minor problem is taken care of:

print_to(output, "KXDrawTool: configured: color=#{:02X}{:02X}{:02X} thickness={} depth={}\n",
                    settings.color.r, settings.color.g, settings.color.b,
                    settings.thickness, settings.depth);

Let’s say, it’s not a problem if the format string contains up to 4 tags, and well dispersed throughout the formatter (good for you if you have an IDE that helps you here with highlighting). But the more parameters you have, the more they are far away from the place where you can find the declaration of its formatter specifier. That’s exactly the reason why I mentioned this COME FROM from Intercal.

A similar problem is in general in C and C++ (for the latter being even more of a burden) that functions and methods need to be usually declared twice – once for the header and once for the implementation – and they have to be kept manually in sync, and often not even copied 1-to-1 (due to namespace shortcuts, default parameters etc.). Modules in C++ were designed to help here, but I can’t see them quickly adopted (which is another story I described elsewhere).

So, take a look at this and tell me: what exactly should you do to identify, which of the function call parameters, are placed where in the formatted line? There’s no way you do this any other way but simply counting the format tags one by one and then applying this count to the variadic arguments. And why did I mention then the method declarations in C++? Because here you also have two separate things to be kept in sync: the format entry in the string and the value. Want to add one inside? You need to find, which exactly value precedes the place you introduced and then find – by counting – the same exactly place at the variadic arguments. The compiler will check the type specification, but won’t check if you mistook x for y, both being int.

Of course, in order to keep things a little less frustrating you can also indent every argument so that it starts in the position of the { character corresponding with this argument. And here is the real deal:

    print_to(output, "KXDrawTool: configured: "
            "color=#{:02X}{:02X}{:02X} thickness={} depth={}\n",
                    settings.color.r,         // |        |
                          settings.color.g,    //|        |
                                settings.color.b,  //     |
                                                 settings.thickness,
                                                          settings.picture_depth);

And now you can see, how this thing could be improved: follow the Chinese, mate!

The expressions for the values to be printed should be written top to bottom instead of left to right!

Otherwise, you can see, even this doesn’t exactly suffice because the more the argument list grows, the more lines separate the percent sign and the argument. Ok, some column highlighting could help, but this is annoying for some (for me it is) and also it’s a poor text formatting support to half-solve a language problem. Which can even expand to a real horror from some “C-based performance programmer”, who has placed once in the code something like this:

sprintf(output,"%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X,%02X", arr[0], arr[1], arr[2], arr[3], arr[4], arr[5], arr[6], arr[7], arr[8], arr[9], arr[10], arr[11], arr[12], arr[13], arr[14], arr[15], arr[16], arr[17], arr[18], arr[19], arr[20], arr[21], arr[22], arr[23], arr[24], arr[25], arr[26], arr[27], arr[28], arr[29], arr[30], arr[31], arr[32], arr[33], arr[34], arr[35], arr[36], arr[37], arr[38], arr[39]);

My boss, who saw this at that time, just asked “and this is certainly in order, right?”.

Yeah, I hear already screaming eagles lecturing me that “it is cleaner if I have a format string with just tagged locations for the values”!

First, it’s not “cleaner” – at best it is the same clean for simple examples using default format, like this:

print(output, "Rectangle : ", left, " ", top, " ", right, " ", bottom, "\n");

Making a pre-formatted single line doesn’t make it any cleaner, while already applying the eye-jumping problem (though still minor in this case):

print(output, "Rectangle: {} {} {} {}\n", left, top, right, bottom);

Of course, cleanest would be to have the interpolated string, but it’s not simple to add to C++ syntax (the expressions in the interpolations must be anyway parsed by the compiler, so such a feature would have to be added on the language level):

print(output, fs_"Rectangle: {left} {top} {right} {bottom}\n");

Second, if I wanted to have a “string template” to be filled with values replacing tags, then this is the example of the right solution:

print(output, "Rectangle %left %top %right %bottom\n",
               "top", top,
               "bottom", bottom,
               "left", left,
               "right", right);

This is a completely separate feature to formatting and there’s completely no reason in mixing them both in a single format string (this was done in C only because it was easiest to implement, not because it was anyhow useful). Moreover, if you need formatting, in this case it should be specified near the expression to be printed, not near the tag in the string:

print(output, "Rectangle %top %left %right %bottom\n",
               "top", setw(6), top,
               "bottom", setw(6), bottom,
               "left", setw(6), left,
               "right", setw(6), right);

I’ll expand on this idea below.

What the programmers actually need

You know what this operator<< for std::ostream actually is? It’s only a workaround for the lack of variadic templates, which were only added in C++11. Because if those have been added, it would have been very simple to define the “print” function, as known from BASIC, Python, and many other languages:

template <class Stream>
inline Stream& print_to(Stream& sout) { return sout;}

template <class Stream, class Arg1, class... Args>
inline Stream& print_to(Stream& sout, Arg1&& arg1, Args&&... args)
{
    sout.write_fmt(arg1); // In C++ today it's ostream::operator<<() that has no named alias
    return print_to(sout, args...);
}

When having this function, you can very easily write the above this way:

    print_to(output, "KXDrawTool: configured: color=#", setw(2), setfill('0'), hex, uppercase,
             settings.color.r, settings.color.g, settings.color.b, dec, setw(0),
			 " thickness=", settings.thickness,
             " depth=", settings.picture_depth, "\n");

Although I personally don’t like runtime state changes and it would be much better to do it this way:

    print_to(output, "KXDrawTool: configured: color=#",
             fmt(settings.color.r, setw(2), setfill('0'), hex, uppercase),
             fmt(settings.color.g, setw(2), setfill('0'), hex, uppercase),
             fmt(settings.color.b, setw(2), setfill('0'), hex, uppercase),
			 " thickness=", settings.thickness,
             " depth=", settings.picture_depth, "\n");

You can find here, how to define the above fmt. Note though that it has a poor performance, of course, especially in comparison to the C++20 format library (and the same {fmt} library available for earlier standards), but it’s just an idea that can be further improved. The poor performance comes mainly from iostream, especially when you want to save and restore format that this function is doing.

Note that if you find the formatting specifiers better when they are specified by a format string, you can still do it like this (not defined in the above fmt though, but it’s a possible feature) – that’s something very much like the format() function in Python:

    print_to(output, "KXDrawTool: configured: color=#",
             fmt(settings.color.r, "02X"),
             fmt(settings.color.g, "02X"),
             fmt(settings.color.b, "02X"),
			 " thickness=", settings.thickness,
             " depth=", settings.picture_depth, "\n");

Now you can see that:

  • every expression is visible exactly at the place where its value is printed
  • every expression has exactly one instance of itself in this instruction
  • the format specifier is attached directly to the expression it touches upon

But then, if you think that it would be “more readable” if you use format string tags, it could be still done this way:

    print_to(output, "KXDrawTool: configured: color=#%RR%GG%BB thickness=%thicks depth=%depth\n",
             arg("RR", fmt(settings.color.r, "02X")),
             arg("GG", fmt(settings.color.g, "02X")),
             arg("BB", fmt(settings.color.b, "02X")),
             arg("thicks", settings.thickness),
             arg("depth", settings.picture_depth));

Still, the main difference to the formatting library is that:

  • The tag-replacement feature is a separate thing to format specification – just a simple tag replacement, nothing more
  • Formatting specifiers are still in use, but they are tied to the value being printed and hence it’s a part of the tag specifier. You can still use any markers in the tag names that would suggest the way how they are intended to be formatted, but this is only an information for the reviewer, not a specification for the language.

You may even resolve to a crazy idea like this:

    print_to(output, "KXDrawTool: configured: color=#",
             settings.color.r %fmt("02X"),
             settings.color.g %fmt("02X"),
             settings.color.b %fmt("02X"),
			 " thickness=", settings.thickness,
             " depth=", settings.picture_depth, "\n");

What then I think would be the ideal solution?

My personal preference is the interpolated string. The above solution is not the same, but close enough. What exactly problems would I have to solve with the format specifiers using language primitives (instead of the string specifier)? I would like to have it specified shorter. But this could be then solved by having something like this before the printer:

    auto H02 = make_fmt(setw(2), setfill('0'), hex, uppercase);
    print_to(output, "KXDrawTool: configured: color=#",
             H02(settings.color.r), H02(settings.color.g), H02(settings.color.b),
			 " thickness=", settings.thickness,
             " depth=", settings.picture_depth, "\n");

Or, should that be more comfortable:

    auto H02 = make_fmt(setw(2), setfill('0'), hex, uppercase);
    print_to(output, "KXDrawTool: configured: color=#",
             settings.color.r %H02, settings.color.g %H02, settings.color.b %H02,
			 " thickness=", settings.thickness,
             " depth=", settings.picture_depth, "\n");

That allows additionally to have a predefined format for particular kind of data for the whole application that can be changed in the central place when needed. All those things are the only problems to solve I can imagine with this kind of formatting.

Yes, I can hear again people screaming that obviously if you have a value of type double and you want to print one with zero-filled 4 digits and precision 8 and one with precision 6 only, with this system you have to do:

print_to(output, "value: ", fmt(value, setfill('0'), setw(4), setprecision(8)),
         " confidence: ", fmt(confidence, setprecision(6)), endl);

or at best with string-specified formatters:

print_to(output, "value: ", fmt(value, "04.8f"),
         " confidence: ", fmt(confidence, ".8"), endl);

which would be much shorter to write even with printf:

fprintf(output, "value: %04.8f confidence: %0.6f\n", value, confidence);

and the same with {fmt} or std::format:

print("value: {:04.8} confidence: {:0.6}\n", value, confidence);

I can admit one thing – this is shorter. I cannot agree with a statement that this is anyhow better for the productivity, as I mentioned in the beginning.

On the other hand, how many times did you happen to use explicit format parameters, unless the need was because it is always required (as in sprintf) or you needed some library- or app-wise consistent formatting, for which you could use a preconfigured format? I can imagine a software that contains lots of printed and formatted floating-point values and strictly placed in a row hexadecimal values with equally 8 digits. But even if, isn’t it simpler to just create a simple function, say “HEX8”, which will format the given value this way (and while the function correctness will be checked by the compiler), than to use "%08X" every single time? Not even mentioning a case that once happened to me when I had an integer value of some logical type to be always formatted the same way that I decided to format as 4-digit hexadecimal value and then the project manager told me that actually it is required to be printed as 4-digit decimal…

So, even if I thought it is better for some reason to use these string-based format specifiers, the best I can imagine for this is:

format_to(output, "value: ", "04.8"_F, value, " confidence: ", "0.6"_F, confidence);

Here I used _F as a UDL applied to a string to turn it into a formatter, as "04.8"_F, value is shorter than specifying fmt(value, "04.8") or value %fmt("04.8").

Ah, of course, additionally it’s better to have a line with short defined expression symbols, without having them so elaborated inside the call. That’s what the tagging feature of the formatting library had to solve, right? You think it’s nice to have them this way, while using then some more elaborate expressions for the actual values:

format_to(output, "value: {value:04.8} confidence: {confidence:0.6}" ... );

If so, you can still use intermediate variables. Just above, not below the general line specifier, but that changes nothing. This can be significant if you use some complicated expression pattern, but if so – see below.

So many features!

So, I was so negligent with the printf function that I have only recently learned that it features argument positioning. That is, you can specify the number of argument in every format specifier, instead of relying on the argument order. This way, instead of counting the position of the percent sign up to ten and then the same in the argument list, you just have to read the 10 number in the specifier and then… again count to ten in the argument list. A perfect solution. And that’s only one of many things that are provided for the fmt library’s {N:} format.

You can also specify tag names, and then tag every argument passed after the format string so that it is known, which argument the format specifier touches upon. Nice, but how does it differ to specifying the intermediate variables before the printing instruction and then using them directly in the sequence of printable pieces? Of course, it can be useful to separate the format pattern from arguments – see below.

In the {fmt} library you can additionally nest the specification so that you can specify also the runtime value of the width or precision – by making it "{:{}}", width, value. Nice, but then the form looks like this:

int valp = 8, valw = 4, confp = 6;
print_to(output, "value: {:0{}.{}} confidence: {:.{}}\n", valw, valp, value, confp, confidence);

PFFF REALLY? And you can really quickly figure out here, which witches watch which watch? Then with the iostream manipulator style + helper fmt function I’d have this:

int valp = 8, valw = 4, confp = 6;
print_to(output, "value: ", fmt(value, setfill('0'), setw(valw), setprecision(valp)),
         " confidence: ", fmt(confidence, setprecision(confp)), endl);

When I call setw(N) or setprecision(N) I can also use the runtime value, and this function can be also overloaded with constexpr if need be. And it needs only to write fmt(value, setw(width)). And doesn’t require adding a format string that looks like expressions in Intercal. This is evidently nothing else than trying to make extra sorry workarounds to allow what with the use of function-like formatting tags comes naturally.

Provided that this is added to the C++20 standard, I’m really wondering what kind of guys are today in the C++ standard committee. I really think that I am kinda one of just a few youngest C++ programmers, even I am barely 50. Those, who have first learned C++, had the first contact with printing on the console as iostream and cout, and only later learned that “in that old, outdated C language there was some ugly printf, but no sane person uses it anymore”. I had this in the ’90s and the last thing I suspected to come in 2020 was some epigone to revive the dead printf horse in the new tatters. Even more astonishing that no one really perceives how counter-productive is the way of pre-specifying the format string to be filled with subsequent, or preselected, list of arguments, in general. Actually the only thing that the creators of {fmt} library can be proud of is that they have achieved an excellent performance by having the format string interpreted at compile time. Nice, but somehow that reminds me of C programmers who declare that they will never write software in C++.

Note that I never said that a formatting string with tags is useless in general. It is useful, but only in a form where it’s a feature on its own and not mixed with any formatted printing – and even if the formatted printing is also used, it’s in a different place. This is what I have written once myself when I had to do a simple JSON format printing (I didn’t know about the {fmt} library at that time):

    out << TemplateSubst(
            R"({ "level" : %level, "id" : "%name", )"
                R"("first" : {"time" : "+%basetime", "offset" : %baseoffset}, )"
                R"("last" : {"time" : "+%lasttime", "offset" : %lastoffset}, )"
                R"("details" : "%details", )"
                R"("repeated" : %repeated })",

            "level", fmt(Rep::level(firsterror.value)),
            "name", Rep::name(firsterror.value),
            "basetime", fmt(rel_basetime),
            "baseoffset", fmt(ers.baseoffset),
            "lasttime", fmt(rel_lasttime),
            "lastoffset", fmt(ers.lastoffset),
            "details", ers.details,
            "repeated", fmt(ers.repeated));

The very same thing could be done using the fmt::arg facility together with the formatter:

   fmt::format(
            R"({{ "level" : {level}, "id" : "{name}", )"
                R"("first" : {{"time" : "+{basetime}", "offset" : {baseoffset}}}, )"
                R"("last" : {{"time" : "+{lasttime}", "offset" : {lastoffset}}}, )"
                R"("details" : "{details}", )"
                R"("repeated" : {repeated} }})",
            fmt::arg("level", Rep::level(firsterror.value)),
            fmt::arg("name", Rep::name(firsterror.value)),
            fmt::arg("basetime", rel_basetime),
            fmt::arg("baseoffset", ers.baseoffset),
            fmt::arg("lasttime", rel_lasttime),
            fmt::arg("lastoffset", ers.lastoffset),
            fmt::arg("details", ers.details),
            fmt::arg("repeated", ers.repeated));


That could be tempting, especially due to that fmt::format gives me also a possibility to use compile-time interpretation of the format string and a nice speedup this way. But from the interface and visual presentation perspective the {fmt} version has still disadvantages for me because:

  • Requires additional explicit fmt::arg specification just because there can be also direct values here in the general case (and the necessity for fmt in the call of TemplateSubst are just because it’s written simple way and handles only strings)
  • If I require a format specifier for a particular value, I must specify it in the format string, not in the argument tag declaration
  • The use of {} in this formatting is very unfortunate as per the need of JSON format, messes up with the braces used for JSON, requires them to be escaped by doubling and makes the whole formatting string more messed up than the version with %tag specifiers

You may say that this last problem is a JSON-specific unfortunate case, but the truth is that JSON is currently one of the most popular data interchange format, and various different formats with hierarchized structure are also using braces (out of popular formats only YML could be a notable exception). I also know that % can be then problematic for other formats, but this can be then easily solved by customizing this character (also as a begin-end pair), which is easy if you have a facility, which’s only purpose is to fill the tagged text with a value replacement. The pursue of having a single multi-purpose solution in one, of which it’s at least just as well so easy to use them separately, has created limitations.

Just to sum up

So, again: As for me, in C++ the general case that is interesting for me as formatting specifier is to specify the values to be concatenated one after another. An interpolated string would be ideal, but in C++ you can only count on that much. The format string facility can provide some interesting features that can be sometimes useful, but these are just rare corner cases, while the use of this format string in most of the cases is counter-productive for the software development.

Therefore I value solutions where you have multiple options at hand, not just one, and exactly the one that is the worst choice for the majority of cases. Adding formatting tags support and formatting functions for intermediate calls in the C++20 format could solve this problem.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a comment