Defensive vs. Paranoid Programming

0. Introduction

I have found many various flavors of the style of programming in my experience. I have already written about things like ridiculous checks for NULL, as well as how “underdocumented” many libraries are. Among many positively perceived styles of programming there is one important, Defensive Programming.

I have also already written about how checking for errors all the time, and “doing something” in response, results in producing a code that is never really tested and if there comes a situation of using it, it fails to do anything that would help in solving the problem.

Being defensive includes lots of various aspects and analyses. Poor mind programmers are using the highly simplified vision of this and it results in something that is next perceived by the next source code viewer (especially if this is someone more experienced) as “paranoid programming”. Poor mind, because the code seems like being prepared for every least kind of error that might occur during the operation, simultaneously forgetting, what they are actually trying to prevent: some “predictable” runtime situation, or some completely unpredictable situation, that is, a bug. Well, it would be really nice to code something that would prevent a bug, but well… how can I say it… 🙂

1. A gun in kid’s hands

Underlying layers of software, that is, the kernel, the operating system, and the libraries, are not only something that save you writing too complicated low level code – they are also something that should provide basic reliability (they save you providing one).

If you want to program on the lowest level, where you are dealing directly with the hardware, there are still some domains, in which you can do it – but you’re rarely allowed to do that in today software. This occurs practically when you are developing the operating system itself, the kernel, or the drivers, but not when you have to provide a business solution. It’s not so that there is some law standing behind it or something like that. It is, to all appearances, the matter of trust. You can say that you don’t trust the underlying layers of the software that it works without bugs – but why should the customer trust you? In particular, when you replace the existing lower layer with your code, how do you think, will the customer trust more you than somebody, whose library is being used by multiple software projects all over the world?

If so is it, then why do you worry whether the library is written properly and whether your software will behave correctly in case of a possible problem in the library? Do you think that your software will be free of bugs? Most likely, if you’d replace some library code by your hand-written code, you will at best make the same number of bugs, usually more.

2. What is really defensive

When some part of the software you’re using, or it can even be some other part of the software that you’ve written, may result in an error, the first and foremost thing for you to check is on which level a possible problem may occur. The fact itself that some function returns a special error value or throws an exception (the second one especially if you write in Java) doesn’t mean yet that this is something you should take care of. Some procedures return error value because in some contexts they can be used on a very low level, or just the API is predicted for this function to be implemented in a very unreliable environment as well as completely reliable ones. So, if the error that would occur in some particular case is something that your underlying software layer should never report, treat the case that it did as something that has occurred out of normal conditions.

A short example in POSIX: the close() function, which closes the file descriptor, may return -1 in case of error. But it may happen only in one case: when the descriptor is for the device open for writing, it uses buffering, and there is some buffer of data still waiting to be written to the device; it’s planned to be written when the buffer becomes full after next write, or on demand when you do flush(), or – as in our case – before disconnecting from the device when you do close(). When the operation of writing the data from the buffer to the device fails, close() returns -1. But of course, it still closes the descriptor! This -1 doesn’t mean that “closing failed” – closing cannot fail! It means, of course, that if you predict possible failure, you’d better flush the buffers manually by flush() and manage the error when you still can do anything.

So, the condition for close() to possibly return failure is when the file is open for writing, it uses buffering (although it usually does), and there is still some data in the buffer waiting to be written. It means, simply, that when you do flush() first (and it succeeded), then your next close() will always return 0 – there is no possibility that it do otherwise. Samewise when the file is not open for writing at all (for reading only, for example) – in this case close() will always return 0.

Of course, if you read the man documentation for close(), you’ll see that it’s a common error to ignore the return value from close() – but, simultaneously, somehow, close() in /usr/include/unistd.h is not declared as “__wur” (shortcut for gcc’s specific __attribute__((warn_unused_result)) attribute), as it’s for, for example, read(). That’s because there are only specific situations, when close() may return error.

So, there can be two completely different reasons, why close() would return -1. The first is when it’s doing final buffer flushing, which failed. But if you ensure that this wouldn’t be the reason, then returning -1 by close() means something completely different. It simply means that the operating system just went out of its mind. There are some theoretical possibilities for this to happen – for example, file has been reconfigured to be open for writing while you didn’t request it. I say “theoretical” because I can imagine that it may happen. The only trick is that the system’s basic reliability statements are that this will never happen.

Another example: you’re opening a file. Wait. Not yet opening. You check the file using “stat” function. You check that the file exists and that you have rights to open it, so opening this file should succeed. So, you open this file just after you made sure of that. If the opening function reports error, this can be now one of two possible errors:

  • the file has been deleted (or changed rights) in between
  • there was some very serious system error

If you want to distinguish these cases, you can call “stat” again. If the second call to “stat” proves that it has actually been deleted or changed rights, you can state it just didn’t exist. Otherwise we have the case of system error.

That’s not all. This file might have been meaning various things. It might be, for example, a file kind of “/etc/passwd” or “/lib/ld-linux.so.2” on Linux. This is such a file that must exist or there is some serious system damage.

The point is, the fact that the file cannot be open in every particular case can mean different things, so there cannot be any universal advice what you should do in case when the file can’t be open. On one hand you can state that this condition, connected with some other call, can be the only one that can provide you with the full information. On the other hand, why should you care? If the file was vital for your program to continue, you have to crash. If not – just state quickly that the file can’t be open and continue without the file. You may want to log this into a log file, you may want to display this info to the user – but whether to do this, should be also specified explicitly by knowing its real meaning to the program. None of these things has to be done “always”. Even if you stated that this is a serious system error, especially if this is only one of possibilities, displaying this to the user is the wrong thing to do. For a user it doesn’t matter if your program crashed because file /x/y/z did not exist or your program has been killed by SIGSEGV signal, or whatever else you’d like to say.

Being really defensive is, first, to realize on what level you are. Once you realize, you can now decide whether the runtime condition that failed for you is something that is just one of alternative for the runtime to work, or this is something that should not occur. In the second case you should crash. But there is also the third case.

3. The Unpredicted

There’s always one more group of error handling: checking for a condition that is declared explicitly to never occur.

For example: there is a function, which returns a pointer to an object. The objects are collected in an array and there is always an object at position 0. For values greater than 0 and existing in the array, the object at that position is returned, otherwise it’s the object at position 0.

This function, then, has in its definition that it never returns NULL. And it really never does, according to its definition. What’s the point then in checking if the object returned by this function wasn’t NULL?

The first explanation is more-less: you’ll never know. But what does that mean? Does that mean that this is an expected situation, when you are checking for one of possible results of the function? Or you rather expect the situation that is explicitly declared to never occur? Of course, it’s the second one. And if so, there can be only one reason for this to occur: the program has went out of control. So, effectively what you’re trying to do is to keep under control the situation of getting the program out of control.

Do you understand now the absurd of the situation? And, to be clear: no, it’s not putting the program back under control. If in any context and any basic conditions you predict that the program can go out of control, then it could have happened to virtually any part of the program. In short, virtually everything may happen. I say “virtually” because the only things that may really happen are problems that may happen at the machine level and come from a compromised reliability of any of the software layer. Anyway, even if we shrink this virtually everything to only possible thing, this is still a situation that the program went out of control. If you then consider that any of things that should never happen have happened, then the number of degrees of freedom for any today application software is so great that such a stupid null-check will decrease the probability of crash by… less than 1%. A probability, which is still very small, at least because it’s explicitly qualified as “should never happen”.

Some people say that even this may help in an extreme situation. Oh, really? So why don’t you resign your job and gamble instead? Why do you waste your time on focusing on something that can decrease the probability of a bug by less than 1% instead of defining the full state conditions and plan the tests based on reduced degrees of freedom and full coverage cross-path passing? Why don’t you, simply, focus on bugs that are much more probable to occur?

4. The game and the worth of the candle

If you think that you have some situation, where there is a static probability of some unexpected result, and you need a tool to catch that at the development stage, use asserts. This is exactly the tool predicted for this use. But remember: assertions are predicted only to work in a situation that you run your application for testing, in your incubator, in your environment, compiled with a special instrumentation for debugging. It’s not predicted to work in a production environment. Not because they may crash your program. Only because they may harm the performance of your software (if you think that it won’t check again the same condition manually).

But… well, there can be a situation that there is more than very little probability of error in a crucial resource! If so, you should crash.

Come on, this condition check doesn’t cost me much in focusing, doesn’t cost much work, this is just a simple instruction – and even if it gives me a negligible advantage, it’s still better to have it than not to have it…

Really? So, if this is only a simple instruction to write, you probably don’t test this code, do you. Maybe the condition to check is a simple instruction to write. But this instruction causes some additional situation to potentially happen in your code. It’s not just condition. It’s the condition, plus all actions that can be taken when it’s satisfied, treated globally, together with all the aftermath. This is not just a side path of the program. This usually is a serious problem and if you want to make the program prepared for this situation to handle it and recover from the problem, it’s really complicated – and as such it must be very thoroughly tested, the program must be run several times in a controlled environment, in which you can cause this problem to seem to happen (at least from the program’s point of view), to see, how it recovers and whether it does it properly. Because, certainly, you’d like to prevent the crash of the whole application. You don’t code your error handling just to make it cause another error, do you.

So, ask yourself first: will your program be really able to recover from this error, and are you able to test it if it really does? If your answer is anything else than “yes, certainly” (even if it’s “most probably yes”), doing anything than crashing is wasting time and resources. Now ask yourself additionally: this whole cost of preparing the recover is paid only to prevent problem that can occur with a probability similar that your customer using your software will be killed by bomb explosion – is that worth a shot?

And again, I remind: it’s not that “bugs too little probable should not be prevented for”. It’s that “bugs cannot be prevented”. Yes, bugs cannot be prevented by a runtime condition check. If there’s something that could have happened only by a very very bad luck bug, you can’t do much about it. I say “can’t do much” because you can lie to yourself that you can and you may do something – but there’s still only a very little probability that you’ll repair it by the action you take. With much greater probability you’ll cause by this action much more problems that that you have caught (that’s why it’s usually better to crash).

4. Unforeseen and unpredictable differ in level

Defensive programming is such a method of programming that prevents from problems that may occur in particular method of how the software is being used. This has completely nothing to do with problems that may occur due to unreliability of any underlying software layer (well, actually we should talk about just “underlying layer” because it’s not only about the software – hardware can also cause problems, if not hardware, then the power supply or so). Defensive programming is simply preventing the program from going out of control, stating that things declared as “should never happen” really never happen; at most, things that users “should never do” are often being done. Defensive programming is simply “doing your homework” and write your software decently, with the widest possible range where this rule applies.

Defensive programming is simply doing things in the range of declared possibilities how the function should work, not preventing a situation, when the function did something that hasn’t been described in its documentation, or a system fails to adhere to rules it guarantees to honor.

I’m pointing that out because there are many developers, who think they are doing Defensive Programming, or just they make their programs safe, while actually they are doing Paranoid Programming. This doesn’t make the program any better nor any safer. It doesn’t make the program have less bugs or harder to destroy in case of a bug. Actually there are just two possible practical results that Paranoid Programming can cause in the software production:

  • completely no observable value added, just wasted time to write and possibly test (best if!) the useless code; sometimes even wasting performance on doing useless checks
  • causing the program to be harder to properly test and therefore wipe out bugs; sometimes may lead to increased vulnerability of the software

I’ll try to describe this second one in more details.

5. Preventing from self-destruction prevents from self-destruction

You know, what the purpose was to have a “self-destruction” button in a military device? I admit I know it more from S-F than from real life, but it still had some purpose. This purpose was: protect the information and the resources, so that the enemy cannot use it on its advantage (and our disadvantage simultaneously).

That’s more than obvious that no one is going to use self-destruction button in a daily use. That would be the last day, of course. This is to be used in an ultimate situation, in which all possible protection, recovery, defense, and whatever else can be undertaken, fails. And there is a serious threat that the enemy will reach our resources and use them, which will make them stronger, or which will make us more vulnerable. Even though we will be in a serious trouble, when we destroy our ship, the whole organization may be in a much tougher trouble, if the enemy get access to our resources. In case of software it’s even not the matter of security: the program running in an environment with resources containing false data can do completely crazy things and can potentially destroy the other, valid data.

That’s why preventing from crash is a kind of “software pacifism”. Preventing from crash causes only preventing from crash. If someone is going to block the self-destruction button from firing forever, they do not “prevent the ship from getting destroyed” (the ship can still be destroyed by many other ways), but from an intended self-destruction, in case when we exceptionally want it.

It’s maybe not the main purpose of checking runtime conditions, but it always happens when somebody is doing “Paranoid Programming”. The biggest problem is that when there’s some function called, which returns some fail-prone value (not, “which fails” – it’s just enough that it returns a pointer), the first thing they think about is the error checking. I don’t say “error prevention”, it’s only error checking. When checked, that a pointer value being returned is NULL, they at best print something on the log, and return NULL. Well, our function doesn’t return a pointer? Ah, it’s integer? Good, let’s return -1. -1 is still a valid return value? Then let’s return, hmm… maybe INT_MAX? I didn’t even mention yet something like “isn’t it so, occasionally, that our function has already guaranteed that it never returns NULL?”.

The real problem is not that this is such a “shut up” method of error handling. The real problem is that if the error handling for this function has been only INTRODUCED this way, the rest of the software using this API is probably not yet prepared to handle this kind of error. Often the return value is checked different way, maybe some kind of error is not caught, or maybe just the return value is ignored. It happens very often.

6. The program writes you

The “metaproblem”, let’s call it so, is that people often do not “design” their programs. I’m not talking about “do a complete design before writing”. I’m talking about designing at all, just before you write one function, spend some time to think what resources you’ll be using, what reasons may be when these resource accessors may fail, and what the meaning is of a particular failure for you.

I just have a feeling that some people do not “write their programs”. They are rather “discovering them”. They just have the intent of the function in head, then they just write what they think should be in an optimistic scenario. When they somehow call a function that may have some notions of causing failure, they automatically put a condition check, effectively discovering this way some part of the function. This function isn’t being written – it’s being discovered. In various points urgently some failure condition pops up, and they just hit it with a condition check, just as if you hit a fly.

I don’t always write functions this way – but when I do, I do not do any error handling! I’m writing a function with an optimistic scenario only, that’s all. Later, when I confirm that this function in this form makes sense and works as it should (by testing the positive scenario only), I’m adding appropriate and well rethought error handling. This is what I do when I have completely no idea how something should work and I must encode it to see it before my eyes and help my imagination. And this is not always the case. Usually I know what functions I should use to get access to the other resources, and what the failure conditions are; having that I can at least plan the API so that appropriate error handling conditions are preserved before the function is written. In particular, I know what kinds of failure should be propagated or handled otherwise.

And, probably, some of programmers additionally think that they are good coders when they write a complete code at “a shot”. That is, they’ll write a function that since the very beginning contains everything it needs: the business logics, the error handling, possibly also logging (but no, of course not the comments!). And once it’s written, it’s working completely correctly and needs no fixes in future. Well, if there were people, who can do that, they wouldn’t even have a chance to externalize themselves in internet forums because they would be quickly caught by some software company to do bugless code for them, before the others can surpass them.

So, another rule – making the function logically complete is much more important than any error handling – and error handling is something that you can always do later. This way you are really focused on error handling and not on any other thing, so you can think this through and you are more certain of not missing any part that requires error handling – at least more than when you’d write error handling exactly at the first time.

Before you write any error handling, you must first (!) – yes, first, that is, before you add any error handling instruction for any of the functions you call inside – make sure that your function is prepared to PROPAGATE the error, or that it’s a kind of error meaning ALTERNATIVE WAY for itself. I can only tell you that the second one is a very rare case – usually a kind of “I can’t do what you request me to do” function’s response means that your function is also unable to perform the task it has been written for. And it means that this function must report “accordingly”. Propagation means also that any other part of the code that calls this function must be prepared for that this function may report failure, and handle this failure properly. I guess you write the functions before you use them, or at least you plan the API of the function completely before you start using it.

Only this is defensive programming.

7. This function may, but need not, end up with a failure

And, the most important thing: never, never ever do things like providing a function to someone else (including distribution of a library) and say in the documentation something like “this function returns XYZ in case of success and -1 in case of failure” (or NULL, or it throws an exception in case of failure). If you are considering using some open-source library, which has something like that in the documentation, then try to find some better documentation for it (it happens that there exist some!), and if it cannot be found, then – seriously – don’t use it and find some alternative. If there’s really no alternative and you must use this library – give it to somebody to (or do it yourself) reverse-engineer the source code and find out the exact conditions where failures may occur, then prepare the good documentation out of these research results. If you don’t keep an eye on it, you won’t be able to know what it really means for you that the function has failed (and this is a key information for planning the error handling). This way you’ll handle every erroneous value the same way, by crashing the current path completely. This is the straight way to paranoid programming.

If you are using exceptions to handle the problems, you are actually in a better situation – because you’re certain that either your caller will catch the exception and secure the program in a result of bad behavior of your function, or the program will crash. At least your function will not be responsible of making mess in the program. The only problem is that, occasionally, your function might have been called just to check if this works or not. That’s why exceptions should be used only in the following cases:

  • when it touches upon a condition that the function explicitly states that it must be satisfied before calling, and the user really could do it
  • when the function could not continue and provide the expected results, and the exception report in this case was explicitly requested (by a flag or alternative version of the function)

If exceptions are used any different way, and especially when the language uses strict exceptions (Java, Vala), exception handling is practically nothing else than a different (and usually awkward) way to handle error-meaning return values. They are even easier to be ignored. It’s very simple to catch all exceptions, provide an empty block at the end, and… continue the program, just as if all the results required for this function to continue, were already available.

8. Catch me! If you can…

It’s very good when assertions are used in the code and next can be used to trace cases when something’s wrong. It’s very bad, though, when you don’t even start to think, what might be the cause of happening this error. The fact that you can catch some erroneous situation doesn’t mean that you have to.

Let’s take such an example:

ptr = new Object;
assert( ptr != 0 );

Of course, let’s state that overloading of operator new in this code is not allowed. What exactly thing are you checking in this particular assert() call? Are you checking if you had made a mistake in the program? How can this pointer be ever null, if the definition of operator new explicitly states that it never does?

What? You may happen to find a non-compliant compiler? Ok, I guess you are really serious, so if you are uncertain about the compiler compliance, create an additional small test program, which will contain various instructions predicted to test the compiler’s standard compliance. This program will be compiled and run before even starting compiling the whole project. All things you may have doubts of can be checked beforehand and confirmed that they work. Once it’s confirmed, you don’t need any checks for that in your production code!

Of course, the compiler may be qualified as non-compliant, but you’d still like to use it. In this case you can provide some specific things, which will solve some of the non-compliance problems, BUT, again: it is extremely unlikely that you’ll be doing that. First, because it’s hard to find such a non-compliant and “in-use” compiler, which returns NULL from new. Second, because even if you find, most likely you don’t want to use it (you’d use some C++-to-C translator or some other language that is easily translated to C and use C compiler for this platform – it would be easier than dealing with various non-compliances of the C++ compiler).

This is not the only case – there’s a variety of possibilities for that. Let’s take even some of the things that are provided librarywise – like boost::shared_ptr. Ok, in C++11 it’s standard, but I just had some good example for that:

shared_ptr<Object> o( new Object );
assert( o.use_count() == 1 );

Think a while about it. It’s not just the fact that it’s stupid – that’s actually obvious, it’s not the point. The point is that you are not testing your code in this case. You are testing the code of shared_ptr. Believe me, if there happened to be any opportunity for this problem to happen, guys from Boost team would have caught that before you’d ever had a chance to see this code. On the other hand, if there was any kind of error like that in your 3rd party library, you can’t do much about it!

What’s the actual reason of doing assertions like this? Say, why do people do it, even though they maybe even know that these things are unlikely to happen? The only thing comes to my mind: they would like to ease their conscience that they “did all the best to prevent errors”. Or maybe not yet that. They just “found something that may be very easy to check”. Because this is easier, much easier than doing good design, good path analysis, good unit and component tests. Because the latter is boring, and adding such checks is so easy and it so much eases the conscience!

Paranoid programming is effectively not a result of wrongly understood rules of defensive programming. This is exactly like with all the other bad practices in programming: the lack of thinking and the lack of good work organization. Which makes it not unlike any other domain of work.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s