The risks of protection

No action can be 100% exempt from risk: on the one hand actions have effects in the world, where causality chains are changeable, hugely complex and not always obvious, so it is impossible to discard with absolute certainty that some unintended consequence will manifest itself; on the other hand each action has a certain cost of opportunity meaning that there are other things that cannot be done (or at least not in that precise moment) because of the original action.

One of the actions with highest potential for trouble are those related to safety and security. They intended to prevent "bad things" from happening, from having a house burglary to setting your carpet on fire, but precisely because of that in the best scenarios you never get to see them in action. It is very hard to know if you house never go burgled because of your countermeasures or just because the crooks in your neighborhood just did not happen to set their sights on it. On the other hand, whenever these measures fail, things tend to go south really fast, which is why the temptation to "overdo" safety is quite appealing. But, as I have mentioned in the past, there is always a point of excess for everything and preventive measures are not an exception. For illustration purposes just check this video channel on YouTube displaying the problems that a devoted techie has when his devices start to act.

Photo: Martin Cathrae

But coming back to the real world, preventive measures are not only advisable but even essential in many aspects of life. Many years ago I took part in a personnel selection process at a chemical company where they made a very strong case of the safety regulations in the production floor: neckties were forbidden, as were necklaces and long hair and beards, because of the risk they posed that a fast moving component could catch on them and harm a worker who would otherwise escape unscathed; same applied to rings, which have the potential to amplify the harm experience by hand workers; but the flagship commandment was to always use the handrail when walking the stairs, which had the corollary that bulky objects (requiring both hands to be carried) could not be transported up or down the stairs and, instead, had to be rolled in in a suitable carrier (and with the assistance of a forklift or equivalent). I mention these measures not because they sound foolish to me, but because, being completely reasonable, they indubitably come at a cost for the company: to say the least, it should allow the workers some time at the beginning of their working day to remove their dangerous objects, as well as establishing protocols to ensure the compliance. Similarly, the company has to provide rolling carts and scissor lifts to give their workers the means to fulfill their duties while following the standards. And even with full compliance safety was not 100%, because there are many situations that cannot be foreseen and even if most common types of accidents can be prevented once you have seen them happening, it is very hard to prevent something that has not happened before and fortune never misses a chance to surprise us.

In space the situation is not much different: instrument are subject to an extremely harsh environment and the operators only have short windows where they can monitor the performance of the observations and take corrective action if necessary. This is the reason why one of the most important requirements in many space missions is that every system shall be able to survive if the operators do not manage to get in touch with it for two weeks. This might seem like an awful lot of time but in the end most space operators are limited to one 8-hour contact per day (if any) and if problems arise in the ground station several days can go by before a solution is found. Luckily even the most sensitive devices can normally be powered down, even if they require a relatively long and convoluted process, that can be performed by the spacecraft alone.

The other frequent wish of the scientists (please not that it is a wish, not a requirement) is that failures should not propagate down the plan: if something prevents the instrument from performing an observation at 8 a.m. it is normally OK to skip that one, but it is important that, when the times come for the next observation (e.g. at 12:30) the instrument still tries to start the new observation even if the previous one failed. Of course, if the conditions for both observations are the same the chances are that the second one will fail too, but the key is that the failure on the second is linked to its own condition, not just because the previous activity failed.

Our current instrument, which is very complex, there are a few nominal modes of operation which determine which parts have to be switched on and which ones can stay off, but manual configuration is also allowed. The problem with manual configuration is that you can command the instrument to perform a task in the wrong systems powered on but the instrument might not find it until it is halfway through the procedure: it would be unable to complete the task nominally and it would not be able to invent a recovery procedure of its own, so instead it remains stuck. The problem of this situation is that any additional activities that could in principle be conducted, fail as well because the instrument is still waiting for someone to get it out of an unexpected situation.

The way we normally work around this kind of problems if that we let the instrument verify its own chances of completing the task. This means, in particular, checking that all the necessary systems are powered up and properly configured. And that is a very robust solution for the instrument in flight: the number of operation errors has come down over time but more noticeably the number of critical errors has gone to virtually zero because, even if the operators fail to configure the instrument properly, it is able to detect the error and just skip the impossible activity. In summary, the protection measures are doing its job adequately.

However, when you are working on ground, developing additional software with only partial copies of the instrument, this safety becomes a problem. Previously it was not a problem if one of the development units did not have an optical filter: the on-board computer would send a command into the blue yonder, skip over the fact that nobody replied to that command and complete the activity to the best of its abilities. It is out of the discussion that images taken without the filter are not going to be of scientific value, but that was not the point. We just intended to demonstrate that the software was safe to operate and that, when used in a fully featured instrument in space, it would provide the intended images. But now, with the newly introduced safety we are faced with the choice between hacking the code so that it does not use the non-existent filter or just stop the testing. This is a case of the protection measures being an obstacle.

Later on I checked with Martin and he confirmed that the model should have a filter, so probably the plug was disconnected accidentally when moving parts around. Still, one thing is clear: every action carryies a certain amount of risk and we have to be ready to live with it. We could always choose not do anything, keep the instrument off and thus make sure that we do not make any mistake at all, but that is not why we are doing this job. Have a nice evening.

Comments

Popular Posts