Investing time on good timing

It is remarkable how often the technical field and real life display a surprising level of cross-talk, producing lessons that are reciprocally applicable. Today I would like to tell you about I happened to run into a surprisingly elegant solution that not only helped my work but also gave me an interesting hint for dealing with the world.

The problem I was trying to solve is rather common in autonomous systems but not frequent enough that someone has built THE solution that is definitely correct and which needs no further discussion. Our instrument is required to take a long series of images at regular intervals for a long period of time, typically once per minute for 48 hours and that while ensuring a high precision in the timing, so that the spacing is as uniform as possible. As it is common in science, we do not have a hard requirement but only "do the best you can", which, on the one hand, allows you to call it quits whenever you want (because you already did what you could), but provides no guarantee that your best effort is going to provide useful results, so the incentive to work harder is always there and there is no "finish line".

Photo: Mark Mathosian

Computer systems usually keep track of time through an internal counter called the "master clock", which essentially sets the rhythm for everything happening inside the computer: nothing can happen in less than a "beat" and some actions even take several beats to complete. This circumstance implies that more powerful computers tend to have faster clocks, so that they can perform more operations per second. In the case of our instrument, we have a fairly slow clock at 400 beats per second but we know that almost every single possible action takes just one beat, which is quite practical for synchronization purposes.

The problem with the master clock is that it just cannot count infinitely high: as with any other type of counter, it has a given number of digits and, once it reaches the maximum, it simply "rolls over" and starts again from zero. For most task, which are executed immediately this is not a problem, but when you have to do something for a long period of time, the chances that the master clock resets to zero start to be non-negligible. In our case, the clock can count up to 32 binary digits, which is approximately 11 million seconds or 125 days. This means that, running an observation for 48 hours we have a 2.5% chance of seeing the clock reset. While this might seem a very slim chance, we cannot risk having our computer crash because we were busy during a roll-over, so we had to do something.

My first approach was considering all the possible combinations of where the clock could be at the time for the last observation and when it would be when it was time for the next one: the algorithm turned out to have nine different cases and I was not even sure that I had covered all of them, but that was (as usual) the best I was able to provide. Discussing with one of the guys in the team he clearly stated how nice it would be to have a "clean algorithm", but I argued that I did not have the time to build one solution from scratch, so I went ahead and implemented the complicated one and started testing. Luckily, the tests had to run at least for 12 hours to provide any level of confidence, and while I was waiting I had a sudden bout of inspiration and managed to put together the "clean" solution that my friend hoped for. The tests of the complicated one went well, but we still decided to switch to the safer solution and test again.

Even with the "good" algorithm we had a few fraction of the cases where the observation did not happen when it was supposed to, but up to one second later. The subsequent observation did make up the lost time, but there were still around 5% of them that were off. Checking the time of the images it was mainly a neat row: 01:15:41.255,  01:16:41.255, 01:17:41.255 and then, suddenly 01:18:41.700! It was such a pity to see that everything went so well most of the time, that I decided I had to fix it, so I spent five days trying different combinations to identify the conditions under which the time was off. I even asked the instrument to report the time when each of the phases of the observation took place, trying to find which phase was taking too much time, but there was no obvious culprit.

Then, in one fortunate occasion, I saw that the second phase was reported to be finished before the first phase. This did not make any sense at all, since the second phase just cannot happen before the first one is complete. This lead me to think that perhaps it was the time information that was wrong. The computer keeps, beside the master clock, which is a plain counter, a second clock that gives the current time and date, including minutes, seconds and milliseconds. What if this information had "jumps" and therefore could not be trusted? The solution was to ask the instrument to report on the value of the master clock and, sure enough, the observations happened exquisitely on time (with 1/400th of a second precision). I still do not know why the clock suffers these jumps, but at least I know that the observations get the right timing.

The two main lessons that I drew form this experience are: if I have time for long tests, then I probably have time for the clean solution; and if the information does not seem to match my expectations, it is possible that it gets altered along the way, so I should try to trace it back to its origin. In real life this means that I should consider more often taking the apparently longer road to the solution if it is going to result better. It also means reminded me that sometimes the problem are not where I am looking for them, so I have to keep an open mind in case they are just a propagation of another error somewhere else.

I humbly apologize if this technical rant is not of your taste, but I was so thrilled and relieved of having found a workable solution that I could not help myself sharing with you. I promise that tomorrow I will provide a more thoughtful discussion. Have a nice evening.

Comments

Popular Posts