Time out

Time outCometh the year 2000, cometh THAT problem. And it will not go away. But which are the high risk items in your office? Peter Mitchell reports Embedded control software poses special Year 2000 compliance difficulties. The engineer – whether supplier or user – has far less control over the equipment’s functions and has correspondingly fewer test options. Sometimes one cannot even change the system clock. The good news is that Y2K failure rates for embedded systems are much lower than for IT systems. Software expert Martyn Thomas, of the Deloitte Touche subsidiary Praxis, estimates the non-compliance rate at only three to five per cent, most simple control systems and PLCs are compliant, he says. But Gerry Docherty of RTE, a specialist consultancy which has examined over 10,000 systems for compliance, estimates the failure rate at five per cent just for the very simplest devices like temperature controllers, and more like 15 or 20 per cent for instruments with obvious calendar functions. For complex systems, the failure rates rise to 80 per cent, he says: “It’s difficult to find any compliant distributed control or SCADA systems”. And newer devices are if anything less trustworthy than old ones, he warns – since they are more likely to be designed around the kind of off-the-shelf building-block logic that so often contains calendar functions. The question is, how to identify high-risk items? Manufacturers’ advice is unreliable. According to Docherty, 30 per cent of suppliers wrongly claim their systems are compliant – usually because their testing was too simple-minded. And Thomas says that many smaller manufacturers are deciding it is not in their interests to tell the truth at the moment. So, failing credible reassurance from the supplier, you must make your own checks – and given the limited time, starting with the most critical equipment. First, does the equipment actually contain an embedded controller? The most obvious flag is a time or date display, or date/times printed on reports. Here, a quick check for compliance is to look at the trend reports, historical logs or calibration status. If the year appears in two-digit form, watch out. The next question is, does the controller do any sort of averaging, say responding to temperature readings over a period of time? “That period might only be 20s, but the logic could still be using a faulty calendar routine”, warns Thomas. Even if the instrument uses no time functions at all, it may still have an internal calendar/clock chip. Quite lot of PLCs, and of course all PCs, have them. The device may even have two separate clocks which behave differently – as they do in PCs. Sometimes a look in the user user manual will show that a top-of-the range model in the same series does provide date functions. If so, it is quite likely that the standard models also contain calendar logic, even though the output may be disabled in these versions. If so – and if the calendar is not Year 2000-compliant – the software may generate an illegal, out-of-range value when the clock crosses into the next century. When the controller’s error checking routines spot this, they will probably force a system close-down. According to Docherty, about 50 per cent of embedded system failures occur in this way. Check this with the standard roll-over test, in which you set the system clock to just before midnight on December 31 1999, then stand well back. Proceed with great caution, though: never test a live system, always backing up data first, and always bear in mind that the equipment might fail irrecoverably. It’s a good plan to start checking now that there are manual overrides or work-arounds you can invoke, and that they work properly. Document them – you may need them. Note there are several other key dates or ‘cusps’ worth testing, including February 29 2000; March 1 2000; and December 31 2000. Let’s suppose the equipment seems to continue operating normally through rollover – is it probably OK? Far from it. The applications routines may not use these clock functions, but the power-up self-test may, for instance to check calibration status. So the first time the device is switched off and on in the new century, it may report an internal hardware failure and refuse to re-boot. “Then you’ve really got a problem, because if it won’t power up you can’t even roll back the century as a quick work-around”, warns Thomas. The next test to try, then, is to set the clock into the next century, and then power-down and up. If it re-boots successfully, check the system clock has remembered the ‘right time’ (most PCs won’t). Get into the habit of doing this with any equipment that has been closed down for operational reasons (with appropriate precautions). What if the post-cusp power-up succeeds? According to Docherty, this is where many manufacturers stop, believing they have proved compliance. But it is vital now to compare the equipment’s output with what it should be. Check also that it will accept new data – and that it acts on it. “Some systems fail by throwing away the new inputs and taking control actions based on the old data”, he says. Another common failure mode to watch for is the wholesale deletion of archive data which the system thinks is time-expired. If you have access to the source code, you can do a static check for date routines – either manually with a walk-through or desk check, or using one of the many automatic scanning tools around. The code will probably contain comments in natural language, so scan for ordinary words as well as variable names. The programmers’ documentation should also be checked, though it is a notoriously unreliable guide to the actual compiled code. But usually, critical parts of the source are not available – either because it was sub-contracted, or because the date and time routines are in third-party function libraries or chip sets. Or even where the whole source is available, there might be several versions, and nobody knows which devices contain which version.
For this reason it is not good enough to check ‘representative samples’ of your equipment. “They vary”, says Thomas. “Manufacturers can change the internal architecture and control circuitry quite a lot, without updating the model number. We’ve found medical systems where one model will work through the new century and a seemingly equivalent instrument will fail. You really have to check every single piece of critical equipment”. And if you can’t test it? There is only one alternative: replace it.

Leave a Reply

Your email address will not be published. Required fields are marked *