See also: Carrier Grade Linux 4.0 – Raising the bar
Network equipment, whether for a phone system, digital broadcast or wide area or enterprise network, must have high reliability and high availability. Traditionally, network equipment companies have provided this through proprietary architectures, but they have been moving to Linux and open source solutions over the past few years.
These solutions vary in effectiveness, and increasingly it is security concerns that are the major risk for network outage. Operators must find ways to protect their systems against a wide range of attacks in order to provide the maximum availability. This can be done through a combination of hardware redundancy and hardened software, but the right implementation is needed to get the most effective solution that provides the maximum availability.
To illustrate, consider a “basic” blade server that uses a non-carrier-grade operating system and has no hardware redundancy, except for power and fans. It employs memory error correction and a five-minute auto-restart feature to recover from transient hardware and software failures. Such a blade server is capable of an unplanned system downtime of 25 minutes per year, or 99.9952 percent availability. However, this estimate does not include denial of service attacks.
Researchers at Bell Labs estimate the average IP address is attacked nine times per year, increasing the annual unplanned system downtime for the basic blade server to between 30 and 300 minutes per year, depending on the server’s network function. The Institute for Advanced Professional Studies reports significantly higher downtime statistics for enterprise blade servers, with around 900 minutes per year of unplanned downtime.
See also: Electronics Weekly’s Focus on Linux, a roundup of content related to the open source operating system shaped for industrial uses.
This difference is typically due to quality problems associated with design implementation, manufacturing and operations training. For mission-critical services this can be costly. Some applications have outage costs of more than $1 million per hour. If the enterprise blade servers studied by the Institute for Advanced Professional Studies are used for such applications, the annual costs can exceed $15 million.
Table 1 shows the impact of outage contributors, including DoS attacks, for different types of server systems modelled using industry data and assuming a desirable DoS target. The types are a single nonredundant server blade that is not using a hardened Carrier Grade Linux solution; a single nonredundant system using CGL; and an enhance cluster of blade servers using CGL.
The improvement in the basic system by using CGL is due to enhanced fault management features resulting in higher fault coverage and faster repair and recovery times and a hardened operating system resulting in a significant reduction in successful DoS attacks. The enhanced cluster is capable of significant availability improvement because of the added protection provided by the redundant hardened server systems using fast failover.
To meet the customer availability requirements and avoid the high outage and maintenance costs, several “carrier-grade” capabilities are key.
Carrier-grade servers are intended to provide high availability service 24/7, so they must be capable of in-service upgrades. For large servers that support many users and provide mission-critical services, the requirement is that the upgrade must not cause a service outage or drop established sessions.
Typically, the cut-over time should take less than 10 to 30 seconds, depending on the application. The software upgrade must also be capable of automatically returning to the previous release upon a failed software insertion. The hardware blades must be hot-swappable, with the system continuing to function as the blade is replaced, tested and returned to service.
Hardware Fault Management
Hardware redundancy can result in a hundred times improvement in system availability, but this has to be balanced with the cost of providing redundant equipment.
Effective hardware redundancy implements both real-time and background tests with fault coverage exceeding 95 percent, contains the failure modes within protection zones, and recovers – sometimes saving state information – to achieve hitless recovery. Multi-unit load-sharing clusters are one of the most cost-effective, scalable redundancy schemes requiring minimal over-provisioning. Comprehensive diagnostics result in the ability to isolate faults and to quickly repair the system. This reduces downtime, unnecessary circuit pack returns (classified as “no fault found”), repair times and costs.
Hardware fault management features also include ones that record failure mode data that is used by vendors to do causal analysis to identify and resolve design issues.
Software Hardening and Fault Management
Software failure rates can vary widely based on the level of software quality and traffic load. Carrier-grade software has low failure rates, has to be robust to the extremes of traffic loads and employs fault management features used to detect and recover from software failure.
This includes graduated recovery schemes that reduce outage duration and modular recovery schemes that target software components and, similar to hardware, record failure mode data to support causal analysis to identify and resolve software design issues.
Malicious software can cause lengthy server outages. The volume, sophistication, and scope of such attacks make it difficult for the equipment vendors to stay ahead of them. A system hardened to prevent intrusions or that is robust to overloads is required to prevent or mitigate the impact of cyber attacks.
Examples are oversize packet detection, rate limiting, modular software design, disabled RPC (remote procedure call), isolation of users from system files and directories, encryption and password access. Monitoring and recording of intrusion data is very important because it can identify and close security vulnerabilities.
Human Factors Design
Procedural errors are often a major contributor to server downtime, especially when there is no hardware redundancy. Human factors design includes both mechanical and computer interface design features, such as command-line checking, to prevent human error.
It is not sufficient to simply add these carrier-grade capabilities to a server. Carrier-grade servers can vary widely depending on how well the carrier-grade features have been implemented, tested and improved using field feedback.
To ensure successful updates and upgrades, my own company’s Carrier Grade Linux distribution, Wind River Platform for Network Equipment, includes full version tracking and compatibility checking and supports early boot cycle detection and recovery so that when a node or a blade has had a kernel upgrade that fails to boot, the OS automatically reverts back to the last successful booting kernel.
A suite of high-coverage software detection features, combined with its graduated auto-restart capabilities, Platform for Network Equipment provides recovery from software failures as low as 20 seconds for a complete system to subsecond restart for individual applications. The OS can automatically switch to a new kernel and bypass all firmware operations for a more stable, cleaner recovery environment.
The fault monitoring and hardening in Platform for Network Equipment supports a high level of redundancy that allows multiple redundant communication paths to hard drives to ensure that data can be written when links have failed. Its cluster storage solution is based on the Oracle Cluster File System version 2, which is more effective than the basic Linux Logical Volume and provides superior hardware fault detection, resulting in reliable internode messaging with fast failover to alternate nodes. It also supports error detection and correction for PCI and memory failures, while the OS fault-handling capabilities incorporate fault detection and reporting on the ATCA/IPMI to prevent outages.
Wind River’s CGL 4.0 implementation provides the capability to monitor and report low memory conditions to the applications so they can take corrective action. System stability is ensured by the Completely Fair Scheduler, which includes techniques to improve the system’s robustness to traffic extremes.
Because of the increasing risk of malicious software, the security policy includes the well established GR Security patches as well as the assurance that all user-space applications and libraries are compiled using full stack protection, which is not commonly used across the industry. This protects the system from malicious applications as well as bugs in applications impacting the stack.
Platform for Network Equipment also includes a suite of tools for active log monitoring (sentry tools), local and remote full system binary, log and configuration file validation, password integrity checking, password quality control that prevents users from changing their passwords to ones that can be easily cracked, and a system jail that is immune to all known chroot jail attacks.
Even for a single nonredundant blade, Platform for Network Equipment can provide benefits. The system’s high-coverage fault-handling features and overload and security hardening result in faster software recovery times. The hardware monitoring features result in reduced repair times and early warning of impending problems.
Low-outage recovery and repair times
The end result is low-outage recovery and repair times, meaning low maintenance costs. Using an assumed network service where outage costs are $1 million per hour, the latest CGL version running on a fully fault-tolerant server is capable of incurring annual outage costs of less than $50,000 per server, one to two orders of magnitude lower than the reported data from other implementations.
By Glenn Seiler, Senior Director of Market Development, Wind River, Networking Solutions