The High Availability Imperative
Nearly every person and organization across the globe is engaged on a daily basis with communications networks. The ability to communicate electronically has become intertwined with many important aspects of how we maintain safety and security, conduct business, entertain, manage our lives, and connect with friends and family. When the ability to communicate is interrupted the impact can range from frustration to catastrophe. Providers of communications network equipment bare a significant responsibility to ensure that the ability to communicate is available 24 X 7 without interruption.
This daunting challenge can be costly and time consuming to meet. However, alternatives such as OpenSAF are enabling the major equipment providers and the growing number of smaller companies to deliver cost-effective high availability services and equipment in less time than traditional methods.
High Availability Requirements
To better understand the value and technology that enables equipment providers to maintain 24x7 availability of services, a quick high level review of high availability requirements is in order. The requirement most simply stated; The ability to provide a service without interruption regardless of hardware or software faults.
This level of availability is typically measured in percent of up time commonly referred to as the number of nines. 2 Nines would indicate that the service is up 99 percent of the time. While that may sound impressive to some, it means that the service is not available for over 87 hours a year or close to 15 minutes a day. As chance would have it, that 15 minutes of down time occurs at the most inopportune times. Table 1 shows the amount of down time for each level of availability measured in Nines. For a system to be considered highly available it typically must operate at a 5 Nines level or better.
Availability of 5 or greater Nines cannot be achieved by improving the quality of components alone. It requires specific hardware and software technology that can detect faults and recover from them without impacting the service being delivered. Redundancy of hardware and software components is key to achieving high availability. For example, an application that is processing mobile phone call connections would run on one processor while an identical standby application would be running on a second processor. Should a fault be detected that affects the active application the service is immediately switched over to the standby application. The user of the service is never aware that a fault occurred.
It is also required that a faulty component be repaired or replaced without affecting the service. For example, a processor board that fails must be able to be powered down, removed and replaced, the new processor powered up, loaded with the appropriate software, and brought back on line as an active or standby resource all without resetting the system.
While redundancy and fault management are critical requirements, it is not enough to achieve 5 nines availability. Upgrades of hardware and software are necessary and taking a service offline to accommodate an upgrade eliminates any hope of achieving high availability. Thus the ability to upgrade hardware and software components real time without service interruption is an absolute requirement.
These management actions all occur in real time, autonomously, without human intervention according to pre established policies maintained by the HA software. Throughout these management actions the network management system must be notified of any faults that have occurred and any associated repair actions and configuration changes.
Managing redundant components, fault detection recovery and repair, in service upgrades, and notification is typically provided by specialized software that is layered between the hardware and applications and thus called High Availability (HA) middleware.
A final and critical requirement is standardization. Network equipment and the required HA middleware are comprised of many building blocks. It is extremely rare that one organization would develop and deploy all the building blocks. Building blocks are sourced from an extensive ecosystem of hardware and software providers as well as from various departments within a single company. Therefore standards and open specifications must be defined that ensure the building blocks can be easily and predictably integrated. The Service Availablity Forum (SA Forum) (www.saforum.org) is the most important organization focused around the variety of well established standards that apply to HA middleware. Its open specifications define the interfaces and frameworks used by the HA middleware to ensure interoperability and portability.
The SA Forum specifications have been implemented by a number of companies, as the core of high availability middleware implementations. This includes proprietary commercial offerings and internal developments, which are generally favored by some larger companies. Given the high cost of development, the complexity of the high availability middleware and its essential, but largely non-differentiating, requirement in the marketplace, the OpenSAF project is an increasingly popular alternative to purely commercial solutions.
The Open Service Availability Framework (OpenSAF, is an open source project developing state-of-the-art high availability middleware for communications network equipment. It is backed by the OpenSAF Foundation, which provides financial, legal and marketing support to the project. The objective is to establish a cost-effective broadly adopted high availability middleware implementation that meets the challenging requirements for 5 nines and beyond service availability. In addition, OpenSAF is expected to enhance portability and interoperability of the communications equipment building blocks.
The OpenSAF Foundation and project has expert management and engineering participation from major telecom equipment manufacturers, Ericsson (News - Alert), Huawei and Nokia Siemens Networks, Enterprise computing companies, HP and SUN Microsystems, embedded computing suppliers, Emerson Network Power-Embedded Computing and software companies, ENEA and Wind River Systems. This represents many of the best-in-class high availability hardware, software, and system providers in the commercial eco-system.
The project is guided by a Technical Leadership Council that oversees the development roadmap, architecture, and technical work groups developing the OpenSAF middleware. Project members work in a controlled development and test environment that ensures carrier class quality is achieved. With support from the OpenSAF Foundation, the project supplements its on line development and meeting activities with Developer Days symposiums, the next of which will be held in Shenzhen, 3–4 June, 2009.
High availability middleware that enables communications equipment to achieve 5 nines or greater service availability is a significant undertaking that require 100s of expert resource years to develop and test. The OpenSAF project was initiated with a significant code base from a major provider of high availability systems under the LGPL v2.1 license.
The initial OpenSAF code base includes an implementation of the SA Forum’s Application Interface Specification (AIS) release B (News - Alert).01.01 which initially includes six of the key services:
1. Availability Management Framework
2. Cluster Membership Service
3. Checkpoint Service
4. Message Service
5. Event Service
6. Global Lock Service
OpenSAF provides additional capabilities to provide an underlying infrastructure and management functionality that complement the SA Forum services. The key capabilities include:
• Message Distribution Service to provide a high-performance messaging infrastructure based on the Transparent Inter-process Communication (TIPC) protocol
• Message-Based Checkpoint Service to enable faster recovery from a hardware failure than is provided by the SA Forum Checkpoint Service
• Distributed Tracing Service to facilitate debugging in a distributed environment
• System Resource Monitoring Service to enable OpenSAF users to detect and recover from an overload situation
• Hardware Platform Integration (HPI (News - Alert)) service which integrates the HPI managed hardware platform with OpenSAF.
• Management Access and System Description services provide a single access point for all management operations and a structured mechanism to define hardware and software components and their relationship from a high availability and failover perspective.
OpenSAF is designed to be processor and operating system agnostic through an operating system portability layer. Currently, OpenSAF supports PowerPC and Intel x86 processor architectures and the Linux operating system.
The OpenSAF Advantage
High availability middleware was once considered a key differentiating feature in communications systems and many companies developed their own implementations, which were ultimately specific to their equipment and applications. Today, time to market, rapid application development and easy integration of applications onto a single platform, all in a cost sensitive environment, are critical business drivers for next generation network providers. OpenSAF, and the LGPL v2.1 license enables multiple companies and individuals to contribute to the code base, sharing the cost and accelerating development. Additionally, ISVs and software application developers can use OpenSAF as their high availability software platform with the knowledge that broad adoption means easy integration and portability of their applications. Businesses can now make the pragmatic choice to allocate resources to creating their own value add IP vs. developing and integrating their own internal and proprietary HA middleware.
Meeting the Challenge
With our dependence on electronic communications for safety and security, business, entertainment, management of our lives, and connection with friends and family, 24x7 availability of communications services is an absolute requirement. High availability middleware provides the capability to deploy next generation network equipment that provides 5 nines or greater service availability. OpenSAF offers a HA middleware solution that allows equipment providers and all levels to deliver cost-effective, timely, and competitive solutions that meet the HA imperative.
John Fryer is Director of Technology Marketing for the Embedded Computing business of Emerson Network Power (News - Alert), President of the OpenSAF Foundation and represents Emerson Network Power on the board of directors for the Service Availability Forum.
Jim Lawrence, works for Enea as Chief Software Standards Officer.
Richard Grigonis is Executive Editor of TMC (News - Alert)’s IP Communications Group. To read more of Richard’s articles, please visit his columnist page.
Edited by Greg Galitzine
Read More SIP Server Stories »