| Increasing Business Success in Today's Service Provider Market |
Minimizing Service Outages Maximizes Profitability and Ensures Success
One of the most challenging elements in today's service provider market is to achieve the
profit margin goals established as business objectives. To handle the ever-increasing pressure to achieve
higher profit margins, service providers must look at increasing revenues and reducing costs in order to
maximize their profits.
Ensuring that revenues remain high means that service outages must be kept to a minimum,
since service outages directly translate into revenue losses. This can easily be seen in the following
example.
A service provider bills an average of $.05 per call per minute. Assume that the provider
suffers a service outage on a switch with 160 T1 lines during a period when switch usage would have been
75%. A single T1 circuit carries 24 voice channels, and 2 channels are required for a single voice call
(one incoming, one outgoing). In this case, revenue loss per minute is equal to:
Loss/minute = (160 T1 lines) * (24 channels/line)
/ (2 channels/call) * 0.75 * $0.05
When that switch is out of service, revenue loss occurs at the rate of $72.00 per minute.
Table A demonstrates how quickly this loss grows.
| Service Outage |
Revenue Loss |
| 5 minutes |
$360.00 |
| 1 hour |
$4,320.00 |
| 2 hours |
$8,640.00 |
| 4 hours |
$17,280.00 |
Table A: Revenue Losses Due to Service Outages
Beyond the loss of revenue, other serious consequences of service outages may include:
increased customer service costs, loss of customers, contractual penalties, and litigation fees. All of
these will result in even further losses. Intangibles are also at stake, such as gaining a poor reputation
in the market place and loss of future opportunities.
Service outages can be mitigated or prevented by selecting equipment that is high in quality.
Defined by terms such as Carrier Grade, Carrier Class, or Telco Grade equipment in the telecommunications
industry, equipment in these classes are generally accepted as assurance that the products are designed to
meet quality objectives to maintain profitability. To understand how equipment achieves the Carrier Grade
designation, the reliability and availability of the product must be analyzed.
Reliability
One of the best ways in which reliability can be expressed in a quantitative fashion is by
studying equipment failure data and providing figures for MTBF (Mean Time Between Failures).1
Telcordia's Reliability Prediction Procedure TR-332 is a well-recognized method of predicting equipment
MTBF figures for new product designs.2
Using TR-332, MTBF values were calculated for the major components of the EdgeIQ Intelligent
Media Gateway manufactured by Versatel Networks, a leader in telecommunications equipment. Table B shows
the MTBF values for each of the product components.
| Component |
MTBF |
| Main CPU SBC (Single Board Computer) |
150,000 hrs |
| Main CPU SBC I/O |
500,000 hrs |
| Power supply unit |
300,000 hrs |
| Fans |
300,000 hrs |
| Backplane |
1,000,000 hrs |
| VoIp i/f card |
111,894 hrs |
| T1/E1 i/f card |
85,692 hrs |
Table B: MTBF Values of Versatel Network's EdgeIQ Intelligent Media Gateway
Components
An MTBF value of 43,800 hours, or 5 years, does not indicate that the component will operate
continually for 5 years, and then fail at hour 43,800. The MTBF value is a statistical value used to
determine the probability or likelihood of failure. From MTBF, the probability that the component will
fail in a certain time period is given by3:
R(t) = e(-t/MTBF)
Where
t is the period of time of interest
MTBF is the calculated MTBF
R(t) is the reliability function
An example calculation follows:
Given that the MTBF for the Main CPU SBC is 150,000 hours, what is the probability that
this card will operate without failure for 5 years
Using the equation above:
R(t) = e(-43,800/150,000) = 74.68%
The reliability, or probability of successful operation, varies depending on the time
period in question. Table C shows the reliability at different times:
| Time |
Reliability |
| 2.5 years |
86.42% |
| 5 years |
74.68% |
| 10 years |
55.77% |
Table C: Reliability Values for Differing Time Intervals
Another value to consider is the probability that a component is operational after a period
of time equal to the MTBF. This is a constant value that can be computed using the reliability equation:
R(t) = e(-MTBF/MTBF) = e(-1)
= 36.79%
This indicates that the probability that the Main CPU SBC will operate without failure within
the MTBF period is 36.79%.
Logically speaking, when looking at individual components, higher MTBF values equate to better
reliability. However, the effect of component MTBF values can have varying effects on overall system
reliability. The overall MTBF of a system of n components is given by the equation:
MTBFsys = (1/MTBF1 + 1/MTBF2 + . + 1/MTBFn)-1
Where
MTBFsys
is the total system MTBF
MTBFn is
the MTBF of component n
In general terms, the overall MTBF of the system is lowered each time a component is added
to the system. However, the impact of additional components can have more significance in some situations.
Consider the following example:
From Table B, the SBC is made up of the Main CPU board and the I/O board. The total MTBF is:
MTBFSBC = (1/MTBFMain CPU + 1/MTBFI/O Board)-1
Or
MTBFSBC = (1/150,000 + 1/500,000)-1 = 115,385 hours
The probability that the SBC will fail after 5 years is 68.41%, versus 74.68% for the Main
CPU by itself. This demonstrates that multiple components can have quite an effect on the overall
reliability. To ensure that the overall system reliability is acceptable, manufacturers strive toward
higher and higher MTBF values on individual system components.
Another common strategy employed for increasing overall system MTBF is through the use of
redundancy. Redundancy is adding a second, or any number of additional duplicate components, to
perform the same functions as the primary component so that if the primary component fails, one of the
redundant components will take over. Industry standards have determined that MTBF is increased by 50%
in a system with a primary component and a single redundant backup, commonly referred to as a hot/standby
or '1+1' configuration. (This arrangement is also referred to as a 1-out-of-2 configuration, which means
that of the 2 components, only 1 component is needed for operation.)
For example, in a redundant SBC configuration where each SBC has an MTBF of 115,385 hours, the
redundant configuration would have an MTBF of 173,078 hours. This is a common strategy used for power
supplies, fans, controllers, and many other items with high uptime requirements.
A redundant configuration with one SBC failure would be expected to continue operation for an
extended period of time before a failure of the redundant SBC would render it inoperable. However, in
reality, repair of the failed SBC component would be likely to occur before the second SBC failed. In
situations like this, repair of failed components must also be taken into account when looking at
overall system uptime. In these cases, the measurement of availability is often used.
Availability
While reliability is a measure of the percentage of time a system operates without failure,
availability can be thought of as the percentage of time a system is available for use. Availability
therefore takes into account failures and repairs of the system that contribute to non-operational time.
Availability is usually expressed as a percentage, given by the equation:
Availability % = MTBFsys / (MTBFsys + MTTRsys) * 100
Where
MTBF is Mean Time
Between Failures, as previously described
MTTR is Mean Time To
Repair
MTTR is a measure of serviceability that represents the mean downtime required to perform
repairs and maintenance. This number should reflect all activities that affect the mean time required to
restore service operation, such as dispatching service personnel, waiting for replacement parts, fault
isolation, and replacing faulty components.
From analyzing the equation above, it is apparent that as long as the MTTR approaches 0,
availability will approach 100%, regardless of the system MTBF. This is the reason engineers focus on
technologies such as redundancy and hot-swap repairs (where a repair can be done "on-line" while a system
is still operational), to maintain a high level of system availability.
A commonly used term associated with system of high availability or uptime is "five nines."
Five nines refers to systems with an availability of 99.999%. Five nines is often considered the minimal
acceptable availability for telecom systems. It translates to 5.26 minutes of downtime in a year.
Here are some examples of how Versatel Network's EdgeIQ Intelligent Media Gateway can help
meet the five nines objective:
- No single point of failure within an EdgeIQ Intelligent Media Gateway.
- No impact on traffic when switching between redundant components.
- Minimal use of shared resources.
- All replaceable parts are hot-swappable.
- Automatic component diagnostics or Built In Test Equipment (BITE).
- Real-time diagnostic reporting.
Operation and Maintenance
Along with evaluating the important measurable values that impact your overall business goal
of increased profitability, there are other factors to consider. For example, the ease of reporting
failures to service personnel and tracking repair progress are critical for minimizing service response
time, and therefore will ultimately impact system availability. In a similar way, accessible repair crews
are also a factor that impacts availability. It is also important to track information so that personnel
can become more knowledgeable at problem identification and resolution.
For a real world example of ensuring smooth operation and maintenance, the following list
details some of the attributes of Versatel Network EdgeIQ Intelligent Media Gateway that allow for improved
operations and maintenance to ensure high availability:
- Easy installation of components.
- Use of COTS (Commercial Off-The-Shelf) components instead of proprietary designs to benefit from
vendor improvement in CPU price/performance/reliability.
- Fully featured OAMP (Operations, Administration, Maintenance, and Provisioning) such as remote
access, multi-user concurrent access, and an application program interface (API).
- Extensive fault detection systems including the use of heart beat mechanisms, built-in tests, and
diagnostics.
- Flexible fault reporting.
- Performance reporting.
- Emergency technical support 24/7.
Physical
Lastly, actual physical system requirements are important elements in the telecommunication
industry that must be taken into account to ensure the equipment operates reliably and safely in a central
office environment, without adverse affect on the network operation. For this objective, service providers
often rely on two key Telcordia documents that establish a set of physical/environmental and
electromagnetic/electrical safety requirements that major service providers use as the key set of criteria
the network equipment must meet:
- GR-63-CORE "NEBS- Network Equipment Building System Generic Requirement"
- GR-1089-CORE "Electromagnetic Compatibility & Electrical Safety - Generic Criteria for Network
Telecommunications Equipment"
Meeting these standards must be a diligent effort. NEBS compliance must be taken into account
from day one of design, and must be an established objective throughout the product lifecycle. Additionally,
OEM components must be selected based on NEBS compliance. Testing beyond the NEBS requirements often will
help to ensure that NEBS compliance is achieved.
For details on how Versatel Networks achieves NEBS compliance, please refer to Table D at the
end of this document.
Summary
Service providers in the process of selecting a platform on which to build new and existing
services should consider some key questions:
- What elements of the system are redundant?
- Are all critical operating elements covered by redundancy?
- Are all critical operating elements repairable without powering down the system?
- What are my options in selecting a redundant versus non-redundant system?
- What are my options in converting a non-redundant system into a redundant system?
- Are there service provisioning or scheduled maintenance tasks that require service outages in
order to perform them?
- Do software upgrades require a service outage?
- What is the service impact when adding more features to the system?
Answers to these questions and the resulting decisions will impact the reliability and
availability of equipment. By factoring in these important measurements, service providers will help to
ensure key business objectives are met.
By selecting Carrier Grade, NEBS-compliant equipment, service providers can be assured that
service outages are minimized. Versatel Network's EdgeIQ Intelligent Media Gateway platform is an example
of equipment that meets these goals and ultimately maximizes profitability and ensures success.
Notes
- MTBF is the average time expected between failures of a repairable component over some specified
time period. MTTF is the average time expected to the failure of a non-repairable component. However,
to avoid confusion, MTBF is often used for both cases. This document adopts this convention.
- Telcordia was formerly known as Bellcore. Both terms are commonly used.
- This equation assumes an exponential distribution for the times of failures.
| Component |
Certifications |
IQ4000 IQ1500 |
NEBS: Level 3 tested per Telcordia SR-3580
EMC: FCC Part 15 Class A, EN55022 Class A, AS3548 Class A,
VCCI Class A, CNS13438
Safety: UL60950, CAN/CSA C22.2 No. 60950- 00, EN60950,
CSA C/US and CE Marks, AS3260
NEBS: Designed to NEBS Level 3 per Telcordia SR-3580
ETSI EN 300-019-2-1
ETSI EN 300-019-2-2
ETSI EN 300-019-2-3
EN 55022 Class B
EN 61000-6-2 (EN55024)
EN 60950 / (c)UL 60950 |
| T1 Card |
FCC Part 68 and FCC 47 Part 15
CS03
IEC 60950
GR-1089 Class A and Class B |
| E1 Card |
European Regulatory Certification CE 2002
Radio and Telecom Terminal Equipment Directive 99/5/EEC
Low Voltage Directive 73/23/EEC
IEC 60950 1999 3rd edition
EN 60950 2000 3rd edition
Electromagnetic Compatibility Directive 89/336/EEC
Immunity: EN 55022 1998
Emisssion:EN 55024 1998 Class B, EN61000-3-2 1995,
EN61000-3-3 1995
Amending Directive 93/68/EEC |
| VoIP Card |
FCC 47 Part 15 Class B (emission)
Low Voltage Directive 73/23/EEC
. IEC 60950 1999 3rd edition
. EN 60950 2000 3rd edition
EN 55022 for CE mark (Emission)
EN 55024 for CE mark (Immunity)
CSA C22.2 no 950 for Canada & US (safety) |
Table D: Versatel Network's NEBS Compliance Listing
About Relex Software Corporation
Relex Software Corporation is a world leader in reliability analysis software. Its products are used
by thousands of engineers in a variety of businesses around the globe. In business since 1986, Relex Software
Corporation produces a superior line of high-quality software tools for reliability
analysis. Long-recognized for their user-friendly, state-of-the-art features, the modular tools in the Relex
Reliability Software Suite include an intuitive graphical user interface, support for scientific graphical charts,
an enhanced CAD interface, visual system modeling with redundancy support, completely customizable output reports,
extensive parts libraries, and a comprehensive online help system. For more information on Relex Software
Corporation, an IS0-9001 and TickIT 2000 certified company, call 724.836.8800 or visit
www.relexsoftware.com.
Copyright © 2008. Versatel Networks and Relex Software Corporation.
|