Thursday, July 28, 2016

"Five Nines" Now is Effectively Impossible for Consumer Web Experience

It probably goes without saying that the Internet is a complex system, with lots of servers, transmission paths, networks, devices and software all working together to create a complete value chain.

And since the availability of any complex system is the combined performance of all cumulative potential element failures, it should not come as a surprise that a complete end-to-end user experience is not “five nines.”

Consider a 24×7 e-commerce site with lots of single points of failure. Note that no single part of the whole delivery chain has availability of  more than 99.99 percent, and some portions have availability as low as 85 percent.

The expected availability of the site would be 85%*90%*99.9%*98%*85%*99%*99.99%*95%, or  59.87 percent. Redundancy is the way performance typically is enhanced at a data center or on a transmission network.

For consumers, “hot” redundancy generally is not possible for devices. One can keep spare devices around, but manual restoration (switch to a different device, power it up) is required. Most often, “rebooting” is the restoration protocol, as “I will call you back” is the restoration protocol for a dropped mobile call.

Component
Availability
Web
85%
Application
90%
Database
99.9%
DNS
98%
Firewall
85%
Switch
99%
Data Center
99.99%
ISP
95%

Some of us are old enough to remember joking about “rebooting your TV,” a quip meant to suggest what would happen as TV signal formats switched from analog to digital, from standard to high-definition formats, from playback devices to Internet-connected devices.

Of course, we sometimes find we actually must reboot our TVs, set-top decoders, Wi-Fi and other access routers, so the quip was not without foundation.

In the past, some might have contrasted the availability (uptime) of televisions compared to computing devices. There are many issues.

Software with lots of code, and little fault isolation, can lead to some amount of crashing, and therefore lower availability. Drivers are known to cause faults.

One study of server availability found that 58 percent of IBM servers operated at 99.999 percent availability, but 46 percent of Hewlett-Packard servers and 40 percent of Oracle servers. Such issues normally are dealt with by building in automatic failover to redundant machines.

But many servers have “two nines” 99 percent availability, off the shelf.

Still, although 79 percent majority of corporations now require a minimum of 99.99 percent uptime or better for mission critical hardware, operating systems and main line of business applications, that target obviously is less than the “five nines” standard for telecom services.

On the other hand, IBM “fault tolerant” servers are supposed to operate at “six nines” of availability, higher than the telecom standard.

Whether software is as reliable, or less reliable, than a “five nines” network is debatable. But most would agree that software and hardware (without redundancy) operates at less than 99.999 percent availability.

There is a big difference between 99 percent availability (88 hours of downtime per year) and 99.9 percent availability (8.8 hours of downtime per year); or 99.99 percent availability (53 minutes each year) and 99.999 percent availability (a bit more than five minutes a year).

It is a myth that “five nines” remains the operational definition of availability for modern IP-based systems supporting voice, web and other over-the-top applications, even if service providers can produce reams of data proving that their core networks actually perform at that level.

In other words, even if networks are highly reliable, human beings use devices and applications that never work close to “five nines” in terms of availability.

The fundamental problem is that end user appliances, applications and operating systems cannot reach “five nines” levels of performance. And the whole calculation of availability is based on concatenated chains of devices. Element A might operate at “five nines.”

But, without redundancy, any transmission chain with three elements would be calculated as 99.999 times 99.999 times 99.999. By definition, the total chain involves the downtime caused by any single element in the chain.

Traditionally, telecom networks have considered 99.999 percent availability the standard for fixed network voice services.

These days, it is hard to find anyone arguing that actual end user application or service experience actually ever approaches “five nines.” The reason is that most of the applications people want access to on the Internet actually are processed in data centers whose servers cannot operate at five nines availability.

To cope with that issue, data centers use redundancy. In other words the issue is not how reliable any server is. The issue is how fast an entity can detect a fault and switch to a backup server.

That same approach (redundancy) is used by transport networks and business access networks.

But many apps still are delivered over networks that are unmanaged, even if availability on the part of the delivery chain any single network can control, is “five nines.”

A new way of thinking about reliability or availability is that modern application delivery systems cannot actually meet the old “five nines” standard, end to end, because the actual end-to-end systems are going to crash often enough that five nines is not possible, even when there are redundant “five nines” access and transport systems.

In other words, loss of local power alone is a threat to five nines for end user experience. Operating systems crash, access to websites hiccups, mobile phone calls drop. Devices run out of battery power.

Wi-Fi is the typical device connection in homes and offices, and no matter how well other elements and systems work, Wi-Fi operations alone would crash “five nines” performance, in terms of the actual experience of application and service availability.


The point is that “five nines” is a myth, when considered from the standpoint of a consumer end user of any Internet service or app, on any consumer device.

No comments:

It Will be Hard to Measure AI Impact on Knowledge Worker "Productivity"

There are over 100 million knowledge workers in the United States, and more than 1.25 billion knowledge workers globally, according to one A...