Showing posts with label outage. Show all posts
Showing posts with label outage. Show all posts

Wednesday, September 29, 2010

82% of Enterprise Outages Caused by Power, Hardware or Telecom Service Failure

Loss of electrical power, hardware failure or loss of telecom service accounted for about 82 percent of the outages experienced by some 200 medium and large businesses over roughly the last year, CDW has found.

While 82 percent of the 200 businesses completing the survey felt confident that their IT resources could sustain disruptions and support operations effectively, 97 percent admitted network disruptions had detrimental effects on their businesses in the last year.

Also, about 1800 smaller businesses reported network disruption of four hours or more within the last year. CDW estimates that such network outages cost U.S. businesses $1.7 billion in lost profits last year.

"The survey confirms that while many businesses believe they are prepared for an unplanned network disruption, many are not – and yet the three most common causes of IT outages are addressable," said Norm Lillis, CDW vice president, system solutions. Power loss ranked as the top cause of business disruptions over the past year, with one third of businesses reporting it prompted their most recent disruption. Hardware failures caused 29 percent of network outages, followed by a loss of telecom services to facilities (21 percent). "

The survey also revealed that businesses need to take advanced preparation more seriously and support employees more effectively with network accessibility.

While 53 percent of respondents said employees are instructed or given the option to work from home when a foreseeable network disruption approaches (a weather event, for example), only a third of businesses activate standby communications and network systems to support increased remote access when warned of such an event.

In fact, while respondents reported that, on average, 44 percent of the workforce normally has telework options, they said that only 39 percent of employees could telework during their most recent network outage.

link to full study

Sunday, November 1, 2009

The Way to Deal with an Outage

Communications network outages occur more often than most users realize. When they do, the only way to respond, beyond restoration as rapidly as possible, is apologize. Quite often, perhaps most of the time, it also helps to explain what happened.

Junction Networks, for example, had an unexpected outage Oct. 26, 2009 for about an hour and a half, and the company's apology and explanation is a good example of what to do when the inevitable outage does occur. First, apologize.

"We do sincerely apologize for this service interruption. We know that you have many choices for your phone service, and we deeply appreciate your patience and understanding during yesterday's interruption of service. Below are the full details of the service issue."

Then remind users where they can get information if an outage ever occurs again.

"One of the first things we do when a service issue occurs is update our Network Alert Blog and Twitter page with as much information as we have at that time. We then post comments to that original post as we learn more. Our Network Alert blog is here: http://www.junctionnetworks.com/blog/category/network-alerts"

"Our Twitter account is: http://www.twitter.com/onsip."

Junction Networks then provides a detailed description of its normal maintenance activities, which can cause "planned outages" with an intentional shift to backup systems.

"As a rule, Junction Networks maintains three different types of maintenance windows:
1.) Weekend - early morning: The maintenance performed will produce a service disruption and could affect multiple systems.
2.) Weekday - early morning: The maintenance performed may produce a service disruption, but is isolated to a single system.
3.) Intra-day: The work performed should not affect our customers.
All maintenance, even that which is known to cause a service disruption, is not expected to cause a disruption for more than a few fractions of a second. For anything that would cause a more serious disruption (one second or more), backup services are swapped in to take the place of the maintenance system."

The company then explains why the specific Oct. 26 outage happened, in some detail, and then the remedies it applied.

Nobody likes outages, but they are a fact of life. If you think about it, there is a very simple reason. Consider today's electronic devices, designed to work with only minutes to hours to several days worth of "outages" each year. If you've ever had to reboot a device, that's an outage. If you've ever had software "hang," requiring a reboot, that's an outage.

Now imagine the number of normally reliable devices that have to be connected in series to complete any point-to-point communications link. That's the number of applications running, on the servers, switches, routers and gateways, on the active opto-electronics in all networks that must be connected for any single point-to-point session to occur.

Don't forget the power supplies, power grid, air conditioners and potential accidents that can take a session out. If a backhaul cuts an optical line, you get an outage. If a car knocks down a telephone pole, you can get an outage.

Now remember your mathematics. Any number less than "one," when multiplied by any other number less than "one," necessarily results in a number that is smaller than the original quantity. In other words, as one concatenates many devices, each individually quite reliable, the reliability or availability of the whole system gets worse.

A single device with 99-percent reliability is expected to fail 3 days, 15 hours and 40 minutes every year. But that's just one device. If any session has 50 possible devices in series, each with that same 99-percent reliability, the system as a whole is reliable only as the multiplied availabilities of each discrete device.

In other words, you have to multiple a number less than "one" by 49 other numbers, each less than "one," to determine overall system reliability.

As an example, consider a system of just 12 devices, each 99.99 percent reliable, and expected to fail about 52 minutes, 36 seconds each year. The whole network would then be expected to fail about 10.5 hours each year.

Networks with less reliability than 99.99 percent or with more discrete elements will fail for longer periods of time.

The point is that outages can be minimized, but not prevented entirely. Knowing that, one might as well have a process in place for the times when service is disrupted.




Thursday, October 15, 2009

T-Mobile USA Sidekick Data Nearly Fully Recovered

T-Mobile USA and Microsoft now say they have “recovered most, if not all, customer data for those  Sidekick customers whose data was affected by the recent outage,” says Roz Ho, Microsoft corporate VP.

"We plan to begin restoring users’ personal data as soon as possible, starting with personal contacts, after we have validated the data and our restoration plan," Ho says. "We will then continue to work around the clock to restore data to all affected users, including calendar, notes, tasks, photographs and high scores, as quickly as possible."

"We now believe that data loss affected a minority of Sidekick users," Ho added. Despite that good news, two class action lawsuits have been filed against T-Mobile USA, alleging that the company misled consumers into believing that their data was more secure than was the case.

Tuesday, February 12, 2008

Slow Email? BlackBerry Outage

Research In Motion Ltd. says an outage left users in North America without access to their BlackBerry email service on Monday, beginning about 3:30 p.m. Eastern Standard time and lasting about three hours.

RIM says no messages were lost during the incident, which caused intermittent delivery delays. No explanation for the outage has been given.

Outages of this sort are the reason many of us are giving more thought to backup and redundancy strategies. On a recent business trip, for the first time in my life, I accidentally left my laptop at home, and was going to be gone for 14 days. True, I had the BlackBerry and another mobile as well.

But in my line of work access to the Web is arguably more important than either of those two sorts of devices, as important as they are. Because of Google Documents & Spreadsheets and Google Broswer Sync, I was able to keep working using public terminals and loaned machines, with access to Microsoft Office.

I also learned to live without access to Outlook for a bit. The BlackBerry helped, of course. The lasting change so far is that I have kept using Google Documents more than I have in the past. That's why sampling is so important. Behavior can change.

Monday, February 4, 2008

Another Cable Cut in Persian Gulf


What are the odds four undersea cables are cut in a single week? Whatever those odds, it has happened. First two cables snap off Egypt. Then a separate cable in the Persian Gulf, and now yet another Middle East cable.

In the latest incident, an undersea telecoms cable linking Qatar to the United Arab Emirates was damaged, disrupting services, telecommunications provider Qtel has reported.
The cable was damaged between the Qatari island of Haloul and the UAE island of Das. The cause of the damage is not yet known.

Qtel's loss of capacity seems to be disrupting voice capacity more than Internet services. Qtel says it was operating at 40 percent over the weekend because alternative cables exist. Nevertheless, disruption to Internet and telephone services in the Gulf state is likely to continue for 10 another days or so.

Not since the December 2006 earthquake off Taiwan have so many cables been taken out of service almost at once.

Saturday, February 2, 2008

Cable Cuts Highlight Opportunity


Several recent undersea cable cuts that interfered with Internet connections in India and the Middle East might ultimately focus attention on other ways to get call center and business process outsourcing handled. By some reports 20 to 25 percent of outsourced call centers initially were unable to do any work at all while many had only 50 percent of capacity once restoration work began and traffic was rerouted.

At some point, at least some providers and some customers will conclude that if the price is equivalent, it makes more sense to base call centers and other business process outsourcing operations on shore. The issue is how to operate them more efficiently.

Perhaps there is a role here for IP voice interconnections. Though other costs are less malleable, it ought to be possible to create highly-distributed call overflow mechanisms using "voice over private network" IP connections in ways that allow economical call center operations in lots of rural areas that are more protected from cable cuts.

Friday, February 1, 2008

FLAG Telecom Loses Undersea Cable

As a reminder of how important undersea cable redundancy is, FLAG Telecom has lost a cable of its own in Persian Gulf. FLAG, a wholly-owned subsidiary of India's number two mobile operator Reliance Communications, says its Falcon cable was reported cut at 0559 GMT, 56 kms (35 miles) from Dubai on a segment between the United Arab Emirates and Oman.

Thursday, January 31, 2008

at&t Wireless Outage

In case you are having trouble sending and receiving email on your at&t Wireless smart phone, or are unable to get connected using your data card, there is a wireless network outage affecting at&t Wireless users in the Midwest and Southeast.

Taiwan Earthquake Just a Year Ago

And speaking of cable cuts that massively disrupt global communications, it was just over a year ago, in December 2006, when an earthquake took out a number of Pacific cables.

Those cable cuts took out much voice and Internet communications in many parts of Asia, as well as 60 percent of capacity between Asia and the United States.

The 2006 Hengchun earthquake occurred on December 26, 2006 at 12:25 UTC (20:25 local time), with an epicenter off the southwest coast of Taiwan, approximately 22.8 km west southwest of Hengchun, Pingtung County, Taiwan, with an exact hypocenter 21.9 km deep in the Luzon Strait ( [show location on an interactive map] 21.89° N 120.56° E), which connects the South China Sea with the Philippine Sea.

Cable Cuts Not That Rare


In the winter of 2000, Telstra, Australia's biggest Internet service provider had a cable cut of its own on Nov. 19, when its Internet backbone cable, sitting in less than 100 feet of seawater about 40 miles off Singapore, was damaged by unknown causes.

Telstra at that time relied on the cable, known as SEA-ME-WE 3 (for Southeast Asia, Middle East and Western Europe) for more than 60 percent of its Internet transmission capacity.


About 23,600 miles long, the cable connected 33 countries, touching places as diverse as Singapore, Malaysia, Thailand, India, Saudi Arabia, Egypt, Djibouti, Turkey, Greece, Italy, Portugal, France and the U.K.

Cable Cut Disrupts India Call Centers

Cable cuts that damaged two undersea Internet cables off Egypt's coast now are disrupting call centers in India, the Wall Street Journal reports. Reportedly, about half of India's Internet bandwidth now is disrupted, and voice traffic to the United States and Europe also are affected.

It could take a week or two to fix the cables, in part because of bad weather, some executives say.

Users in India, Egypt, Qatar, Saudi Arabia, the United Arab Emirates, Kuwait and Bahrain are affected by the outages.

Observers think an anchor might have snagged the cables. At least that's what Flag Telecom Group Ltd. now believes. The incident took place 8.3 kilometers (5.2 miles) from Alexandria beach in northern Egypt.

Emirates Integrated Telecommunications Co., the United Arab Emirates' second-biggest mobile-phone company, is working with the cable operators, Flag Telecom and SEA-ME-WE 4, to find out why the cables were cut and to determine when service can be restored.

The outage is a reminder that physical infrastructure, however mundane, underlies all of modern computing and communications. It's also a reminder that if your business or life depends on Internet-based communications, commerce and content, you need a diversity strategy. It costs more money. But so does inability to do your work.

Monday, December 3, 2007

at&t Internet Outage in former BellSouth Areas

Users are reporting outages in the former BellSouth territory on Monday Dec. 3, apparently caused by a Domain Name Server issue. IP services are really useful. They just aren't generally as reliable as the old public switched telephone network, though. These days, end users have to spend at least some time, and some money, creating backup systems for their crucial communications and information services.

Outage reports are posted from Georgia, Florida, Louisiana, South Carolina and Mississippi.

Tuesday, October 9, 2007

T-Mobile Goes Down


It wasn't your imagination: if you use T-Mobile data services, you had no connectivity for as much as four hours on Tuesday. Personally, I thought it was the coverage inside the convention center I am working inside of. Nope. There was an outage. I thought it was the BlackBerry server at one point. But no.

The latest outage just illustrates an important element of digital life: you really can't trust any service or application to remain "always available." Everything is going to crash, or be unusable, for some amount of time. So one either gets used to the idea of periodic outages, or if that isn't satisfactory, you are going to have to back up all your mission critical services, devices, data or applications. Personally, I don't worry too much about application diversity, though most of us have some of that. I do make sure broadband and mobile access, as well as computing devices, are redundant.

Sunday, September 9, 2007

Another Outage for BlackBerry


U.S. Internet-based users of the Research in Motion BlackBerry service might have noticed, and might still be noticing odd behavior from their handhelds. Like, no mail parts of Friday, and then huge dumps of what you thought was archived mail thereafter. If so, it might be because RIM had another outage of some significance last Friday, Sept. 7. That's two significant outages this year.

All of us may someday lament the fact that no service we now enjoy and rely upon has the ruggedness and uptime of the old public switched network.

Wednesday, August 22, 2007

GrandCentral Number Porting Affects 434


GrandCentral has had a few number porting issues of its own. CEO Craig Walker says GrandCentral had issues with 434 customers whose numbers could not be seamlessly transitioned from one underlying supplier to another.

What happened is that a supplier of numbers and connections "sent us a notice that they’d be exiting certain markets and disconnecting some phone numbers in 30 days," says Walker. GrandCentral immediately began porting the numbers to a larger carrier partner. But 434 couldn't transparently be moved.

Those users had to be assigned new telephone numbers in the same area codes they already were using. Going forward, GrandCentral is emphasizing working with large, reliable providers committed to providing these services long term.

"Although this affected only 15 of the local areas where we offer services, out of nearly 8,000, we take this matter seriously and have done everything to make the disruptions as limited as possible," says Walker.

That is the way to handle an unplanned outage.

Wells Fargo Outage Yesterday

Wells Fargo, the fifth-largest U.S. bank, had an outage of its own on Tuesday, taing down knocked the company's Internet, telephone, and ATM banking services for at least an hour and 40 minutes. Five nines? Not likely.

U.K. VoIP Provider Also Has Outage


U.K. VoIP provider VoIP.co.uk had an outage of its own last Monday. Users could call other VoIP.co.uk users, but were unable to place or receive calls from users on the public telephone network. Service was out for the better part of a day.

Monday, August 20, 2007

Skype: The Ultimate Windows Externality


"On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption triggered by a massive restart of our users’ Windows-based computers across the globe within a very short time frame as they re-booted after receiving a routine set of patches through Windows Update," Skype says.

Not everybody buys that explanation. But, if true, it has to rank as the most massive, unexpected software interaction Windows ever has inadvertently caused.

The high number of restarts apparently caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction, Skype says. Some have argued that the outage proves peer-to-peer networks are inherently unstable.

It's hard to test that assertion since Skype uses a modified P2P architecture with a sign-in process that is more "client-server" and centralized than most other P2P networks.

Some think there was some sort of hacker attack, but Skype denies it. "We can confirm categorically that no malicious activities were attributed."

If the Microsoft routine updates were, in fact, contributory or causal, it would rank as the most significant network-wide interaction anybody ever has seen. Just another example of the way applications are reshaping the way global networks perform.

As some of you know I have recently been dealing with interactions caused by a Vista upgrade, mostly of the "we don't talk to Vista" sort. I will say one thing, however. Vista seems to be much more robust than XP was about handling "hibernation" operations. XP used to become unstable after several hiberation operations, at least on my machines. I have not found that to be the case with Vista.

Friday, August 17, 2007

Skype Outage Not Over

Skype initially said its outage is over, but that clearly is not the case everywhere, and we are nearing 24 hours since the log-in problem began. Now Skype warns that the outage is likely to continue through Friday. My U.S. log-in still hangs.

The service had been sporadic but gradually improving during the business day in Asia on Friday, some report.

"There are about 2.5 million people logged in right now, where normally there would be over 8 million, and it's been going on and off every 10 minutes," says Mark Main, senior analyst at Ovum in London.

You may draw your own conclusions about which other application or service providers might benefit, but urges to gloat should generally be suppressed. Nobody whose service uses IP and the public networks is safe from outages or service disruptions.

That's why businesses and networks have redundancy. People who scream and yell about losing their service have only themselves to blame if they didn't build some level of diversity and redundancy even into their personal communications. Use Skype, other IM applications, mobiles, POTS-replacement VoIP, and POTS, email and anything else you can get your hands on. Some of us use multiple mobiles from different providers and multiple broadband providers. But never hang everything on any one service or provider, especially if your business depends on it. Personally, I wouldn't even hang my personal communications on a "single provider" strategy.

Thursday, August 16, 2007

And Cisco Goes Down, Also...

Cisco's main www.cisco.com page was offline at 11 a.m. Pacific Time on Aug. 8 and stayed offline for more than two and a half hours. It returned at about 1:45 p.m. The outage was an unintended byproduct of routine maintenance.

Will AI Fuel a Huge "Services into Products" Shift?

As content streaming has disrupted music, is disrupting video and television, so might AI potentially disrupt industry leaders ranging from ...