Wednesday, May 4, 2016

SEVEN REASONS FOR DATA CENTER FAILURE


Improper System Authorization
In a data center environment only very few administrators, (if any) should have full and unrestricted authorization to access all systems in a data center. Access should be tightly regulated. 


Ineffective Fallback Procedures
One of the major steps that is mostly ignored when planning for maintenance windows is the fallback procedure. Usually, the process documented is not consistently vetted and does not fully revert all changes back to original form.


Making Too Many Changes
Administrators should try to ensure that they are not making too many changes at a time during a maintenance window as this can be very tasking. When administrators are under pressure to complete a large number of tasks within a short period of time, mistakes are bound to occur. Secondly, as a result of a lot of changes occurring at the same time frame,  troubleshooting post-change problems can be far more tasking.


Insufficient, Old, Or Misconfigured Backup Power
Power failure is the most known cause for a data center to go down. Power outages happen all the time. As a result of this there is a need for redundant power sources like Battery and/or generator power to be used as a backup source. The challenge sometimes can be as a result of batteries not being replaced in a timely manner, generators not being tested, and power failure tests not performed. All of these oversights can result to the unavailability of redundant power when needed.


Cooling Failures
Data centers are known to generate a huge amount of heat per time. This is why cooling is so important to any data center.  It is important therefore to ensure that temperature sensor readings and alerts are sent to admins, so as to ensure sufficient time to implement your backup cooling procedures.

Changes Outside Maintenance Windows
Sometimes in data centers, there are situations where a request comes in to make a slight change to a server or piece of network equipment. And while data center protocol technically necessitates this change request to pass through the change-control committee, people feel it can easily be made outside of a formal change-control process and maintenance window. This can be mostly true but quite often, a minor change has unforeseen implications.


Hanging Onto Legacy Hardware
Hardware is likely to fail at some point on the other. As a matter of fact, the longer hardware is kept, the more likely it is to fail. This is common knowledge but yet we still have critical applications running on very old hardware. These problems are usually as a result of lack of a structured and comprehensive migration plan onto a new hardware or software platform .

No comments:

Post a Comment