With all the advancements we have made in technology, the human element still plays a critical role to its success, and with that comes the potential for human error. Mistakes are a part of business, relationships and life. Your mistakes don’t define you, how you handle them does.
Not only are mistakes learning opportunities, they serve as an opportunity to show how well you take responsibility, address the issue and grow from it. Nobody likes when you pass the buck, but when you own it and take responsibility people are more inclined to respect you.
In October 2016 we received an email from our partner, MageMojo in regards to a router issue that was causing a 50% packet loss on its internal network. The email struck us with its honesty and determination to resolve the issue. The level of professionalism and commitment they displayed boosted our confidence in the partnership. We were so impressed, we decided to share their letter with you as a guide to how to write a letter owning up to your mistakes. The email was broken down into five sections:
- Summary of Events
First, it is important to give an overview of what occurred, what the problem was and what was done to address it. The goal of this is to put things into context, NOT to shift the blame. Not every bad outcome is a mistake, and not every mistake is your responsibility, and it is ok to reframe the events to give new light to the events. However, this should not be written as a defense, it is important to include what your role was in the events and acknowledge the error.
Here’s what they wrote:
On October 26, 2016, at around 2 PM ET we received an alert that a WAN uplink in one of the core routers was down. We determined the optic was failed and needed replaced. A technician was dispatched to replace the dead optic. At 4:05 PM ET, upon inserting the optic in the Secondary router, the Primary router panicked and rebooted into read-only mode. Our tech and network operators immediately restarted the Primary. The reboot took 8 minutes. By 4:15 PM ET the core routers had re-initialized we’re back online. Our monitoring systems cleared up except for a few servers.
We immediately noticed that our server for magemojo.com was experiencing about 50% packet loss on its internal network. We continued to receive a few customer reports of problems. We scanned the internal network for all customer problems and isolated the packet loss to one /24 VLAN. We thought the problem was related to the abrupt failover of the core routers and decided it was best to try switching back routers to the first Primary. At 6:40 PM ET we initiated the core router failover back to the previous Primary. The failover did not go smoothly and resulted in another reboot which caused another 8-minute network disruption. The core routers returned with the old Secondary as Primary, and the problem remained.
We reviewed the configuration to ensure there were no configuration errors with the new Primary but did not find any. We ran diagnostics on all network hardware and interfaces to identify the problem. We found no problems. We ran diagnostics a second time to check for anything we might have missed. Again, we found no problems. At this point, we started going through the changelog working our way backward to look for any changes that could have caused the problem. We found a change from September that looked possibly related to the packet loss. We reverted the change, and the packet loss stopped at 9:10 PM ET. The core continues to remain stable with the new Primary.
- The Nitty Gritty Details
Not everyone wants to know the nitty gritty details about what happened, but some people do and it is your responsibility to offer it. A more detailed version of the events will show that you are on top of things, and are taking action. It will also help to pinpoint the exact issues and what your role was, what was justified and what was out of your control.
Here’s what they wrote:
Our core network uses 2 x Cisco 6500E Series Switches each with its own Supervisor 2T XL. Both Sup2T’s are combined using Cisco’s Virtual Switching Solution (VSS) to create a single virtual switch in HA. Each rack has 2 x Cisco 3750X switches stacked for HA using StackWise with 10G fiber uplinks to the core 6500’s. Servers then have 2x1G uplinks in HA, 1 to each 3750X. Our network is fully HA all the way through and at no point should a single networking device cause an outage. We have thoroughly tested our network HA and confirmed all failover scenarios worked perfectly. Why inserting an optic into a 6500E blade would cause the other switch, the Primary, to reboot is completely unexpected and unknown at this moment. Why a single switch rebooting would cause both to go down is unknown. Why the previous Primary failed to resume its role is also unknown. We have Cisco support researching these events with us.
The packet loss problem was challenging to figure out. First, it wasn’t clear what the common denominator was among the servers with packet loss. Packet loss happened on servers across all racks and all physical hardware. We thought for sure the problem was related to the abrupt switchover of the core. We focused our search isolating the packet loss and narrowed it down to one particular vlan. Why one vlan would have packet loss, but not be completely down, was a big mystery. We thoroughly ran diagnostics and combed through every bit of hardware, routing tables, arp tables, etc.. We isolated the exact location the problem was happening but why remained an unknown. Finally, we concluded the problem was not caused by the network disturbance earlier. That’s when we focused on the change log.
The change was related to an “ip redirects” statement in the config and how we use route-maps to keep internal traffic routing internally and external traffic routing through the F5 Viprion cluster. During tuning of the Sup2T CPU performance, this line changed for one particular vlan. At the time it created no problems and packets routed correctly. However, after the core failover, the interfaces changed and subsequently the F5 Viprion cluster could not consistently route all packets coming from that internal vlan back to the internal network interface from which they originated.
- Where We Went Wrong
Now that all the details about the event have been explained, it is time to step up and admit your mistake and explain what you could have done, and clarify if you acted appropriately in your attempts to resolve it. Your honesty and transparency will be noticed and appreciated. If anyone was hurt by your mistake, this is a good time to issue an apology as well.
Here’s what they wrote:
Honestly, we did a few things wrong. First, we should not have made any changes to the core network during the afternoon unless those changes were 100% mission critical. Second, we jumped right into fixing the problem and replying to customers but never publicly acknowledged the problem started. Third, our support desk overloaded with calls and tickets. We tried very hard to respond to everyone, but it’s just not possible to speak to 100’s of customers on the phone at once. We were not able to communicate with everyone.
- How We’re Doing Better
Tell them what you have learned. Prove that you are taking action and can be trusted in the future. Find the right balance between acknowledging your mistake and your eagerness to move forward. Strength and perseverance is admired, this is your chance to prove yourself. Explain quantifiable steps you are taking and details about what you have learned and what you will now do differently as a result. Mistakes happen and it is takes courage to admit it, but don’t just stop there. This is an opportunity to differentiate yourself by showing how you will move forward.
Here’s what they wrote:
We’re re-evaluating our prioritization of events to reclassify mission critical repairs. All work, even if we think there is zero chance of a problem, should be done at night and scheduled in advance. Customers will be notified, again, even if we don’t expect any disruption of services.
We also know that we need to post on Twitter, our status page, and enable a pre-recorded voice message that says “We are aware that we are currently experiencing a problem and we are working on the issue.” as soon as we first identify a problem. We’re working on setting up a new status page where we can post event updates. We’re also working with our engineers to guide them on how to provide better status updates to us internally so that we can relay information to customers. Unfortunately, during troubleshooting, there is no ETA and any given ETA will be wildly inaccurate. But we understand at the very least an update of “no new updates at this time” is better than no update at all.
Finally, we’re working with Cisco engineers to find the cause of the reboot, upgrade IOS, and replace any hardware that might have caused the problem.
- Thank You
Always end on a positive note. Say thank you. It is hard to own your mistake, but they have just given you a listening ear. Let them know how much you appreciate and value your relationship, and that you don’t take it for granted. Say thank you for their understanding and patience through the process and the opportunity to maintain the relationship moving forward. Reiterate that they can trust you moving forward and that you learn from your mistakes and then, thank them again.
Here’s what they wrote:
Thank you for being a customer here at Mage Mojo. We want you to know that we work incredibly hard every day to provide you with the highest level support, best performance, and lowest prices. Our entire team dedicates themselves to making your experience with Magento a good one. When something like this happens, we hope you’ll understand how hard we work, and how much we care about you. Everything we do is for you the customer and we appreciate your business. Trust in us to learn from this experience and know that we’ll grow stronger providing an even better service for you in the future. Thank you again for being a customer.
Leave a Reply
You must be logged in to post a comment.