Let’s face it, as a Product Manager, you have your own glory days when all charts are pointing upwards but you cannot avoid issues or incidents, as we all like to call it. While only the engineering team can fix the issues, the other teams can indirectly contribute. So let’s talk about, how you, as a Product Manager, can add value when such incidents occur.
Before we dive in, let’s create a hypothetical incident:
Let’s say, your company sells some beauty products online (e-commerce site). Customers are completing the orders but your internal dashboard is showing “payment pending.” However, you can see successful payments in the Payment Gateway’s dashboard. The issue seems to be only with the internal dashboard. Now this is causing an issue for downstream in all the workflows that are dependent on your Admin dashboard’s status; especially order fulfilment teams.
Incident analysis: What’s really happening here?
First things first, try to get a sense of what is happening here. You will have to work with the the Engineering Manager or whoever is leading the team to fix it. Identify the people that are working on this as a first step.
Next, try to understand the issue better. This is where your technical knowledge will help you. If you don’t understand, then research on the topic and gain more knowledge on it. Sometimes a simple Google search or a question to ChatGPT might help.
In this case, you might have to learn a bit more about how exactly the order status gets updated when the payment is successful. Also, you might have to find out more about how APIs work.
Document the knowledge that you have collected on the issue as much as possible.
Here are few questions you can answer:
- What is the incident exactly?
- When did the incident occur?
- Do we know if this incident has occurred earlier? If yes, is there any past documentation on how it might have been solved?
- Who noticed it first and how was the issue reported? (This information will be helpful later to devise a mitigation plan).
Key Takeaway: Get to the depth of the incident and document initial findings.
Impact analysis: How does it impact the customers and the company?
Depending on the severity of the incident, the impact will be different. It may impact the entire product bringing everything to a standstill or it may be small part of the workflow. It might affect customers directly or only the internal teams’ workflows.
To analyse the impact, work with your engineering team. In our hypothetical scenario, we might discover that only payments from Country X are affected, rest of the payments work fine.
Document everything you notice during this analysis.
Here are some more questions to look at:
- Is the customer impacted directly? If yes, which part of the product are they not able to use?
- What is the criticality of the impact? In our example, not being able to see the updated payment status is more critical than not being able to download the order receipt.
- Which teams should be informed about this incident?
- Do we know why it occurred? At this stage, you may or may not know the root cause. If you know the cause, write it down. If you don’t, then move on to the next thing.
Your work does not end here. In fact, it has only begun. So buckle up.
Key Takeaway: When doing impact analysis, look at all angles and every part of the workflow. You can perhaps, create a checklist.
Mitigation plan:
In most situations, it is very likely that the team needs more time to fix the issue. However, you have to look for alternate solutions because you cannot let the workflow stop.
For different problems, the alternatives will be different and this is something you can work out with Engineering team and the team that is heavily impacted. In our example, you will be talking to Engineer and Operations team to come up with an alternative. Since we get the status of the payment in the Payment Gateway’s dashboard, can that be used directly? Can we manually update something? Can we have an engineer to manually make updates to the database until issue resolution?
You get the drift. Some solutions are obviously going to sound stupid but that is exactly why you are brainstorming with the team. You need to drive these discussions to fruition.
Key Takeaway: Look for the shortest possible path to alternate solution that will work. In some cases, it may be advisable to just wait until the issue is fixed.
Communication Plan
This is where you contribute the most. All the documentation that you did in the previous steps will be put to use here.
Firstly, setup a communication line with your Engineering Team.
The Engineering Manager or team lead will be your point of contact. Be in touch with them on a regular interval that you both are comfortable with. Tip: Don’t go asking “what’s the update?” every 5 mins. Instead, enable the team to use common channels.
Secondly, prepare the communication with the internal teams.
Thirdly, partner up with the marketing team or equivalent team to draft customer-facing communication. This is required only if the customers are directly impacted. In our example, we won’t need to do it.
What should you communicate?
Talk to your teams as if they are your friends. This means, you automatically cut out jargons and tech-heavy phrases. No matter what medium of communication you choose, you can always include following components:
- What is the issue?
- What caused this?
- What is the impact?
- When will it be fixed?
- How do we mitigate or handle until fixed?
Be candid and open about the issue. Keep the message short. Keep the language as simple as you can.
For example, instead of saying, “the status is not getting update because the webhook is giving 500 gateway error”, tell them, “Every time a payment is successful, our payment gateway partner sends the notification to us and we update our admin dashboard. But in this case, we are not getting this notification.”
Post fix:
Create an Incident Report with full details. Generally, Engineering Team works on this. You can either choose to use the same or create your own with simplified language.
If you had a customer-facing incident, then team up with Engineering and Product Marketing Teams and draft an incident report. Some startups may not have specialised roles and therefore, you might be the only person to handle all communication.
What if you don’t know the root cause or the estimate to fix it?
Let’s admit, communication is hard. It is hard especially when you don’t have any updates to give. For example, even after 2 hours of debugging, the team is not able to find out what went wrong. However, you still have to respond to stakeholders.
I try to take approach of “as is” update. This means, you don’t have to fabricate any story but tell exactly how things are going.
For example, your update could be, “At this point, we do not know what is causing this issue. Our team of 4 members has been on it for the past 2 hours. They have their suspicions of what might be causing it but yet to confirm. Here is the plan of action in the meantime…“
You can take this time to inform the team about the alternate solution and prepare them to adopt it until then.
My mistakes from the past:
- Mistake #1 — At times, my only update was “We are fixing it.”
This gave no confidence to the team because I am not telling them what is the issue, how long they have to wait, how can they mitigate. When there is a bug in the system, we all have a tendency to treat it as our personal failure. We fail to acknowledge that no product is perfect and failures are inevitable. Avoid doing this.
- Mistake #2 — Not being transparent about requiring help from others.
You can use help from other PMs, or teams that have dealt with such issues. In my initial days, I tried to do it all by myself and suffered. Avoid doing this and seek help.
