Monitoring and alerting
What is monitoring and alerting?
Monitoring in context of software systems is an activity of observing system parameters over period of time.
Alerting is warning of a danger, threat, or problem, typically with the intention of having it avoided or dealt with.
Why you should care about monitoring?
As an Engineering Manager usually you and your team are responsible for some system(s). Any service outage might have severe impact on the business. Your job is to make sure that you and your team are aware at any point of the time about the health of the systems you own. The last thing which you want to end up with is an user of your services to come to you and describe the issue which you as a system owner were not aware about.
There are at least few reasons why you should consider improving in this area.
Reason #1: People
Your team members are your biggest asset. There are no successful teams without great team standing behind the success. You want to keep them happy.
You might be wondering: "How lack of monitoring impact happiness?"
Lack of proper monitoring and alerting leads to unexpected and unplanned events. I am speaking here about the situations where you need to fix something ASAP. This results in context switching and stress. This can severely impact team morale and lead to mental burn down. No one wants to be in a team which is constantly firefighting.
Reason #2: Your company customers
Without monitoring and alerting you might be losing your customers on that latest version of Safari where the button stopped working properly. This can cause serious consequences in a long term for your business.
Reason #3: Incident resolution speed
If you are monitoring each part of your system, figuring out what went wrong is much faster than when you get this email from annoyed user stating that "the webiste doesn't work at all!". Incident response time can make a difference if it comes to the size of financial loss of the company.
Reason #4: Your reputation
Most companies which I have worked for are very professional and in case of emergencies everyone was supportive. However if situation repeats it starts to become a problem and even most tolerant leaders and colleagues will start to notice it. No one likes to be dragged to to the war room during long weekend. You want to be on top of your system health in order to prevent it.
Reason #4: Increased productivity
Helps improve performance
How many issues we had last week, month, quarter?
Best practices
Checklist (to prioritize)
API latency
API response times
API health check
Connectivity
API success ratio
top errors
Certificates validity
JavaScript/client side errors
Memory usage
Disk usage
Real time data is important
Processor usage
Notifications but not spam
Types
Availability monitoring
Last updated
Was this helpful?