Overview of how we monitor our Production systems
All services that we look after either produce logs or metrics. Both are key for us to monitor to make sure the systems are healthy and operating quickly.
We process logging in the following format:
The generally accepted format for a log string is the following:
Logs include key information such as what the system is doing if it had any errors and external interactions with other services.
Metrics time-series streams of timestamped values. For example, an API might have 4 requests for a specific endpoint each minute. A metric would be used to track this to track usage.
We use DataDog to consume all of our metrics, logs and integrate with our cloud providers. It is a SaaS DevOps monitoring solution that allows the SRE team to create monitors against all the logs and metrics we receive With these monitors we can alert our incident management system PagerDuty and slack.
We use PagerDuty to manage our incidents, schedule who's on call and automate the calling of a team member when an incident occurs.