ConceptsIntermediate6-8 min reference

Monitoring & Observability

Testing doesn't stop at release. Observability is how you see what the system actually does in production — and it's a QA tool, not just an ops one: it tells you whether a release is healthy, surfaces the bugs your tests missed, and gives you the server-side signal that load and integration tests can't. Pair this with the Performance Testing sheet for the load side.

Why observability matters for QA

It catches what tests miss. No test suite covers every real-world input; error tracking surfaces the exceptions real users hit.
It validates a release in production. Monitoring dashboards and error rates are the fastest signal that a deploy is healthy or needs rolling back.
It gives load tests their server-side half. A load tool says that something is slow; observability says where (see the performance sheet's bottleneck section).
It is your test environment's health check. Flaky tests are sometimes a flaky environment — observability tells the difference.

Monitoring vs observability

	Monitoring	Observability
Question	Is a known thing OK? (CPU, error rate)	Why is the system behaving this way?
Based on	Predefined metrics & alerts	Rich, high-cardinality telemetry you can explore
Good for	Known failure modes	Unknown-unknowns — novel issues

Monitoring tells you something is wrong; observability lets you ask why without shipping new code to find out. Modern tools blur the line, but the distinction shapes what you can diagnose.

The three pillars

Pillar	What it is	What it answers
Logs	Timestamped event records	What happened, exactly, at this moment
Metrics	Numeric measurements over time	How much / how many / how fast (trends)
Traces	A request's path across services	Where the time/failure went in a distributed call

Most platforms ingest all three. Logging is the most familiar to testers; traces are the key to debugging microservices, where one request crosses many services.

The tool landscape

Need	Tools
Full-platform APM / observability	Datadog, New Relic, Grafana, Splunk
Log management / aggregation	Graylog, Kibana, Mezmo
Error / exception tracking	Sentry, Rollbar
Alerting & incident response	PagerDuty

Many overlap — Datadog and Splunk span metrics, logs and traces; Grafana visualises across sources; Kibana is the visualisation layer of the Elastic Stack.

Error tracking for QA

Error trackers (Sentry, Rollbar) capture unhandled exceptions with stack traces, breadcrumbs and the release/version they appeared in. For QA they're high-value:

A spike in a new error right after a deploy is a regression you can catch in minutes.
The stack trace + context often reproduces a bug faster than a vague user report.
Grouping by release tells you whether your change introduced it.

Wiring error tracking into the release process turns production into an extension of your test feedback loop.

Alerting and on-call

PagerDuty and the alerting in the platforms above turn signals into action — routing the right alert to the right person. The QA-relevant discipline is alert quality: an alert that fires constantly gets ignored (the same "trusted gate" lesson as flaky tests). Alert on symptoms users feel (error rate, latency SLO breaches), not on every metric wiggle.

How QA uses observability

Post-release verification — watch error rate, latency and key metrics for a defined window after each deploy; treat a regression as a release gate.
Bug reproduction — pull the logs/trace/error context for a reported issue instead of guessing.
Test-environment health — confirm the env is healthy before blaming the test (flaky env vs flaky test).
Correlating performance tests — line up load-test timings with server-side metrics to locate bottlenecks.
Defining SLOs — agree the latency/error targets that performance tests and production alerts both measure against.

Quick observability checklist

Error tracking wired into releases, grouped by version
Post-deploy monitoring window defined as a release gate
Key metrics (error rate, latency percentiles) dashboarded
Logs/traces accessible for bug reproduction, not just ops
Alerts fire on user-felt symptoms, not noisy metric wiggles
Test-environment health observable (env vs test flakiness)
Load-test results correlated with server-side telemetry
SLOs agreed and measured in both testing and production