On this page8 sections
ConceptsIntermediate6-8 min reference

Monitoring & Observability

Testing doesn't stop at release. Observability is how you see what the system actually does in production — and it's a QA tool, not just an ops one: it tells you whether a release is healthy, surfaces the bugs your tests missed, and gives you the server-side signal that load and integration tests can't. Pair this with the Performance Testing sheet for the load side.

Why observability matters for QA

  • It catches what tests miss. No test suite covers every real-world input; error tracking surfaces the exceptions real users hit.
  • It validates a release in production. Monitoring dashboards and error rates are the fastest signal that a deploy is healthy or needs rolling back.
  • It gives load tests their server-side half. A load tool says that something is slow; observability says where (see the performance sheet's bottleneck section).
  • It is your test environment's health check. Flaky tests are sometimes a flaky environment — observability tells the difference.

Monitoring vs observability

MonitoringObservability
QuestionIs a known thing OK? (CPU, error rate)Why is the system behaving this way?
Based onPredefined metrics & alertsRich, high-cardinality telemetry you can explore
Good forKnown failure modesUnknown-unknowns — novel issues

Monitoring tells you something is wrong; observability lets you ask why without shipping new code to find out. Modern tools blur the line, but the distinction shapes what you can diagnose.

The three pillars

PillarWhat it isWhat it answers
LogsTimestamped event recordsWhat happened, exactly, at this moment
MetricsNumeric measurements over timeHow much / how many / how fast (trends)
TracesA request's path across servicesWhere the time/failure went in a distributed call

Most platforms ingest all three. Logging is the most familiar to testers; traces are the key to debugging microservices, where one request crosses many services.

The tool landscape

NeedTools
Full-platform APM / observabilityDatadog, New Relic, Grafana, Splunk
Log management / aggregationGraylog, Kibana, Mezmo
Error / exception trackingSentry, Rollbar
Alerting & incident responsePagerDuty

Many overlap — Datadog and Splunk span metrics, logs and traces; Grafana visualises across sources; Kibana is the visualisation layer of the Elastic Stack.

Error tracking for QA

Error trackers (Sentry, Rollbar) capture unhandled exceptions with stack traces, breadcrumbs and the release/version they appeared in. For QA they're high-value:

  • A spike in a new error right after a deploy is a regression you can catch in minutes.
  • The stack trace + context often reproduces a bug faster than a vague user report.
  • Grouping by release tells you whether your change introduced it.

Wiring error tracking into the release process turns production into an extension of your test feedback loop.

Alerting and on-call

PagerDuty and the alerting in the platforms above turn signals into action — routing the right alert to the right person. The QA-relevant discipline is alert quality: an alert that fires constantly gets ignored (the same "trusted gate" lesson as flaky tests). Alert on symptoms users feel (error rate, latency SLO breaches), not on every metric wiggle.

How QA uses observability

  1. Post-release verification — watch error rate, latency and key metrics for a defined window after each deploy; treat a regression as a release gate.
  2. Bug reproduction — pull the logs/trace/error context for a reported issue instead of guessing.
  3. Test-environment health — confirm the env is healthy before blaming the test (flaky env vs flaky test).
  4. Correlating performance tests — line up load-test timings with server-side metrics to locate bottlenecks.
  5. Defining SLOs — agree the latency/error targets that performance tests and production alerts both measure against.

Quick observability checklist

  • Error tracking wired into releases, grouped by version
  • Post-deploy monitoring window defined as a release gate
  • Key metrics (error rate, latency percentiles) dashboarded
  • Logs/traces accessible for bug reproduction, not just ops
  • Alerts fire on user-felt symptoms, not noisy metric wiggles
  • Test-environment health observable (env vs test flakiness)
  • Load-test results correlated with server-side telemetry
  • SLOs agreed and measured in both testing and production