Logging and Monitoring: The Stuff That Saves You at 3am

TL;DR

Logging captures what your code is doing. Monitoring watches if it’s healthy. Together, they let you debug production when things go wrong. Structured logging (JSON, not strings) makes analysis easy. Alerting on metrics prevents surprises. You find problems at 9am, not 3am.

The system went down at 2 AM. By the time the on-call engineer noticed, it had been down for 20 minutes. They frantically checked logs. Millions of lines. No structure, no timestamps. Took three hours to find the bug. If we’d had proper monitoring, an alert would have woken them in two minutes. The difference between “we know something’s wrong” and “we’re debugging blind at 3am.”

Most developers treat logging and monitoring as afterthoughts. Infrastructure stuff. Not their problem. Wrong. Logging and monitoring are how you survive running production code.

The Difference: Logging vs. Monitoring

Logging: What happened. Detailed records of events in your code.

2026-03-12T15:30:45Z INFO User logged in: user_id=123
2026-03-12T15:30:46Z ERROR Database connection failed: timeout=5000ms
2026-03-12T15:30:47Z WARN Slow query: SELECT * FROM users took 2000ms

Monitoring: Is it healthy? Metrics that tell you the system’s state right now.

CPU usage: 45%
Memory usage: 60%
Error rate: 0.1% (10 errors per 10,000 requests)
Response time (p99): 450ms
Database connections: 8/10

Logs tell you what happened. Metrics tell you if something’s wrong. Together, they let you understand and fix problems.

Structured Logging: Make Logs Parseable

Bad logging: write strings

console.log('User 123 logged in from 192.168.1.1 at 2026-03-12T15:30:45Z');

// Later, you need to find all login attempts
// grep "logged in" | grep "192.168" | ...
// Fragile, error-prone

Good logging: structured data

logger.info({
  event: 'user_login',
  user_id: 123,
  ip_address: '192.168.1.1',
  timestamp: '2026-03-12T15:30:45Z'
});

// Output as JSON:
// {"event":"user_login","user_id":123,"ip_address":"192.168.1.1","timestamp":"2026-03-12T15:30:45Z"}

// Now searching is easy:
// grep '"event":"user_login"' | jq '.user_id'
// Find all login attempts by user: jq 'select(.user_id == 123)'

Use a logging library like Pino (Node), Logback (Java), or Python logging.

const pino = require('pino');
const logger = pino();

app.post('/login', async (req, res) => {
  logger.info({
    event: 'login_attempt',
    email: req.body.email
  });

  try {
    const user = await authenticate(req.body.email, req.body.password);
    logger.info({
      event: 'login_success',
      user_id: user.id,
      email: user.email
    });
    res.json({ token: user.token });
  } catch (err) {
    logger.error({
      event: 'login_failed',
      email: req.body.email,
      error: err.message
    });
    res.status(401).json({ error: 'Invalid credentials' });
  }
});

Now every log entry is structured JSON. Easy to parse, filter, and analyze.

Log Levels: What to Log

ERROR: Something broke. Needs immediate attention.

logger.error({
  event: 'database_connection_failed',
  error: err.message
});

WARN: Something unexpected but not broken. Worth noting.

logger.warn({
  event: 'slow_query',
  query: 'SELECT * FROM users',
  duration_ms: 5000
});

INFO: Important business events. User logins, payments, significant actions.

logger.info({
  event: 'payment_completed',
  order_id: 123,
  amount: 99.99
});

DEBUG: Detailed information for debugging. Usually disabled in production.

logger.debug({
  event: 'request_received',
  method: 'POST',
  path: '/api/orders',
  headers: req.headers
});

Adjust log level per environment. Production: ERROR and WARN. Development: all.

Monitoring: Metrics That Matter

You can’t monitor everything. Pick metrics that tell you if something’s wrong:

Golden Signals (Google’s SRE Book):

Latency: Response time. If p99 latency jumps from 100ms to 1000ms, something’s wrong.
Traffic: Request volume. If traffic drops 50%, maybe something crashed.
Errors: Error rate. If error rate goes from 0.01% to 1%, you have a problem.
Saturation: How full is the system? CPU 95%, memory 90%, database connections at max.

Track these and you’ll catch most problems.

// Express middleware to track metrics
const prometheus = require('prom-client');

const httpDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const httpErrors = new prometheus.Counter({
  name: 'http_errors_total',
  help: 'Total HTTP errors',
  labelNames: ['method', 'route', 'status_code']
});

app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpDuration
      .labels(req.method, req.route, res.statusCode)
      .observe(duration);

    if (res.statusCode >= 400) {
      httpErrors
        .labels(req.method, req.route, res.statusCode)
        .inc();
    }
  });

  next();
});

Now you’re collecting latency and error metrics. Send these to a monitoring system (Prometheus, DataDog, New Relic).

Alerting: Wake Someone Up When It Matters

Monitoring without alerting is pointless. You can’t watch graphs 24/7. Alerts wake you when something’s broken.

// Alert rules (Prometheus format)

// If error rate > 1% for 5 minutes, alert
- alert: HighErrorRate
  expr: rate(http_errors_total[5m]) > 0.01
  for: 5m
  annotations:
    summary: "High error rate detected"

// If p99 latency > 1 second, alert
- alert: HighLatency
  expr: histogram_quantile(0.99, http_request_duration_seconds) > 1
  for: 5m
  Acnotations:
    summary: "High latency detected"

// If CPU > 90%, alert
- alert: HighCPU
  expr: node_cpu_seconds_total > 0.9
  for: 5m
  annotations:
    summary: "High CPU usage"

Configure alerts to go to Slack, PagerDuty, or your phone. The key: only alert on things that need immediate human attention.

Alert fatigue is real. Too many alerts = people ignore them. Alert only on genuine problems.

Correlation: Logs and Metrics Together

A user reports slow requests. You check the metrics.

// Metrics show:
// - p99 latency spiked from 100ms to 2000ms
// - CPU jumped to 95%
// - Database connections at 9/10
// - No increase in error rate

// Likely: heavy database query under load

// Check logs:
// filter by timestamp of spik
// look for slow queries
// find which user was affected
// reproduce the issue

Logs tell you what happened. Metrics tell you something changed. Together they tell you why.

Distributed Tracing: Following Requests Through Services

In a microservices system, a request touches multiple services. Request goes through User Service, Order Service, Payment Service. One is slow.

Distributed tracing tracks a request across services:

// Request arrives with trace ID
WX-Trace-ID: abc123

// User Service logs:
// trace_id=abc123 event=request_received path=/api/user/123

// User Service calls Order Service:
// trace_id=abc123 event=calling_order_service

// Order Service logs:
// trace_id=abc123 event=request_received path=/api/orders

// Order Service calls Payment Service:
// trace_id=abc123 event=calling_payment_service

// Payment Service logs:
// trace_id=abc123 event=request_received path=/api/pay took=2000ms

// Back through the chain, all with same trace_id

// In the trace viewer, you see the full request path and timing
// You see Payment Service took 2000ms (bottleneck)

Use Jaeger or Zipkin for distributed tracing.

When NOT to Log Everything

Don’t log passwords, API keys, or sensitive data. It ends up in logs where anyone with access can read it.

Don’t log at DEBUG level in production by default. Too much noise. Enable it for specific requests to diagnose issues.

Don’t use logging as your monitoring. Logs are for debugging. Metrics are for monitoring. They’re different tools.

Common Mistakes

Logging without structure. String logs are useless at scale. JSON is mandatory.

Not logging errors. If something fails, log it. Include the error message and stack trace.

Logging too much. Logs become noise. Be selective. Log important events and errors.

No alerting. Metrics are useless if nobody checks them. Wire up alerts.

Alerting on everything. Alert fatigue kills monitoring. Alert only on real problems.

FAQ

Should I log to a file or send to a service?

Send to a centralized logging service (ELK, Datadog, Splunk). Files are hard to search and rotate. Centralized logging scales better.

How long should I keep logs?

Depends on compliance. Usually 30-90 days for operational logs. Longer for audit logs. Old logs are searchable but expensive to store.

What’s the difference between Prometheus and Grafana?

Prometheus collects and stores metrics. Grafana visualizes them. Prometheus is the database. Grafana is the pretty dashboard.

Can I trace requests without distributed tracing tools?

Yes, manually. Pass a trace ID through all services and log it. Distributed tracing tools automate this and add timing information.

Logging and monitoring are the difference between confident deployments and panic at 3am. Get them right and you sleep better. Get them wrong and you’re debugging blind while the system burns. Invest in good logging infrastructure. Your future self will thank you.

design, engineering, quality

DevelopersCodex