## Curation Note
Production debugging without observability is like flying blind. This skill compiles patterns from SRE teams at major tech companies where observability is considered essential infrastructure. The three pillars approach (logs, metrics, traces) provides complete visibility. The emphasis on correlation IDs and structured logging addresses the primary challenge: connecting related events across distributed systems.
## The Three Pillars
### 1. Logs (Events)
```typescript
// Structured logging with correlation
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label })
},
base: {
service: 'user-service',
version: process.env.APP_VERSION,
environment: process.env.NODE_ENV
}
});
// Create child logger with request context
function createRequestLogger(requestId: string, userId?: string) {
return logger.child({
requestId,
userId,
traceId: getTraceId()
});
}
// Usage in request handler
app.use((req, res, next) => {
req.log = createRequestLogger(req.headers['x-request-id'] || generateId(), req.user?.id);
req.log.info(
{
method: req.method,
path: req.path,
query: req.query
},
'Request received'
);
next();
});
```
### 2. Metrics (Aggregates)
```typescript
import { Counter, Histogram, Gauge, Registry } from 'prom-client';
const register = new Registry();
// Request metrics
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [register]
});
const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register]
});
// Business metrics
const activeUsers = new Gauge({
name: 'active_users_total',
help: 'Number of currently active users',
registers: [register]
});
// Middleware to collect metrics
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || 'unknown';
httpRequestDuration.labels(req.method, route, res.statusCode.toString()).observe(duration);
httpRequestTotal.labels(req.method, route, res.statusCode.toString()).inc();
});
next();
});
```
### 3. Traces (Requests)
```typescript
import { trace, context } from '@opentelemetry/api';
const tracer = trace.getTracer('user-service');
async function createUser(userData: UserData): Promise<User> {
// Create span for this operation
return tracer.startActiveSpan('createUser', async (span) => {
try {
span.setAttribute('user.email', userData.email);
// Database operation as child span
const user = await tracer.startActiveSpan('db.insert', async (dbSpan) => {
dbSpan.setAttribute('db.system', 'postgresql');
dbSpan.setAttribute('db.operation', 'INSERT');
const result = await db.users.create(userData);
dbSpan.end();
return result;
});
// External service call as child span
await tracer.startActiveSpan('email.send', async (emailSpan) => {
emailSpan.setAttribute('email.type', 'welcome');
await emailService.sendWelcome(user.email);
emailSpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
return user;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
```
## Log Levels
```typescript
// When to use each level:
logger.trace('Detailed debugging'); // Development only
logger.debug('Variable values'); // Development/debugging
logger.info('Normal operations'); // Business events
logger.warn('Unexpected but handled'); // Potential issues
logger.error('Operation failed'); // Errors requiring action
logger.fatal('System cannot continue'); // Critical failures
```
## Alerting Rules
```yaml
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: 'High error rate detected'
- alert: SlowRequests
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: '95th percentile latency above 1 second'
```
## Best Practices
1. **Correlate everything** - Use request IDs across services
2. **Structure logs** - JSON format for parsing
3. **Right level, right time** - Don't log sensitive data
4. **Sample high-volume** - 1% of traces may suffice
5. **Alert on symptoms** - User impact, not causes
6. **Retention policies** - Balance cost and usefulness
7. **Dashboard hierarchy** - Service > Endpoint > Instance