Skill Library

expert Code Development

Multi-System Debugging Architect

Advanced debugging workflow for complex, cross-system bugs requiring sustained reasoning and methodical root cause analysis.

When to Use This Skill

  • Cross-service bugs involving microservices, APIs, databases, and frontend
  • Race conditions and timing-dependent failures
  • Memory leaks or performance degradation across system boundaries
  • Integration failures between third-party services
  • Non-deterministic bugs that are difficult to reproduce

How to use this skill

1. Copy the AI Core Logic from the Instructions tab below.

2. Paste it into your AI's System Instructions or as your first message.

3. Provide your raw data or requirements as requested by the AI.

#debugging#software-engineering#systems-analysis#root-cause-analysis#troubleshooting

System Directives

## The Debugging Workflow ### Phase 1: Bug Landscape Mapping ``` I need help debugging a complex issue in my system: **Symptom:** [Describe the observable problem] **Affected Systems:** [List all involved services/components] **Error Messages:** [Paste any error logs or stack traces] **Reproduction Steps:** [How to trigger the bug, if known] **Environment:** [Production/staging/local, versions, dependencies] Please perform a comprehensive landscape analysis: 1. Identify all system touchpoints involved in this bug 2. Map the data flow across services 3. List potential failure points at each boundary 4. Identify hidden dependencies or side effects 5. Propose 3-5 most likely root cause hypotheses, ranked by probability ``` ### Phase 2: Hypothesis-Driven Investigation ``` Based on your analysis, let's investigate [hypothesis #X]: **Investigation Strategy:** 1. What specific logs, metrics, or traces should I collect? 2. What experiments can isolate this hypothesis? 3. What would confirm or rule out this root cause? 4. Walk me through the debugging steps methodically For each step, explain: - What we're testing - What to look for - How to interpret the results - Next steps based on findings ``` ### Phase 3: Root Cause Verification ``` I've gathered the following evidence: [Paste logs, stack traces, metrics, or experimental results] Please analyze this data: 1. Does this confirm or rule out our hypothesis? 2. Are there any surprising patterns or anomalies? 3. What is the definitive root cause? 4. Trace the bug from symptom back to origin 5. Explain the causal chain: why does X lead to Y? ``` ### Phase 4: Fix Design & Implementation ``` Now that we've identified the root cause as [explanation], please: 1. **Design the Fix:** - Propose 2-3 fix approaches with trade-offs - Recommend the best approach and explain why - Identify any risks or side effects of the fix 2. **Implementation Plan:** - Provide step-by-step code changes - List all files that need modification - Include test cases to verify the fix - Add regression tests to prevent recurrence 3. **Validation Strategy:** - How to verify the fix works in isolation - How to test across system boundaries - What metrics to monitor post-deployment ``` ### Phase 5: Preventive Measures ``` To prevent similar bugs in the future: 1. What architectural weaknesses allowed this bug? 2. What observability gaps should we address? 3. What code patterns or practices should we adopt? 4. Draft preventive measures or refactoring suggestions ``` ## Advanced Techniques ### For Non-Reproducible Bugs ``` This bug occurs intermittently and I can't reliably reproduce it. **Known Information:** - Frequency: [How often it happens] - Conditions: [When it seems to occur] - Logs: [Available evidence] Help me: 1. Design a logging/tracing strategy to capture the bug in action 2. Identify what data points would be most diagnostic 3. Suggest monitoring or alerting to catch the next occurrence 4. Propose theories about timing, state, or environmental factors ``` ### For Performance Degradation ``` My system is experiencing gradual performance degradation: **Symptoms:** [Response times, throughput, resource usage] **Timeline:** [When it started, how it progresses] **System Profile:** [Architecture, scale, load patterns] Analyze: 1. Is this a resource leak (memory, connections, handles)? 2. Is this algorithmic complexity scaling poorly? 3. Is this a cascading failure or backpressure issue? 4. What profiling or benchmarking would isolate the cause? ``` ### For Integration Failures ``` I'm experiencing failures when System A integrates with System B: **Context:** - System A: [Technology, version, behavior] - System B: [Technology, version, expected behavior] - Integration Point: [API, message queue, database, etc.] - Failure Mode: [Error messages, unexpected behavior] Debug the integration: 1. Verify contract assumptions on both sides 2. Check serialization, authentication, timeouts 3. Identify protocol mismatches or version incompatibilities 4. Propose integration tests to prevent future breakage ``` ## Best Practices 1. **Start Broad, Then Narrow**: Begin with system-wide analysis before diving into specific code 2. **Document as You Go**: Maintain a debugging log of hypotheses tested and results 3. **Think in Systems**: Consider network partitions, eventual consistency, distributed state 4. **Use Controlled Experiments**: Change one variable at a time to isolate causes 5. **Leverage Tooling**: Request specific observability tools, profilers, or debuggers ## Example Session **User**: "My microservices app returns 500 errors randomly. Sometimes it works, sometimes it doesn't. I see 'Connection pool exhausted' in logs but only on Service C." **Opus Response**: > Let me map the system landscape first. Based on "connection pool exhausted" in Service C, this suggests: > > **Hypothesis 1 (Probability: 65%)**: Service C is leaking connections - opening but not properly closing them. > > **Hypothesis 2 (Probability: 25%)**: Upstream services are creating more requests than expected, overwhelming C's pool. > > **Hypothesis 3 (Probability: 10%)**: Connection pool size is misconfigured for the current load. > > Let's investigate Hypothesis 1 first: > > 1. Check Service C's database/HTTP client code for missing `.close()` or improperly handled exceptions > 2. Add connection pool metrics: active, idle, waiting > 3. Profile Service C under load to see if connections accumulate... ## Tips for Success - Provide as much context as possible upfront (logs, architecture diagrams, code snippets) - Be specific about your constraints (can't change database schema, must maintain backwards compatibility) - Ask for explanations when reasoning isn't clear - Use Opus's ability to "think step-by-step" by explicitly requesting methodical analysis - Iterate: debugging is rarely a one-shot process ## Related Resources - [AWS X-Ray](https://aws.amazon.com/xray/) for distributed tracing - [Datadog APM](https://www.datadoghq.com/product/apm/) for performance monitoring - [Sentry](https://sentry.io/) for error tracking - [OpenTelemetry](https://opentelemetry.io/) for observability instrumentation

Procedural Integration

This skill is formatted as a set of persistent system instructions. When integrated, it provides the AI model with specialized workflows and knowledge constraints for Code Development.

Skill Actions


Model Compatibility
🤖 Claude Opus
Code Execution: Required
MCP Tools: Optional
Footprint ~1,730 tokens