Principal Engineer | Huntsville, AL

Distributed systems engineering with a focus on operability and safe recovery.

I design and harden event-driven systems, data pipelines, and control-plane services so failures are visible, recovery is safe, and operations stay ready. Currently employed full time; open to limited advisory and moonlighting work subject to employer policies.

Event-driven systems Observability and MTTR Safe replay and idempotency AWS + Kubernetes

Current focus

  • Reliability under partial failure and dependency instability.
  • Safe recovery: DLQ strategy, quarantine, and replay runbooks.
  • Low-noise telemetry with correlation IDs and traces.
  • Game Days and readiness checks that keep systems honest.
Experience

Principal-level engineering across finance, defense, and data platforms.

I build high-throughput systems and data platforms that stay reliable under pressure, with a bias for clarity, repeatability, and safe recovery.

Background

Built high-throughput distributed systems, data pipelines, and cloud-native platforms in commercial and defense contexts.

Working style

Structured, documentation-first delivery with clear acceptance criteria and measurable outcomes.

Focus areas

Where I add the most value.

Hands-on engineering and architecture work that improves reliability, visibility, and recovery speed.

Operability and incident readiness

Telemetry baselines, alert tuning, and runbooks that make failures diagnosable and actionable.

Resilient workflows and pipelines

Retry classification, backoff, backpressure, and safe failure handling for event-driven systems.

Safe recovery design

Idempotency, quarantine, and replay strategies to prevent corruption during reprocessing.

Strengths

Depth across systems, platforms, and reliability patterns.

Systems and architecture

  • Event-driven workflows and streaming ingestion
  • Control-plane reconciliation loops
  • Partition-tolerant and async patterns
  • Performance-critical services and data platforms

Tools and languages

  • AWS (Lambda, SQS, Step Functions, EventBridge)
  • Kubernetes/EKS, Gateway API, Envoy
  • OpenTelemetry, Prometheus, Grafana
  • Go, Python, SQL
Selected highlights

Selected outcomes from production systems.

Improved latency stability Tuned gRPC retry behavior for latency-sensitive workflows under volatile load.
Faster incident recovery Designed idempotent consumer patterns and safe replay paths for streaming pipelines.
Contact

Open to limited advisory and moonlighting engagements.

If the scope is clear and bounded, I am happy to talk. Evening and weekend availability varies by week.

Joshua A. Sorrell

Huntsville, AL

me@joshuasorrell.com

256.654.5707

joshuasorrell.com