Senior Observability Specialist
Johannesburg, ZA

Why This Role Matters
In a world where milliseconds matter and reliability is non-negotiable, observability is the superpower that keeps enterprise systems alive, agile, and trusted.
As our Senior Observability Specialist, you will architect the nervous system of our digital estate, bringing clarity to complexity across hybrid cloud, distributed platforms, and mainframe environments. Your work will ensure incidents are prevented before they happen, detected faster than ever, and resolved with confidence and speed.
You won’t just monitor systems, you will illuminate them, championing Site Reliability Engineering (SRE) principles, automation-first thinking, and a truly measure-first culture.
Your Mission
To deliver enterprise-grade observability solutions that provide real-time insights into system health, performance, and resilience—using best-in-class platforms like Dynatrace and ServiceNow.
You’ll partner with technology and business leaders to design scalable, integrated monitoring ecosystems that empower teams, reduce noise, and turn data into decisive action.
What You’ll Be Doing
Design & Build World-Class Observability
- Architect and implement end-to-end observability across applications, infrastructure, and services
- Define and set enterprise standards for monitoring and event management
- Instrument services using logs, metrics, traces, telemetry, APM, RUM, and synthetics
- Design synthetic user journeys for proactive, early-warning detection
- Build high-signal alerts and visually compelling dashboards that matter
- Ensure scalability, availability, and recoverability across hybrid cloud environments
Integrate, Automate, Accelerate
- Integrate observability platforms with ServiceNow Event Management
- Design event normalization, enrichment, CI binding, and noise reduction strategies
Drive Operational Excellence
- Strengthen operational readiness with clear procedures and response plans
- Partner with operations teams to close monitoring gaps and refine incident response
Champion Continuous Improvement
- Research emerging observability tools, trends, and practices
- Design advanced, cross-technology observability solutions at enterprise scale
Lead, Influence & Enable
- Serve as an observability and SRE consultant across enterprise initiatives
- Support SRE practices including SLIs, SLOs, error budgets, and post-incident reviews
Experience & Expertise
- 8–10 years of IT experience, with 7–10+ years in monitoring, APM, observability, or SRE at enterprise scale
- Deep, hands-on expertise with Dynatrace, ServiceNow APM/Event Management, or equivalent platforms
- Strong background in event management design, ITSM integration, and automation
- Proven experience designing SLIs, SLOs, alert strategies, and driving MTTR reduction
- Production experience in AWS, Azure, or GCP, including Kubernetes and containerized environments
Technical Strengths
- Strong programming logic and scripting expertise
- Ability to design solutions that span complex, interdependent technologies
- Solid understanding of hybrid cloud and distributed system architectures
Qualifications
- IT Diploma / Degree or equivalent practical experience
The Way You Work
- Strategic thinker who sees the big picture while mastering the details
- Calm under pressure, leading incidents with clarity and confidence
- Able to influence without authority, coaching teams to shift-left and own their SLOs
- Collaborative across all organizational levels
- Emotionally intelligent, decisive, resilient, and quality-driven
The Impact You’ll Make
You’ll redefine how reliability is engineered, how incidents are experienced, and how visibility empowers teams. Your influence will be felt across platforms, portfolios, and people, turning observability into a competitive advantage.

---------------------------------------------------------------------------------------
Please contact the Nedbank Recruiting Team at +27 860 555 566