Posted Apr 18, 2026
As a Site Reliability Engineer at Ascendion, you will embed directly within engineering teams to become an expert practitioner of the systems you support. Your role will involve understanding how systems are built, how they behave in production, and troubleshooting them forensically when issues arise. You will be expected to have both engineering rigor and operational instinct. Key Responsibilities:
Embed within engineering squads to build deep system knowledge, including understanding architecture, data flows, failure modes, and dependencies. - Instrument systems with comprehensive observability such as metrics, logs, traces, and alerting to provide a full forensic picture of production behavior. - Participate in on-call rotas and lead technical incident response using structured troubleshooting and tooling to diagnose and resolve production issues rapidly. - Proactively identify reliability risks and collaborate with engineering teams to address them before they impact production. - Build and maintain runbooks, playbooks, and diagnostic tooling to support efficient incident management. - Continuously monitor system performance, validating infrastructure health, functional correctness of data pipelines, and application behavior. - Support the SRE Lead in establishing team-wide standards for monitoring, alerting, and incident response. Essential Skills & Experience:
Solid software engineering or platform engineering background with production operations experience. - Hands-on experience with observability and monitoring tooling such as Datadog, Grafana, ELK stack, Prometheus, or equivalent. - Experience troubleshooting complex distributed systems with strong diagnostic skills and a methodical approach to incident investigation. - Comfortable reading and understanding application code as well as infrastructure configuration. - Experience working in Agile engineering teams with shared ownership of reliability outcomes. Key Responsibilities:
Embed within engineering squads to build deep system knowledge, including understanding architecture, data flows, failure modes, and dependencies. - Instrument systems with comprehensive observability such as metrics, logs, traces, and alerting to provide a full forensic picture of production behavior. - Participate in on-call rotas and lead technical incident response using structured troubleshooting and tooling to diagnose and resolve production issues rapidly. - Proactively identify reliability risks and collaborate with engineering teams to address them before they impact production. - Build and maintain runbooks, playbooks, and diagnostic tooling to support efficient incident management. - Continuously monitor system performance, validating infrastructure health, functional correctness of data pipelines, and application behavior. - Support the SRE Lead in establishing team-wide standards for monitoring, alerting, and incident response. Essential Skills & Experience:
Solid software engineering or platform engineering background with production operations experience. - Hands-on experience with observability and monitoring tooling such as Datadog, Grafana, ELK stack, Prometheus, or equivalent. - Experience troubleshooting complex distributed systems with strong diagnostic skills and a methodical approach to incident investigation. - Comfortable reading and understanding application code as well as infrastructure configuration. - Experience working in Agile engineering teams with shared ownership of reliability outcomes.
Don't want to apply yourself?
Our team writes your resume, applies for you, preps you for interviews, and negotiates your offer.
Browse Jobs
By Role
By City