Posted Apr 22, 2026
We’re a global team of over 400 people, working together to push the boundaries of open-source technology and multi-cloud solutions. Our vision is to help developers, builders, and creators bring their ideas to life with speed and simplicity, by providing a cloud data platform that makes open-source databases, search, streaming, and application infrastructure easily accessible to everyone. ### The Role:
We are seeking a Director of Site Reliability Engineering to lead a global organization responsible for the reliability and operational excellence of the Aiven platform globally. You will lead a high-performing SRE team, setting the vision and strategy to ensure resilient, scalable, and highly automated systems across our 24/7/365 operations. Your team will proactively manage platform health, lead incident response and cross-functional coordination, and drive continuous improvement in reliability and performance. As a senior leader, you will partner closely with engineering, product, and support teams worldwide, influence system architecture, and invest in tooling and automation to reduce toil and enhance production reliability. This role combines strategic leadership, customer centricity, and deep operational accountability, with a focus on delivering reliable services at global scale while developing strong technical leaders within your organization. ### What You'll Do:
Define and drive global SRE operating strategy in partnership with regional SRE leaders across EMEA, AMER and APAC, ensuring alignment on reliability goals, operating models, and execution across a 24/7/365 follow-the-sun organization. - Build and lead a multi-regional SRE organization through managers, developing leadership capability, mentoring team, and ensuring consistent performance, culture, and delivery across geographies. - Set the vision and roadmap for reliability engineering, enabling teams to deliver high-impact tools, automation, and process initiatives that improve platform resilience, scalability, and efficiency. - Own global incident management strategy and operating model, including on-call design, coverage, and escalation frameworks, ensuring seamless coordination and high availability across regions. - Establish a metrics-driven operating cadence, defining KPIs/SLIs/SLOs/Error Budget, driving data-informed prioritization, and embedding operational rigor and continuous improvement across the SRE organization. ###
Proven experience leading and scaling global SRE or infrastructure organizations through managers, ideally across multiple regions and time zones. - Strong track record of defining and executing reliability strategy at scale, including ownership of SLIs/SLOs, incident management frameworks, and operational excellence programs. - Demonstrated ability to build, develop, and mentor senior leaders, creating high-performing, inclusive teams and strong leadership pipelines. - Experience operating in a 24/7/365 production environment, with deep understanding of follow-the-sun models, on-call design, and large-scale incident response. - Ability to partner cross-functionally at the executive level (Engineering, Product, Support) to influence architecture, prioritization, and long-term platform investments. - Strong data-driven leadership approach, with experience defining SLI/SLOs and using metrics to drive prioritization, accountability, and continuous improvement. - Solid technical foundation in distributed systems, cloud infrastructure, and automation, with the ability to engage credibly with senior engineers and influence technical direction. - Experience driving large-scale change and organizational design, including scaling teams, evolving operating models, and improving efficiency and reliability at company level. ### Global Benefits:
Our global benefits are designed to help you thrive and grow, personally and professionally:
Don't want to apply yourself?
Our team writes your resume, applies for you, preps you for interviews, and negotiates your offer.
Browse Jobs
By Role
By City