As a Site Reliability Engineer (SRE), your role involves combining software engineering and systems engineering to build, operate, and support large-scale, distributed, fault-tolerant systems. Your focus will be on ensuring high availability, performance, security, and reliability across cloud-native and hybrid environments through automation, observability, and operational excellence. Key Responsibilities:
Manage system uptime and reliability across cloud-native (AWS, GCP) and hybrid architectures
Design and implement Infrastructure as Code (IaC) solutions meeting security and engineering standards using tools like Terraform, cloud CLIs, and cloud SDKs
Build and maintain CI/CD pipelines for application and infrastructure deployment using tools such as Jenkins and cloud-native toolchains
Develop automated tooling for deploying production changes and managing service requests effectively
Create and maintain comprehensive runbooks for detecting, remediating, and restoring services
Troubleshoot and triage complex issues in distributed systems, including participation in on-call rotations for high-severity incidents
Continuously improve runbooks and operational processes to reduce Mean Time to Recovery (MTTR)
Lead blameless postmortems for availability incidents and own remediation actions to prevent recurrence
Key Skills to Develop:
DevSecOps
Operational Excellence
Systems Thinking
Troubleshooting
Technical Communication and Presentation
Required Experience & Qualifications:
Bachelors degree in Computer Science or a related technical field involving coding (or equivalent practical experience)
57 years of experience across software engineering, systems administration, database administration, or networking
Minimum 2+ years of experience developing or administering systems on public cloud platforms
Experience monitoring infrastructure and application availability to meet performance and reliability objectives
Proficiency in one or more programming/scripting languages such as Python, Bash, Java, Go, JavaScript, or Node.js
Strong cross-functional understanding of systems, networking, storage, security, and databases
System administration and automation experience using tools such as Terraform, Chef, Ansible, and containers (Docker, Kubernetes)
Strong experience with CI/CD tools and practices
Cloud certifications are strongly preferred
Key Responsibilities:
Manage system uptime and reliability across cloud-native (AWS, GCP) and hybrid architectures
Design and implement Infrastructure as Code (IaC) solutions meeting security and engineering standards using tools like Terraform, cloud CLIs, and cloud SDKs
Build and maintain CI/CD pipelines for application and infrastructure deployment using tools such as Jenkins and cloud-native toolchains
Develop automated tooling for deploying production changes and managing service requests effectively
Create and maintain comprehensive runbooks for detecting, remediating, and restoring services