Staff Site Reliability Engineer - Data Center

Boston, MA or Remote Remote

Posted Apr 22, 2026

Where You Fit

We're looking for a skilled staff level Site Reliability Engineer focused on designing, building, and operating our hybrid cloud/on-prem environment. ## What You’ll Do

If you're the right candidate, you'll be exercising all the skills you have and building new ones along the way:

Advancing the state of our operations by implementing SRE best practices - focusing on users, monitoring, and automation. - Engineering infrastructure patterns for cloud environments in Amazon Web Services - building in security, reliability and scalability. - Designing, building, and operating our data center to support our rapidly growing Machine Learning team. - Integrating on-premises datacenter environments with existing cloud infrastructure to create a seamless hybrid cloud environment. - Improving the reliability and resilience of our infrastructure through root-cause analysis and reviewing gaps in designs, and implementations of our infrastructure. - Participating in platform on-call rotations and assisting with urgent incident response. ## What You Bring

Our employees' skills come in all shapes and sizes, but to be successful in this role with us, you'll at least need:

8+ years of relevant experience. - Automation: You work hard to eliminate toil by automating everything through scripting, configuration management tools (Ansible), and code (Python/GoLang). - You’ve built monitoring infrastructure with modern observability tools (Datadog/Grafana/Prometheus). - You’ve worked with infrastructure as code (Terraform/Cloudformation). - You’ve administered physical hardware stacks in production settings (iDRAC/IPMI/Nvidia UFM/Juniper Systems). - You’re opinionated on storage solutions and how they can be optimized for high performance workloads (Quobyte/S3/FSx/EFS). - Familiarity with modern network designs and comfort operating across network layers. - Some experience and opinions on virtualization, containerization, or container orchestration platforms. (EKS/ClusterAPI/KVM). - Operations experience: You’ve managed critical production infrastructure and are familiar with incident response, scaling, and rapid growth related challenges. - A bachelor's degree in Computer Science or equivalent experience. - An insatiable intellectual curiosity and the ability to learn quickly in a complex space. - Travel: Willingness to travel up to 25% of the time. ##

Staff Site Reliability Engineer - Data Center

Where You Fit

More jobs like this

Staff Site Reliability Engineer

Data - Site Reliability Engineer

Staff Site Reliability Engineer

Explore more

More jobs like this

Staff Site Reliability Engineer

Data - Site Reliability Engineer

Staff Site Reliability Engineer