Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. Cerebras Systems Inc. has multiple openings for Sr. Member of Technical Staff
Title: Sr. Member of Technical Staff
Job Duties:
Design and develop software features that support system resiliency and high availability, including automated recovery mechanisms and fault-tolerant architecture across distributed environments. - Develop and maintain cloud-based deployment workflows for AI inference software using AWS tools and services to support low-latency and scalable system performance. - Develop Python-based scripts and APIs to streamline data preprocessing, inference execution, and post-processing for real-time inference tasks. - Use parallel programming techniques (e.g., multi-threading, asynchronous processing) to maximize resource efficiency on AWS compute instances. - Develop software components to support visualization and analysis of system performance metrics, enhancing the monitoring and usability of inference services. ⠀
Develop inference software in Docker containers and define Kubernetes orchestration strategies that ensure software reliability and efficient scaling. - Develop automated scripts to detect and mitigate common failure modes, improving software system reliability. - Debug issues related to model deployment, container orchestration, networking configurations, documenting steps to reproduce and root-cause defects. - Triage and resolve defects in the software service by analyzing logs, metrics, and distributed traces using tools like AWS CloudWatch, Grafana, or custom Python scripts. - Work with product management and user experience teams to define requirements for inference service interfaces, including configuration, monitoring, and event logging. - Author detailed technical documentation for infrastructure configurations, inference workflows, and APIs, ensuring clarity for internal teams and external customers. - Document and track defects, enhancements, and release notes using tools like Jira and Git, ensuring version control and traceability. Minimum Requirements:
Master’s degree or foreign equivalent degree in Computer Science, or a related field and 18 months of experience as Information Security Analyst, Software Engineer, Sr. Member of Technical Staff, IT Senior Applications Engineer, or a related occupation required. The required experience must include 18 months of experience with the following:
**Infrastructure-as-Code and deployment automation:**Terraform, AWS CloudFormation, AWS CDK, and Ansible;
**Containerization and orchestration:**Docker, Kubernetes, AWS EKS, AWS Elastic Container Service (ECS), AWS Fargate, and Helm;
Compute and serverless services: AWS EC2, AWS Lambda functions, and Auto Scaling Groups;
Monitoring, logging, and distributed tracing: AWS CloudWatch, AWS X-Ray, ELK (Elasticsearch, Logstash, Kibana), Prometheus, and Grafana;
Programming languages and frameworks: Python, Node.js, JavaScript, and Flask;
Data storage and caching: PostgreSQL, Redis, and NFS; and
CI/CD and version control: Jenkins and Git
Additional Information:
Employer’s name: Cerebras Systems Inc. Job site : 1237 E Arques Avenue, Sunnyvale, CA 94085