Posted Apr 27, 2026
Key Responsibilities:
Lead technical workshops to design Sovereign AI and Private Cloud AI platforms using Dell Validated Designs (DVD). - Act as a Subject Matter Expert (SME) on the integration of NVIDIA AI Enterprise (NVAIE) with Dell PowerEdge XE servers (H100/H200/B200). - Develop high-level and low-level designs (HLD/LLD) that incorporate GPU/Network Operators and high-speed InfiniBand/RoCE fabrics. - Deploy and optimize Red Hat OpenShift and upstream Kubernetes in air-gapped or hybrid-cloud enterprise environments. - Implement advanced workload scheduling and fractional GPU slicing using Run:ai or Slurm to maximize client ROI on hardware. - Guide customers in choosing and implementing the right orchestration layer (e.g., BCM for bare metal vs. Kubernetes for microservices). - Architect end-to-end MLOps pipelines utilizing Kubeflow, MLflow, or ClearML to streamline the "data-to-model" lifecycle. - Enable distributed training and fine-tuning (LLMs/GenAI) for clients using Ray and PyTorch on Dell infrastructure. - Integrate Rafay for clients requiring decentralized or multi-cluster AI management across edge and core data centres. - Contribute to the CoE by developing reusable IP, deployment playbooks, and automated Ansible/Helm/Terraform scripts. - Mentor junior consultants and lead technical proof-of-concepts (PoCs) that demonstrate the performance of Dell-NVIDIA stacks. Qualification Required:
10+ years in professional services or consulting, with a heavy focus on AI, Big Data, or HPC infrastructure. - Mastery of NVIDIA GPU Operator, Network Operator, and NVIDIA Base Command Manager (BCM). - Expert-level Kubernetes (CKA/CKS) or Red Hat OpenShift skills, including complex security, CNI (Cilium/Multus) and storage (CSI) configurations. - Experience with Run:ai, Slurm, or Altair PBS for high-concurrency AI environments. - Hands-on experience with Kubeflow, MLflow, Ray, and ClearML. - Advanced Ansible, Helm, Terraform, and Python skills for "Infrastructure as Code" delivery. - Deep expertise in Dell PowerEdge (XE/R series), PowerScale, and PowerSwitch networking. Key Responsibilities:
Lead technical workshops to design Sovereign AI and Private Cloud AI platforms using Dell Validated Designs (DVD). - Act as a Subject Matter Expert (SME) on the integration of NVIDIA AI Enterprise (NVAIE) with Dell PowerEdge XE servers (H100/H200/B200). - Develop high-level and low-level designs (HLD/LLD) that incorporate GPU/Network Operators and high-speed InfiniBand/RoCE fabrics. - Deploy and optimize Red Hat OpenShift and upstream Kubernetes in air-gapped or hybrid-cloud enterprise environments. - Implement advanced workload scheduling and fractional GPU slicing using Run:ai or Slurm to maximize client ROI on hardware. - Guide customers in choosing and implementing the right orchestration layer (e.g., BCM for bare metal vs. Kubernetes for microservices). - Architect end-to-end MLOps pipelines utilizing Kubeflow, MLflow, or ClearML to streamline the "data-to-model" lifecycle. - Enable distributed training and fine-tuning (LLMs/GenAI) for clients using Ray and PyTorch on Dell infrastructure. - Integrate Rafay for clients requiring decentralized or multi-cluster AI management across edge and core data centres. - Contribute to the CoE by developing reusable IP, deployment playbooks, and automated Ansible/Helm/Terraform scripts. - Mentor junior consu
Don't want to apply yourself?
Our team writes your resume, applies for you, preps you for interviews, and negotiates your offer.
Browse Jobs
By Role
By City