- Deploy and manage production-grade Kubernetes clusters tailored for GPU-intensive workloads in AI/ML environments.
- Build and maintain GPU node pools using NVIDIA Device Plugin, CRI-O, or container runtimes supporting GPU scheduling.
- Orchestrate containerized distributed model training using Kubernetes and frameworks like PyTorch, TensorFlow, or Hugging Face.