Job Description
The Platform Engineer – AI & GPU Services will be responsible for implementing and maintaining AI/ML platforms and GPU resource management across cloud (GCP) and on-premise infrastructure. This role combines expertise in cloud services, AI/ML technologies, and infrastructure automation to support both product engineering and platform engineering functions. The ideal candidate will have experience working with generative AI services, GPU management, and container orchestration platforms.
Responsibilities:
• Architect, build, and maintain AI/ML platforms using Google Cloud Platform (GCP) services like Compute, Storage, IAM, and VPC.
• Manage NVIDIA GPU resources across projects using Run.ai or similar tools.
• Develop and maintain MLOps pipelines on platforms like Vertex AI, supporting AI/ML model training and deployment.
• Write Python scripts for model development, automation, and infrastructure management.
• Use Terraform for Infrastructure as Code (IaC) to automate provisioning and deployment of cloud resources.
• Deploy and manage AI/ML models on container orchestration platforms such as OpenShift and GKE.
• Collaborate with AI teams to facilitate LLM deployment (e.g., Llama, Mistral) and GPU utilization.
• Automate and enhance CI/CD pipelines for seamless integration and deployment of services.
• Monitor performance and capacity with Prometheus, Grafana, and other observability tools to ensure system stability.
• Engage in DevOps practices, including containerization, orchestration, and infrastructure management.
Qualifications
• Strong experience with Google Cloud Platform (GCP) and its core services (Compute, Storage, IAM, VPC).
• Experience with GPU resource management tools (e.g., Run.ai).
• Proficiency with Python for AI/ML workflows and automation.
• Hands-on experience with MLOps platforms like Vertex AI.
• Experience with Terraform for managing cloud infrastructure using Infrastructure as Code (IaC) practices.
• Knowledge of Kubernetes and container orchestration platforms such as OpenShift and GKE.
• Familiarity with monitoring and logging tools like Prometheus, Grafana, and the ELK Stack.
• Proven track record of working with CI/CD pipelines and DevOps automation tools.