Staff AI Infrastructure Engineer
Posted on 9/11/2023
Scale AI

51-200 employees

Data platform for AI
Company Overview
Scale AI's mission is to accelerate the development of AI applications.
Locations
San Francisco, CA, USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
AWS
Docker
Google Cloud Platform
Terraform
Kubernetes
Python
CategoriesNew
AI & Machine Learning
DevOps & Infrastructure
Software Engineering
Requirements
  • 5+ years of experience building machine learning training pipelines or inference services in a production setting
  • Experience leading teams to accomplish shared goals
  • Experience with distributed training techniques such as DeepSpeed, FSDP, etc
  • Experience building, deploying, and monitoring complex microservice architectures
  • Experience with Python, Docker, Kubernetes, and Infrastructure as code (e.g. terraform)
  • Experience working with a cloud technology stack (eg. AWS or GCP)
Responsibilities
  • Build highly available, observable, performant, and cost-effective APIs for model training
  • Participate in our team's on call process to ensure the availability of our services
  • Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment
  • Exercise good taste in building systems and tools and know when to make build vs. buy tradeoffs, with an eye for cost efficiency
  • Provide mentorship and guidance to junior engineers, fostering their professional growth and development within the team
Desired Qualifications
  • Experience with LLM inference latency optimization techniques, e.g. kernel fusion, quantization, dynamic batching, etc