Full-Time

ML Infrastructure Engineer

Posted on 7/17/2025

Phizenix

Phizenix

Compensation Overview

$180k - $200k/yr

Menlo Park, CA, USA

In Person

Category
DevOps & Infrastructure (1)
Required Skills
Kubernetes
Rust
Microsoft Azure
Python
Tensorflow
Pytorch
Docker
AWS
Go
Observability
C/C++
Google Cloud Platform
Requirements
  • Masters or PhD in Computer Science, Engineering, or a related field (or equivalent experience)
  • Strong foundation in software engineering, systems design, and distributed systems
  • Experience with cloud platforms (AWS, GCP, or Azure)
  • Proficient in Python and at least one systems-level language (C++/Rust/Go)
  • Hands-on experience with Docker, Kubernetes, and CI/CD workflows
  • Familiarity with ML frameworks like PyTorch or TensorFlow from a systems perspective
  • Understanding of GPU programming and high-performance infrastructure
Responsibilities
  • Design and manage distributed infrastructure for ML training at scale
  • Optimize model serving systems for low-latency inference
  • Build automated pipelines for data processing, model training, and deployment
  • Implement observability tools to monitor performance in production
  • Maximize resource utilization across GPU clusters and cloud environments
  • Translate research requirements into robust, scalable system designs
Desired Qualifications
  • Experience with large-scale ML training clusters and GPU orchestration
  • Knowledge of LLM-serving tools (vLLM, TensorRT, ONNX Runtime)
  • Experience with distributed training strategies (e.g., data/model/pipeline parallelism)
  • Familiarity with orchestration tools like Kubeflow or Airflow
  • Background in performance tuning, system profiling, and MLOps best practices

Company Size

N/A

Company Stage

N/A

Total Funding

N/A

Headquarters

N/A

Founded

N/A

INACTIVE