Full-Time

AI Compute Infrastructure Engineer

Confirmed live in the last 24 hours

Cerebras

Cerebras

201-500 employees

Produces large-scale AI computing systems

Hardware
AI & Machine Learning

Mid, Senior

Sunnyvale, CA, USA

Required Skills
TCP/IP
Kubernetes
Python
Tensorflow
Pytorch
Docker
Linux/Unix
Requirements
  • BS CS/EE, MS CS/EE
  • 5+ years relevant experience in managing compute infrastructure
  • Hands-on technical expert
  • Proficiency with Python and other common programming languages
  • Demonstrated high impact in a variety of products and roles
  • Experience in container orchestration platforms like Kubernetes and SLURM
  • Experience with ML frameworks like PyTorch, Tensorflow, etc.
  • Strong knowledge and demonstrated experience with: Linux based compute systems, virtualization, docker containers, Scheduling and orchestration applications like SLURM, Kubernetes, Good understanding of cloud infrastructure design, deployment and maintenance, Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired, Past experience with cross-functional team projects, Past experience and interactions with high-value customers, Should have a proven track record to own and drive challenges to completion
Responsibilities
  • Operate and manage multiple Advanced ML accelerator solutions from Cerebras Systems - Condor Galaxy
  • Maximize the available compute capacity - thereby providing high uptime at max performance for the CG deployments
  • Monitor and oversee CG health to ensure stability and security
  • Manage and customize k8s, cluster, cloud features on CGs
  • Provide solutions to ML users using tools and components available in a vast linux-based ecosystem - compute, storage, networking.
  • Configure, deploy and debug container-based services on orchestration platforms like Kubernetes.
  • Provide 24/7 monitoring, support – using automated tools and hands-on manual troubleshooting
  • Training and Inference in data center, LLM (50b to 500b parameter models), multi-modal, mistral etc.
  • Adapt and make progress in a fast-paced and constantly evolving environment.
  • Document processes and procedures needed to efficiently operate CGs.

Cerebras Systems provides cutting-edge technology for artificial intelligence work with its CS-2 AI computer, which features the world's largest chip, the Wafer Scale Engine (WSE-2). This technology fosters a culture of rapid innovation, significantly reducing AI training times and enhancing productivity. Employees at this company are part of a pioneering environment, working with advanced technology that leads the industry in accelerating AI capabilities, making it an excellent place for career growth in the field of AI technology.

Company Stage

Series F

Total Funding

$720M

Headquarters

Sunnyvale, California

Founded

2016

Growth & Insights
Headcount

6 month growth

7%

1 year growth

2%

2 year growth

-10%