Full-Time

AI Compute Infrastructure Engineer

Confirmed live in the last 24 hours

Cerebras

Cerebras

201-500 employees

Develops AI accelerators for efficient computing

Data & Analytics
Enterprise Software
AI & Machine Learning

Senior

Toronto, ON, Canada + 1 more

More locations: Sunnyvale, CA, USA

Category
DevOps & Infrastructure
Platform Engineering
Cloud Engineering
DevOps Engineering
Required Skills
TCP/IP
LLM
Kubernetes
Python
Tensorflow
Pytorch
Docker
Linux/Unix
Requirements
  • BS CS/EE, MS CS/EE
  • 5+ years relevant experience in managing compute infrastructure
  • Hands-on technical expert
  • Proficiency with Python and other common programming languages
  • Demonstrated high impact in a variety of products and roles
  • Experience in container orchestration platforms like Kubernetes and SLURM
  • Experience with ML frameworks like PyTorch, Tensorflow, etc.
  • Strong knowledge and demonstrated experience with: Linux based compute systems, virtualization, docker containers, Scheduling and orchestration applications like SLURM, Kubernetes
  • Good understanding of cloud infrastructure design, deployment and maintenance
  • Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired
  • Past experience with cross-functional team projects
  • Past experience and interactions with high-value customers
  • Should have a proven track record to own and drive challenges to completion
Responsibilities
  • Operate and manage multiple Advanced ML accelerator solutions from Cerebras Systems - Condor Galaxy
  • Maximize the available compute capacity - thereby providing high uptime at max performance for the CG deployments
  • Monitor and oversee CG health to ensure stability and security
  • Manage and customize k8s, cluster, cloud features on CGs
  • Provide solutions to ML users using tools and components available in a vast linux-based ecosystem - compute, storage, networking.
  • Configure, deploy and debug container-based services on orchestration platforms like Kubernetes.
  • Provide 24/7 monitoring, support – using automated tools and hands-on manual troubleshooting
  • Training and Inference in data center, LLM (50b to 500b parameter models), multi-modal, mistral etc.
  • Adapt and make progress in a fast-paced and constantly evolving environment.
  • Document processes and procedures needed to efficiently operate CGs.
Desired Qualifications
  • Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired

Cerebras Systems accelerates artificial intelligence (AI) processes with its CS-2 system, which replaces traditional clusters of graphics processing units (GPUs). This system simplifies AI tasks by removing the complexities of parallel programming and cluster management, allowing for faster results in critical applications like cancer drug response prediction. Cerebras serves clients across various industries, including pharmaceuticals and government research labs, and generates revenue through the sale of its hardware and software solutions. The company's goal is to enhance the speed and efficiency of AI training and inference, reducing costs in AI research and development.

Company Stage

Series F

Total Funding

$700.4M

Headquarters

Sunnyvale, California

Founded

2016

Growth & Insights
Headcount

6 month growth

0%

1 year growth

-5%

2 year growth

-10%
Simplify Jobs

Simplify's Take

What believers are saying

  • Growing AI model efficiency demand aligns with Cerebras' energy-efficient accelerators.
  • AI democratization increases need for user-friendly systems like Cerebras' CS-2.
  • Pharmaceutical industry's push for faster drug discovery boosts demand for Cerebras' technology.

What critics are saying

  • Competition from NVIDIA and Graphcore could impact Cerebras' market share.
  • Rapid AI model evolution may necessitate frequent hardware updates, increasing R&D costs.
  • Supply chain vulnerabilities could delay production of Cerebras' hardware.

What makes Cerebras unique

  • Cerebras' Wafer-Scale Engine is the largest chip ever built for AI.
  • The CS-2 system replaces traditional GPU clusters, simplifying AI computations.
  • Cerebras serves diverse industries, including pharmaceuticals and government research labs.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Professional Development Budget

Flexible Work Hours

Remote Work Options

401(k) Company Match

401(k) Retirement Plan

Mental Health Support

Wellness Program

Paid Sick Leave

Paid Holidays

Paid Vacation

Parental Leave

Family Planning Benefits

Fertility Treatment Support

Adoption Assistance

Childcare Support

Elder Care Support

Pet Insurance

Bereavement Leave

Employee Discounts

Company Social Events