Full-Time

AI Infrastructure Operations Engineer

Confirmed live in the last 24 hours

Cerebras

Cerebras

201-500 employees

Develops AI acceleration hardware and software

No salary listed

Senior

Toronto, ON, Canada + 2 more

More locations: Bengaluru, Karnataka, India | Sunnyvale, CA, USA

Candidates can be based in Sunnyvale, CA; Toronto, Canada; or Bangalore, India.

Category
Applied Machine Learning
AI & Machine Learning
Required Skills
Kubernetes
Python
Docker
Linux/Unix
Requirements
  • 6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing.
  • Strong proficiency in Python scripting for automation and system administration.
  • Deep understanding of Linux-based compute systems and command-line tools.
  • Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM.
  • Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner.
  • Experience with monitoring and alerting systems.
  • Should have a proven track record to own and drive challenges to completion.
  • Excellent communication and collaboration skills.
  • Ability to work effectively in a fast-paced environment.
  • Willingness to participate in a 24/7 on-call rotation.
Responsibilities
  • Manage and operate multiple advanced AI compute infrastructure clusters.
  • Monitor and oversee cluster health, proactively identifying and resolving potential issues.
  • Maximize compute capacity through optimization and efficient resource allocation.
  • Deploy, configure, and debug container-based services using Docker.
  • Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed.
  • Handle engineering escalations and collaborate with other teams to resolve complex technical challenges.
  • Contribute to the development and improvement of our monitoring and support processes.
  • Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies.
Desired Qualifications
  • Operating large scale GPU clusters.
  • Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired.
  • Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure).
  • Familiarity with machine learning frameworks and tools.
  • Experience with cross-functional team projects.

Cerebras Systems specializes in accelerating artificial intelligence (AI) processes with its CS-2 system, which is designed to replace traditional clusters of graphics processing units (GPUs) used in AI computations. The CS-2 system simplifies the complexities of parallel programming, distributed training, and cluster management, making AI tasks more efficient. Clients from various sectors, including pharmaceuticals, government research labs, healthcare, finance, and energy, benefit from the system's ability to deliver faster results, which is essential for critical applications like cancer drug response predictions. Cerebras generates revenue by selling its proprietary hardware and software solutions, including the CS-2 systems and related cloud services. The company's goal is to provide a comprehensive solution that enables clients to achieve quicker AI training and lower latency in AI inference, ultimately reducing the costs associated with AI research and development.

Company Size

201-500

Company Stage

Series F

Total Funding

$700.4M

Headquarters

Sunnyvale, California

Founded

2016

Simplify Jobs

Simplify's Take

What believers are saying

  • Growing AI model efficiency demand aligns with Cerebras' energy-efficient accelerators.
  • AI democratization increases need for user-friendly systems like Cerebras' CS-2.
  • Pharmaceutical industry's push for faster drug discovery boosts demand for Cerebras' technology.

What critics are saying

  • Competition from NVIDIA and Graphcore could impact Cerebras' market share.
  • Rapid AI model evolution may necessitate frequent hardware updates, increasing R&D costs.
  • Supply chain vulnerabilities could delay production of Cerebras' hardware.

What makes Cerebras unique

  • Cerebras' Wafer-Scale Engine is the largest chip ever built for AI.
  • The CS-2 system replaces traditional GPU clusters, simplifying AI computations.
  • Cerebras serves diverse industries, including pharmaceuticals and government research labs.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Professional Development Budget

Flexible Work Hours

Remote Work Options

401(k) Company Match

401(k) Retirement Plan

Mental Health Support

Wellness Program

Paid Sick Leave

Paid Holidays

Paid Vacation

Parental Leave

Family Planning Benefits

Fertility Treatment Support

Adoption Assistance

Childcare Support

Elder Care Support

Pet Insurance

Bereavement Leave

Employee Discounts

Company Social Events