Full-Time

Cluster Deployment Operations Engineer

Confirmed live in the last 24 hours

Cerebras

Cerebras

201-500 employees

Develops AI acceleration hardware and software

Data & Analytics
Enterprise Software
AI & Machine Learning

Senior

Toronto, ON, Canada + 1 more

More locations: Sunnyvale, CA, USA

Category
Applied Machine Learning
AI & Machine Learning
Required Skills
Python
Go
Linux/Unix

You match the following Cerebras's candidate preferences

Employers are more likely to interview you if you match these preferences:

Degree
Experience
Requirements
  • Proficiency in scripting and practical coding, particularly in Shell and Python (Go is a plus).
  • Strong experience troubleshooting, analyzing, and administering large-scale, distributed systems.
  • 5+ years of experience in data center operations and Linux system administration.
  • Knowledge and hands-on experience with network configuration and operations.
  • Expertise in hardware operations including networking components (e.g., cabling, switches, routers).
Responsibilities
  • Plan and execute cluster deployments, from small-scale to massive distributed systems.
  • Manage hands-on aspects of the deployments, coordinating with data center staff for hardware configurations and necessary maintenance.
  • Troubleshoot issues related to networking (e.g., BGP, cluster creation hurdles, or cabling errors) and hardware (e.g., hardware DOA).
  • Monitor and maintain systems to ensure uptime, performance, and reliability.
  • Collaborate with cross-functional teams including hardware vendors, data center operations, and network engineers to manage the entire lifecycle of deployment.
  • Ensure comprehensive documentation is created and maintained for deployments, configurations, and operational processes.
  • Develop tools, scripts, or playbooks to automate routine tasks and deployment processes.
Desired Qualifications
  • Experience with Kubernetes and the Prometheus monitoring stack.
  • Experience with CI/CD tools (e.g., Git, Jenkins, etc.).
  • Familiarity with BGP and other networking protocols, including troubleshooting at Layer 1,2,3.
  • Experience with automation tools for deployments, monitoring, and operational efficiency (such as creating playbooks or automated scripts).

Cerebras Systems accelerates artificial intelligence (AI) processes with its CS-2 system, which replaces traditional clusters of graphics processing units (GPUs) and simplifies AI operations by eliminating complex programming and management. The CS-2 system provides faster results for clients in various industries, including pharmaceuticals and government research labs, enabling quicker AI training and lower latency in AI inference. Cerebras generates revenue through the sale of its hardware and software solutions, distinguishing itself with the largest processor in the industry. The company's goal is to simplify and accelerate AI tasks, reducing costs associated with AI research and development.

Company Size

201-500

Company Stage

Series F

Total Funding

$700.4M

Headquarters

Sunnyvale, California

Founded

2016

Simplify Jobs

Simplify's Take

What believers are saying

  • Growing AI model efficiency demand aligns with Cerebras' energy-efficient accelerators.
  • AI democratization increases need for user-friendly systems like Cerebras' CS-2.
  • Pharmaceutical industry's push for faster drug discovery boosts demand for Cerebras' technology.

What critics are saying

  • Competition from NVIDIA and Graphcore could impact Cerebras' market share.
  • Rapid AI model evolution may necessitate frequent hardware updates, increasing R&D costs.
  • Supply chain vulnerabilities could delay production of Cerebras' hardware.

What makes Cerebras unique

  • Cerebras' Wafer-Scale Engine is the largest chip ever built for AI.
  • The CS-2 system replaces traditional GPU clusters, simplifying AI computations.
  • Cerebras serves diverse industries, including pharmaceuticals and government research labs.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Professional Development Budget

Flexible Work Hours

Remote Work Options

401(k) Company Match

401(k) Retirement Plan

Mental Health Support

Wellness Program

Paid Sick Leave

Paid Holidays

Paid Vacation

Parental Leave

Family Planning Benefits

Fertility Treatment Support

Adoption Assistance

Childcare Support

Elder Care Support

Pet Insurance

Bereavement Leave

Employee Discounts

Company Social Events

Growth & Insights

Headcount

6 month growth

-9%

1 year growth

-5%

2 year growth

-9%