HPC Operations Engineer
Posted on 9/18/2023
CoreWeave

201-500 employees

Locations
Hillsboro, OR, USA
Experience Level
Entry
Junior
Mid
Senior
Expert
Desired Skills
Bash
PowerShell
Linux/Unix
Kubernetes
Python
CategoriesNew
Software Engineering
Requirements
  • 2 or more years of experience troubleshooting or administering data center or on-prem infrastructure (servers, storage, network or a mix)
  • Strong understanding of Linux system administration and networking concepts
  • Ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently and reliably
  • Software development or scripting languages (bash, python, powershell, etc)
  • Grafana, prometheus, promsql queries or similar observability platforms
  • Data center environments including server racks, HVAC systems, fiber trays
  • Kubernetes administration
  • Be Curious at your Core
  • Act like an Owner
  • Empower Employees
  • Deliver Best In-Class Client Experience
  • Achieve More Together
Responsibilities
  • Install, configure, and maintain large-scale high-performance supercomputing clusters running state-of-the-art GPUs
  • Troubleshoot hardware and software issues; escalate and coordinate as needed with data center, network and platform teams to drive resolution
  • Monitor and analyze system performance and take appropriate remediation actions for cloud health
  • Approach your work with flexibility and optimism anticipating shifting business and technical priorities
  • Create and maintain documentation of team processes, knowledge and best practices for system management
  • Think critically about your day-to-day work and work collaboratively to improve team processes and efficiency