Simplify Logo

Full-Time

HPC Engineer

Machine Learning Infrastructure, US Remote

Posted on 4/6/2024

Hugging Face

Hugging Face

201-500 employees

AI collaboration platform with advanced tools

AI & Machine Learning

Senior, Expert

Remote in USA

Category
DevOps & Infrastructure
Site Reliability Engineering
Cloud Engineering
DevOps Engineering
Required Skills
Rust
Python
Git
Data Structures & Algorithms
AWS
Go
Development Operations (DevOps)
Linux/Unix
Google Cloud Platform
Requirements
  • 7+ years of experience in a DevOps or infrastructure Engineer role building machine learning infrastructure and working with large GPU clusters
  • Knowledge of cloud providers such as AWS, GCP, infra-as-code frameworks and observability tools
  • Familiarity with Python Scientific stack, Pytorch
  • Experience with data structures, data modeling, and database management as well as object and file storage systems
  • Strong communication, collaboration, and documentation skills
  • Experience with Linux, Git, containers, networking and command line tools
  • Strong programming skills in Python, Golang, and/or Rust
Responsibilities
  • Design, develop, deploy, and maintain reliable and scalable infrastructure that enables efficient training workloads
  • Manage large compute clusters for AI Training and development
  • Create tooling and infrastructure that abstract compute and storage in ML workflows
  • Measure and optimize system performance
  • Monitor and troubleshoot infrastructure issues, ensuring high availability and performance of AI workloads
  • Stay up to date with the latest advancements in AI infrastructure technologies and recommend improvements to enhance system efficiency and performance
  • Work closely with AI software engineering teams to ensure infrastructure can handle all system requirements
  • Provide primary operational support and engineering for multiple teams

Hugging Face is a platform for the machine learning community to collaborate on models, datasets, and applications, offering state-of-the-art technologies such as Transformers for PyTorch, TensorFlow, and JAX, Diffusers for image and audio generation, and Tokenizers optimized for research and production. The company also provides paid Compute and Enterprise solutions for deploying on optimized Inference Endpoints.

Company Stage

Series D

Total Funding

$395.2M

Headquarters

Paris, France

Founded

2016

Growth & Insights
Headcount

6 month growth

23%

1 year growth

41%

2 year growth

130%

Benefits

Flexible Work Environment

Health Insurance

Unlimited PTO

Equity

Growth, Training, & Conferences

Generous Parental Leave

INACTIVE