Full-Time

HPC Engineer

Machine Learning Infrastructure, US Remote

Confirmed live in the last 24 hours

Hugging Face

Hugging Face

201-500 employees

AI collaboration platform with state-of-the-art technologies

AI & Machine Learning

Senior

Remote in USA

Required Skills
Python
Git
Data Structures & Algorithms
Pytorch
AWS
Go
Development Operations (DevOps)
Linux/Unix
Google Cloud Platform
Requirements
  • 7+ years of experience in a DevOps or infrastructure Engineer role building machine learning infrastructure and working with large GPU clusters
  • Knowledge of cloud providers such as AWS, GCP, infra-as-code frameworks and observability tools
  • Familiarity with Python Scientific stack, Pytorch
  • Experience with data structures, data modeling, and database management as well as object and file storage systems
  • Strong communication, collaboration, and documentation skills
  • Experience with Linux, Git, containers, networking and command line tools
  • Strong programming skills in Python, Golang, and/or Rust
Responsibilities
  • Design, develop, deploy, and maintain reliable and scalable infrastructure that enables efficient training workloads
  • Manage large compute clusters for AI Training and development
  • Create tooling and infrastructure that abstract compute and storage in ML workflows
  • Measure and optimize system performance
  • Monitor and troubleshoot infrastructure issues, ensuring high availability and performance of AI workloads
  • Stay up to date with the latest advancements in AI infrastructure technologies and recommend improvements to enhance system efficiency and performance
  • Work closely with AI software engineering teams to ensure infrastructure can handle all system requirements
  • Provide primary operational support and engineering for multiple teams

Hugging Face is a leader in providing collaboration platforms for the machine learning community, specializing in cutting-edge technologies like Transformers and Diffusers. This community-focused environment, combined with their innovative tools for machine learning applications, makes it an excellent choice for professionals looking to advance their skills in AI technology and contribute to meaningful projects. Their offerings of Compute and Enterprise solutions also ensure that team members work with optimized and effective tools in both research and production environments.

Company Stage

Series D

Total Funding

$395.2M

Headquarters

Paris, France

Founded

2016

Growth & Insights
Headcount

6 month growth

26%

1 year growth

58%

2 year growth

159%

Benefits

Flexible Work Environment

Health Insurance

Unlimited PTO

Equity

Growth, Training, & Conferences

Generous Parental Leave