Full-Time
HPC Engineer
Machine Learning Infrastructure, US Remote
Confirmed live in the last 24 hours
AI collaboration platform with state-of-the-art technologies
Senior
Remote in USA
- 7+ years of experience in a DevOps or infrastructure Engineer role building machine learning infrastructure and working with large GPU clusters
- Knowledge of cloud providers such as AWS, GCP, infra-as-code frameworks and observability tools
- Familiarity with Python Scientific stack, Pytorch
- Experience with data structures, data modeling, and database management as well as object and file storage systems
- Strong communication, collaboration, and documentation skills
- Experience with Linux, Git, containers, networking and command line tools
- Strong programming skills in Python, Golang, and/or Rust
- Design, develop, deploy, and maintain reliable and scalable infrastructure that enables efficient training workloads
- Manage large compute clusters for AI Training and development
- Create tooling and infrastructure that abstract compute and storage in ML workflows
- Measure and optimize system performance
- Monitor and troubleshoot infrastructure issues, ensuring high availability and performance of AI workloads
- Stay up to date with the latest advancements in AI infrastructure technologies and recommend improvements to enhance system efficiency and performance
- Work closely with AI software engineering teams to ensure infrastructure can handle all system requirements
- Provide primary operational support and engineering for multiple teams
Hugging Face is a leader in providing collaboration platforms for the machine learning community, specializing in cutting-edge technologies like Transformers and Diffusers. This community-focused environment, combined with their innovative tools for machine learning applications, makes it an excellent choice for professionals looking to advance their skills in AI technology and contribute to meaningful projects. Their offerings of Compute and Enterprise solutions also ensure that team members work with optimized and effective tools in both research and production environments.
Company Stage
Series D
Total Funding
$395.2M
Headquarters
Paris, France
Founded
2016
6 month growth
↑ 26%1 year growth
↑ 58%2 year growth
↑ 159%Benefits
Flexible Work Environment
Health Insurance
Unlimited PTO
Equity
Growth, Training, & Conferences
Generous Parental Leave