Our mission
Genmo makes it easy for anyone to create movies, as if it were magic. Using our web application, any user can create cinematic video using a simple text prompt.
We imagine a world where high-quality cinematic video content is as plentiful as water. Our mission is to empower the next billion video creators to tell their stories.
The Role
Our in-house training supercomputer is central to this mission, enabling researchers to train large, distributed machine learning models efficiently. We are seeking a contractor to maintain and enhance this supercomputer, ensuring its high performance, availability, and scalability.
Responsibilities
Cluster Maintenance and Support: Provide ongoing, on-call technical support to resolve issues, ensuring minimal downtime. This includes performing maintenance tasks such as draining impacted nodes, rebooting problematic nodes, and configuring the RDMA network for optimal performance.
Incident Response: Monitor and report GPU node failures, responding swiftly to minimize impact.
Access Control Management: Manage user access to the cluster including adding or removing users as necessary as well as maintaining security over access points.
System and Software Updates: Update system settings and software to maintain security and efficiency. Proactively schedule and manage node reboots to optimize performance and stability. Improve performance and stability of GPU container solution.
Monitoring: Set up and manage monitoring solutions (e.g., New Relic, Datadog, Prometheus) and active GPU monitoring tools (e.g., NVIDIA DCGM).
Qualifications
Proven experience managing and supporting HPC infrastructure, especially in a GPU-intensive environment.
Strong familiarity with Linux OS flavors, container technologies (Singularity, Docker, Kubernetes) and host management technologies (Ansible).
Experience with HPC job schedulers (Slurm, LSF) and monitoring tools (Prometheus, NVIDIA DCGM).
Knowledge in configuring and optimizing RDMA networks and NVMe-backed storage solutions for high-performance computing.
Effective problem-solving skills, with the ability to manage incidents and maintenance tasks efficiently.
Excellent communication skills, with the capability to provide on-call support and respond to urgent issues.
Bonus Points
Experience with HPC.
Prior involvement in setting up and managing containerized environments, specifically with Enroot or Singularity.
Contribution to open-source HPC or container technology projects.
Background in deploying and optimizing large-scale, distributed machine learning training environments.
Pay Range: $55-85/hr
This contract position is pivotal to maintaining the backbone of our AI model training capabilities. If you are passionate about high-performance computing and want to contribute to the cutting edge of AI research and development, we would love to hear from you.
Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.