Full-Time

HPC Infrastructure Engineer

Confirmed live in the last 24 hours

AHEAD

AHEAD

1,001-5,000 employees

Digital transformation and cloud solutions provider

Consulting
Enterprise Software

Compensation Overview

$135k - $165kAnnually

+ Target Bonus

Senior, Expert

Remote in USA

Category
DevOps & Infrastructure
Cloud Engineering
Required Skills
Bash
Kubernetes
Python
Grafana
Machine Learning
Docker
Prometheus
Terraform
Ansible
Customer Service
Linux/Unix

You match the following AHEAD's candidate preferences

Employers are more likely to interview you if you match these preferences:

Degree
Experience
Requirements
  • Bachelor’s degree or equivalent Information Systems or related field. Unique education, specialized experience, skills, knowledge, training, or certification may be substituted for education
  • 5+ years of expert level experience managing infrastructure in high-performance computing environments including configuration, troubleshooting, and best practice
  • Strong understanding of Kubernetes architecture, components, and networking
  • Linux engineer with experience in RedHat, Ubuntu, and Rocky distributions
  • Experience with deploying and managing Kubernetes clusters in production environments, including those with GPU acceleration
  • Experience with HPC workloads, schedulers (e.g., SLURM, PBS, Torque), and applications, particularly in the context of AI/ML and deep learning
  • Experience with containerization technologies (e.g., Docker, Singularity)
  • Experience with Infrastructure-as-Code (IaC) tools (e.g., Terraform, Ansible)
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana), experience integrating with Elastic Observability
  • Strong scripting skills (e.g., Bash, Python)
  • Excellent problem-solving and troubleshooting skills
  • Experience configuring, maintaining and troubleshooting Kubernetes
  • Experience with storage technology (e.g., Ceph, Vast Data Platform) and distributed file systems (e.g., Lustre, GPFS, NFS, GlusterFS)
  • Experience configuring, maintaining and troubleshooting Nvidia/Mellanox (Cumulus OS) switches a plus
  • Experience with both ethernet and InfiniBand networking a plus
  • 1+ years working with an enterprise ITSM system: Service Now is a bonus
  • Managed Services or consulting experience is required
  • Strong background with customer service
  • High level problem-solving and communication skills
  • Strong oral and written communications skills
Responsibilities
  • Providing enterprise-level operational support to Managed Services customers for incident, problem, and change management activities
  • Design, deploy, and manage Kubernetes clusters optimized for HPC workloads, with a focus on integrating and managing NVIDIA DGX systems
  • Optimize cluster performance, resource utilization, and cost-effectiveness, specifically addressing the unique requirements of DGX systems
  • Implement monitoring, logging, and alerting solutions for HPC Linux clusters, Kubernetes, and DGX infrastructure
  • Ensure the security of the Kubernetes infrastructure and HPC workloads, including the protection of sensitive data processed by DGX systems
  • Troubleshoot and resolve issues related to Kubernetes, DGX systems, HPC applications, and infrastructure
  • Stay up to date on the latest technologies and trends in Kubernetes, HPC, and NVIDIA DGX systems, including new hardware and software releases
  • Work across technical teams to troubleshoot complex infrastructure issues
  • Create and maintain detailed documentation
  • Serve as a subject matter expert and escalation point for HPC technologies
  • Work with vendors to resolve infrastructure issues
  • Communicate with customers and internal team with transparency
  • Participate in on-call rotation
  • Completion of training and certification as assigned to further skills and knowledge
Desired Qualifications
  • Hands-on experience with deploying, managing, and optimizing NVIDIA DGX systems preferred
  • Experience configuring, maintaining and troubleshooting Nvidia/Mellanox (Cumulus OS) switches a plus
  • Experience with both ethernet and InfiniBand networking a plus
  • 1+ years working with an enterprise ITSM system: Service Now is a bonus
  • Related certifications are a bonus

AHEAD specializes in digital transformation services, focusing on helping medium to large enterprises modernize their IT infrastructure. The company offers a range of services including cloud migration, automation, and infrastructure optimization, primarily utilizing Microsoft Azure. AHEAD's products work by providing tailored consulting and managed services that guide clients through the complexities of digital transformation. What sets AHEAD apart from its competitors is its deep expertise in cloud solutions and a strong client-centric approach, which has been validated by achieving Gold Cloud Platform Competency with Azure. The company's goal is to empower organizations to take control of their digital transformation journeys and achieve sustainable success.

Company Size

1,001-5,000

Company Stage

Acquired

Total Funding

N/A

Headquarters

Chicago, Illinois

Founded

2007

Simplify Jobs

Simplify's Take

What believers are saying

  • Increased demand for hybrid cloud solutions aligns with AHEAD's expertise.
  • Growing interest in AI-driven IT operations presents new opportunities for AHEAD.
  • The rise of edge computing offers AHEAD opportunities in edge-to-cloud solutions.

What critics are saying

  • Emerging cloud service providers offer similar services at lower costs.
  • Rapid technological advancements may outpace AHEAD's solution updates.
  • Economic downturns could reduce spending on digital transformation projects.

What makes AHEAD unique

  • AHEAD specializes in digital transformation with a focus on cloud solutions.
  • The company achieved Gold Cloud Platform Competency with Microsoft Azure.
  • AHEAD offers tailored solutions for cloud migration and infrastructure optimization.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

401(k) Retirement Plan

Paid Vacation

Paid Sick Leave

Company News