Full-Time

Infrastructure Management Engineer

Posted on 9/1/2025

TensorWave

TensorWave

51-200 employees

Scalable AI cloud computing for PyTorch

No salary listed

Las Vegas, NV, USA

In Person

Category
DevOps & Infrastructure (1)
Required Skills
Bash
PHP
Kubernetes
Python
Grafana
Perl
Go
Prometheus
Linux/Unix
Helm
Requirements
  • Proven experience managing enterprise-grade hardware at scale
  • Strong understanding of out-of-band management systems (IPMI/BMC/Redfish)
  • Hands-on expertise with monitoring systems (Prometheus, Grafana, SNMP, Nagios, CheckMK, or similar)
  • Solid knowledge of network administration, including firewalls, routing, VPNs, NAT, and managed switches
  • Linux system administration experience (installation, configuration, troubleshooting)
  • Experience with filesystems, RAID, partitioning, and general storage management
  • Familiarity with certificate management, key-based authentication, and basic cryptographic functions
  • Experience with bare metal provisioning (MAAS, Foreman, or similar)
  • Understanding of PXE/UEFI/HTTP boot systems
  • Ability to write functional, maintainable bash scripts for automation
Responsibilities
  • Manage and maintain enterprise-grade server hardware and infrastructure components
  • Utilize out-of-band management systems (iLO, iDRAC, IPMI, Redfish, etc.) for remote operations
  • Use automated hardware management tools (BMC/Redfish-based) to streamline provisioning and maintenance
  • Perform hardware diagnostics and troubleshooting (CPU, memory, disks, PSUs, NICs, etc.)
  • Handle vendor interactions, including RMAs, part replacements, and inventory tracking
  • Oversee datacenter hardware operations, including racking, cabling, PDU installation, and physical layout
  • Use Data Center Infrastructure Management (DCIM) tools for inventory, capacity planning, and environmental tracking
  • Manage power delivery and consumption across racks and nodes
  • Configure and monitor managed PDU systems for power cycling, monitoring, and alerts
  • Collaborate with colocation providers on connectivity, power, security, and maintenance tasks
  • Build and maintain infrastructure monitoring and alerting using tools such as Prometheus/Grafana, SNMP, Nagios, CheckMK, or similar platforms
  • Implement automated alerting for hardware health, network status, power issues, and service-level metrics
  • Create dashboards to give internal teams visibility into system performance and reliability
  • Manage and configure firewalls, routing, and network segmentation
  • Configure and troubleshoot VPN technologies (IPsec, OpenVPN, WireGuard)
  • Oversee subnetting, IP address allocation, and network architecture planning
  • Configure managed switches, VLANs, port settings, and trunking
  • Manage NAT, port forwarding, and related gateway/edge network configurations
  • Install, configure, and manage Linux servers (Ubuntu/Debian preferred)
  • Perform system-level troubleshooting (boot issues, login problems, service failures)
  • Manage networking configuration (static IPs, DHCP)
  • Configure and maintain filesystems: partitioning, MD RAID, ext4/XFS, LVM, resizing/growing volumes
  • Implement secure access using public key authentication and proper SSH hardening
  • Manage certificates for internal systems, including issuance, revocation, HTTPS installation, and rotation
  • Handle basic BIOS configuration relevant to bare metal provisioning or system bring-up
  • Deploy and manage hardware provisioning tools such as MAAS, Foreman, or similar systems
  • Configure and troubleshoot network boot mechanisms (PXE, UEFI Boot, HTTP Boot)
  • Automate provisioning pipelines to rapidly bring new nodes online
  • Work with Kubernetes clusters at a foundational level (cluster access, basic resource troubleshooting)
  • Deploy workloads using Helm charts and maintain cluster application lifecycle
  • Assist with cluster scaling, node replacements, and security hardening
  • Write shell scripts (bash) for automation of system tasks, monitoring, or provisioning.
  • Use CLI tooling such as jq, sed, awk, grep, and rsync
  • Optionally automate workflows using languages like Python, Go, PHP, or Perl
Desired Qualifications
  • Experience with Kubernetes beyond the basics (operators, cluster scaling, CRDs)
  • Experience with Helm chart customization
  • Familiarity with automation languages such as Python, Go, PHP, or Perl
  • Previous datacenter operations or colocation management experience
  • Exposure to high-availability or distributed compute environments
  • Knowledge of infrastructure security and hardening practices

TensorWave provides cloud-based AI computing resources for large-scale workloads, with a focus on PyTorch. It runs on AMD MI300X hardware with a scalable fabric, offering pools of VRAM exceeding 1 petabyte accessible via subscription or pay-as-you-go. It differentiates itself by using AMD-based hardware and a large VRAM pool to offer cost-efficient, scalable compute without customers owning hardware. The goal is to become a leading provider of scalable, high-performance AI cloud infrastructure that enables enterprises and research institutions to train, fine-tune, and deploy large models at predictable, competitive costs.

Company Size

51-200

Company Stage

Series A

Total Funding

$143M

Headquarters

Las Vegas, Nevada

Founded

2023

Simplify Jobs

Simplify's Take

What believers are saying

  • Hits $100M ARR in 16 months since December 2023 launch.
  • Secures $100M Series A from Magnetar, AMD Ventures, Prosperity7.
  • Expands 20MW with TECfusions in Tucson, Keystone by H1 2026.

What critics are saying

  • CoreWeave dominates 70% market with superior NVIDIA H100 clusters.
  • Lambda Labs offers 40% cheaper AMD MI300X hybrid options.
  • AMD MI400 delays to Q4 2026 obsolete MI325X against NVIDIA B200.

What makes TensorWave unique

  • TensorWave auctions GPUs eBay-style for long-term contracts.
  • Deploys largest North America AMD MI325X cluster with 8,192 GPUs.
  • Partners Credo for ZeroFlap cables boosting cluster reliability 1,000x.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Stock Options

Health Insurance

Dental Insurance

Vision Insurance

Life Insurance

Disability Insurance

Health Savings Account/Flexible Spending Account

Unlimited Paid Time Off

Flexible Work Hours

Remote Work Options

Paid Vacation

Paid Sick Leave

Paid Holidays

Sabbatical Leave

Hybrid Work Options

401(k) Retirement Plan

401(k) Company Match

Parental Leave

Mental Health Support

Wellness Program

Growth & Insights and Company News

Headcount

6 month growth

-1%

1 year growth

0%

2 year growth

20%
Business Wire
Feb 25th, 2026
TensorWave deploys Credo's ZeroFlap cables and optics for AMD-based AI clusters

TensorWave, an AMD-exclusive AI cloud provider, has partnered with Credo Technology to deploy Credo's ZeroFlap Active Electrical Cables and optical transceivers across its next-generation AI cluster infrastructure. The collaboration aims to deliver faster deployment times and higher cluster reliability for AI workloads. Credo's ZeroFlap technology offers 100 million hours mean time between failures and claims reliability 1,000 times better than legacy interconnect solutions. The system integrates with Credo's PILOT telemetry platform for real-time monitoring and fault isolation. TensorWave, which has raised over $166 million from investors including Magnetar, AMD Ventures and Nexus Venture Partners, operates one of the world's largest all-AMD GPU clouds. The partnership supports the company's mission to provide production-grade AI infrastructure for enterprise customers and AI labs.

AMD
Aug 27th, 2025
Your AI Journey, Accelerated: How Enterprise Teams Are Scaling AI From Concept to Impact

At the AMD Advancing AI 2025 event, global innovators including Dell Technologies, Supermicro, Vultr, TensorWave, and AWS shared how they're helping organizations move through each critical phase of AI adoption with open, flexible, and solutions using AMD technology.

WAYA Media
May 15th, 2025
Aramco's Prosperity7 Backs US AI Startup TensorWave in USD 100M Round

Prosperity7, Aramco's VC arm, leads USD 100M Series A funding in US-based AI infrastructure startup TensorWave.

Grit Daily
May 14th, 2025
TensorWave Raises $100M to Build the Future of AI Infrastructure

In a big swing that signals just how hot the AI infrastructure race has become, TensorWave announced a $100 million Series A funding round co-led by Magnetar and AMD Ventures, with participation from Maverick Silicon, Nexus Venture Partners, and new investor Prosperity7.

FinSMEs
May 14th, 2025
TensorWave Raises $100M in Series A Funding

TensorWave, a Las Vegas, NV-based company developing AMD-powered AI infrastructure solutions, raised $100M in Series A funding

INACTIVE