Full-Time

Senior Software Engineer

Distributed Training

Posted on 6/20/2025

Clockwork Systems

Clockwork Systems

1-10 employees

Advanced clock synchronization for distributed systems

Compensation Overview

$150k - $230k/yr

+ Stock options + Equity awards

Palo Alto, CA, USA

In Person

Category
Software Engineering (2)
,
Required Skills
Bash
Kubernetes
Python
Pytorch
Requirements
  • Deep experience with PyTorch and torch.distributed (c10d)
  • Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale
  • Proficiency in Python and Linux shell scripting
  • Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar
  • Strong understanding of NCCL, collective communication, and GPU topology
  • Familiarity with debugging tools and techniques for distributed systems
Responsibilities
  • Develop and support distributed PyTorch training jobs using torch.distributed / c10d
  • Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks
  • Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
  • Optimize performance across communication, I/O, and memory bottlenecks
  • Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
  • Write tooling and scripts to streamline training workflows and experiment management
  • Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)
Desired Qualifications
  • Experience scaling LLM training across 8+ GPUs and multiple nodes
  • Knowledge of tensor, pipeline, and data parallelism
  • Familiarity with containerized training environments (Docker, Singularity)
  • Exposure to HPC environments or cloud GPU infrastructure
  • Experience with training workload orchestration tools or custom job launchers
  • Comfort with large-scale checkpointing, resume/restart logic, and model I/O
  • Profiling tools: PyTorch Profiler, Nsight, nvprof, or equivalent
  • Experience with performance tuning in distributed training environments
  • Contributions to ML infrastructure open-source projects
  • Familiarity with storage, networking, or RDMA/GPU Direct technologies
  • Understanding of observability in ML pipelines (metrics, logs, dashboards)

Clockwork Systems provides clock synchronization technology for mission-critical distributed systems, ensuring precise timing across operations. Its solutions run in both cloud and on-premises environments and are delivered through software licensing, subscriptions, and professional services. The company differentiates itself with deep timing expertise and end-to-end timing across networks, framed by tools like Latency Sensei for cloud latency monitoring. Its goal is to help customers achieve reliable, accurate synchronization to boost performance and reduce timing-related issues in time-sensitive applications.

Company Size

1-10

Company Stage

Early VC

Total Funding

$41.6M

Headquarters

Palo Alto, California

Founded

2018

Simplify Jobs

Simplify's Take

What believers are saying

  • Raised $20.6M Series A in 2022 led by NEA for FleetIQ launch.
  • Serves hyperscalers, banks, pharma labs optimizing model training.
  • Hired NetApp exec Suresh Vasudevan as CEO in 2025.

What critics are saying

  • NVIDIA software stack bundles GPU management, blocks penetration in 12-24 months.
  • Hyperscalers like Meta, Google build proprietary fabrics in 18-36 months.
  • Customer concentration risks 30-50% revenue loss from one cloud exit.

What makes Clockwork Systems unique

  • FleetIQ delivers microsecond visibility into GPU clusters for AI workloads.
  • Software-driven fabric runs on NVIDIA, AMD, InfiniBand, RoCE, Ethernet.
  • Stanford spinout founded 2018 extends clock sync to AI training.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Competitive Salary

Company News

The Associated Press
Mar 11th, 2026
Clockwork.io launches TorchPass to eliminate GPU failure waste in AI training, saving $6M per 2,048-GPU cluster

Clockwork.io has launched TorchPass Workload Fault Tolerance, a software solution that eliminates costly GPU training failures through Live GPU Migration technology. The system allows AI training workloads to continue running through hardware failures, network disruptions and node crashes without requiring checkpoint restarts. The company claims TorchPass can save over $6 million annually in a typical 2,048-GPU deployment by reducing wasted training progress by 95%. In large clusters, it cuts lost time from approximately three hours per day to under ten minutes. Independent testing by SemiAnalysis found TorchPass delivered faster fault-tolerant performance than standard checkpoint-restart approaches and higher Model FLOPs Utilisation than leading open-source alternatives. The solution typically completes recovery in approximately three minutes whilst training continues uninterrupted. TorchPass is now available as part of Clockwork.io's FleetIQ platform.

The SaaS News
Sep 11th, 2025
Clockwork Raises $20.57 Million in Funding | The SaaS News

Clockwork Raises $20.57 Million in Funding

TechStartups.com
Sep 10th, 2025
Clockwork Systems Raises $20.6M for FleetIQ

Stanford spinout Clockwork has raised $20.6 million, led by NEA with participation from notable investors, to address AI's GPU inefficiency. The funding coincides with the launch of FleetIQ, a software solution aimed at enhancing GPU performance by improving communication between GPUs, clusters, and clouds. This innovation seeks to reduce crashes, shorten restarts, and increase utilization rates, making AI infrastructure more efficient and sustainable.

INACTIVE