Full-Time

LLM Training Dataset and Checkpoint Optimization Engineer

Confirmed live in the last 24 hours

Together AI

Together AI

51-200 employees

Decentralized cloud services for AI development

Enterprise Software
AI & Machine Learning

Compensation Overview

$160k - $230kAnnually

+ Equity + Benefits

Senior

San Francisco, CA, USA

Category
Applied Machine Learning
AI & Machine Learning
Required Skills
Python
Tensorflow
Pytorch
Go
C/C++
Data Analysis

You match the following Together AI's candidate preferences

Employers are more likely to interview you if you match these preferences:

Degree
Experience
Requirements
  • 5+ years of experience in data engineering, distributed systems, or ML infrastructure.
  • Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI).
  • Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5).
  • Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS).
  • Proficient in Python, C++, or Go for performance-critical systems.
  • Experience with I/O optimization techniques (e.g., asynchronous data loading, prefetching).
  • Familiarity with compression and serialization for large datasets and checkpoints.
  • Analytical and problem-solving mindset.
  • Strong communication and collaboration skills across teams.
Responsibilities
  • Design and optimize high-throughput data pipelines for streaming and processing massive training datasets.
  • Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency.
  • Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph).
  • Build and optimize distributed checkpoint mechanisms for large-scale training workflows.
  • Implement techniques to minimize checkpoint I/O overhead and ensure fault tolerance.
  • Develop incremental and differential checkpointing solutions to reduce storage costs.
  • Profile and debug bottlenecks in data pipelines and checkpoint systems.
  • Optimize for GPU/TPU utilization by ensuring efficient data feeding and checkpoint recovery times.
  • Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets.
  • Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs.
  • Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements.
  • Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows.
Desired Qualifications
  • Experience with ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training.
  • Familiarity with hardware accelerators (e.g., GPUs, TPUs) and storage optimizations.
  • Knowledge of open-source contributions or projects related to data pipelines or checkpointing.
  • Experience with incremental and real-time checkpointing solutions.

Together AI focuses on enhancing artificial intelligence through open-source contributions. The company offers decentralized cloud services that allow developers and researchers to train, fine-tune, and deploy generative AI models. Their platform is designed to support a wide range of clients, from small startups to large enterprises and academic institutions, by providing cloud-based solutions that simplify the development and deployment of AI models. Unlike many competitors, Together AI emphasizes open and transparent AI systems, which fosters innovation and aims to achieve beneficial outcomes for society. The company's goal is to empower users with the tools they need to advance AI technology while maintaining a commitment to openness.

Company Size

51-200

Company Stage

Series B

Total Funding

$519M

Headquarters

Menlo Park, California

Founded

2022

Simplify Jobs

Simplify's Take

What believers are saying

  • Raised $305M in Series B funding, boosting its AI Acceleration Cloud expansion.
  • Growing demand for open-source AI models enhances Together AI's market position.
  • Partnerships with major firms like Salesforce and Zoom strengthen its enterprise solutions.

What critics are saying

  • Emerging models like Flux 1.1 Pro could challenge Together AI's offerings.
  • Market volatility from DeepSeek R1's release may impact Together AI's valuation.
  • Over-reliance on NVIDIA GPUs could lead to supply chain vulnerabilities.

What makes Together AI unique

  • Together AI focuses on open-source contributions, setting it apart in the AI industry.
  • The company offers decentralized cloud services, reducing dependency on centralized providers.
  • Together AI supports over 450,000 developers, showcasing its extensive reach and influence.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Company Equity

Growth & Insights and Company News

Headcount

6 month growth

-4%

1 year growth

0%

2 year growth

0%
36Kr
Feb 21st, 2025
Together AI raises $305M, valued $3.3B

Together AI, a U.S. AI cloud service provider, announced a $305 million Series B funding round, valuing the company at $3.3 billion. The investment, led by General Catalyst and Prosperity7 Ventures, will accelerate the development of AI applications based on open-source models like DeepSeek-R1. Together AI offers over 200 model API services and GPU rentals, with annual revenue exceeding $100 million. The company is expanding its infrastructure, deploying NVIDIA GPU clusters across North America.

VentureBeat
Feb 21st, 2025
Together Ai’S $305M Bet: Reasoning Models Like Deepseek-R1 Are Increasing, Not Decreasing, Gpu Demand

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More. When DeepSeek-R1 first emerged, the prevailing fear that shook the industry was that advanced reasoning could be achieved with less infrastructure.As it turns out, that’s not necessarily the case. At least, according to Together AI, the rise of DeepSeek and open-source reasoning has had the exact opposite effect: Instead of reducing the need for infrastructure, it is increasing it.That increased demand has helped fuel the growth of Together AI’s platform and business. Today the company announced a $305 million series B round of funding, led by General Catalyst and co-led by Prosperity7. Together AI first emerged in 2023 with an aim to simplify enterprise use of open-source large language models (LLMs)

VC News Daily
Feb 20th, 2025
Together AI Secures $305M Series B Funding

Together AI announced a $305 million Series B funding round led by General Catalyst and co-led by Prosperity7. Notable investors include Salesforce Ventures, DAMAC Capital, NVIDIA, and others. The funding will support Together AI's mission to enhance AI model training and deployment with a focus on performance, control, and cost-efficiency.

PR Newswire
Feb 20th, 2025
Together AI Secures $305M Series B Funding

Together AI announced a $305 million Series B funding round led by General Catalyst and Prosperity7, valuing the company at $3.3 billion. The investment will enhance its AI Acceleration Cloud, focusing on open source models and NVIDIA Blackwell GPU deployment. Together AI supports over 450,000 developers and partners with major firms like Salesforce and Zoom. The platform offers enterprise-grade AI solutions with advanced infrastructure and research innovations for improved efficiency and cost-effectiveness.

Maginative
Feb 20th, 2025
Together AI Raises $305M for Expansion

Together AI has raised $305 million in a Series B funding round, increasing its valuation to $3.3 billion. The round was led by General Catalyst and co-led by Prosperity7, with participation from NVIDIA, Salesforce Ventures, Kleiner Perkins, and Coatue. The funds will be used to expand its AI Acceleration Cloud by deploying NVIDIA Blackwell GPUs. Together AI supports over 200 open-source AI models and has surpassed 450,000 developers and $100 million in annualized revenue.