Full-Time

LLM Training Frameworks and Optimization Engineer

Confirmed live in the last 24 hours

Together AI

Together AI

51-200 employees

Decentralized cloud services for AI development

Enterprise Software
AI & Machine Learning

Compensation Overview

$160k - $230kAnnually

+ Equity + Benefits

Senior

San Francisco, CA, USA

Category
Deep Learning
AI & Machine Learning
Required Skills
Python
Tensorflow
CUDA
Pytorch
C/C++

You match the following Together AI's candidate preferences

Employers are more likely to interview you if you match these preferences:

Degree
Experience
Requirements
  • 5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure.
  • Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA).
  • Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism).
  • Familiarity with GPU/TPU hardware and deep learning performance optimizations.
  • Proficient in Python and C++ or CUDA for high-performance computing.
  • Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding).
  • Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization.
  • Analytical problem-solving skills and a focus on performance improvement.
  • Strong collaboration and communication skills across teams.
Responsibilities
  • Design, implement, and optimize distributed training frameworks tailored for large language models.
  • Develop custom modules, plugins, and features to enhance framework scalability and performance.
  • Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training.
  • Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training.
  • Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks.
  • Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators.
  • Ensure training systems scale efficiently to thousands of nodes and petabytes of data.
  • Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines.
  • Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements.
  • Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle.
Desired Qualifications
  • Familiarity with graph optimization and compiler-level performance tuning.
  • Contributions to open-source deep learning or distributed training projects.
  • Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels).

Together AI focuses on enhancing artificial intelligence through open-source contributions. The company offers decentralized cloud services that allow developers and researchers to train, fine-tune, and deploy generative AI models. Their platform is designed to support a wide range of clients, from small startups to large enterprises and academic institutions, by providing cloud-based solutions that simplify the development and deployment of AI models. Unlike many competitors, Together AI emphasizes open and transparent AI systems, which fosters innovation and aims to achieve beneficial outcomes for society. The company's goal is to empower users with the tools they need to advance AI technology while maintaining a commitment to openness.

Company Size

51-200

Company Stage

Series B

Total Funding

$519M

Headquarters

Menlo Park, California

Founded

2022

Simplify Jobs

Simplify's Take

What believers are saying

  • Raised $305M in Series B funding, boosting its AI Acceleration Cloud expansion.
  • Growing demand for open-source AI models enhances Together AI's market position.
  • Partnerships with major firms like Salesforce and Zoom strengthen its enterprise solutions.

What critics are saying

  • Emerging models like Flux 1.1 Pro could challenge Together AI's offerings.
  • Market volatility from DeepSeek R1's release may impact Together AI's valuation.
  • Over-reliance on NVIDIA GPUs could lead to supply chain vulnerabilities.

What makes Together AI unique

  • Together AI focuses on open-source contributions, setting it apart in the AI industry.
  • The company offers decentralized cloud services, reducing dependency on centralized providers.
  • Together AI supports over 450,000 developers, showcasing its extensive reach and influence.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Company Equity

Growth & Insights and Company News

Headcount

6 month growth

-4%

1 year growth

0%

2 year growth

0%
36Kr
Feb 21st, 2025
Together AI raises $305M, valued $3.3B

Together AI, a U.S. AI cloud service provider, announced a $305 million Series B funding round, valuing the company at $3.3 billion. The investment, led by General Catalyst and Prosperity7 Ventures, will accelerate the development of AI applications based on open-source models like DeepSeek-R1. Together AI offers over 200 model API services and GPU rentals, with annual revenue exceeding $100 million. The company is expanding its infrastructure, deploying NVIDIA GPU clusters across North America.

VentureBeat
Feb 21st, 2025
Together Ai’S $305M Bet: Reasoning Models Like Deepseek-R1 Are Increasing, Not Decreasing, Gpu Demand

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More. When DeepSeek-R1 first emerged, the prevailing fear that shook the industry was that advanced reasoning could be achieved with less infrastructure.As it turns out, that’s not necessarily the case. At least, according to Together AI, the rise of DeepSeek and open-source reasoning has had the exact opposite effect: Instead of reducing the need for infrastructure, it is increasing it.That increased demand has helped fuel the growth of Together AI’s platform and business. Today the company announced a $305 million series B round of funding, led by General Catalyst and co-led by Prosperity7. Together AI first emerged in 2023 with an aim to simplify enterprise use of open-source large language models (LLMs)

VC News Daily
Feb 20th, 2025
Together AI Secures $305M Series B Funding

Together AI announced a $305 million Series B funding round led by General Catalyst and co-led by Prosperity7. Notable investors include Salesforce Ventures, DAMAC Capital, NVIDIA, and others. The funding will support Together AI's mission to enhance AI model training and deployment with a focus on performance, control, and cost-efficiency.

PR Newswire
Feb 20th, 2025
Together AI Secures $305M Series B Funding

Together AI announced a $305 million Series B funding round led by General Catalyst and Prosperity7, valuing the company at $3.3 billion. The investment will enhance its AI Acceleration Cloud, focusing on open source models and NVIDIA Blackwell GPU deployment. Together AI supports over 450,000 developers and partners with major firms like Salesforce and Zoom. The platform offers enterprise-grade AI solutions with advanced infrastructure and research innovations for improved efficiency and cost-effectiveness.

Maginative
Feb 20th, 2025
Together AI Raises $305M for Expansion

Together AI has raised $305 million in a Series B funding round, increasing its valuation to $3.3 billion. The round was led by General Catalyst and co-led by Prosperity7, with participation from NVIDIA, Salesforce Ventures, Kleiner Perkins, and Coatue. The funds will be used to expand its AI Acceleration Cloud by deploying NVIDIA Blackwell GPUs. Together AI supports over 200 open-source AI models and has surpassed 450,000 developers and $100 million in annualized revenue.