Full-Time

GPU Cluster Resource Scheduling and Optimization Engineer

Confirmed live in the last 24 hours

Together AI

Together AI

51-200 employees

Decentralized cloud services for AI development

Enterprise Software
AI & Machine Learning

Compensation Overview

$160k - $230kAnnually

+ Equity + Benefits

Senior

San Francisco, CA, USA

Category
Applied Machine Learning
AI Research
AI & Machine Learning
Required Skills
Kubernetes
Microsoft Azure
Python
Machine Learning
AWS
Go
C/C++
Google Cloud Platform

You match the following Together AI's candidate preferences

Employers are more likely to interview you if you match these preferences:

Degree
Experience
Requirements
  • 5+ years of experience in resource scheduling, distributed systems, or large-scale machine learning infrastructure
  • Proficiency in distributed computing frameworks (e.g., Kubernetes, Slurm, Ray)
  • Expertise in designing and implementing resource allocation algorithms and scheduling frameworks
  • Hands-on experience with cloud platforms (e.g., AWS, GCP, Azure) and GPU orchestration
  • Proficient in Python, C++, or Go for building high-performance systems
  • Strong understanding of operational research techniques, such as linear programming, graph algorithms, or evolutionary strategies
  • Analytical mindset with a focus on problem-solving and performance tuning
  • Excellent collaboration and communication skills across teams
Responsibilities
  • Develop and implement intelligent scheduling algorithms tailored for distributed AI workloads on multi-cluster and multi-tenant environments
  • Ensure efficient allocation of GPUs, TPUs, and CPUs across diverse workloads, balancing resource utilization and job performance
  • Design optimization techniques for dynamic resource allocation, addressing real-time variations in workload demand
  • Implement load balancing, job preemption, and task placement strategies to maximize throughput and minimize latency
  • Build systems that efficiently scale to thousands of nodes and petabytes of data
  • Optimize training and inference pipelines to reduce runtime and cost while maintaining accuracy and reliability
  • Build tools for real-time monitoring and diagnostics of resource utilization, job scheduling efficiency, and bottlenecks
  • Leverage telemetry data and machine learning models for predictive analytics and proactive optimization
  • Collaborate with researchers, data scientists, and platform engineers to understand workload requirements and align resource management solutions
  • Stay updated with the latest trends in distributed systems, AI model training, and cloud-native technologies
Desired Qualifications
  • Experience with AI/ML frameworks (e.g., TensorFlow, PyTorch, JAX)
  • Familiarity with AI-specific workloads like DDP, sharded training, or reinforcement learning
  • Knowledge of auto-scaling and cost-optimization strategies in cloud environments
  • Contributions to open-source scheduling or orchestration projects

Together AI focuses on enhancing artificial intelligence through open-source contributions. The company offers decentralized cloud services that allow developers and researchers to train, fine-tune, and deploy generative AI models. Their platform is designed to support a wide range of clients, from small startups to large enterprises and academic institutions, by providing cloud-based solutions that simplify the development and deployment of AI models. Unlike many competitors, Together AI emphasizes open and transparent AI systems, which fosters innovation and aims to achieve beneficial outcomes for society. The company's goal is to empower users with the tools they need to advance AI technology while maintaining a commitment to openness.

Company Size

51-200

Company Stage

Series B

Total Funding

$519M

Headquarters

Menlo Park, California

Founded

2022

Simplify Jobs

Simplify's Take

What believers are saying

  • Raised $305M in Series B funding, boosting its AI Acceleration Cloud expansion.
  • Growing demand for open-source AI models enhances Together AI's market position.
  • Partnerships with major firms like Salesforce and Zoom strengthen its enterprise solutions.

What critics are saying

  • Emerging models like Flux 1.1 Pro could challenge Together AI's offerings.
  • Market volatility from DeepSeek R1's release may impact Together AI's valuation.
  • Over-reliance on NVIDIA GPUs could lead to supply chain vulnerabilities.

What makes Together AI unique

  • Together AI focuses on open-source contributions, setting it apart in the AI industry.
  • The company offers decentralized cloud services, reducing dependency on centralized providers.
  • Together AI supports over 450,000 developers, showcasing its extensive reach and influence.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Company Equity

Growth & Insights and Company News

Headcount

6 month growth

-4%

1 year growth

0%

2 year growth

0%
36Kr
Feb 21st, 2025
Together AI raises $305M, valued $3.3B

Together AI, a U.S. AI cloud service provider, announced a $305 million Series B funding round, valuing the company at $3.3 billion. The investment, led by General Catalyst and Prosperity7 Ventures, will accelerate the development of AI applications based on open-source models like DeepSeek-R1. Together AI offers over 200 model API services and GPU rentals, with annual revenue exceeding $100 million. The company is expanding its infrastructure, deploying NVIDIA GPU clusters across North America.

VentureBeat
Feb 21st, 2025
Together Ai’S $305M Bet: Reasoning Models Like Deepseek-R1 Are Increasing, Not Decreasing, Gpu Demand

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More. When DeepSeek-R1 first emerged, the prevailing fear that shook the industry was that advanced reasoning could be achieved with less infrastructure.As it turns out, that’s not necessarily the case. At least, according to Together AI, the rise of DeepSeek and open-source reasoning has had the exact opposite effect: Instead of reducing the need for infrastructure, it is increasing it.That increased demand has helped fuel the growth of Together AI’s platform and business. Today the company announced a $305 million series B round of funding, led by General Catalyst and co-led by Prosperity7. Together AI first emerged in 2023 with an aim to simplify enterprise use of open-source large language models (LLMs)

VC News Daily
Feb 20th, 2025
Together AI Secures $305M Series B Funding

Together AI announced a $305 million Series B funding round led by General Catalyst and co-led by Prosperity7. Notable investors include Salesforce Ventures, DAMAC Capital, NVIDIA, and others. The funding will support Together AI's mission to enhance AI model training and deployment with a focus on performance, control, and cost-efficiency.

PR Newswire
Feb 20th, 2025
Together AI Secures $305M Series B Funding

Together AI announced a $305 million Series B funding round led by General Catalyst and Prosperity7, valuing the company at $3.3 billion. The investment will enhance its AI Acceleration Cloud, focusing on open source models and NVIDIA Blackwell GPU deployment. Together AI supports over 450,000 developers and partners with major firms like Salesforce and Zoom. The platform offers enterprise-grade AI solutions with advanced infrastructure and research innovations for improved efficiency and cost-effectiveness.

Maginative
Feb 20th, 2025
Together AI Raises $305M for Expansion

Together AI has raised $305 million in a Series B funding round, increasing its valuation to $3.3 billion. The round was led by General Catalyst and co-led by Prosperity7, with participation from NVIDIA, Salesforce Ventures, Kleiner Perkins, and Coatue. The funds will be used to expand its AI Acceleration Cloud by deploying NVIDIA Blackwell GPUs. Together AI supports over 200 open-source AI models and has surpassed 450,000 developers and $100 million in annualized revenue.