Full-Time

Machine Learning Operations Engineer

MLOps

Posted on 10/8/2024

Together AI

Together AI

201-500 employees

Open-source AI research via decentralized cloud

Compensation Overview

$160k - $240k/yr

+ Equity + Benefits

San Francisco, CA, USA

In Person

Category
AI & Machine Learning (2)
,
Required Skills
LLM
Scikit-learn
Kubernetes
Microsoft Azure
Python
Tensorflow
Pytorch
Docker
AWS
Go
DevOps
Google Cloud Platform
Requirements
  • 5+ years experience working on a production level ML training or inference system.
  • Bachelor’s degree in computer science or equivalent industry experience.
  • Strong understanding of the state of the art in machine learning especially LLMs.
  • Experience with DevOps practices like CI/CD, automation, containerization (Docker), and orchestration (Kubernetes).
  • Proficiency in cloud platforms like AWS, Google Cloud, or Azure.
  • Expertise in programming (Python, go, etc.) and frameworks for ML (TensorFlow, PyTorch, Scikit-learn).
Responsibilities
  • Work closely with engineering, research, and sales on deploying, evaluating, and operating inference systems for both customers and internal use.
  • Build and maintain tools, services, and documentation for automation and testing.
  • Analyze and improve efficiency, scalability, and stability of various system resources.
  • Conduct design and code reviews.
  • Participate in an on-call rotation to respond to critical incidents as needed.

Together AI provides open-source AI tools and decentralized cloud services to train, fine-tune, and deploy generative models for researchers, developers, and organizations. It runs tasks in the cloud where users run training jobs, manage model versions, and deploy applications via subscriptions and usage fees. It differentiates itself by prioritizing open-source, transparency, and a decentralized cloud approach instead of a proprietary stack. Its goal is to broaden access to powerful AI and build open, verifiable AI systems that benefit society through shared technology.

Company Size

201-500

Company Stage

Series B

Total Funding

$533.5M

Headquarters

Menlo Park, California

Founded

2022

Simplify Jobs

Simplify's Take

What believers are saying

  • $1B deal drives 10x YoY revenue growth past $100M ARR.
  • NVIDIA GTC 2026 partnerships enable Nemotron 3 Super day-0 access.
  • Cursor leverages Blackwell for real-time coding inference scaling.

What critics are saying

  • NVIDIA OpenShell erodes inference edge within 6-12 months.
  • Cursor in-house optimizations trigger client exodus in 12-18 months.
  • Alibaba Wan 2.7 on Atlas commoditizes pricing in 3-6 months.

What makes Together AI unique

  • Together AI deploys Alibaba's Wan 2.7 video suite at $0.10 per second.
  • Mamba-3 SSM outperforms transformers at 16K decode speeds openly.
  • Voice platform achieves sub-700ms latency via co-located Deepgram-Cartesia.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Company Equity

Growth & Insights and Company News

Headcount

6 month growth

-3%

1 year growth

-5%

2 year growth

8%
TEI
Apr 3rd, 2026
Together AI launches Wan 2.7 video suite at $0.10 per second.

Together AI launches Wan 2.7 video suite at $0.10 per second. Together AI has rolled out Alibaba's Wan 2.7 video generation model on its cloud platform, pricing the text-to-video capability at $0.10 per second of generated footage. The deployment marks the first major cloud availability for the four-model suite that Alibaba released in late March. The text-to-video model, accessible via the endpoint Wan-AI/wan2.7-t2v, supports 720p and 1080p resolution with outputs ranging from 2 to 15 seconds. Audio input can drive generation, and multi-shot narrative control works directly through prompt language - a meaningful upgrade over basic prompt-to-video systems that force creators into fragmented workflows. What's actually shipping. Right now, only text-to-video is live. Together AI says image-to-video and reference-to-video capabilities are "coming soon," with video editing tools to follow. The image-to-video model will support first-frame, first-and-last-frame, and continuation generation - useful for storyboarding workflows. A 3x3 grid-to-video feature targets teams building structured content from static assets. Reference-to-video gets more interesting for production work. It'll accept both reference images and reference videos as inputs, handling multi-character interactions and complex scene composition at up to 1080p for 10-second clips. The editing play. Video Edit, the fourth model in the suite, addresses what's arguably the biggest pain point in AI video: the inability to revise without starting from scratch. Together AI's implementation will support instruction-based editing via text, reference image-based modifications, style transfer, and temporal feature cloning - motion, camera work, effects lifted from source media. For creative teams, keeping these capabilities within one API surface eliminates the handoff chaos that currently plagues AI video production. Most workflows today involve generating in one tool, editing in another, and manually patching the results. Competitive positioning. The $0.10 per second pricing puts Together AI in striking distance of competitors, though direct comparisons depend heavily on resolution and duration parameters. Wan 2.7 itself has drawn attention since its March release - reviews have called it potentially the strongest AI video model of 2026, though some skepticism about the hype remains. Alibaba built Wan 2.7 within its Qwen ecosystem, and earlier versions (2.1 and 2.2) were open-sourced. Whether 2.7 follows that path hasn't been confirmed, but the model is now accessible through multiple cloud providers including Atlas Cloud and WaveSpeedAI alongside Together AI. Integration details. For developers already on Together AI's platform, adding video generation requires no new authentication or billing setup. The same SDKs work across text, image, and video inference. The company offers serverless endpoints for development with volume pricing available for production workloads. Teams evaluating the technology can test directly in Together AI's playground before committing to API integration. Full documentation covers parameters including audio inputs, resolution control, and the polling loop required for asynchronous video generation jobs.

CoinsNews
Mar 17th, 2026
Mamba-3 SSM drops with inference-first design beating transformers at decode.

Mamba-3 SSM drops with inference-first design beating transformers at decode. Together.ai releases Mamba-3, an open-source state space model built for inference that outperforms Mamba-2 and matches Transformer decode speeds at 16K sequences. (Read More)

Together AI
Mar 16th, 2026
Together AI at NVIDIA GTC 2026: explore our latest innovations across research and products.

Together AI at NVIDIA GTC 2026: explore its latest innovations across research and products. Join Together Computer Inc from March 16-19 in San Jose as Together Computer Inc showcase the latest research breakthroughs and new platform capabilities across open source LLMs, voice AI, production-scale inference and AI factories. This year, Together AI is excited to be part of NVIDIA GTC with multiple major announcements and conversations shaping the AI ecosystem - from cutting-edge model releases to new voice AI capabilities, and technical sessions with its research and engineering leaders. If you're attending GTC, Together Computer Inc'd love to connect. Key announcements. At GTC 2026, several of the announcements Together Computer Inc is participating in highlight a core theme: AI systems are becoming more open, agentic, and production ready. Together AI, the AI Native Cloud, is designed to support this shift - helping developers train, shape, and deploy large-scale AI systems with the performance and cost-efficiency required for real-world applications. Together Computer Inc is making multiple announcements today at GTC. Use NVIDIA Dynamo 1.0 in Together AI. NVIDIA has launched NVIDIA Dynamo 1.0, an open-source software for generative and agentic inference at scale. Together Computer Inc is excited to work with NVIDIA on Dynamo 1.0 and have already been using Dynamo as part of its inference stack to deliver more optimized performance in production use cases. At Together AI, Together Computer Inc is committed to open innovation and are looking forward to exploring use cases that Dynamo 1.0 can be applied to. Connect to Together's high-performance inference through NVIDIA OpenShell. Together AI and NVIDIA are working together on NVIDIA NemoClaw - an open source stack that simplifies running OpenClaw always-on assistants, more safely, with a single command. As part of the NVIDIA Agent Toolkit, it installs the NVIDIA OpenShell runtime - a secure environment for running autonomous agents, and open source models like NVIDIA Nemotron. Together is excited to host NVIDIA OpenShell runtime created for customers who want high performance models to build agents. Together AI has a model library with over 150 optimized models that can now be easily accessed via NemoClaw. Paired with Together's dedicated endpoints, developers get the speed and cost efficiency of its inference engine at production scale. Leverage NVIDIA Nemotron 3 Super for multi-agent workflows. NVIDIA Nemotron 3 Super is a hybrid mixture-of-experts model designed for high-performance reasoning and multi-agent workflows. It combines a Mamba-Transformer architecture with a 1M-token context window to support long-horizon reasoning and complex agent interactions. With 120B total parameters (12B active per token), the model is optimized to run multiple collaborating agents efficiently - even on a single GPU - making it well suited for AI-native workflows like software development agents, financial analysis, and cybersecurity automation. Nemotron 3 Super can be deployed through its Dedicated Model Inference, providing developers with a simple and scalable way to run advanced reasoning models in production. Build voice agents with NVIDIA Parakeet TDT 0.6B V3. As part of its recent voice solutions launch, NVIDIA Parakeet TDT 0.6b V3 automatic speech recognition (ASR) model is now available in the Together AI Model Library, giving developers access to high-performance, low-latency transcription optimized for real-time voice applications. By combining Parakeet's ASR accuracy with Together's high-performance inference infrastructure, AI natives can build production-ready voice agents that deliver fast, reliable, and scalable transcription. Together sessions. The Together AI team, along with customers like Cursor and Decagon, will share insights across multiple GTC sessions, covering topics from production inference to open AI research. Sessions include: * Engineering real-world LLM inference: Bridging open-source and production systems March 17 - 2:00 PM Yineng Zhang - Principal AI Researcher, Together AI * Hard-Won Lessons From Production Inference at Scale March 17 - 4:00 PM Yuchen Wu, Engineer, Cursor | Ce Zhang - CTO, Together AI * Build Trust and Discovery Through Open-Source AI in Research March 18 - 2:00 PM Percy Liang - Co-Founder, Together AI * Under the Hood of Building and Scaling AI-Native Applications March 18 - 4:00 PM Alan Yiu, VP of Product, Decagon | Charles Zedlewski - Chief Product Officer, Together AI Visit Together Computer Inc at booth #1213. Beyond sessions, the Together team will be hosting booth activations and side events throughout the week, including curated executive meetups focused on next-generation AI infrastructure and AI-native applications. Stop by to: * See live demos of Together AI infrastructure and models * Learn how teams are scaling production inference and agentic systems * Meet researchers and engineers building the future of open AI models and infrastructure

Asia Token Fund
Mar 13th, 2026
Together AI Launches Voice Agent Platform With Sub-700ms Latency

Together AI launches voice agent platform with sub-700ms latency. Together AI rolled out a unified voice agent platform that keeps speech-to-text, language models, and text-to-speech processing on the same infrastructure cluster. The $3.3 billion AI cloud startup claims the setup delivers end-to-end latency under 700 milliseconds - fast enough for natural conversation flow. The platform integrates natively with Deepgram for transcription and Cartesia for voice synthesis, both running on Together's co-located servers rather than bouncing audio across multiple cloud providers. Why co-location matters for voice. Most production voice systems stitch together separate vendors for each pipeline stage. Audio hits one provider for transcription, routes to another for the LLM response, then bounces to a third for speech synthesis. Each handoff adds network latency and failure points. Together's pitch: keep everything in the same datacenter. The company reports sub-500ms latency in optimal conditions, though the 700ms figure represents their stated ceiling for end-to-end processing. "Voice agents live or die by latency, and every network hop between providers is a place where the experience breaks down," said Abe Pursell, Deepgram's VP of Partnerships. Model flexibility without the patchwork. The platform supports Whisper Large v3, Minimax Speech 2.6 Turbo, Rime Arcana, and Kokoro alongside Together's full LLM catalog. Developers can swap components without rebuilding integrations - useful for teams testing different voice characteristics or transcription accuracy for specific use cases. Cartesia brings its Sonic-3 and Sonic-2 TTS models to the platform. Deepgram contributes Nova-3, Nova-3 Multilingual for transcription, Flux for conversational STT, and Aura-2 for synthesis. Unlike opaque speech-to-speech systems, Together's modular approach preserves access to intermediate transcripts and response text. Teams can inspect, modify, and route data mid-stream - a requirement for many enterprise compliance workflows. Enterprise requirements and production use. The platform targets regulated industries with zero data retention options, SOC 2 Type II certification, HIPAA compliance, and dedicated data residency. Decagon, which runs customer support voice agents handling billing inquiries and technical troubleshooting, already operates on the stack. Together AI raised $305 million in February 2025 at a $3.3 billion valuation, with reports suggesting the company is now in talks to raise at $7.5 billion. The company has surpassed 450,000 developers and crossed $100 million in annualized revenue. The voice platform launch represents Together's expansion beyond its core LLM inference business into the growing voice AI market, where latency and reliability remain persistent pain points for production deployments.

Together AI
Mar 11th, 2026
Together AI brings NVIDIA Nemotron 3 to developers on day 0.

Together AI brings NVIDIA Nemotron 3 to developers on day 0. Together Computer Inc is excited to bring NVIDIA Nemotron 3 Super to Together AI, the AI Native Cloud. Built for multi-agent orchestration and complex reasoning, Nemotron 3 Super is a 120B-parameter (12B active) hybrid model that combines Transformer and Mamba architectures. Running Nemotron 3 Super on Together AI Dedicated Inference allows engineering teams to deploy this open-weights model on managed infrastructure designed for high-throughput inference workloads. Architectural capabilities for agentic workflows. Modern agentic systems that analyze massive document stores or orchestrate multi-step planning require models that can maintain state across long contexts without sacrificing generation speed. Nemotron 3 Super introduces several architectural innovations that make it well suited for these workloads: * Hybrid MoE Architecture (Transformer + Mamba): By combining Mamba's efficient sequence processing with Transformer attention, the model maintains strong reasoning capability while keeping active parameters (12B out of 120B) manageable for faster inference. Its Latent MoE design enables the model to call four experts for the inference cost of one, improving efficiency for reasoning-heavy workloads. * 1M-Token Context Window: The 1-million-token context length allows applications to process entire codebases, maintain state across long agent trajectories, and inject significantly larger retrieval payloads directly into prompts. * Multi-Token Prediction: Nemotron 3 Super is trained to generate several tokens simultaneously in a single forward pass. For applications that produce large outputs such as code generation or structured responses, this drastically reduces generation latency, delivering over 50% higher token generation speeds compared to current leading open models. To achieve leading accuracy across benchmarks like AIME 2025 and SWE Bench verified, the model was trained using multi-environment reinforcement learning (RL) and NVIDIA-generated high-quality synthetic data. Because NVIDIA provides the model with open weights, datasets, and development recipes, engineering teams maintain full control to customize and fine-tune the model for their specific environments. Running Nemotron 3 Super on Together AI. Serving a 120B-parameter hybrid model with a 1M-token context window typically requires distributed compute across multiple nodes. Nemotron 3 Super is available through Together AI Dedicated Inference, offering an infrastructure environment tailored for both experimentation and production scale without the overhead of GPU provisioning: * Single-GPU Deployment: The model is optimized to run collaborating agents on a single GPU footprint, supporting deployment on single NVIDIA H200 or H100 GPUs. Together AI handles the underlying infrastructure orchestration, allowing teams to deploy these workloads without provisioning or managing GPUs directly. * Research-Optimized Performance: Running hybrid MoE architectures efficiently requires highly tuned serving software. Together AI accelerates model execution through the Together Inference Engine and custom CUDA kernels. This stack helps teams achieve lower latency and higher throughput during live inference. * Production-Grade Isolation: Dedicated Inference isolates workloads on reserved hardware to support predictable throughput and consistent performance at scale. The platform operates on enterprise-ready infrastructure, including a 99.9% uptime SLA and SOC 2 compliance. Get started. Run large-context reasoning workloads, deploy multi-agent systems, and scale production reference without managing GPU infrastructure. Faq. What is NVIDIA Nemotron 3 Super? NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) reasoning model designed for complex AI workflows and multi-step problem solving. It combines Transformer and Mamba components to deliver strong reasoning capability with efficient inference. What architecture does Nemotron 3 Super use? Nemotron 3 Super uses a hybrid Mixture-of-Experts architecture that combines Transformer attention with Mamba sequence processing. This design improves compute efficiency while maintaining strong reasoning performance. What context length does Nemotron 3 Super support? Nemotron 3 Super supports context windows of up to 1 million tokens, enabling applications to analyze large document collections, maintain long conversations, and incorporate extensive retrieval context into reasoning workflows. What types of applications can use Nemotron 3 Super? Nemotron 3 Super is well suited for applications that coordinate multiple agents or operate across large knowledge sources. Examples include developer assistants that analyze and refactor codebases, enterprise systems that process large document collections, cybersecurity workflows that triage vulnerabilities or analyze system logs, and orchestration systems that route tasks across specialized agents based on user intent. How do developers run Nemotron 3 Super on Together AI? Nemotron 3 Super is deployed on Together AI through Dedicated Model Inference. Dedicated deployments allow teams to run models on reserved infrastructure designed for production workloads with predictable performance. Do developers need to manage GPUs? No. Together AI manages the underlying infrastructure, allowing developers to deploy and scale AI workloads without provisioning GPU resources directly. Why use Together AI for these workloads? Together AI provides infrastructure designed for large-scale AI systems, including reliable inference, serverless scaling, and managed infrastructure for modern AI applications.

INACTIVE