Full-Time

Senior Software Engineer

Model Performance

Inference

Inference

1-10 employees

Serverless AI model inference across compute.

Compensation Overview

$220k - $320k/yr

+ Equity

San Francisco, CA, USA

In Person

Hybrid role; 4 days/week in SF office (Bay Area candidates).

Category
Software Engineering (1)
Requirements
  • 2+ years of experience in ML systems, inference optimization, or GPU programming
  • Strong proficiency in Python and familiarity with C++
  • Hands-on experience with LLM inference frameworks (vLLM, SGLang, TensorRT-LLM, or similar)
  • Deep understanding of GPU architecture and experience profiling GPU workloads
  • Familiarity with LLM optimization techniques (quantization, speculative decoding, continuous batching, KV cache management)
  • Experience with PyTorch and understanding of how models execute on hardware
  • Track record of measurably improving system performance
Responsibilities
  • Implement and productionize optimization techniques including quantization, speculative decoding, KV cache optimization, continuous batching, and LoRA serving
  • Deep dive into inference frameworks (vLLM, SGLang, TensorRT-LLM) and underlying libraries to debug and improve performance
  • Profile and optimize CUDA kernels and GPU utilization across our serving infrastructure
  • Add support for new model architectures, ensuring they meet our performance standards before going to production
  • Experiment with novel inference techniques and bring successful approaches into production
  • Build tooling and benchmarks to measure and track inference performance across our fleet
  • Collaborate with applied ML engineers to ensure trained models can be served efficiently
Desired Qualifications
  • Experience with CUDA programming
  • Familiarity with serving non-LLM models (TTS, vision, embeddings)
  • Experience with distributed inference and multi-GPU serving
  • Contributions to open-source inference frameworks
  • Experience with Docker and Kubernetes
  • You don't need to tick every box. Curiosity and the ability to learn quickly matter more.

Inference.net provides a distributed, serverless platform that lets developers run open-source AI models without managing infrastructure. It operates a global network of compute providers and leverages underutilized data center capacity to offer cost-effective LLM inference via a simple API, supporting models like Llama 3.1 8B. Customers are charged based on compute usage, giving a scalable solution for building AI-enabled applications. Unlike traditional clouds, it emphasizes serverless access to high-quality models with cloud-like reliability at a lower cost. The goal is to democratize access to AI technology by removing infrastructure complexity and cost barriers for developers and companies.

Company Size

1-10

Company Stage

Seed

Total Funding

$11.8M

Headquarters

San Francisco, California

Founded

2023

Simplify Jobs

Simplify's Take

What believers are saying

  • $11.8M seed from Multicoin Capital and a16z CSX fuels R&D expansion.
  • Teams like GravityAds train GPT-5 quality models at lightning speed.
  • Grants program attracts open-source developers with free compute resources.

What critics are saying

  • OpenAI o1 erodes cost edge with 50% cheaper superior reasoning now.
  • Together AI captures workloads with 2x lower latency on Arm chips.
  • DeepSeek-V3 commoditizes SLMs as users self-host at 10% compute cost.

What makes Inference unique

  • Catalyst platform uses production traffic for self-improving AI models.
  • Aggregates underutilized data center capacity for 90% cost savings.
  • Full-stack LLM lifecycle from monitoring to specialized model deployment.

Help us improve and share your feedback! Did you find this helpful?

Your Connections

People at Inference who can refer or advise you

Benefits

Health Insurance

Dental Insurance

Vision Insurance

Unlimited Paid Time Off

Hybrid Work Options

401(k) Company Match

Commuter Benefits

Phone/Internet Stipend

Gym Membership

Wellness Program

Mental Health Support

Stock Options

Performance Bonus

Profit Sharing

Company Equity

Remote Work Options

Sabbatical Leave

Company News

RootData
Oct 15th, 2025
Inference.net raises $11.8M seed funding

Open-source AI provider Inference has completed an $11.8 million seed round financing, led by Multicoin Capital and a16z CSX, with participation from Topology Ventures, Founders, Inc., and angel investors. The funding will enhance Inference's R&D efforts in model and infrastructure performance and improve its capacity to serve more companies.

Inference.net
Oct 14th, 2025
Announcing our $11.8M Series Seed

Announcing its $11.8M Series Seed. Inference is excited to announce that Inference has raised $11.8M in Series Seed funding, led by Multicoin Capital and a16z CSX, with participation from Topology Ventures, Founders, Inc., and an exceptional group of angel investors. Inference.net enables companies to train and deploy custom AI models that outperform general-purpose alternatives at a fraction of the cost. This capital will accelerate its mission to help businesses take control of their AI destiny. A fork in the road. Every company building with AI faces a critical challenge: pay unsustainable prices to OpenAI, Anthropic, and Google for general-purpose models, or compromise on quality with cheaper alternatives. This dependency on frontier labs creates three fundamental risks: First, spiraling costs limit scale. As usage grows from thousands to billions of requests, API costs can consume entire budgets. Second, companies lack control over core business infrastructure, leaving them vulnerable to price changes, model deprecations, and service disruptions. Third, when everyone uses the same models, true differentiation becomes impossible. Companies shouldn't have to choose between quality and cost. They shouldn't be forced to send sensitive customer data to third-party servers. And they shouldn't build their competitive advantage on infrastructure they don't control. Where Inference stand. Over the past year, Inference has trained and deployed custom language models for some of the fastest-growing AI-native companies in the world. Its approach is straightforward: Inference identify the specific, repeatable tasks that businesses run millions of times and train purpose-built models that excel at exactly those tasks. Whether extracting data from documents, captioning images, or classifying content, its models deliver superior results for their specialized domains. The results speak for themselves. Custom models match or exceed frontier model performance while running 2-3x faster and costing up to 90% less. These models, up to 100x smaller than GPT-5-class systems, prove that optimization for specific tasks beats general capability on a cost-to-performance ratio. Specialized models transform the economics of using AI at scale. Companies spending millions annually on API calls reduce costs by up to 90%. Applications previously constrained by latency can now serve real-time use cases. Businesses concerned about data privacy run models on their own infrastructure. Most importantly, companies gain full control of the AI models powering their core products. Beyond economics, custom models provide lasting competitive advantage. When every company has access to the same frontier models, differentiation disappears. Custom models trained on proprietary data and optimized for specific workflows become a moat that competitors cannot replicate. Your AI becomes yours, and yours alone. Moving forward. The next decade will witness two parallel tracks in AI development. Frontier labs will continue pushing the boundaries with massive, general-purpose models for open-ended tasks like coding, creative writing, and complex reasoning. These models will remain expensive but essential for exploratory use cases. Simultaneously, a new ecosystem of specialized models will power the repetitive, high-volume tasks that constitute the majority of business AI usage. Companies will rely on frontier labs for cutting-edge capabilities while owning and operating custom models for core operations. As companies scale from prototypes to production, the cost of relying on frontier labs becomes untenable. Meanwhile, the open-source ecosystem has matured dramatically, and new post-training techniques make it possible to match frontier capabilities with far fewer parameters. This funding enables Inference to expand its research and development efforts into new frontiers of model and infrastructure performance while scaling its ability to serve more companies. Join Inference. The transition from renting to owning intelligence has begun. Inference aim to accelerate this process. If you're spending more than $50,000 per month on closed-source AI providers, Inference can help you cut costs and improve performance in as little as 4 weeks. Book a call with its research team to learn more. Own your model. Scale with confidence. Schedule a call with its research team to learn more about custom training. Inference'll propose a plan that beats your current SLA and unit cost.