Full-Time

ML Ops Engineer

Founding Team

Fabrion

Fabrion

AI-native platform for industrial manufacturing

No salary listed

San Francisco, CA, USA

In Person

Category
Operations & Logistics (1)
Required Skills
LLM
Kubernetes
MLOps
React.js
Github Actions
Docker
RAG
LangGraph
Terraform
Next.js
Observability
REST APIs
LangChain
DevOps
Requirements
  • 5+ years of experience as a full stack or backend engineer.
  • Proven experience owning and delivering production systems end-to-end.
  • Familiarity with modern frontend frameworks such as React or Next.js.
  • Familiarity with building Application Programming Interfaces, databases, cloud infrastructure, or deployment workflows at scale.
  • Ability to work in early-stage startups and handle autonomous roles.
Responsibilities
  • Build and maintain secure, scalable, and automated pipelines for Large Language Model fine-tuning, Supervised Fine-Tuning, Low-Rank Adaptation, Reinforcement Learning from Human Feedback, and Direct Preference Optimization training.
  • Develop Retrieval-Augmented Generation embedding pipelines with dynamic updates.
  • Manage model conversion, quantization, and inference rollout.
  • Manage hybrid compute infrastructure for training and inference workloads using Kubernetes, Ray, and Terraform across cloud and on-premises Graphics Processing Unit clusters.
  • Containerize models and agents using Docker with reproducible builds and Continuous Integration and Continuous Deployment via GitHub Actions or ArgoCD.
  • Implement and enforce model governance including versioning, metadata, lineage, reproducibility, and evaluation capture.
  • Create and manage evaluation and benchmarking frameworks such as OpenLLM-Evals, RAGAS, and LangSmith.
  • Integrate with security and access control layers like Open Policy Agent, Attribute-Based Access Control, and Keycloak to enforce model policies per tenant.
  • Instrument observability for model latency, token usage, performance metrics, error tracing, and drift detection.
  • Support the deployment of agentic applications with LangGraph, LangChain, and custom inference backends.
Desired Qualifications
  • 4+ years in Machine Learning Operations, Machine Learning platform engineering, or infrastructure-focused Machine Learning roles.
  • Deep familiarity with model lifecycle management tools such as MLflow, Weights and Biases, Data Version Control, and HuggingFace Hub.
  • Experience with large model deployments, specifically open-source Large Language Models like LLaMA, Mistral, Falcon, or Mixtral.
  • Comfortable with tuning libraries including HuggingFace Trainer, DeepSpeed, Fully Sharded Data Parallel, and Quantized Low-Rank Adaptation.
  • Familiarity with inference serving tools such as vLLM, Text Generation Inference, Ray Serve, and Triton Inference Server.
  • Proficiency with Terraform, Helm, Kubernetes, and container orchestration.
  • Experience with Continuous Integration and Continuous Deployment for Machine Learning, such as GitHub Actions and model checkpoints.
  • Experience managing hybrid workloads across Graphics Processing Unit clouds like Lambda, Modal, HuggingFace Inference, or Sagemaker.
  • Familiarity with cost optimization techniques including spot instance scaling, batch prioritization, and model sharding.
  • Familiarity with LangChain, LangGraph, LlamaIndex or similar Retrieval-Augmented Generation and agent orchestration tools.
  • Experience building embedding pipelines for multi-source documents including PDF, JSON, CSV, and HTML.
  • Experience integrating with vector databases such as Weaviate, Qdrant, FAISS, and Chroma.
  • Experience implementing model-level Role-Based Access Control, usage tracking, and audit trails.
  • Experience integrating with Application Programming Interface rate limits, tenant billing, and Service Level Agreement observability.
  • Experience with policy-as-code systems like Open Policy Agent and Rego.
  • Experience with monitoring tools such as Prometheus, Grafana, OpenTelemetry, and LangSmith.
  • Experience with security tools like Keycloak and Vault.
  • Proficiency in Python and Bash, with optional knowledge of Rust or Go for tooling.
  • Experience with Service Organization Control 2, Health Insurance Portability and Accountability Act, or Government Cloud-grade model operations.
  • Prior experience as a founder, founding engineer, or in a 0-1 pre-seed startup.

Fabrion builds an AI-native platform for industrial manufacturing to accelerate AI adoption across complex, multi-tier value chains. The product applies artificial intelligence and machine learning to optimize operations in manufacturing, supply chains, and value-chain processes, with the aim of improving speed, resilience, and overall productivity. Unlike generic AI tools, Fabrion targets industrial contexts and real-world industrial workflows to address efficiency and robustness in manufacturing environments. The company differentiates itself through a focused mission on the industrial sector, leveraging a dedicated platform designed for complex, multi-tier value chains and partnering with capital backers like 8VC to fund its early-stage development. Fabrion’s goal is to help manufacturers operate more efficiently and robustly by enabling AI-driven decision making across their value chains.

Company Size

N/A

Company Stage

N/A

Total Funding

N/A

Headquarters

N/A

Founded

N/A

Simplify Jobs

Simplify's Take

What believers are saying

  • Telit Cinterion's deviceWISE Suite validates market demand for OT-AI integration at scale.
  • Foundational Industries' autonomous factory buildout creates partnership opportunities for Fabrion's AI stack.
  • Multi-tier value chain optimization addresses $2T+ manufacturing inefficiency across supply chains globally.

What critics are saying

  • Telit Cinterion directly competes with unified data backbone for industrial AI integration.
  • Foundational Industries' hardware-software integration outpaces Fabrion's software-only platform capabilities.
  • 8VC's exclusive funding creates existential risk if reallocated to competing autonomous operations startups.

What makes Fabrion unique

  • Custom fine-tuned SLMs with RLHF governance differentiate from foundation model wrappers.
  • Bare-metal acceleration for ETL/training enables federated deployments across hyperscalers and edge.
  • Industry-specific knowledge graph connects fragmented data, teams, and suppliers across production lifecycle.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

401(k) Retirement Plan

Remote Work Options