Full-Time

Machine Learning Data Engineer

Systems & Retrieval

Posted on 7/26/2025

Zyphra

Zyphra

51-200 employees

Open-source AI company building multimodal agents

No salary listed

San Francisco, CA, USA

In Person

Category
Data & Analytics (1)
Required Skills
Python
NoSQL
SQL
Machine Learning
Observability
DevOps
Requirements
  • Strong software engineering background with fluency in Python
  • Experience designing, building, and maintaining data pipelines in production environments
  • Deep understanding of data structures, storage formats, and distributed data systems
  • Familiarity with indexing and retrieval techniques for large-scale document corpora
  • Understanding of database systems (SQL and NoSQL), their internals, and performance characteristics
  • Strong attention to security, access controls, and compliance best practices (e.g., GDPR, SOC2)
  • Excellent debugging, observability, and logging practices to support reliability at scale
  • Strong communication skills and experience collaborating across ML, infra, and product teams
Responsibilities
  • Design and implementation of distributed data ingestion and transformation pipelines
  • Building retrieval and indexing systems that support RAG and other LLM-based methods
  • Mining and organizing large unstructured datasets, both in research and production environments
  • Collaborating with ML engineers, systems engineers, and DevOps to scale pipelines and observability
  • Ensuring compliance and access control in data handling, with security and auditability in mind
Desired Qualifications
  • Experience building or maintaining LLM-integrated retrieval systems (e.g, RAG pipelines)
  • Academic or industry background in data mining, search, recommendation systems, or IR literature
  • Experience with large-scale ETL systems and tools like Apache Beam, Spark, or similar
  • Familiarity with vector databases (e.g., FAISS, Weaviate, Pinecone) and embedding-based retrieval
  • Understanding of data validation and quality assurance in machine learning workflows
  • Experience working on cross-functional infra and MLOps teams
  • Knowledge of how data infrastructure supports training pipelines, inference serving, and feedback loops
  • Comfort working across raw, unstructured data, structured databases, and model-ready formats

Zyphra builds open-source, open-science AI focused on multimodal models and efficient systems that run on a wide range of hardware. It develops autonomous agent platforms for enterprises to enable conversational AI, automation, and offline-personalized assistants. Its Maia project blends neural architectures with long-term memory, reinforcement learning, and continual learning for text and audio modalities. Zyphra also releases open-source assets like Zamba, Zamba2-2.7B, and Zyda data, and is backed by investors to support offline, hardware-flexible AI with transparent research resources.

Company Size

51-200

Company Stage

N/A

Total Funding

N/A

Headquarters

San Francisco, California

Founded

2019

Simplify Jobs

Simplify's Take

What believers are saying

  • AMD MI355X delivers 35x inference gains since October 2025, boosting Zyphra Cloud throughput.
  • Hugging Face reports 40% Q1 2026 rise in MoE model downloads, driving ZAYA1-8B adoption.
  • IBM Cloud's February 2026 Polara upgrades cut latency 60%, accelerating Maia super-agent training.

What critics are saying

  • AMD integrates Zyphra's kernels into MI355X firmware within 12-24 months, commoditizing platform.
  • DeepSeek R1-0528 and Mistral Small-4-119B beat ZAYA1-8B benchmarks, shifting developers in 6-12 months.
  • Google DeepMind poaches Zyphra's ex-Anthropic talent in 6-12 months, halting Maia development.

What makes Zyphra unique

  • Zyphra builds full-stack AI platform on AMD MI355X GPUs via TensorWave for long-context agents.
  • ZAYA1-8B MoE model uses under 1B active parameters, rivaling Nemotron-3-Nano-30B on reasoning.
  • ZUNA open-sources EEG-to-text brain-computer interface model outperforming spline interpolation.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Dental Insurance

Vision Insurance

401(k) Retirement Plan

401(k) Company Match

Relocation Assistance

Wellness Program

Growth & Insights and Company News

Headcount

6 month growth

3%

1 year growth

0%

2 year growth

21%
PR Newswire
Feb 18th, 2026
Zyphra releases ZUNA, open-source brain-computer interface model for thought-to-text AI

Zyphra has released ZUNA, a foundation model for brain-computer interfaces that processes electroencephalography data and advances towards thought-to-text communication. The 380-million-parameter diffusion autoencoder model reconstructs high-fidelity brain signals from imperfect EEG data, improving diagnostics and research workflows. ZUNA works across various EEG systems, from consumer headsets to 256-electrode research equipment, predicting missing channels from sparse inputs. The model outperforms traditional spherical-spline interpolation methods, particularly with incomplete or noisy data. The San Francisco-based company released ZUNA as open-source software under an Apache 2.0 licence, with model weights available on Hugging Face and code on GitHub. Zyphra is seeking collaborations to improve future versions for specific use cases across medical devices, neuroscience research and consumer neurotechnology sectors.

Surperformance
Oct 1st, 2025
Zyphra partners with IBM, AMD: $1B AI boost

IBM and AMD have partnered with Zyphra, an AI startup valued at $1 billion after its Series A funding, to provide next-generation AI infrastructure. The multi-year agreement includes deploying a large cluster of AMD Instinct MI300X GPUs and AMD Pensando network accelerators on IBM Cloud. The first phase was delivered in September, with expansion planned for 2026. Zyphra will use this to train its 'Maia' super-agent for enhancing business productivity through language, image, and sound processing.

VentureBeat
Jun 7th, 2024
Zyphra Debuts Zyda, A 1.3T Language Modeling Dataset It Claims Outperforms Pile, C4, Arxiv

VB Transform 2024 returns this July! Over 400 enterprise leaders will gather in San Francisco from July 9-11 to dive into the advancement of GenAI strategies and engaging in thought-provoking discussions within the community. Find out how you can attend here. Zyphra Technologies is announcing the release of Zyda, a massive dataset designed to train language models. It consists of 1.3 trillion tokens and is a filtered and deduplicated mashup of existing premium open datasets, specifically RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. The company claims its ablation studies reveal that Zyda performs better than the datasets it was built on. An early dataset version powers Zyphra’s Zamba model and will eventually be available for download on Hugging Face.Image credit: Zyphra“[We] came up with Zyda when [we] were trying to create a pretraining dataset for [our] Zamba series of models,” Zyphra Chief Executive Krithik Puthalath tells VentureBeat in an email

INACTIVE