Full-Time

Research Scientist

DatologyAI

DatologyAI

11-50 employees

Automated data curation for GenAI training

Compensation Overview

$180k - $300k/yr

+ Equity + Relocation Assistance + 401(k) Match

Company Does Not Provide H1B Sponsorship

San Carlos, CA, USA

Hybrid

Category
AI & Machine Learning (1)
Required Skills
Neural Networks
Pytorch
Apache Spark
Snowflake
Requirements
  • 3+ years of deep learning research experience
  • Experience with post-training large vision, language, and multimodal models
  • Post-training algorithm development, data curation, and/or synthetic data methods for: Preference-based tuning (e.g. DPO, RLVR, RRHF)
  • Post-training algorithm development, data curation, and/or synthetic data methods for: Alternative supervision & self-supervision techniques such as self-training and chain-of-thought distillation
  • Post-training algorithm development, data curation, and/or synthetic data methods for: SFT (e.g. instruction tuning and demonstration fine-tuning)
  • Post-training tooling development and engineering experience
  • Strong understanding of the fundamentals of deep learning
  • Sufficient software engineering + deep learning framework (PyTorch or a willingness to learn PyTorch) skills to conduct large-scale research experiments and build production prototypes
  • Demonstrated track record of success in deep learning research, whether papers, tools, or other research artifacts
  • Experience with data management and distributed data processing solutions (e.g. Spark, Snowflake)
  • Experience building + shipping ML products
Responsibilities
  • Post-training data curation. You’ll conduct research on how to algorithmically curate post-training data—e.g., how to generate and refine preference and instruction-following data, how to curate capability- and domain-specific data, and make post-training more effective, controllable, and generalizable.
  • Unifying pre-training and post-training data curation. Pushing the bounds on model capabilities requires unifying post-training and pre-training data curation. You will pursue research on end-to-end data curation: how to curate pre-training data to improve the post-trainability of models and how to jointly optimize pre- and post-training data curation, all in service of maximizing the final performance of post-trained models.
  • Transform messy literature into practical improvements. The research literature is vast, rife with ambiguity, and constantly evolving. You will use your skills as a scientist to source, vet, implement, and improve promising ideas from the literature and of your own creation.
  • Conduct science driven by real-world needs. At DatologyAI, we understand that conference reviewers and academic benchmarks don’t always incentivize the most impactful research. Your research will be guided by concrete customer needs and product improvements.
Desired Qualifications
  • Experience with data management and distributed data processing solutions (e.g. Spark, Snowflake)
  • Experience building + shipping ML products
  • Candidates do not need a PhD or extensive publications

DatologyAI offers automated data curation tools to optimize GenAI training by selecting high-quality, relevant data and removing noisy or harmful data. The core tech analyzes datasets and plugs into existing training pipelines, requiring minimal code changes, and scales from small to petabyte-scale data with usage-based pricing. It differentiates itself with end-to-end automated curation at scale and easy integration, supported by recognized research work and contributions to ImageNet, plus a team with CMU PhD expertise and immigrant-founder VC backing. The goal is to help organizations train better AI models more efficiently and cost-effectively by ensuring high-quality data throughout the training lifecycle.

Company Size

11-50

Company Stage

Series A

Total Funding

$57.7M

Headquarters

Redwood City, California

Founded

2023

Simplify Jobs

Simplify's Take

What believers are saying

  • Raised $46M in 2024 to fuel automated data curation expansion.
  • Founders Alex Morcos, Matthew Leavitt, Bogdan Gaza bring Amazon, Twitter expertise.
  • Plans to grow from 10 to 25 employees by end of 2024.

What critics are saying

  • OpenAI replicates DatologyAI pipelines internally, commoditizing tech by 2025.
  • CleanLab undercuts with 30% lower pricing, stealing healthcare clients in 2026.
  • EU AI Act blocks sales to Europe due to black-box non-compliance in 2026.

What makes DatologyAI unique

  • DatologyAI offers modality-agnostic curation for text, images, video, audio, genomic data.
  • Deploys on-premises or VPC, scaling to petabytes without training code changes.
  • Automatically optimizes batching and augmentation for specific model applications.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Dental Insurance

Vision Insurance

401(k) Company Match

Unlimited Paid Time Off

Annual Wellness Stipend

Annual Learning and Development Stipend

Relocation Assistance

Company News

SiliconANGLE Media
May 9th, 2024
DatologyAI raises $46M to streamline AI model training data diets

DatologyAI raises $46M to streamline AI model training data diets - SiliconANGLE

DatologyAI
Feb 23rd, 2024
Introducing DatologyAI — Making models better through better data, automatically

Models are what they eat. AI models trained on large-scale datasets have demonstrated jaw-dropping abilities and have the power to transform every aspect of our daily lives, from work to play. This massive leap in capabilities has largely been driven by corresponding increases in the amount of data we train models on, shifting from millions of data points several years ago to billions or trillions of data points today. As a result, these models are a reflection of the data on which they’re train

SiliconANGLE Media
Feb 23rd, 2024
DatologyAI raises $11.65M to automate data curation for more efficient AI training

DatologyAI raises $11.65M to automate data curation for more efficient AI training.

TechCrunch
Feb 22nd, 2024
DatologyAI is building tech to automatically curate AI training datasets | TechCrunch

A new startup, DatologyAI, claims to be able to automatically curate the massive data sets on which AI models train.