Full-Time

Product Engineer

AI Data Platform

Confirmed live in the last 24 hours

Labelbox

Labelbox

201-500 employees

Provides data labeling solutions for AI

Compensation Overview

$130k - $260k/yr

Junior, Mid

San Francisco, CA, USA

Hybrid model with 2 days per week in office.

Category
Backend Engineering
Software Engineering
Required Skills
RabbitMQ
Python
MySQL
NoSQL
Node.js
Data Structures & Algorithms
Nest.js
Apache Kafka
Java
Postgres
TypeScript
Elasticsearch
MongoDB
Cassandra
Requirements
  • Bachelor’s degree in Computer Science, Data Engineering, or a related field. Advanced degree preferred.
  • 2+ years of work experience in a software or data-focused company, with significant expertise in data infrastructure and backend engineering.
  • Deep knowledge of designing and managing scalable database systems, including relational databases (e.g., PostgreSQL, MySQL), NoSQL stores (e.g., MongoDB, Cassandra), and cloud-native solutions (e.g., Google Spanner, AWS DynamoDB).
  • Strong experience with data infrastructure components such as data pipelines, streaming systems, and storage architectures (e.g., Cloud Buckets, Key-Value Stores).
  • Proficiency in optimizing databases for performance (e.g., schema design, indexing, query tuning) and integrating them with broader data workflows.
  • Previous experience with distributed systems tools (e.g., queues, message brokers like Kafka or RabbitMQ, job orchestration frameworks) for real-time data processing and other use cases.
  • Previous experience with search engines (e.g., ElasticSearch).
  • Knowledge of backend development using languages like Python, Java, or TypeScript; familiarity with NodeJS and NestJS is a plus.
  • Proficient in data structures, algorithms, and system design for large-scale data management.
  • Demonstrated ability to keep up with trends in data infrastructure and database technologies.
  • Excellent communication and collaboration skills.
  • Strong sense of ownership and ability to thrive in a fast-paced environment.
  • Comfortable with ambiguity, breaking down high-level requirements into actionable data infrastructure tasks methodically.
  • Resourceful problem-solver with attention to detail, eager to take initiative and deliver results.
  • High proficiency in leveraging AI tools for daily development (e.g., Cursor, GitHub Copilot).
Responsibilities
  • Design and build scalable data infrastructure, integrating high-performance databases (relational, NoSQL, cloud-native) with distributed systems for data processing, storage, and streaming.
  • Optimize database systems for performance, reliability, and scalability, ensuring efficient data retrieval, indexing, and querying to support AI workflows.
  • Develop and maintain data pipelines using distributed queues, message brokers, and job management mechanisms to enable high-throughput import/export operations.
  • Collaborate with team members and stakeholders to align data infrastructure with platform goals and customer needs.
  • Participate in Sprint Planning, Standups, and related activities to drive data-focused initiatives forward.
  • Mentor and guide less experienced engineers, sharing expertise in data infrastructure and database optimization.
  • Support the team’s area of ownership by working with the Support organization to resolve customer-facing data issues.
  • Stay abreast of industry trends in data infrastructure and database technologies, incorporating relevant innovations into our systems.
  • Contribute to technical documentation, research publications, blog posts, and presentations at conferences and forums.
  • Innovation in AI: Enhance data infrastructure capabilities for an AI platform used by leading AI labs to develop powerful multi-modal large language models (LLMs).
Desired Qualifications
  • Familiarity with data warehousing solutions (e.g., Snowflake, BigQuery).
  • Experience with container orchestration systems (e.g., Kubernetes) for deploying data infrastructure components.
  • Experience with one or more public cloud platforms: Google Cloud Platform (GCP) (preferred), Amazon Web Services (AWS), Microsoft Azure.
  • Understanding of the Data + AI ecosystem and its relevance to large-scale AI platforms.
  • Knowledge of memory management and optimization in data-intensive systems.
  • Experience with DevOps tools (e.g., ArgoCD, DataDog) for monitoring and managing data infrastructure.
  • Previous experience using LLM backed AI services such as from OpenAI, Anthropic, Google, etc. to develop product features.

Labelbox offers data labeling solutions for artificial intelligence applications, helping businesses label images, videos, text, and documents efficiently. Their tools create workflows that manage labeling tasks, ensuring high-quality results for clients in industries like agriculture, healthcare, and technology. Operating on a software-as-a-service (SaaS) model, Labelbox generates revenue through subscription fees and additional workforce services. The company's goal is to enhance AI development by providing effective data labeling solutions that improve the efficiency and quality of AI model training.

Company Size

201-500

Company Stage

Series D

Total Funding

$188.9M

Headquarters

San Francisco, California

Founded

2018

Simplify Jobs

Simplify's Take

What believers are saying

  • Partnership with Google Cloud enhances Labelbox's generative AI capabilities.
  • Auto-computed metrics improve model debugging and performance before production.
  • Increased demand for AI-driven data labeling in healthcare boosts Labelbox's market potential.

What critics are saying

  • Emerging competition from companies like DeepSeek threatens Labelbox's market position.
  • Google's AI model upgrades could attract businesses away from Labelbox.
  • Shift towards competitive protection in AI limits Labelbox's collaborative opportunities.

What makes Labelbox unique

  • Labelbox offers a data-centric AI platform with human supervision and automation.
  • The platform supports Fortune 500 companies like Walmart, P&G, and Adobe.
  • Labelbox provides advanced data labeling solutions for images, videos, text, and documents.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Competitive remuneration

Flexible vacation policy (we don't count PTO Days)

401k Program

College savings account

HSA

Daily lunches paid for by the company (especially convenient while working from home)

Virtual wellness and guided meditation programs

Dog-friendly office

Regular company social events (happy hours, off-sites)

Professional development benefits and resources

Remote friendly (we hire in-office and remote employees)

Growth & Insights and Company News

Headcount

6 month growth

1%

1 year growth

1%

2 year growth

-6%
Tech in Asia
Jun 4th, 2025
Deepseek’S New Ai Model May Be Trained On Google’S Gemini

👩‍🍳 How we use AI at Tech in Asia, thoughtfully and responsibly.🧔‍♂️ A friendly human may check it before it goes live. More news hereChinese AI lab DeepSeek has released an updated reasoning model, R1-0528, which is reported to perform well in math and coding benchmarks.However, concerns have been raised regarding the potential use of data from Google’s Gemini AI family in training this model.Developer Sam Paech, based in Melbourne, shared evidence on social media indicating that R1-0528 shows similarities to Google’s Gemini 2.5 Pro.Another developer, known for creating SpeechMap, also noted that the reasoning patterns of R1-0528 resemble those of Gemini AI.DeepSeek has not disclosed the sources of data used for training the model.🔗 Source: TechCrunch🧠 Food for thought1️⃣ Model distillation creates an ethical gray area amid fierce AI competitionDistillation, the process of training smaller models using outputs from larger ones, has become a contentious but widespread practice in AI development, especially for companies with limited computing resources.While distillation itself is a legitimate technique, DeepSeek’s alleged use of competitors’ models highlights the intellectual property challenges in AI development, with previous accusations suggesting they used OpenAI’s outputs without authorization1.This case illustrates a technical reality: companies like DeepSeek, which are “short on GPUs and flush with cash,” may find it economically rational to create synthetic data from competitors’ models rather than building everything from scratch2.The increasing adoption of protective measures by major AI labs, such as OpenAI requiring ID verification from countries that exclude China or Google summarizing model traces, demonstrates how seriously these companies view the threat of unauthorized knowledge transfer3.These protective measures reflect a broader industry recognition that model weights represent the culmination of substantial investments, making them valuable intellectual property worth safeguarding4.2️⃣ AI contamination creates attribution challenges for researchers and companiesThe difficulty in definitively proving model copying stems partly from the growing “contamination” of the open web with AI-generated content, making it increasingly challenging to determine a model’s true training sources.As content farms flood the internet with AI-generated text and bots populate platforms like Reddit and X, the lines between human-created content and AI outputs are blurring, complicating efforts to create “clean” training datasets5.This contamination means that similar word choices and expression patterns across different models might simply reflect training on the same AI-generated web content rather than direct copying6.The challenges of attribution are further complicated by the fact that many models naturally converge on similar linguistic patterns due to shared training methodologies and objectives, making it difficult to establish definitive evidence of unauthorized distillation7.These attribution difficulties create significant implications for intellectual property protection in AI, as companies struggle to determine whether similarities between models indicate legitimate convergence or improper copying1.3️⃣ AI security measures signal a shift from open collaboration to competitive protectionThe increasing implementation of security measures by AI labs reflects a significant shift in the industry from open collaboration toward protecting competitive advantages in a high-stakes technological race.Major AI companies are implementing increasingly sophisticated protections, such as OpenAI requiring ID verification, Google “summarizing” model traces, and Anthropic explicitly protecting “competitive advantages,” signaling a new phase of AI development where knowledge protection trumps open sharing8.This defensive posture is emerging in a context where the stakes are enormous. Training a single large AI model can cost millions in computing resources and produce emissions equivalent to five cars’ lifetimes, making the intellectual property extremely valuable9.These protective measures are particularly notable in the context of international AI competition, with some U.S. legislators even proposing criminal penalties for downloading certain Chinese AI models like DeepSeek, highlighting the geopolitical dimensions of AI development10.The tension between collaboration and protection reflects a maturing AI industry where companies increasingly view their training methodologies and model capabilities as critical competitive assets rather than academic research to be openly shared3.Recent DeepSeek developments

PYMNTS
Sep 24th, 2024
Google Slashes Prices, Upgrades And Boosts Performance Of Ai Models

Google’s latest artificial intelligence models could accelerate AI adoption in eCommerce and retail, developers say, as the tech giant unveils upgrades designed to attract more businesses to its Gemini platform. The company announced two updated production-ready models in a Tuesday (Sept. 24) blog post, Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002, which offer enhanced capabilities across a range of tasks, including product recommendations, inventory management and customer service automation. “The new release introduces advanced capabilities in math and vision tasks,” Sujan Abraham, a senior software engineer at AI firm Labelbox, told PYMNTS. “These models are designed for a wide range of tasks, including text, code and multimodal applications. They can process larger and much more complex inputs like 1,000-page PDFs, massive code repos and hour-long videos

Reworked
Sep 12th, 2023
Labelbox Introduces Large Language Model (LLM) Solution to Help Enterprises Innovate With Generative AI, Expands Partnership With Google Cloud

Labelbox introduces Large Language Model (LLM) solution to help enterprises innovate with generative AI, expands partnership with Google Cloud.

Datanami
Sep 12th, 2023
Labelbox Introduces LLM Solution to Help Enterprises Innovate with Generative AI, Expands Partnership with Google Cloud

Labelbox introduces LLM solution to help enterprises innovate with generative AI, expands partnership with Google Cloud.

Labelbox
Dec 20th, 2022
Debugging models made easy with auto-computed metrics

In the next week, Labelbox Inc.’ll be releasing auto-generated model metrics to debug your model, find and fix labeling errors, and improve the overall performance of your model before it hits production on real-world data.