Member of Technical Staff: Data Acquisition @ essential AI

Essential AI’s mission is to deepen the partnership between humans and computers, unlocking collaborative capabilities that far exceed what could be achieved today. We believe that building delightful end-user experiences requires innovating across the stack - from the UX all the way down to models that achieve the best user value per FLOP.

We believe that a small, focused team of motivated individuals can create outsized breakthroughs. We are building a world-class multi-disciplinary team who are excited to solve hard real-world AI problems. We are well-capitalized and supported by March Capital and Thrive Capital, with participation from AMD, Franklin Venture Partners, Google, KB Investment, NVIDIA.

The Role

The Data Acquisition (Crawler) Engineer will be responsible for developing and maintaining the systems that allow for the smooth and efficient collection, storage, and processing of data from various sources. Your primary responsibility will be to design, develop, and maintain web crawlers and data acquisition systems in an efficient and reliable manner to support our model training.

What you’ll be working on

Architect and build large scale distributed web crawler system.
Design and implement web crawlers and scrapers to automatically extract data from websites, handling challenges like dynamic content and scaling to large data volumes.
Develop data acquisition pipelines to ingest, transform, and store large volumes of data.
Develop a highly scalable system and optimize crawler performance.
Monitor and troubleshoot crawler activities to detect and resolve issues promptly.
Work closely with data infrastructure and data researcher to improve the quality of the data.

What we are looking for

Previous large scale web crawling experience is a must for this role.
Minimum of 5 years of experience in data-intensive applications and distributed systems.
Proficiency in high performance programming languages like Go or Rust or C++.
Strong understanding of orchestration and containerization frameworks like Docker / Kubernetes.
Experience building on GCP or AWS services.
Bonus: You have deep expertise working with headless browsers and Chrome DevTools Protocol.
Bonus: You are curious to learn and develop understanding of how data sources and quality affects LLM capabilities.

We encourage you to apply for this position even if you don’t check all of the above requirements but want to spend time pushing on these techniques.

We are based in-person in SF. We offer relocation assistance to new employees.

The Role

What you’ll be working on

What we are looking for

6 month growth

1 year growth

2 year growth