Simplify Logo

Full-Time

Lead Reliability Engineer

Confirmed live in the last 24 hours

Celestial AI

Celestial AI

51-200 employees

Optical interconnects for hyperscale data centers

Hardware
AI & Machine Learning

Compensation Overview

$175k - $200kAnnually

+ Equity

Expert

Santa Clara, CA, USA

Category
DevOps & Infrastructure
Site Reliability Engineering
Requirements
  • Bachelor's degree in Engineering or related field; Master's or PhD degree preferred.
  • 15+ years of experience in reliability engineering, with a focus on datacenter and high-performance computing applications at component, board and system level.
  • Very strong understanding on physics of failures to drive material and process improvements for components.
  • Strong understanding of reliability principles, methodologies, and tools relevant to datacenter and HPC environments, such as reliability modeling, fault tolerance techniques, and performance optimization strategies.
  • Experience working with industry standards and guidelines specific to datacenter and HPC reliability, such as GR-468 and other relevant datacenter component qualification requirements.
  • Proven ability to lead cross-functional teams and drive reliability initiatives in fast-paced environments.
  • Excellent problem-solving skills and the ability to perform detailed root cause analysis in complex systems.
  • Effective communication skills and the ability to collaborate with internal teams and external stakeholders in the datacenter and HPC ecosystem.
Responsibilities
  • Develop and implement reliability strategies, standards, and processes customized for datacenter and high-performance computing applications, addressing unique challenges such as thermal management, power integrity, and workload variability.
  • Lead reliability testing and qualification activities tailored for datacenter and HPC environments, including stress testing, thermal cycling, and performance degradation analysis.
  • Collaborate closely with cross-functional teams, including hardware design, systems engineering, and datacenter operations, to integrate reliability considerations into product development and deployment processes.
  • Conduct thorough reliability analyses specific to datacenter and HPC applications, such as MTBF (Mean Time Between Failures) calculations, system-level fault tolerance assessments, and risk mitigation strategies.
  • Define reliability requirements and specifications for new products targeting datacenter and HPC markets, working closely with design teams to ensure compliance with industry standards and customer expectations.
  • Lead root cause analysis and corrective actions for reliability issues identified in datacenter and HPC environments, driving continuous improvement initiatives and implementing best practices.
  • Stay abreast of emerging technologies and industry trends in datacenter and HPC reliability engineering, leveraging this knowledge to enhance the reliability and performance of our systems.

Celestial AI focuses on enhancing the performance of hyperscale data centers through its technology called Photonic Fabric™, which is an optical compute interconnect. This technology improves memory sharing by reducing the total DRAM requirements by up to 35%, leading to significant cost savings for data centers. It achieves this by utilizing optical interconnects that lower power consumption and increase bandwidth capacity, which is crucial for multi-tenant cloud environments where lower latency memory pooling can save around 23%. Additionally, Celestial AI's technology enables the disaggregation of High Bandwidth Memory (HBM) by using optics instead of traditional PCIe connections, addressing future needs for higher bandwidth connections among compute units. The company aims to provide this optical connectivity to data centers, facilitating advancements in Generative AI and other complex computing tasks, positioning itself as a key player in the evolution of advanced computing.

Company Stage

Series C

Total Funding

$337.9M

Headquarters

Sunnyvale, California

Founded

2020

Growth & Insights
Headcount

6 month growth

12%

1 year growth

22%

2 year growth

120%
Simplify Jobs

Simplify's Take

What believers are saying

  • The recent $175 million Series C funding round, led by prominent investors, underscores strong financial backing and growth potential.
  • Appointment of industry veteran Diane Bryant to the Board of Directors brings valuable expertise and credibility to the company.
  • Adoption of Photonic Fabric™ by leading hyperscalers indicates market validation and potential for widespread industry impact.

What critics are saying

  • The highly competitive nature of the high-performance computing market could challenge Celestial AI's ability to maintain its technological edge.
  • Dependence on the successful integration and adoption of Photonic Fabric™ technology by hyperscalers and other partners poses a risk if market acceptance is slower than anticipated.

What makes Celestial AI unique

  • Celestial AI's Photonic Fabric™ technology offers a unique optical compute interconnect that significantly reduces DRAM requirements and power consumption, setting it apart from traditional data center solutions.
  • The company's focus on disaggregating HBM over optics instead of PCIe provides a forward-looking approach to meet future bandwidth and latency needs, unlike competitors relying on conventional interconnects.
  • Strategic collaborations with hyperscalers and AI computing providers enable Celestial AI to address critical performance chokepoints, positioning it as a leader in advanced AI infrastructure.