Full-Time

Preparedness Reliability Engineer

Confirmed live in the last 24 hours

OpenAI

OpenAI

5,001-10,000 employees

Develops safe and beneficial AI technologies

AI & Machine Learning

Senior, Expert

San Francisco, CA, USA

Relocation assistance offered to new employees.

Category
DevOps & Infrastructure
Site Reliability Engineering
Required Skills
Datadog
Kubernetes
Grafana
CloudFormation
Microservices
Prometheus
Terraform
Splunk

You match the following OpenAI's candidate preferences

Employers are more likely to interview you if you match these preferences:

Degree
Experience
Requirements
  • Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent work experience)
  • At least 7+ years of professional software engineering experience
  • Proven experience as a reliability engineer or a similar role in a fast-paced, rapidly scaling company
  • Strong proficiency in cloud infrastructure
  • Proficiency in programming/scripting languages
  • Experience with containerization technologies and container orchestration platforms like Kubernetes
  • Knowledge of IaC tools such as Terraform or CloudFormation
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Experience with observability tools such as DataDog, Prometheus, Grafana, Splunk and ELK stack
  • Experience with microservices architecture and service mesh technologies
  • Knowledge of security best practices in cloud environments
Responsibilities
  • Work on scaling our infrastructure to support a wide variety of evaluations, supporting systems and automation
  • Collaborate with development teams to make our systems more reliable (owning Production Readiness Reviews)
  • Implement and manage monitoring systems to proactively identify issues and anomalies in our production environment
  • Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability
  • Implement fault-tolerant and resilient design patterns to minimize service disruptions
  • Build and maintain automation tools to streamline repetitive tasks and improve system reliability
  • Partner with engineers and researchers at OpenAI to help bring frontier research capabilities to the world
  • Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability
Desired Qualifications
  • Enjoy seeking out and addressing bottlenecks and areas for performance improvement in our systems
  • Utilize Infrastructure as Code (IaC) principles to automate infrastructure provisioning and configuration management
  • Experienced in collaborating with cross-functional teams to ensure that reliability and scalability are considered in the design and development of new features and services
  • Track record of accelerating engineering reliability by empowering fellow engineers with excellent tooling and systems
  • Help create a diverse, equitable, and inclusive culture that makes all feel welcome while enabling radical candor and the challenging of group think
  • Humble attitude, eagerness to help colleagues, and a desire to do whatever it takes to make the team succeed
  • Own problems end-to-end, and are willing to pick up whatever knowledge you're missing to get the job done

OpenAI develops artificial intelligence technologies aimed at benefiting humanity. The company creates advanced AI models that can perform various tasks, such as automating processes and enhancing creativity. OpenAI's products, like Sora, allow users to generate videos from text descriptions, showcasing the capabilities of their AI systems. What sets OpenAI apart from competitors is its capped profit model, which limits the amount of profit the company can make, ensuring that excess profits are used to maximize the social benefits of AI. The goal of OpenAI is to ensure that artificial general intelligence (AGI) is developed safely and ethically, benefiting all of humanity.

Company Stage

Debt Financing

Total Funding

$18.4B

Headquarters

San Francisco, California

Founded

2015

Growth & Insights
Headcount

6 month growth

0%

1 year growth

-7%

2 year growth

-9%
Simplify Jobs

Simplify's Take

What believers are saying

  • OpenAI's involvement in Project Stargate boosts its infrastructure and strategic partnerships.
  • The 'Operator' AI agent positions OpenAI as a leader in practical AI applications.
  • Collaboration with academia and government enhances OpenAI's role in responsible AI deployment.

What critics are saying

  • Meta's $65 billion AI investment could overshadow OpenAI's advancements.
  • Apple's AI division enhancement may increase competition for OpenAI.
  • Project Stargate's scale may lead to resource allocation challenges for OpenAI.

What makes OpenAI unique

  • OpenAI's capped profit model emphasizes ethical AI development and social responsibility.
  • The 'Operator' AI agent showcases OpenAI's innovation in autonomous web-based task performance.
  • OpenAI's focus on inference-time compute enhances AI security against adversarial attacks.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health insurance

Dental and vision insurance

Flexible spending account for healthcare and dependent care

Mental healthcare service

Fertility treatment coverage

401(k) with generous matching

20-week paid parental leave

Life insurance (complimentary)

AD&D insurance (complimentary)

Short-term/long-term disability insurance (complimentary)

Optional buy-up life insurance

Flexible work hours and unlimited paid time off (we encourage 4+ weeks per year)

Annual learning & development stipend

Regular team happy hours and outings

Daily catered lunch and dinner

Travel to domestic conferences