Full-Time

Software Engineering Reliability PMTS

Posted on 1/21/2025

Salesforce

Salesforce

10,001+ employees

Cloud-based Customer Relationship Management solutions

Compensation Overview

$211.5k - $334.6kAnnually

+ Incentive Compensation + Equity

Senior, Expert

Company Historically Provides H1B Sponsorship

San Francisco, CA, USA + 1 more

More locations: Bellevue, WA, USA

For Washington-based roles, the base salary hiring range is $211,500 to $306,600. For California-based roles, the base salary hiring range is $230,800 to $334,600.

Category
Backend Engineering
Software QA & Testing
Software Engineering
Required Skills
Kubernetes
Python
Machine Learning
Docker
AWS
Go
Jenkins
Development Operations (DevOps)
Linux/Unix
Requirements
  • Bachelor’s degree or equivalent experience in Computer Science, Engineering, or a related technical field.
  • Shown expertise in implementing robust reliability processes across full-stack, end-to-end ML platforms, with in-depth understanding of Generative AI architecture and systems.
  • 8+ years of experience in production support and triaging roles with a focus on end to end, infrastructure and operational reliability.
  • Experience in DevOps or data center management roles with expertise in Linux system engineering.
  • Strong knowledge of cloud services (AWS preferred), container technologies (Docker, Kubernetes), and CI/CD tools (Jenkins, GitLab).
  • Proficiency in scripting languages (Python, Shell, Golang) and knowledge of AI model deployment and scaling.
Responsibilities
  • Passion for triaging and solving complex problems in production systems.
  • You will establish the reliability process and collaborate closely with lead engineers.
  • Multi-System Debugging and Triage (must-have): AgentForce integrates multiple Salesforce platforms, such as Core, Service Cloud, Sales Cloud, Data Cloud, and AI Cloud, in addition to LLM providers like OpenAI, Azure OpenAI, and AWS Bedrock. Expertise in diagnosing and triaging performance and scalability issues across these diverse systems and vendors, as well as addressing scaling challenges, is essential.
  • Capable of investigating alerts and customer-reported issues, comprehensively analyzing the end-to-end stack. This includes first-level triage to assess all systems involved in a specific use case, identifying root causes, and generating detailed reports. Escalate to relevant engineering contacts and work to resolve the issue when necessary.
  • Production Support & Issue Triage: Lead and shape the production triage process for AgentForce, focusing on service, infrastructure deployment, configuration, performance, and latency issues.
  • Collaborate with cross-functional teams and external partners to ensure scalable and reliable services.
  • Maintain comprehensive documentation of production issues, workflows, and areas for improvement.
  • Infrastructure & Scaling Management: Understand and support capacity modeling and forecasting to ensure adequate capacity for Agentforce services in production.
  • Ensure and drive the scaling of Large Language Models (LLMs) and associated services in prod are in line with projected capacity requirements based on usage pattern. Consistently review chatbot and AI model utilization and optimize capacity based on usage trends to prevent any outage.
  • Automation & Operational Excellence: Create and maintain playbooks and detailed knowledge articles for future analysis and troubleshooting. Automate manual processes to maintain high availability and repeatability of production systems.
  • Monitoring & Trust Management: Utilize the availability and trust dashboards, adjust SLOs and SLIs based on production feedback.
  • Identify automation gaps in prod and compare the establish critical user journey (CUJ) benchmarks for reliability and trustworthiness.
  • Cross-functional Collaboration: Establish strong partnerships with Customer Support Groups (CSG) team to streamline escalations and minimize disruptions.
  • Be part of the 24x7 on-call support and multi-GEO coverage to maintain service reliability during peak periods.
  • Stakeholder Collaboration: Collaborate with business and engineering stakeholders for operational excellence, processes, and SLAs. Drive improvements based on key metrics, KPIs, and customer feedback.
Desired Qualifications
  • Experience in leading large-scale AI applications and services, including monitoring and diagnostic techniques.
  • Expertise in deploying and leading LLMs and technologies like Retrieval-Augmented Generation (RAG).
  • Background in monitoring tools such as Splunk, Prometheus, Grafana, and ELK stack.
  • Knowledge of java profiler (e.g java flight recorder), open telemetry.
  • Knowledge of TCP/IP networking protocols and infrastructure services in IaaS environments.
  • Familiarity with MLOps tools and practices for supporting the machine learning lifecycle.
  • AWS or Salesforce certifications are a plus.

Salesforce offers cloud-based software solutions that focus on Customer Relationship Management (CRM). Its main product, Customer 360, provides tools for businesses to manage customer interactions across various functions like marketing, sales, and service. The company operates on a subscription model, allowing clients to access its services without the need for expensive installations, and it stands out by offering customizable solutions tailored to different industries. Salesforce aims to help businesses enhance customer satisfaction and drive growth through effective relationship management.

Company Size

10,001+

Company Stage

IPO

Headquarters

San Francisco, California

Founded

1999

Simplify Jobs

Simplify's Take

What believers are saying

  • Salesforce's $1 billion investment in Singapore expands its Southeast Asian presence.
  • The launch of prebuilt AI agents positions Salesforce as a key player in healthcare.
  • Collaboration with Singapore Airlines showcases Salesforce's AI-powered customer service solutions.

What critics are saying

  • Layoffs of 10,000 employees may impact innovation and service delivery.
  • Increased competition from Microsoft in healthcare could challenge Salesforce's market share.
  • $1 billion investment in Singapore could strain financial resources if returns don't materialize.

What makes Salesforce unique

  • Salesforce's Customer 360 offers a comprehensive suite of CRM applications.
  • The subscription-based model provides a steady revenue stream and continuous innovation.
  • Salesforce tailors solutions to meet specific industry needs, enhancing customer satisfaction.

Help us improve and share your feedback! Did you find this helpful?

Benefits

Health Insurance

Life Insurance

401(k) Retirement Plan

Remote Work Options

Flexible Work Hours

Parental Leave

Wellness Program

Growth & Insights and Company News

Headcount

6 month growth

1%

1 year growth

7%

2 year growth

-1%
Marketing Interactive
Mar 14th, 2025
Singapore Airlines picks Salesforce for AI-powered customer service

This collaboration will integrate Salesforce technologies, including Agentforce, Einstein in Service Cloud, and Data Cloud into Singapore Airlines' customer case management system, with an aim to enhance the personalisation and consistency of customer services provided by the airline.

Yahoo Finance
Mar 12th, 2025
In This Article:

Last week, Salesforce announced a significant collaboration with Singapore Airlines, integrating its technologies to enhance customer service through AI solutions.

Macho Levante
Mar 12th, 2025
Salesforce's Billion-Dollar Bet: Pioneering AI Evolution in Singapore

Salesforce, the American colossus in cloud software, has just announced a bold $1 billion investment in this Southeast Asian hub.

Gapingvoid
Mar 11th, 2025
Big Tech: Proceed with Caution

Recently, Salesforce laid off 10,000 employees.

CXO Today
Mar 11th, 2025
Pothys Swarna Mahal Collaborates with Salesforce to Elevate Customer Experience

Pothys Swarna Mahal collaborates with Salesforce to elevate customer experience.

INACTIVE