Facebook pixel

AWS Infrastructure Engineer
Confirmed live in the last 24 hours
United States
Experience Level
Desired Skills
Customer Service
Development Operations (DevOps)
  • A skilled Engineer. At least 5 years in Cloud Operations/Platform Services with a keen interest in solving problems using automation
  • Understand SRE and DevOps methodologies. You understand the build and deployment cycle of an application, and how to operate a resilient system
  • Strong experience in Incident & Event Management (NOC, App Support…)
  • Experience with support and troubleshooting of 24x7 high volume transactional Web applications
  • Knowledge of Windows and Linux systems
  • Experience of Cloud infrastructure and platform services (we run on AWS)
  • Familiarity with terraform and IaC best practices
  • Solid experience with GitHub or other version control systems
  • APM systems such as Dynatrace, AppDynamics and/or New Relic
  • Alerting tool such as Grafana OnCall, PagerDuty, or OpsGenie
  • Experience in Scripting languages such as Python, Bash and PowerShell
  • Strong verbal and written communication skills. Ability to take ownership of issues
  • Systematic problem-solving approach. You should have an understanding of how to analyze, and troubleshoot large-scale distributed systems
  • Happy in the Clouds. Our Cloud Native platform is hosted on AWS. You'll be comfortable working with a system that supports users from around the world, at scale. Experience working for a Digital company, delivering real time transactional services (Finance/regulated) is preferred
  • Bias for action. You see a problem, you fix a problem. You get buy-in for your solutions and keep tickets moving. We're always looking for ways to ship at pace
  • Growth mindset. A willingness to use your skills and experience to mentor less-experienced engineers. A desire to learn from others and make yourself better every day
  • Agile outlook. You need to be excited about working in a fast-changing environment. Products, tools, frameworks and processes change, we evolve and take the best bits with us. The teams drive the evolution
  • Monitor our Production systems and react to alerts swiftly
  • Ensure 24x7 availability of our product platform working with the Tech teams
  • Participate in the development of our monitoring & alerting strategies with the SRE team across multiple cloud environments, in particular AWS, using advanced monitoring tools like Grafana, AppDynamics and Splunk
  • Experience in cloud and on premise infrastructure, understanding the challenges and considerations to migrate workloads from on premise to AWS. Understanding of SQL, Windows server, Active Directory, DNS, VMWare, Networking skills and the willingness to research and advise on new technologies and developments
  • Manage incidents, categorization, triage, resolution and escalation
  • Communicate appropriately with our business stakeholders on incidents (Customer Service…)
  • Participate in an oncall/shift rotation
  • Use code to solve problems. configuration, infrastructure, tooling, and automation, everything must be solved by writing high quality code that performs and scales
  • Using best practices and standards in regards to Observability, Monitoring, Alerting, Capacity Planning, availability, performance/latency, change, troubleshooting for all our Tech services
  • Work closely with feature teams to ensure that services are correctly monitored, change is delivered in a safe and secure way, resilience is built into our product and our standards and best practices adopted
  • Lead or be involved in the troubleshooting of complex incidents and problems
  • Have visibility on end to end service to our customers and ensure their journey is stable and consistent across all the microservices and 3rd party dependencies with the observability tool you will have implemented with the Engineering teams
  • Perform various Technical Operations in collaboration with the DevOps and Infrastructure teams (patching, log management, space management …)
  • Develop various technical runbooks in collaboration with other tech teams
  • Participate in the continuous improvements of our operational processes (Incident, Problems, Change …)
  • Provide input in Post Incident Review / Post Mortem and take initiative in order to prevent and reduce incidents

51-200 employees