Toast Logo Toast
⏩ Toast Logo
@

Senior Site Reliability Engineer - Java/Python

🌍 Bengaluru, Karnataka, India πŸ“… 03/28/2023

Apply

Software Engineering Manager

πŸ“… 06/10/2023

Job Description

Now, more than ever, the Toast team is committed to our customers. We’re
taking steps to help restaurants navigate these unprecedented times with
technology, resources, and community. Our focus is on building the restaurant
platform that helps restaurants adapt, take control, and get back to what they
do best: building the businesses they love. And because our technology is
purpose-built for restaurants, by restaurant people, restaurants can trust
that we’ll deliver on their needs for today while investing in experiences
that will power their restaurant of the future.

At Toast, our Site Reliability Engineers (SREs) are responsible for keeping
all customer-facing services and other Toast production systems running
smoothly. SREs are a blend of pragmatic operators and software craftspeople
who apply sound software engineering principles, operational discipline, and
mature automation to our environments and our codebase. Our decisions are
based on instrumentation and continuous observability as well as through
predictions and capacity planning.

**About this _roll_ * (Responsibilities) **

* Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
* Partner with development teams to improve services through rigorous testing and release procedures
* Participate in system design consulting, platform management, and capacity planning
* Create sustainable systems and services through automation and uplifts
* Balance feature development speed and reliability with well-defined service level objectives

**Troubleshooting and Supporting Escalations:**

* Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
* Diagnose performance bottlenecks and implement optimizations across infrastructure, database, web, and mobile applications
* Implement strategies to increase system reliability and performance through on-call rotation and process optimization
* Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again

**Do you have the right _ingredients*_? (Requirements)**

* Polyglot technologist/generalist with a thirst for learning
* Deep understanding of cloud and microservice architecture, and the JVM
* Experience with tools such as APM, Terraform, Ansible, GitHub, Jenkins, Docker
* Experience developing software or software projects in at least four languages, ideally including two of Go, Python and Java
* Experience with cloud computing technologies ( AWS cloud provider preferred)
* Extensive and broad industry experience with at least 7+ years in SRE and/or DevOps roles

***Bread puns encouraged but not required**