This is a developed professional level role for an SRE. Individuals are responsible for challenging reliability and toil reduction... Contributes to SRE knowledge documentation Functional Competencies/Technical Skills: Design for Reliability Can support...
time, take increasing responsibility for leading incidents end-to-end. Improve operational reliability: Identify... recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process...
for observability and automation across complex, large-scale service mesh architectures. Responsibilities Engage in and improve the... and refinement Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R...
for observability and automation across complex, large-scale service mesh architectures. Responsibilities Engage in and improve the... and refinement Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R...
for observability and automation across complex, large-scale service mesh architectures. Responsibilities Engage in and improve the... and refinement Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R...
for observability and automation across complex, large-scale service mesh architectures. Responsibilities: Engage in and improve the... and refinement Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R...
for observability and automation across complex, large-scale service mesh architectures. Responsibilities: Engage in and improve the... and refinement Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R...
Owns reliability architecture and end-to-end service understanding (dependencies, failure modes, and customer journeys... readiness criteria. Drives cross-team reliability reviews and recommends design changes, runbooks, and safe rollout/rollback...
and be responsible for observability and automation across complex, large-scale service mesh architectures. Responsibilities Engage..., operation and refinement Deliver tools/software to improve the reliability and scalability of services, automate operations...
and be responsible for observability and automation across complex, large-scale service mesh architectures. In order to enhance... and refinement Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R...
Description Responsibilities Implement observability tooling to monitor AWS EKS-based systems focusing... on performance, reliability, and scalability. Participate in on-call rotations, providing critical support as needed. Ensure timely...
career opportunity that will light a fire within you. The Senior Cloud SRE works to improve the reliability... and occurrence of outages. A Typical Day Might Include the Following: Create a new dashboard to provide observability...
). About the Role and Team: You will play a key role in modernizing critical applications with a focus on improving observability.... Reliability: Ensure the reliability, availability, and performance of applications and services. Develop and track new service...
observability, automation, and resiliency. You’ll work across both Mainframe technologies (COBOL, RPG) and modern server-based... deployments. Reliability: Ensure the reliability, availability, and performance of applications and services. Develop and track...
projects from end-to-end that are focused on managing and maintaining optimum platform infrastructure performance, reliability..., and security using SRE practices, observability tools, manual and automated procedures, documentation, people and processes...
stable, efficient, and cost-effective at global scale. We focus on enhancing the observability and operability... stability and reliability of TikTok's core services; respond quickly to production incidents and build mechanisms and platforms...
stable, efficient, and cost-effective at global scale. We focus on enhancing the observability and operability... assessment. Responsibilities: - Ensure the stability and reliability of TikTok's core services; respond quickly to production...
, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems... utilization, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands...
stable, efficient, and cost-effective at global scale. We focus on enhancing the observability and operability... and reliability of TikTok's core services; respond quickly to production incidents and build mechanisms and platforms to continuously...
stable, efficient, and cost-effective at global scale. We focus on enhancing the observability and operability... and reliability of TikTok's core services; respond quickly to production incidents and build mechanisms and platforms to continuously...