[Remote] Senior Site Reliability Engineer

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. HavocAI is a leader in collaborative autonomy, focused on solving complex human problems through advanced technology. They are seeking a Senior Site Reliability Engineer to ensure the availability, performance, and resilience of mission-critical services while collaborating with various teams to improve operational maturity and reliability standards.

Responsibilities

Design and evolve reliability architecture for distributed and cloud-hosted systems
Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning
Partner with platform and application teams to design systems for reliability, scalability, and operability
Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines
Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads
Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews
Conduct root cause analysis for complex production incidents and drive long-term corrective actions
Improve operational readiness through runbooks, automation, resilience testing, and production-readiness reviews
Reduce operational toil through tooling, automation, and process improvements
Help build a culture of ownership, accountability, and continuous improvement across production systems
Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health
Ensure services and data pipelines are observable, debuggable, and performant in production
Drive performance analysis and tuning across infrastructure, application, and service layers
Improve alert quality, reduce noise, and ensure operational signals are actionable
Partner with engineering teams to define meaningful reliability and performance metrics
Build automation to improve system reliability, deployment safety, and recovery processes
Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns
Support and improve Kubernetes-based environments and containerized workloads
Contribute to infrastructure-as-code practices and platform automation
Help define operational standards for cloud infrastructure, deployment workflows, and production services
Collaborate with security teams to ensure secure and resilient system design
Participate in disaster recovery planning, backup strategy, and resilience testing
Maintain strong operational practices around access control, secrets management, change management, and production access
Support secure operations for systems that may serve defense, autonomy, or mission-sensitive use cases

Skills

7+ years of experience in SRE, infrastructure engineering, systems engineering, or related roles
Strong experience operating large-scale distributed production systems
Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals
Hands-on experience with Kubernetes and container orchestration
Programming or scripting experience in Go, Python, or similar languages
Experience designing and operating observability systems for production environments
Proven ability to lead incident response and drive reliability improvements
Strong communication skills and ability to collaborate across engineering teams
Ability to operate calmly and effectively under pressure
Must be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if required
Experience supporting autonomy, robotics, simulation, real-time systems, or data-intensive platforms
Familiarity with AWS and large-scale cloud infrastructure
Experience with chaos engineering, fault injection, or resilience testing
Knowledge of CI/CD systems and progressive delivery practices
Experience working in high-reliability, safety-critical, defense, or mission-critical environments
Experience with Infrastructure as Code tools such as Terraform or Pulumi
Experience with Prometheus, Grafana, OpenTelemetry, Datadog, ELK/OpenSearch, or similar observability tools

Benefits

100% Employer paid Health, Dental and Vision Insurance for you and your families
Life Insurance (Employer Paid)
Ability to participate in the companies 401k program (Matching)
Unlimited PTO policy with an enforced 2 week minimum
Equity Package
Work / Home Office Stipend
Global Entry
16 Week Paid Parental Leave
Monthly Health and Wellness Stipend

Company Overview

Apply tot his job

Apply To this Job

Apply

[Remote] Senior Site Reliability Engineer

Responsibilities

Benefits

Related roles

Senior Site Reliability Engineer, Remote Job

Site Reliability Engineer II - Remote - Remote

Senior Site Reliability Engineer - AWS

Mid to Senior Site Reliability Engineer (SRE) - AWS Cloud (Security Clearance Required)

Site Reliability Engineer / Software Architect

Senior Site Reliability Engineer, Remote Job

Kubernetes Engineer

Kubernetes Engineer (DoD Secret | Weeknight Mission Readiness | Remote – U.S.)

[Remote] Kubernetes Platform Engineer

Kubernetes Engineer - Remote

Remote Data Entry Clerk

Experienced Remote Chat Moderator and Live Support Specialist – Flexible Full-Time Opportunity for Exceptional Customer Service Professionals to Thrive in a Dynamic and Supportive Environment at blithequark

Amazon Jobs at Home Office Associate (work from home) - Join Now

Orion Requirements and Verification Systems EngineeringLead

Virtual Neurology Physician (MD/DO) - MUSC Center for Telehealth - South Carolina

Experienced Remote Data Entry Clerk and Customer Support Agent – Part-Time Work from Home Opportunity with Blithequark

Global Clinical Operations, Consultant - Menopause job at Carrot Fertility in US National

Business Transformation Customer Insights Professional – Remote Position Driving Data-Driven Strategy & Customer Experience Innovation

Medical Collections Specialist

Experienced Full Stack Data Entry Clerk – Remote Work Opportunity with arenaflex