OpenShift Site Reliability Engineer

FormusPro offers awesome career opportunities in software engineering, data science and much, much more.

United Kingdom Jobs FormusPro
United Kingdom
OpenShift Site Reliability Engineer
Hybrid
Full Time
OpenShift Site Reliability Engineer

Infrastructure

We are seeking a talented Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of OpenShift Virtualization platforms that delivers enterprise VM hosting services. This role combines software engineering and systems administration to build highly reliable and scalable systems, automate operations, and drive continuous improvement through SRE principles.

You will define and monitor Service Level Objectives (SLOs), implement observability solutions, respond to incidents, automate toil, and conduct chaos engineering experiments to validate platform resilience. This is a hands-on role that requires strong technical skills, automation mindset, and a passion for operational excellence.

This is a full-time hybrid-based role.

To be successful in the role, you will need to be a good team player with excellent communication skills, have the ability to manage your own workload, and work well on your own initiative and under direction.

Key Responsibilities

Service Level Management & Reliability

  • Define, track, and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets
  • Monitor platform reliability metrics and identify areas for improvement
  • Establish SLO-based alerting strategies to reduce noise and alert fatigue
  • Conduct quarterly SLO reviews with stakeholders
  • Balance velocity and stability through error budget policies
  • Build reliability dashboards for real-time platform health visibility

 

Observability & Monitoring

  • Design, implement, and maintain the observability stack (Prometheus, Grafana, EFK, AlertManager)
  • Build and optimize Grafana dashboards for infrastructure, platform, and application metrics
  • Write and tune PromQL queries for complex metric analysis
  • Configure intelligent alerting with proper thresholds and escalation policies
  • Implement distributed tracing for VM performance troubleshooting
  • Develop custom exporters and instrumentation for specialized monitoring needs
  • Maintain centralized logging infrastructure (ElasticSearch, Fluentd, Kibana)

 

Incident Response & Management

  • Serve as Incident Commander for critical platform incidents (Sev 1/2)
  • Lead blameless postmortem reviews and drive action item closure
  • Perform deep root cause analysis using data-driven approaches
  • Coordinate cross-team incident response and communication
  • Develop and maintain incident runbooks and playbooks
  • Track MTTR, MTTD, and other incident metrics
  • Participate in 24×7 on-call rotation for production support

 

Automation & Toil Reduction

  • Identify and eliminate toil through automation and process improvements
  • Build automation for repetitive operational tasks (Ansible, Python, Bash)
  • Develop self-healing automation for common failure scenarios
  • Implement event-driven automation using OpenShift operators
  • Create automated remediation workflows integrated with monitoring
  • Measure and report on toil reduction metrics quarterly

 

Platform Performance & Optimization

  • Conduct performance analysis and tuning of OpenShift platform components
  • Optimize VM performance (CPU, memory, storage, network)
  • Tune Red Hat CoreOS kernel parameters for hypervisor workloads
  • Analyze resource utilization patterns and recommend rightsizing
  • Identify and resolve performance bottlenecks
  • Implement capacity forecasting models
  • Optimize storage performance (ODF/Ceph tuning, IOPS optimization)

 

Chaos Engineering & Resilience Testing

  • Design and execute chaos engineering experiments to validate platform resilience
  • Conduct regular failure injection testing (node failures, network partitions, storage disruptions)
  • Test disaster recovery procedures and document findings
  • Validate backup/restore processes through regular DR drills
  • Build automated chaos testing pipelines
  • Create gameday scenarios for team preparedness

Knowledge, Skills and Experience

Must-Have Skills & Experience

 

Experience Requirements:

  • 5-8 years of overall IT infrastructure and operations experience
  • 3+ years of hands-on experience with Red Hat OpenShift Container Platform (4.x)
  • 2+ years of SRE, DevOps, or Platform Reliability Engineering experience
  • 2+ years of experience with OpenShift Virtualization, KubeVirt, or VM platforms
  • 2+ years of experience with Linux system administration (RHEL/CentOS/Ubuntu)

 

Technical Skills:

  • Advanced OpenShift administration and troubleshooting
  • Advanced observability platforms (Prometheus, Grafana, PromQL, AlertManager)
  • Advanced logging systems (EFK/ELK stack, Loki, Splunk)
  • Expert automation scripting (Python, Bash, Go)
  • Advanced Ansible for automation and configuration management
  • Advanced Linux performance tuning and troubleshooting
  • Intermediate OpenShift Virtualization (VM lifecycle, performance, troubleshooting)
  • Intermediate Red Hat CoreOS and systemd
  • Intermediate Kubernetes concepts (Operators, Custom Resources, Controllers)
  • Intermediate storage systems (ODF/Ceph, block/file storage, performance tuning)
  • Intermediate networking (OVN-Kubernetes, TCP/IP, network troubleshooting)

 

SRE Practices:

  • Strong understanding of SRE principles (SLO/SLI, error budgets, toil management)
  • Experience with incident management frameworks (ITIL, PagerDuty)
  • Proven track record of reducing MTTR and improving platform reliability
  • Experience conducting blameless postmortems
  • Knowledge of chaos engineering principles
  • Experience with capacity planning and forecasting

 

Software Engineering:

  • Strong coding skills in Python, Go, or similar languages
  • Experience with Git and version control workflows
  • Understanding of CI/CD pipelines and automation
  • Ability to write clean, maintainable, and testable code

 

Certifications Required (one or more):

  • Red Hat Certified Engineer (RHCE)
  • Red Hat Certified Specialist in OpenShift Administration
  • OR equivalent demonstrable experience

Desirable Skills & Experience

Highly Desirable:

  • Red Hat Certified Architect (RHCA) certification
  • Certified Kubernetes Administrator (CKA) certification
  • Experience with Google SRE practices and methodologies
  • GitOps expertise (ArgoCD, Flux)
  • Experience with chaos engineering tools (Litmus, Chaos Mesh, Gremlin)
  • Advanced knowledge of Grafana Loki for log aggregation
  • Experience with distributed tracing (Jaeger, Tempo, OpenTelemetry)
  • Proficiency in Go programming for operator development

 

Nice to Have:

  • Professional Cloud Architect certification (AWS, Azure, GCP)
  • Experience with machine learning for anomaly detection
  • Knowledge of APM tools (Dynatrace, New Relic, AppDynamics)
  • Experience with load testing tools (k6, Locust, JMeter)
  • Familiarity with FinOps and cost optimization
  • Experience with multi-cluster management (RHACM)
  • Public speaking experience at conferences or meetups
  • Contributions to open source projects

 

Key Success Metrics

  • Platform reliability: Meet or exceed 99.9% availability SLO

Ready To Join The Team?

Our amazing team is growing rapidly and we are always looking for talented individuals.

Comapny Benefits

In addition to the competitive salary, you will also be entitled to the following benefits:

  • Time allocated for personal development and training
  • Pension contributions
  • Death in Service Cover
  • Income Protection
  • 20 days holiday per annum, plus 5 lifestyle days and the option to buy additional days
  • Access to well being and legal assistance 24 hours a day, 7 days a week
  • Access to multiple offerings including supermarket savings, discounted days out, the daily coffee or a summer holiday – there’s something to suit everyone’s lifestyle
  • Company social events throughout the year
  • Access to an electric vehicle scheme 
  • Access to private medical insurance 

Why join FormusPro

Empower.  Impact.  Change.

That’s our motto, and we mean it.  Our focus is helping our clients solve problems with Microsoft technology that enables them to achieve their goals.  We’re a technical bunch (and proud of it).  That combination of how we partner with our clients and our technical DNA has created an environment that allows our team to thrive.

Apply Now

Fill out the form, drop us your CV and we’ll be in touch to discuss working with our awesome staff. 

Name(Required)
Drop files here or
Accepted file types: doc, docx, pdf, Max. file size: 512 MB.