OpenShift Site Reliability Engineer

FormusPro offers awesome career opportunities in software engineering, data science and much, much more.

United Kingdom

Hybrid

Full Time

OpenShift Site Reliability Engineer

Infrastructure

We are seeking a talented Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of OpenShift Virtualization platforms that delivers enterprise VM hosting services. This role combines software engineering and systems administration to build highly reliable and scalable systems, automate operations, and drive continuous improvement through SRE principles.

You will define and monitor Service Level Objectives (SLOs), implement observability solutions, respond to incidents, automate toil, and conduct chaos engineering experiments to validate platform resilience. This is a hands-on role that requires strong technical skills, automation mindset, and a passion for operational excellence.

This is a full-time hybrid-based role.

To be successful in the role, you will need to be a good team player with excellent communication skills, have the ability to manage your own workload, and work well on your own initiative and under direction.

Key Responsibilities

Service Level Management & Reliability

Define, track, and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets
Monitor platform reliability metrics and identify areas for improvement
Establish SLO-based alerting strategies to reduce noise and alert fatigue
Conduct quarterly SLO reviews with stakeholders
Balance velocity and stability through error budget policies
Build reliability dashboards for real-time platform health visibility

Observability & Monitoring

Design, implement, and maintain the observability stack (Prometheus, Grafana, EFK, AlertManager)
Build and optimize Grafana dashboards for infrastructure, platform, and application metrics
Write and tune PromQL queries for complex metric analysis
Configure intelligent alerting with proper thresholds and escalation policies
Implement distributed tracing for VM performance troubleshooting
Develop custom exporters and instrumentation for specialized monitoring needs
Maintain centralized logging infrastructure (ElasticSearch, Fluentd, Kibana)

Incident Response & Management

Serve as Incident Commander for critical platform incidents (Sev 1/2)
Lead blameless postmortem reviews and drive action item closure
Perform deep root cause analysis using data-driven approaches
Coordinate cross-team incident response and communication
Develop and maintain incident runbooks and playbooks
Track MTTR, MTTD, and other incident metrics
Participate in 24×7 on-call rotation for production support

Automation & Toil Reduction

Identify and eliminate toil through automation and process improvements
Build automation for repetitive operational tasks (Ansible, Python, Bash)
Develop self-healing automation for common failure scenarios
Implement event-driven automation using OpenShift operators
Create automated remediation workflows integrated with monitoring
Measure and report on toil reduction metrics quarterly

Platform Performance & Optimization

Conduct performance analysis and tuning of OpenShift platform components
Optimize VM performance (CPU, memory, storage, network)
Tune Red Hat CoreOS kernel parameters for hypervisor workloads
Analyze resource utilization patterns and recommend rightsizing
Identify and resolve performance bottlenecks
Implement capacity forecasting models
Optimize storage performance (ODF/Ceph tuning, IOPS optimization)

Chaos Engineering & Resilience Testing

Design and execute chaos engineering experiments to validate platform resilience
Conduct regular failure injection testing (node failures, network partitions, storage disruptions)
Test disaster recovery procedures and document findings
Validate backup/restore processes through regular DR drills
Build automated chaos testing pipelines
Create gameday scenarios for team preparedness

Knowledge, Skills and Experience

Must-Have Skills & Experience

Experience Requirements:

5-8 years of overall IT infrastructure and operations experience
3+ years of hands-on experience with Red Hat OpenShift Container Platform (4.x)
2+ years of SRE, DevOps, or Platform Reliability Engineering experience
2+ years of experience with OpenShift Virtualization, KubeVirt, or VM platforms
2+ years of experience with Linux system administration (RHEL/CentOS/Ubuntu)

Technical Skills:

Advanced OpenShift administration and troubleshooting
Advanced observability platforms (Prometheus, Grafana, PromQL, AlertManager)
Advanced logging systems (EFK/ELK stack, Loki, Splunk)
Expert automation scripting (Python, Bash, Go)
Advanced Ansible for automation and configuration management
Advanced Linux performance tuning and troubleshooting
Intermediate OpenShift Virtualization (VM lifecycle, performance, troubleshooting)
Intermediate Red Hat CoreOS and systemd
Intermediate Kubernetes concepts (Operators, Custom Resources, Controllers)
Intermediate storage systems (ODF/Ceph, block/file storage, performance tuning)
Intermediate networking (OVN-Kubernetes, TCP/IP, network troubleshooting)

SRE Practices:

Strong understanding of SRE principles (SLO/SLI, error budgets, toil management)
Experience with incident management frameworks (ITIL, PagerDuty)
Proven track record of reducing MTTR and improving platform reliability
Experience conducting blameless postmortems
Knowledge of chaos engineering principles
Experience with capacity planning and forecasting

Software Engineering:

Strong coding skills in Python, Go, or similar languages
Experience with Git and version control workflows
Understanding of CI/CD pipelines and automation
Ability to write clean, maintainable, and testable code

Certifications Required (one or more):

Red Hat Certified Engineer (RHCE)
Red Hat Certified Specialist in OpenShift Administration
OR equivalent demonstrable experience

Desirable Skills & Experience

Highly Desirable:

Red Hat Certified Architect (RHCA) certification
Certified Kubernetes Administrator (CKA) certification
Experience with Google SRE practices and methodologies
GitOps expertise (ArgoCD, Flux)
Experience with chaos engineering tools (Litmus, Chaos Mesh, Gremlin)
Advanced knowledge of Grafana Loki for log aggregation
Experience with distributed tracing (Jaeger, Tempo, OpenTelemetry)
Proficiency in Go programming for operator development

Nice to Have:

Professional Cloud Architect certification (AWS, Azure, GCP)
Experience with machine learning for anomaly detection
Knowledge of APM tools (Dynatrace, New Relic, AppDynamics)
Experience with load testing tools (k6, Locust, JMeter)
Familiarity with FinOps and cost optimization
Experience with multi-cluster management (RHACM)
Public speaking experience at conferences or meetups
Contributions to open source projects

Key Success Metrics

Platform reliability: Meet or exceed 99.9% availability SLO

Ready To Join The Team?

Our amazing team is growing rapidly and we are always looking for talented individuals.

Comapny Benefits

In addition to the competitive salary, you will also be entitled to the following benefits:

Time allocated for personal development and training
Pension contributions
Death in Service Cover
Income Protection
20 days holiday per annum, plus 5 lifestyle days and the option to buy additional days
Access to well being and legal assistance 24 hours a day, 7 days a week
Access to multiple offerings including supermarket savings, discounted days out, the daily coffee or a summer holiday – there’s something to suit everyone’s lifestyle
Company social events throughout the year
Access to an electric vehicle scheme
Access to private medical insurance

Why join FormusPro

Empower. Impact. Change.

That’s our motto, and we mean it. Our focus is helping our clients solve problems with Microsoft technology that enables them to achieve their goals. We’re a technical bunch (and proud of it). That combination of how we partner with our clients and our technical DNA has created an environment that allows our team to thrive.