Rackspace

Returning Candidate?

Site Reliability Engineer - OpenStack Private Cloud

Site Reliability Engineer - OpenStack Private Cloud

Req # 
36683
Location(s) 
US-Remote
US-TX-Austin
US-TX-San Antonio
Category 
Customer Relationship & Support, System Administration / Engineering

Job Overview

Overview & Responsibilities

Rackspace is seeking a Site Reliability Engineer - OpenStack Private Cloud to join our team full time. 

 

As a Site Reliability Engineer, you will work with other SREs, Engineers, Developers and our support & operations teams to ensure maximum performance, reliability and automation of our Private Cloud deployments and infrastructure.

We recognize that manual approaches to operations do not scale, and are launching a new team in Private Cloud Engineering to tackle the significant problems of managing many, discreet Private Cloud installations with multiple offerings and form-factors at scale world-wide.

 

Our Site Reliability Engineer is someone who is familiar with both software and systems engineering with a desire not to just resolve the problem but prevent it in the future. You should have excellent written and verbal communication skills and you should be comfortable operating in fast paced environment.

 

You will be working with many new and cutting-edge technologies, such as Kubernetes, Docker & LXC containers, software defined networking, security tools, and other Cloud Native Compute Foundation projects as well as our OpenStack private cloud as a service, OpenShift and Managed Kubernetes product offerings.

In addition to resolving and automating issues internally and downstream if a problem, or issue is better served by fixing the issue in the upstream Open Source code, you will be submitting patches to improve the operational and reliability aspects of the upstream projects.

 

Responsibilities:

  • Design, architect, as well as maintain existing operational solutions for managing our customer environments and infrastructure, across data centers and technologies with the specific goal of increasing the automation, repeatability, and consistency of operational tasks.
  • Implement and maintain monitoring and alerting solutions that help discover failures in a timely fashion while working with engineers to identify root cause and fix issues
  • Provide basic to intermediate network administration and troubleshooting.
  • Day-to-day operational management, including response, incident, event and problem management activities along with our service delivery and engineering teams.
  • Participate in on-call rotation duties.
  • Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
  • Support services & deployments before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.

Qualifications

Qualifications:

  • Experience with algorithms, data structures, complexity analysis and software design.
  • Experience in one or more of the following: Python, Go, and cross platform scripting is a must.
  • Experience with Linux systems administration and tuning.
  • Experience with automation tools such as Docker, Jenkins, Ansible, Terraform
  • Understand and have implemented Docker and other container based systems.
  • Experience in one or more of the following: Openstack, OpenShift, Kubernetes, Docker/Docker Swarm.
  • Comfort with collaboration, open communication and remote teams.

Preferred qualifications:

  • Interest in designing, analyzing and troubleshooting large-scale distributed systems.
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
  • Ability to debug and optimize code and automate routine tasks.
  • Think of infrastructure and automation as code and critical engineering tasks.

 

#LI-SR1