• Site Reliability Engineer - Managed Kubernetes - AWS/Azure (REMOTE)

    Location(s) US-TX-San Antonio | US-TX-Remote
    Req #
    Software Development, System Administration / Engineering
  • About Rackspace

    Rackspace is modernizing IT in today’s multi-cloud world. We have been honored by Fortune, Forbes, Glassdoor and others as one of the best places to work. We serve over 50% of the Fortune 100 companies & customers in 120 countries around the globe. Our achievements are powered by our people – we call them Rackers.  We grow & thrive through world-class development opportunities, learning & selling bleeding-edge technologies & solutions, and most importantly, connecting with each other (the best & brightest in the industry). Are you a Racker? Join us!


    More on Rackspace


    Rackers aren’t all alike. We look different. We think uniquely. We are from many places and our beliefs & backgrounds vary. But, being a Racker — a valued member of a winning team on an inspiring mission – is what connects us all. Rackers are encouraged to bring their whole self to work every day, as we know that unique perspectives fuel innovation and enable us to best serve our customers & communities around the globe. We welcome you to apply today and want you to know that we are committed to offering equal employment opportunity without regard to age, color, disability, gender, gender reassignment or identity or expression, genetic information, marital or civil partner status, pregnancy or maternity status, military or veteran status, nationality, ethnic or national origin, race, religion or belief, sexual orientation, or any legally protected characteristic. If you have a disability or special need that requires accommodation, please let us know.

    Overview & Responsibilities

    As a Site Reliability Engineer, you will work with other SREs, Engineers, Developers and our support & operations teams to ensure maximum performance, reliability and automation of our Managed Kubernetes deployments and infrastructure on top of Azure / AWS.


    We recognize that manual approaches to operations do not scale, and have a dedicated team of Site Reliability Engineering to tackle the significant problems of managing many, discrete Private Cloud and Public Cloud Kubernetes deployments with multiple offerings and form-factors at scale world-wide.


    Our Site Reliability Engineer is someone who is familiar with both software and systems engineering with a desire not to just resolve the problem but prevent it in the future. You should have excellent written and verbal communication skills and you should be comfortable operating in fast paced environment.


    You will be working with many new and cutting-edge technologies, such as Kubernetes, Docker & LXC containers, software defined networking, security tools, and other Cloud Native Compute Foundation projects as well as our extended platform support for Managed Kubernetes on top of Azure / AWS.


    In addition to resolving and automating issues internally and downstream if a problem, or issue is better served by fixing the issue in the upstream Open Source code, you will be submitting patches to improve the operational and reliability aspects of the upstream projects.



    • Design, architect, as well as maintain existing operational solutions for managing our customer environments and infrastructure, across data centers and technologies with the specific goal of increasing the automation, repeatability, and consistency of operational tasks.
    • Implement and maintain monitoring and alerting solutions that help discover failures in a timely fashion while working with engineers to identify root cause and fix issues
    • Provide basic to intermediate network administration and troubleshooting.
    • Day-to-day operational management, including response, incident, event and problem management activities along with our service delivery and engineering teams.
    • Participate in on-call rotation duties.
    • Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
    • Support services & deployments before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
    • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
    • Practice sustainable incident response and blameless postmortems.


    Required qualifications:

    • Experience in one or more of the following: Microsoft Azure, Amazon Web Services
    • Kubernetes and Docker/container runtimes is a must.
    • Experience in one or more of the following: Python, Go, and cross platform scripting is a must.
    • Experience with algorithms, data structures, complexity analysis and software design.
    • Experience with Linux systems administration and tuning.
    • Experience with automation tools such as Docker, Jenkins, Ansible, Terraform
    • Understand and have implemented containerized systems.
    • Comfort with collaboration, open communication and remote teams.

    Preferred qualifications:

    • Interest in designing, analyzing and troubleshooting large-scale distributed systems.
    • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
    • Ability to debug and optimize code and automate routine tasks.
    • Think of infrastructure and automation as code and critical engineering tasks.