As a Site Reliability Engineer, you will work with other SREs, Engineers, Developers and our support & operations teams to ensure maximum performance, reliability and automation of our Private Cloud deployments and infrastructure.
We recognize that manual approaches to operations do not scale, and are launching a new team in Private Cloud Engineering to tackle the significant problems of managing many, discreet Private Cloud installations with multiple offerings and form-factors at scale world-wide.
Our Site Reliability Engineer is someone who is familiar with both software and systems engineering with a desire not to just resolve the problem but prevent it in the future. You should have excellent written and verbal communication skills and you should be comfortable operating in fast paced environment.
In this role, you will be focused on our various private cloud product offerings, mostly around Rackspace Private Cloud - OpenStack itself, but will include other Private Cloud product offerings in the future. This is a mix of different configurations and deployment methods offered as part of the private cloud product. This includes understanding how OpenStack is installed, configured, upgraded, and operational expertise and debugging.
In addition to resolving and automating issues internally and downstream if a problem or issue is better served by fixing the issue in the upstream Open Source code, you will be submitting patches to improve the operational and reliability aspects of the upstream projects.
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Rackspace's managed service offerings & customer deployments have reliability and uptime appropriate to users' needs and a fast rate of improvement while monitoring and validating capacity and performance. Focused on reliability, scalability and the development of automation to manage a set of repetitive tasks at scale.
Supports high complexity deployments and internal teams on an as-needed basis. Responsible for the roll-out and operations of medium complexity systems automation. Collaborates with other teams on tools for systems automation. Works in conjunction with multiple teams to ensure up-time and reliability of customer deployments.
Experience in one or more of: Ansible, Chef, Puppet, C, C++, Java, Perl, Ruby, Python, Bash or Go.
Intermediate experience working with Unix/Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols.
Networking: e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, SDN, OSI layers, and load balancing.
Expertise in designing, analyzing and troubleshooting large-scale distributed systems.
Intermediate knowledge of operating systems.
Familiarity with algorithms, data structures, and complexity analysis.
Intermediate experience designing complex SaaS applications for cloud reliability and scalability.
Strong experience with GCP, AWS or Openstack APIs or OpenStack administration.
Intermediate experience with cloud infrastructure automation and CI/CD pipeline design.
Expertise in operational monitoring and management tools (Nagios, Datadog, etc.).
Intermediate written & verbal communication skills, both highly technical and non-technical.
Ability to work closely with non-technical stakeholders and executives.
Systematic problem-solving approach, coupled with a strong sense of ownership and drive.
Additional skills may be required depending on role; for example Ansible, iPXE, Jenkins, and other modern tools/technologies.
EXPERIENCE/EDUCATION: High school diploma or equivalent required. Bachelor's degree in Computer Science or equivalent experience. Usually requires 5+ years of information systems design/architecture/development experience. May require additional certifications depending on specialization.