Job Seekers

Site Reliability Engineer

Boston, MA
$110,000 - $130,000
Job Type
Direct Hire
Mar 20, 2017
Job ID
As a Site Reliability Engineer you will have ownership of foundational DevOps services and a big impact on our product teams. The client's infrastructure, event processing, and team have grown 300% year over year so there are always new skills to learn and technical challenges to solve the right way. This role is full-time and based in Boston.

•Design, write and deliver software to improve the availability, scalability, latency, and efficiency of services.
•Perform quantitative analysis to understand high-impact events that break product functionality and manage the cross-functional effort resolve those events
•Solve problems relating to mission critical services and build automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
•Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
•Uncover and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies
•Confidently make informed, data-driven decisions in a fast paced environment with competing priorities
•Identify and drive opportunities to improve operational workflows
•Conduct periodic on call duties
•Educate other engineers on the best practices for building and operating highly reliable systems

•BA or BS Degree in Computer Science, related field, or equivalent experience
•Technical, Engineering or Quantitative background
•Proven experience with Linux (we run Ubuntu) and all layers of the networking stack. You should be confident administering and debugging production Linux systems
•Experience working on team software projects
•Experience in one or more of: Python, Ruby, Go.
•Familiarity with running and scaling distributed software systems (load balancing, high availability, systems (load balancing, high availability, systems monitoring, etc.)

Bonus Points:
•Expertise in designing, analyzing and troubleshooting high-traffic, large-scale distributed systems.
•Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.
•Experience with Amazon Web Services (AWS) or similar cloud compute offerings
•Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
•Experience with building and scaling highly-reliable distributed Python systems (we use Django extensively)
•Experience with instrumenting and monitoring production systems (Nagios, Statsd/Graphite, APM, etc.)
•Systematic problem solving approach, coupled with a strong sense of ownership and drive