Summary
The Meraki cloud serves millions of customer devices from 8 datacentres around the world.
Technology Interest: Cloud and Data Center, Networking, Software Development, Testing
Area of interest: Engineer - Software
Job type: Professional
The Meraki cloud serves millions of customer devices from 8 datacentres around the world. As a Senior Site Reliability Engineer on the Observability team you will be responsible for designing useful, scalable and secure monitoring systems that make sure we stay online. You’re passionate about data, and about using automation to raise the bar.
In this role you will join a small engineering team that is based out of our office in London, UK. You will lead the design, development and operational aspects of the monitoring, log/event collection, and metric processing systems which support our private cloud. We believe in automating manual tasks with the right tools.
As SREs at Meraki we are responsible for building and scaling the cloud that supports millions of Meraki devices across the world. Meraki’s customer base has grown by a factor of 2-3 every year, serving more than 4 billion HTTP requests per day across six datacentres. Our customers depend on our products to run their critical infrastructure of network switches, security appliances, wireless APs and security cameras. We embrace the *nix way, automate away tedious tasks and build infrastructure as code.
Example projects of a Senior Site Reliability Engineer (Observability):
-
Lead the discussion around our Graphite architecture to handle the next five years of metric growth.
-
Design and build ElasticSearch clusters holding 10-1000TB of data, for a variety of use cases.
-
Gather requirements, design and build an alerting system that allows developers to construct alerts - from multiple data sources and alerting workflows.
-
Develop comprehensive meta-monitoring tools that provide new insights into our complex event and metric pipelines.
-
Write libraries and APIs that provide a simple, unified interface to other developers when they use our monitoring, logging and event processing systems.
-
Automate cluster scaling so monitoring resources can be requested and automatically deployed.
You are an ideal candidate if you:
-
Have 6+ years experience designing, deploying and operating mid to large scale enterprise or cloud environments.
-
Have 3+ years experience scripting or coding with languages like Ruby, Scala, Python, or Bash.
-
Fearlessly dive into other people's source code to solve a problem.
-
Know your way around *nix systems. We run Debian.
-
Consult with other teams on how they can better monitor their service. Evangelize best practice.
-
You automate all the things.
-
You care about and empathise with the customer experience. You have experience supporting an externally-facing production environment, ideally in a team that follows the sun.
-
Bonus points for experience with: ElasticSearch, Logstash, Kibana, Graphite, Grafana, statsd, collectd, Snowflake, Ansible, Ruby.
Keywords: Observability, Monitoring, SRE, Site Reliability Engineering, DevOps, ElasticSearch, Logstash, Kibana, ELK, Grafana, Graphite, statsd, collectd, Snowflake, Ansible, Ruby.
Cisco is an Affirmative Action and Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, national origin, genetic information, age, disability, veteran status, or any other legally protected basis. Cisco will consider for employment, on a case by case basis, qualified applicants with arrest and conviction records.