Partnering with expansion-stage, enterprise software companies that we believe can become category leaders.

Limited partner investing in exceptional early-stage venture fund managers.

Partnering with early-stage companies at the nexus of technology and culture.

Portfolio Jobs

Looking for your next role? Take a look at these exciting jobs at Sapphire Ventures’ portfolio companies. Our Talent team is passionate about connecting you to your dream job!

My job alerts

Senior Site Reliability Engineer

Dremio

This job is no longer accepting applications

See open jobs at Dremio.See open jobs similar to "Senior Site Reliability Engineer" Sapphire Ventures.

Software Engineering

Lisbon, Portugal

Posted 6+ months ago

Be Part of Building the Future

Dremio is the unified lakehouse platform for self-service analytics and AI, serving hundreds of global enterprises, including Maersk, Amazon, Regeneron, NetApp, and S&P Global. Customers rely on Dremio for cloud, hybrid, and on-prem lakehouses to power their data mesh, data warehouse migration, data virtualization, and unified data access use cases. Based on open source technologies, including Apache Iceberg and Apache Arrow, Dremio provides an open lakehouse architecture enabling the fastest time to insight and platform flexibility at a fraction of the cost. Learn more at www.dremio.com.

About the role

Dremio’s SREs ensure that internal and externally visible services have reliability and uptime appropriate to users' needs and a fast rate of improvement. You will be joining a small team of experienced SREs helping to deliver a world class experience to Dremio Cloud customers. Our systems, like many, are joint-cognitive, made up of both people and software: complex and therefore intrinsically hazardous. We understand and expect that catastrophe is always just around the corner.

What you’ll be doing

Drive continuous improvements to our usage of Kubernetes, our Operators, and the GitOps deployment paradigm
Extend our networking, service mesh and Kubernetes systems to support connectivity between GCP, AWS and Azure
Collaborate with Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, monitoring/alerting, capacity planning, production readiness and service reviews
Help define and instrument Service Level indicators and objectives (SLIs/SLOs) with service owners in the Engineering teams. Develop SLO-based on-call strategies for service owners and their teams
Collaborate within our virtual Observability team: develop and improve observability (tracing, events, metrics, profiling, logging and exceptions) of the Dremio Cloud product
Building internal tools and pipelines for production use and to automate routine tasks, ability to debug and optimize code written by others. You recognize complexity and are familiar with multiple techniques to manage it but recognize the potential cost in complete rewrites.
Evangelize and advocate for resilience engineering and reliability practices across our organization
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity
Join an on-call rotation for systems and services that the SRE team owns
Practice sustainable incident response and post-incident investigation analysis
Use techniques developed in and around the Learning from Incidents community
Drive the cultural, technical, and process changes to move towards a true continuous delivery model within the company.

What we’re looking for

5+ years of relevant experience in the following areas: SRE, Distributed Systems, Cloud Operations, Software Engineering
Expertise in Kubernetes, Istio, Terraform, ArgoCD/Flux
Expertise with software defined networking infrastructure: dedicated and partner interconnects, VPNs, BGP
Excellent command of cloud services on GCP/AWS/Azure, CI/CD pipelines
Have moderate-advanced experience in Python/Go, and at least reading knowledge of Java
Interested in designing, analyzing and troubleshooting large-scale distributed systems
A systematic problem-solving approach, coupled with strong communication skills and a sense of ownership, drive, and determination
Great ability to write, debug, and optimize code and automate routine tasks
Interested in and able to identify, conceptualize, and solve complex problems within the SRE domain, finding existing solutions or building new ones as necessary
A solid background in software development and architecting resilient and reliable application

Bonus points if you have

Hands-on experience with large-scale production Kubernetes clusters (<=1000 nodes)
Developed SLIs/SLOs for production systems
Hands-on experience using Honeycomb for OpenTelemetry trace analysis

#LI-JW1

What we value

At Dremio, we hold ourselves to high standards when it comes to People, Thinking, and Action. Our Gnarlies (that's what we call our employees) communicate with clarity, drive accountability, and are respectful towards each other. We confront brutal facts and focus on results while operating with a sense of urgency and building a "flywheel". People who like to jump in and drive momentum will thrive in our #GnarlyLife.

Dremio is an equal opportunity employer supporting workforce diversity. We do not discriminate on the basis of race, religion, color, national origin, gender identity, sexual orientation, age, marital status, protected veteran status, disability status, or any other unlawful factor.

Dremio is committed to providing any necessary accommodations for individuals with disabilities within our application and interview process. To request accommodation due to a disability, please inform your recruiter.

Dremio has policies in place to protect the personal information that employees and applicants disclose to us. Please click here to review the privacy notice.

Important Security Notice for Candidates

At Dremio, we uphold trust and transparency as paramount values in all our interactions with customers, partners, employees, and the general public. We have been targeted by individuals creating fake domains similar to ours to scam prospects and candidates. Please note that all official communications from us will be from an @dremio.com domain. If you suspect you've been targeted by a scam, it's imperative to report the incident to your local law enforcement agencies. For more information about this type of scam, please refer to Dremio's official statement here.

This job is no longer accepting applications

See open jobs at Dremio.See open jobs similar to "Senior Site Reliability Engineer" Sapphire Ventures.

See more open positions at Dremio

Privacy policy Cookie policy

Portfolio Jobs

Senior Site Reliability Engineer

Be Part of Building the Future

About the role

What you’ll be doing

What we’re looking for

Bonus points if you have

What we value

Important Security Notice for Candidates

Austin, TX

San Francisco, CA

Menlo Park, CA

London