Senior System Reliability engineer (Technical Support & Customer Facing)
ThoughtSpot
Systems Reliability Engineer (Technical Support)
About Us
ThoughtSpot is an AI-powered analytics platform that enables users to explore and analyze data through natural language queries, making insights accessible to all. Our mission is to deliver reliable, high-performing applications that empower our customers.
The Role
As part of the ThoughtSpot SRE team, you will be on the cutting edge of operational intelligence. You will not only ensure service reliability but also act as a trusted partner for our customers — proactively leveraging AI/ML to deliver timely updates, meaningful solutions, and predictive improvements. You are the bridge between our customers and engineering, combining deep systems expertise with a genuine passion for customer success. If you thrive in dynamic environments and are committed to building resilient, self-optimizing systems, this role is for you.
What You'll Do
Technical & Customer Support
- Act as the primary point of contact for customer-facing technical issues related to our SaaS platform, including data connectivity, report errors, performance concerns, access problems, data inconsistencies, software bugs, and integration challenges.
- Understand and empathize with the challenges ThoughtSpot users face, offering tailored solutions to improve their experience.
- Provide timely, accurate, and clear updates to customers, consistently meeting SLAs and driving issues through to full resolution via tickets and calls.
- Translate complex technical issues into clear, concise updates for both technical and non-technical stakeholders.
- Create and maintain knowledge-base articles to empower customer self-service and improve support efficiency.
System Reliability & Monitoring
- Maintain, monitor, and troubleshoot ThoughtSpot cloud infrastructure using tools like Grafana, Prometheus, Datadog, and Splunk.
- Monitor system health and performance through metrics, logs, and dashboards to detect and prevent issues proactively.
- Implement and leverage AI/ML-driven solutions for proactive observability, predictive anomaly detection, and intelligent alerting to enhance service reliability and reduce Mean Time to Resolution (MTTR).
- Understand and apply NetOps and SecOps principles for cloud and on-premise deployments.
- Develop and implement automation and best practices to streamline operations and strengthen system reliability.
- Optimize SRE workflows with AI tools to boost operational effectiveness.
Incident Management & Continuous Improvement
- Participate in on-call rotations, lead incident reviews, and conduct thorough root cause analyses to drive continuous improvement.
- Work cross-functionally with Engineering to define and implement tools that enhance debuggability, supportability, availability, scalability, and performance.
- Be an expert in both cloud and on-premise infrastructure by developing automation and best practices.
What You'll Bring
- B.S. in Computer Science or equivalent relevant experience.
- Proven experience troubleshooting complex Linux systems and managing virtualization and cloud platforms (VMware, AWS, Azure, GCP).
- Hands-on experience with monitoring tools such as Grafana, Prometheus, Datadog, or Splunk.
- Demonstrated experience and a keen interest in leveraging AI/ML principles to address SRE challenges — including AIOps, predictive maintenance, and intelligent automation.
- Prior experience in enterprise customer support, including on-call rotations and incident management, with the ability to lead root cause analyses.
- Strong problem-solving and algorithmic thinking with a solid understanding of system internals.
- Excellent verbal and written communication skills with the ability to work independently and cross-functionally in fast-paced environments.
- Familiarity with scripting and programming languages such as Python, Go, Bash, or Java.
- Exposure to infrastructure and service monitoring frameworks with the ability to analyze data to ensure high availability.
Good to Have
- Experience partnering with Engineering to design and implement mission-critical tooling and automation that advances system debuggability, high availability, elastic scalability, and performance.
- Experience with alerting strategies and monitoring system tuning to minimize alert fatigue and optimize Mean Time to Acknowledge (MTTA).
- Familiarity with C/C++ or other low-level systems languages.
Ideal Candidate Profile
You have a balanced mix of technical expertise in cloud operations and a proven record of handling support incidents and end-user queries. This sets you apart from candidates with purely systems or cloud engineering backgrounds. You move fluidly between deep technical investigation and customer-facing communication — equally at home diagnosing a complex infrastructure issue and presenting findings clearly to an enterprise stakeholder.
What We Offer
- Competitive salary and benefits package.
- Opportunities for professional growth and career advancement.
- A collaborative work environment where your input and expertise directly impact customer experience and platform reliability.
If you're ready to leverage your technical skills in a role that directly influences customer success and BI user satisfaction, we'd love to hear from you.