- Lead, mentor, and grow a team of Site Reliability Engineers. Foster a culture of innovation and technical excellence where engineers feel empowered to do their best work. Provide personalized coaching, create professional development plans, and guide the careers of senior and emerging talent within the team.
- Establish equitable, sustainable on-call practices (including global coverage where applicable) that protect focus time and avoid burnout.
- Define team rituals - runbook reviews, game days, and incident retros - that reinforce quality and learning.
- Strategic Planning & Vision: Define and drive the multi-year technical strategy and vision for Tubi's observability, and automation platforms. Partner with infra lead to align Tubi's infrastructure & SRE roadmap. Partner with tech leaders to align the SRE roadmap with business objectives. Champion a data-driven approach to reliability, using Service Level Objectives (SLOs) and error budgets to facilitate productive conversations about risk and feature velocity.
- Operational Excellence & Incident Management: Own the end-to-end availability, performance, and efficiency of our critical user-facing services. Evolve our incident response practice to reduce MTTR and MTBF. Champion a rigorous, blameless, and data-driven post-mortem culture to ensure we learn from both successes and failures, driving engineering teams toward systemic fixes and automation to prevent recurrence of incidents.
- Streamline and improve our existing processes and practices, and collaborate with other teams to enhance our production release standards by improving current processes.
- Define and tune a 24x7 on-call rotation for low noise and fast response; act as executive escalation partner during major incidents.
- Own disaster-recovery strategy (playbooks, failover drills, recovery simulations) and track SLO gaps with time-bound remediations.
- Financial & Vendor Management: Own the SRE budget, tooling, and headcount. Manage relationships with key third-party vendors for observability and SRE-related AI platforms, work with infra lead and finance team for contract negotiations and ensure value from investments.
- Cross-Functional Collaboration: Act as a key influencer and strategic partner to leaders in Software Engineering, Product Management, and Infra/Sec. Drive the adoption of SRE best practices and principles throughout the organization, ensuring new services are designed for reliability, scalability, and observability from day one.
- 8+ years of experience in a technical field, with at least 3+ years in an engineering leadership position managing SRE, DevOps, or Production Engineering teams.
- A deep, principled understanding of SRE tenets, including SLIs, SLOs, error budgets, toil reduction, and capacity planning.
- Exceptional communication, negotiation, and influencing skills, with the ability to articulate complex technical concepts and strategies to both technical and non-technical stakeholders at all levels of the organization.
- A strong technical background as a hands-on software engineer or site reliability engineer prior to moving into management. Deep knowledge of AWS services (networking, IAM, EKS, ALBs/NLBs, Route 53, CloudWatch). Proven experience with Kubernetes in production (EKS preferred), including service exposure, networking, and availability engineering.
- Hands-on familiarity with modern SRE tools and technologies, including Infrastructure as Code (Terraform, Ansible), container orchestration (Kubernetes), observability platforms (Prometheus, Grafana, Datadog, Splunk), and incident tooling (PagerDuty, FireHydrant), deployment-safety tooling (Argo Rollouts, LaunchDarkly), and observability standards (OpenTelemetry).
- Executive-caliber incident communication/storytelling skills (clear status, stakeholder alignment, and post-incident narratives).
- Demonstrated success in hiring, developing, and mentoring high-performing engineers, including managing senior and principal-level talent.
- Experience managing globally distributed teams and developing equitable and sustainable on-call rotation practices.
- Experience in financial planning, budget management, and vendor contract negotiation for technical infrastructure and tooling.
- AIOps Strategy Development: Developing and executing the strategy for integrating AIOps and machine learning into our observability stack. Move the team from a reactive monitoring posture to predictive maintenance and automated anomaly detection, fundamentally changing how we ensure reliability.
- Accelerating Automation with AI: Championing the effective and responsible use of AI-assisted coding tools within the SRE team. Set standards and practices to leverage these tools to accelerate automation, tooling, and infrastructure code.
- Building the Business Case: Building the techno-economic case for new AI tooling, managing vendor relationships, and ensuring cost-effective and secure implementation. Articulate ROI in terms of reduced downtime, improved efficiency, and faster incident resolution.
- Fostering Critical AI Literacy: Fostering a culture that can evaluate, debug, and learn from AI outputs, extending blameless post-mortems to AI-driven actions and recommendations.
-
Tata Consultancy Services (TCS) is an equal opportunity employer and embraces diversity in race, nationality, ethnicity, gender, age and sexual orientation to create a workforce that reflects the societies they operate in. · Good years of relevant application production support e ...
Toronto, ON1 week ago
-
Join BuildOps, the largest commercial trade platform in the country, as we transform the multi-billion dollar commercial contracting industry. · ...
Toronto, Ontario, Canada $90,000 - $145,000 (CAD) per year2 days ago
-
We are seeking a skilled Site Reliability Engineer (SRE) to enhance the reliability scalability performance our systems applications The ideal candidate will have strong experience in automation cloud platforms observability incident management DevOps practices This role involves ...
Toronto, Ontario4 weeks ago
-
The Site Reliability Engineer will ensure the reliability and availability of software systems by designing resilient architectures, automating infrastructure management, and implementing effective incident response processes. · ...
Toronto, Ontario1 month ago
-
The DevOps and Automation department is looking for a Site Reliability Engineer with strong expertise in Dynatrace to ensure the reliability, performance and observability of large scale, distributed systems. · Monitoring application flow (transactions) to check on anomalies and ...
Toronto, Ontario4 weeks ago
-
We are seeking a Site Reliability Engineer to ensure the reliability performance and scalability of our systems. · ...
Toronto3 days ago
-
The TeamGlobal Banking and Markets Engineering (GBME) is the fast-moving, award-winning technology engine that powers Scotiabank's Corporate, · In this exciting role, you'll apply your analytical skills to design and develop applications that deliver excellence,effectiveness,and ...
Toronto, Ontario3 weeks ago
-
+As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team responsible for ensuring the operations and reliability of Scotiabank digital applications. · +Gather and refine specifications and requirements based on technical needs. · Create an envi ...
Toronto, Ontario1 month ago
-
+As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. · +Gather and refine specifications and requirements based on technical needs. · Direct day-to ...
Toronto, Ontario1 month ago
-
We are seeking a Site Reliability Engineer IV to support trade finance change initiatives and BAU operations within a large financial technology environment. · Strong understanding of system and system integration patterns (REST APIs, MQ, etc.) · Advanced technical troubleshootin ...
Toronto, Ontario1 month ago
-
We are looking for a Site Reliability Engineer to ensure the reliability availability and performance of enterprise platforms through strong observability monitoring and incident management practices. · ...
Greater Toronto Area $90,000 - $145,000 (CAD) per year2 days ago
-
· As a Site Reliability Engineer (GCP) you will play a key role at Stacktics Inc., where we design, create, deploy, maintain and grow industry-leading Cloud Infrastructure, Big Data Analytics and Cloud For Marketing products, solutions and services. As a SRE/DevOps team member, ...
Toronto $90,000 - $145,000 (CAD) per year1 day ago
-
This is a position for a Site Reliability Engineer to ensure the reliability performance scalability of our systems. · Monitor and maintain the health performance availability of our systems services. · ...
Toronto, Ontario1 month ago
-
We're looking for Site Reliability Engineers to join our transformative cloud journey. · iManage Means... · You'll create middleware and platform guardrails that empower developers to innovate quickly and reliably. You'll combine technical depth with empathy to eliminate customer ...
Toronto4 weeks ago
-
We're seeking software and systems engineers specializing in reliability and platform services to join our transformative cloud journey. · ...
Toronto, ON, Canada1 week ago
-
We're seeking an experienced Site Reliability Engineer (SRE) to join our team, focusing on designing, implementing and maintaining scalable CI/CD pipelines. · Design, implement and maintain scalable CI/CD pipelines using tools like Jenkins, Argo and GitOps. · Collaborate with dev ...
Toronto, Ontario1 month ago
-
We're not just building better tech. We're rewriting how data moves and what the world can do with it. · It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, No egos, · No solo acts. · ...
Toronto1 month ago
-
A long-standing privacy-focused technology company operating large-scale global infrastructure is hiring a Site Reliability Engineer to help strengthen and scale its VPN and DNS platforms. · ...
Toronto, Ontario1 week ago
-
We are looking for a Site Reliability Engineer to help us tame DNS. · ...
Toronto, Ontario1 month ago
-
We are looking for a Site Reliability Engineer to join our team. The successful candidate will be responsible for designing and building automation for infrastructure provisioning and configuration management. · Automation & Configuration Management: Strong experience with Ansibl ...
Toronto, Ontario1 month ago
-
The Site Reliability Engineer will play a crucial role in ensuring the reliability, performance and scalability of our systems. · Monitor and maintain the health, performance and availability of our systems and services. · ...
Toronto1 week ago
Senior Manager, Site Reliability Engineering - Toronto - Tubi Tv
Description
Senior Manager, Site Reliability Engineering
Overview
About Tubi: Boldly built for every fandom, Tubi is a free streaming service that entertains over 100 million monthly active users. Tubi offers the world's largest collection of Hollywood movies and TV shows, thousands of creator-led stories and hundreds of Tubi Originals made for the most passionate fans. Headquartered in San Francisco and founded in 2014, Tubi is part of Tubi Media Group, a division of Fox Corporation.
About the Role
About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations team. We are a software engineering organization that applies a developerver mindset and toolkit to the challenges of building and running large-scale, distributed systems. Our mission is to engineer resilience from the ground up, enabling our product teams to innovate rapidly while ensuring our users have a stellar experience. We own the availability, latency, performance, and capacity of our platform, and we achieve our goals through a culture of data-driven decision-making, blameless learning, and relentless automation.
We are seeking an experienced and visionary Senior SRE Manager to lead and grow our newly built Site Reliability Engineering team. You are more than a people manager or a tech lead; you are the strategic leader responsible for architecting our reliability roadmap. You will build and mentor a team of talented engineers, foster a culture of blameless learning and continuous improvement, and champion the engineering practices that allow us to balance rapid innovation with rock-solid stability. You will be a key influencer in our engineering leadership, partnering with peers across the organization to ensure reliability is a shared responsibility and a core tenet of our engineering culture.
What You'll Do
Your Background
Preferred Qualifications (Nice-to-Haves)
The AI Mandate: Building the Future of Observability with AI
You will not just manage a team that uses AI; you will lead the charge in building an AI-native SRE function. This is a strategic mandate that requires a forward-thinking leader who understands both the potential and the pitfalls of integrating intelligent systems into critical operations. This includes:
#LI-Hybrid
EEO Statement: We are an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, gender identity, disability, protected veteran status, or any other characteristic protected by law. We will consider qualified applicants with criminal histories consistent with applicable law.
Interested in building your career at Tubi? Get future opportunities sent straight to your email.
#J-18808-Ljbffr
-
Site Reliability Engineer
Only for registered members Toronto, ON
-
Site Reliability Engineer
Only for registered members Toronto, Ontario, Canada
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Greater Toronto Area
-
Site Reliability Engineer
Only for registered members Toronto
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto
-
Site Reliability Engineer
Only for registered members Toronto, ON, Canada
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto, Ontario
-
Site Reliability Engineer
Only for registered members Toronto