MetaSr. Production Operations Engineer
Nov. 2023Worked as part of a team managing a globally distributed server fleet of over 12 million servers across multiple data centers. Analyzed complex technical issues, including automated tooling, hardware failures, and network problems.
Utilized data-driven decision-making to identify root causes, develop effective solutions, and implement process improvements, ensuring optimal system performance, reliability, and efficiency at scale.
Developed and implemented automated action plans to streamline repair processes, reducing server downtime and improving system availability.
Served as the first point of contact for break-fix technicians, assisting with projects, repairs, and retrofits.
Debugged hardware and Linux OS issues, and contributed to process improvements and best practices in data center operations.
Mentored junior team members, providing guidance and support to help them develop their technical skills and confidence, while fostering a collaborative and knowledge-sharing environment.
Participated in on-call rotation to maximize server fleet up-time and utilization rates.