• Engineering for stable operation and monitoring of the service
- Operates Zeta's service stably, where hundreds of thousands of users spend an average of 3 hours daily.
- Builds a scalable AI serving infrastructure in a multi-cloud environment.
- Establishes a monitoring system capable of quickly responding to system failures.
- Optimizes bottlenecks to achieve high Throughput and low Latency for the service.
• System/Utility Operations for DevOps
- Develops internal automation utilities for deployment, monitoring, etc.
- Automates repetitive tasks arising from service operation and infrastructure to enhance the overall productivity of the team.
• Build and operate data pipeline infrastructure
- Configures and operates pipelines that refine and analyze logs from the service infrastructure into usable data for research and analysis.
- Develops tools needed for operations while building pipelines that perform various tasks such as log streaming, data refinement, and large-scale batch processing.
Position Introduction
We are looking for a developer who can withstand high traffic without service outages or delays: Zeta faces an enormous amount of service traffic every day, similar to game servers, as hundreds of thousands of people use our service for over 2 hours a day and more than 8 hours a week. And this traffic is increasing by more than double every month. Zeta's SRE must operate the service stably without traffic interruptions or delays while smoothly and rigorously conducting various ongoing A/B tests. To do this, we are looking for an engineer with the capability for efficient infrastructure configuration and traffic response.
You can gain optimized AI service operation experience: Zeta's SRE directly operates and manages the AI model serving infrastructure that is core to the product, working with ML Engineers to find ways to optimize costs and speed. Since we serve our self-developed AI (including LLM) models directly from the cloud, we are using over 100 GPUs in real time and employing various techniques for cost and speed optimization among them. These techniques encompass the know-how and optimization methods we acquired while operating AI services for over 3 years. I am confident that the experience you will gain while working in this position in the already arrived AI era will become a very valuable asset to your capabilities and career.
You can experience running the AI Data Fly Wheel from start to finish: Developing better data pipelines to create data that is beneficial for AI learning, leading to better AI models, and increasing user satisfaction to collect more high-quality data from users as a successful and consistent Data Fly Wheel experience is not easy. Zeta collects various events and logs occurring within the service in real time and uses them directly for product improvement and AI model learning. Zeta's SRE establishes and improves efficient data pipelines so that data can be used seamlessly and appropriately. In other words, Zeta's SRE contributes directly to enhancing the competitiveness of our products beyond simply operating a stable service. Moreover, the experience and know-how gained from successfully operating the Fly Wheel in an AI era where data is the most crucial ingredient will significantly aid your competitiveness.
Key Responsibilities
• Engineering for stable operation and monitoring of the service
- Operates Zeta's service stably, where hundreds of thousands of users spend an average of 3 hours daily.
- Builds a scalable AI serving infrastructure in a multi-cloud environment.
- Establishes a monitoring system capable of quickly responding to system failures.
- Optimizes bottlenecks to achieve high Throughput and low Latency for the service.
• System/Utility Operations for DevOps
- Develops internal automation utilities for deployment, monitoring, etc.
- Automates repetitive tasks arising from service operation and infrastructure to enhance the overall productivity of the team.
• Build and operate data pipeline infrastructure
- Configures and operates pipelines that refine and analyze logs from the service infrastructure into usable data for research and analysis.
- Develops tools needed for operations while building pipelines that perform various tasks such as log streaming, data refinement, and large-scale batch processing.