Scalable SRE Practices for AI Service Reliability: Monitoring and Alerting in Production ML Systems
SRE or Scalable Site Reliability Engineering practices prioritise integrating software engineering norms and maintaining extremely scalable and reliable software systems. The aim is to establish actionable insights to decrease the gap in reliability between conventional software systems and emerging ML-driven infrastructures. The traditional SRE models did not highlight ML-specific issues, such as data drift, model degradation, and inference latency specifically. The outcomes reveal the essential reliability measures, analyse the state-of-the-art observability architectures, and observe integration risks with the help of a mixed-method technique that consists of case studies and performance graphs. This paper investigates SRE initiatives to increase monitoring and alerting of production ML methods. The analysis in this paper influences the ML-aware diagnostic, automation, and adaptive alerting system. Furthermore, applications of ML-aware SRE, automated monitoring, and others were recommended in this paper