Get in Touch

Course Outline

Foundations of Agentic Systems in Production

  • Agentic architectures: loops, tools, memory, and orchestration layers.
  • The lifecycle of agents: development, deployment, and continuous operation.
  • Challenges associated with managing agents at production scale.

Infrastructure and Deployment Models

  • Deploying agents within containerized and cloud environments.
  • Scaling patterns: horizontal vs. vertical scaling, concurrency, and throttling.
  • Multi-agent orchestration and workload balancing.

Monitoring and Observability

  • Key metrics: latency, success rate, memory usage, and agent call depth.
  • Tracing agent activity and call graphs.
  • Instrumenting observability using Prometheus, OpenTelemetry, and Grafana.

Logging, Auditing, and Compliance

  • Centralized logging and structured event collection.
  • Compliance and auditability within agentic workflows.
  • Designing audit trails and replay mechanisms for debugging.

Performance Tuning and Resource Optimization

  • Reducing inference overhead and optimizing agent orchestration cycles.
  • Model caching and lightweight embeddings for faster retrieval.
  • Load testing and stress scenarios for AI pipelines.

Cost Control and Governance

  • Understanding cost drivers for agents: API calls, memory, compute, and external integrations.
  • Tracking agent-level costs and implementing chargeback models.
  • Automation policies to prevent agent sprawl and idle resource consumption.

CI/CD and Rollout Strategies for Agents

  • Integrating agent pipelines into CI/CD systems.
  • Testing, versioning, and rollback strategies for iterative agent updates.
  • Progressive rollouts and safe deployment mechanisms.

Failure Recovery and Reliability Engineering

  • Designing for fault tolerance and graceful degradation.
  • Retry, timeout, and circuit breaker patterns for agent reliability.
  • Incident response and post-mortem frameworks for AI operations.

Capstone Project

  • Build and deploy an agentic AI system with comprehensive monitoring and cost tracking.
  • Simulate load, measure performance, and optimize resource usage.
  • Present the final architecture and monitoring dashboard to peers.

Summary and Next Steps

Requirements

  • Solid understanding of MLOps and production machine learning systems.
  • Experience with containerized deployments (Docker/Kubernetes).
  • Familiarity with cloud cost optimization and observability tools.

Target Audience

  • MLOps engineers.
  • Site Reliability Engineers (SREs).
  • Engineering managers overseeing AI infrastructure.
 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories