The Client
Katalist is a generative AI-powered video and storyboarding platform that enables creators to transform scripts into cinematic, production-ready visuals. With consistent AI characters, dynamic scenes, and automated video generation, Katalist streamlines the journey from concept to final cut.
- Industry:Technology, Information and Internet
- Company Size:1-10
- Country:United States
“Getting the infrastructure right, especially ML infrastructure is a challenging and time consuming task even for a large business with readily available resources. For startups like Katalist it can be life and death. With Cloudvisor's help we were able to get it right and make the most of the precious AWS credits we had at our disposal with minimum headache.”
Challenges
- Multi-cloud complexity – AI and supporting workloads were running across AWS, GCP, Azure, and Vercel, increasing operational overhead and reducing visibility.
- Inefficient autoscaling – Due to scaling issues in Kubernetes, infrastructure had to be provisioned for peak load, leading to higher-than-necessary costs.
- Manual deployments – Helm-based deployments were largely manual and could take up to a full day to complete.
- Security gaps – Improvements were needed in IAM configuration, CloudTrail auditing, GuardDuty monitoring, and overall account governance.
- Network architecture limitations – VPC configuration required adjustments, including NAT Gateways, VPC endpoints, and secure access mechanisms.
- Limited cost optimisation visibility – The infrastructure lacked a structured review to identify savings opportunities and architectural improvements.
Before Cloudvisor’s involvement, the client’s Kubernetes-based AI platform was operational, but there were opportunities to improve autoscaling efficiency, strengthen security controls, and streamline deployment processes to enhance agility and cost management.
Solutions
Cloudvisor started with a comprehensive infrastructure and security review, followed by targeted architectural and automation improvements focused on autoscaling, governance, security, and deployment efficiency.
Well-Architected Framework Review
A free AWS Well-Architected Framework Review was conducted to identify architectural gaps in areas of security, reliability, automation, cost optimization, define remediation priorities, and look for potential funding opportunities for improvements.
EKS Autoscaling Optimisation with Karpenter
The node autoscaling issue in Amazon EKS was addressed by planning and configuring Karpenter to dynamically provision nodes based on workload demand, reducing the need to size for peak load.
IAM and Account Governance Improvements
Best-practice guidance was provided for IAM configuration, helping improve role management, access control, and overall AWS account governance.
Enhanced Security Monitoring
CloudTrail and GuardDuty configurations were reviewed and improved to ensure stronger auditing capabilities and proactive threat detection.
VPC and Network Architecture Enhancements
The VPC was adjusted by introducing NAT Gateways and S3 VPC endpoints, and the EKS control plane was moved to a private subnet. A Client VPN solution was designed to enable secure access to private resources within AWS network such as EKS control plane, databases and EC2 nodes.
Customised EKS Node Configuration
EKS node AMIs were customised to pre-load Docker images and ML models, simplifying automation and reducing initialization time for AI workloads.
Automated CI/CD with ArgoCD
An automated CI/CD pipeline using ArgoCD was implemented significantly reducing manual deployment effort and improving release reliability. Additional changes were made to enable future implementation of canary deployment.
AWS Services Used
- Amazon EKS (Elastic Kubernetes Service)
- Karpenter
- AWS IAM (Identity and Access Management)
- AWS CloudTrail
- Amazon GuardDuty
- AWS Client VPN
- Amazon VPC (with NAT Gateways and VPC Endpoints)
Results
Following infrastructure optimisation and automation improvements, the client gained stronger cost control, improved security posture, and significantly more efficient deployment processes.
- Optimised AutoscalingAWS costs were reduced through AWS credits and infrastructure optimisation.
- Improved Security VisibilityEnhanced CloudTrail and GuardDuty strengthened monitoring and auditing.
- Faster DeploymentsArgoCD automation reduced deployment time from up to a day to a streamlined release process.
- Stronger Network SetupPrivate subnets, VPN access, and VPC endpoints increased security and reliability.
- Clear Improvement RoadmapThe Well-Architected Review delivered actionable next steps and identified funding opportunities.
- Production ReplicationAfter successful validation, the same autoscaling configuration was rolled out to the production environment.
