About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
Three years of locking down SageMaker environments across regulated industries taught me one thing early: your networking decisions on day one determine whether the ML infrastructure passes an audit six months later. Teams treat SageMaker networking as an afterthought. Notebook instances get default settings. Models train with full internet access. Then the security review arrives and everybody scrambles. Give the networking layer the same rigor you'd give any production VPC workload.
Understanding SageMaker Network Architecture
SageMaker resources run in one of two modes: internet-facing or VPC-isolated. Get this distinction right early or plan on a painful re-architecture later.
Internet-Facing vs. VPC Mode
Out of the box, SageMaker notebook instances, training jobs, and endpoints all get direct internet access. Quick to set up. Terrible for production. I watched teams run like this for months, then face a brutal migration when compliance caught up. VPC mode places SageMaker resources inside your own Virtual Private Cloud, giving you actual control over what talks to what. Every production SageMaker deployment I build starts here.
VPC Configuration Best Practices
Use Private Subnets
Deploy SageMaker resources in private subnets. Full stop. Public subnets give your ML infrastructure a direct path to and from the internet; that attack surface buys you nothing. I have yet to see a legitimate production use case requiring SageMaker resources in a public subnet.
Creating a SageMaker notebook instance in a private subnet
Implement Multiple Availability Zones
For production workloads (especially SageMaker endpoints handling real-time inference), spread resources across at least two Availability Zones. Learned this one the hard way. An AZ outage in us-east-1 took down a single-AZ inference endpoint for 47 minutes. The multi-AZ version of that same service? Up the entire time.
Network Isolation for Training Jobs
Working with sensitive data? Run training jobs in network isolation mode. This blocks the training container from making any outbound network calls. A healthcare client had a compliance requirement that no patient data could leave the VPC boundary during model training. Network isolation was the cleanest way to satisfy that requirement.
Enabling network isolation for SageMaker training jobs
Security Groups and NACLs
Principle of Least Privilege
Most teams over-permit SageMaker security groups because troubleshooting connectivity issues during training is painful and nobody wants to be the bottleneck. Resist that urge. Notebook instances typically need inbound HTTPS on port 443 for Jupyter access through the console. That's it.
Inbound Rules:
- Type: HTTPS
- Protocol: TCP
- Port: 443
- Source: Your VPC CIDR or specific IP ranges
Outbound Rules:
- Allow necessary AWS service endpoints
- Block all other traffic by default
Separate Security Groups by Function
Maintain distinct security groups for each SageMaker component type. Merging them into a single "sagemaker-sg" feels convenient until you need to debug why your endpoint can reach a resource that only training jobs should access.
Security Group Type
Purpose
Notebook Security Group
For development and experimentation
Training Security Group
For training jobs
Endpoint Security Group
For model serving
Data Source Security Group
For databases and data storage
The payoff comes during incident response. Traffic from a specific security group in your flow logs tells you immediately which component generated it.
Use Network ACLs as a Second Layer
Security groups handle instance-level filtering. NACLs operate at the subnet level. Think of NACLs as a safety net: coarse-grained rules that catch anything the security groups miss. They matter most for subnets hosting production inference endpoints, where a misconfigured security group could expose your model to unauthorized callers.
VPC Endpoints for AWS Services
Why VPC Endpoints Matter
Without VPC endpoints, every API call from your VPC-isolated SageMaker resources to S3, ECR, or CloudWatch routes through a NAT Gateway (or fails entirely). VPC endpoints keep that traffic on the AWS backbone: lower latency, better security posture, and no more hemorrhaging money on NAT Gateway data processing charges. On one project, switching S3 access from NAT to a gateway endpoint cut the monthly data transfer bill by $2,400.
Essential VPC Endpoints for SageMaker
Six endpoints at minimum. Teams forget one (usually STS or CloudWatch Logs) and then spend hours debugging why training jobs fail silently.
Interface endpoints (SageMaker API, ECR, CloudWatch Logs) need multiple AZs and their own security groups. Put them in the same subnets as your SageMaker resources so traffic stays local to the AZ whenever possible.
Both enableDnsSupport and enableDnsHostnames must be turned on in your VPC. Skip this and your VPC endpoints silently fail to resolve. I've wasted more hours debugging this than I care to admit; usually somebody provisioned the VPC manually through the console and missed the checkbox.
resource"aws_vpc""main"{cidr_block="10.0.0.0/16"enable_dns_support=trueenable_dns_hostnames=true}# Or modify existing VPC attributesdata"aws_vpc""existing"{id="vpc-xxxxx"}resource"aws_vpc_attribute""enable_dns_support"{vpc_id=data.aws_vpc.existing.idenable_dns_support=true}resource"aws_vpc_attribute""enable_dns_hostnames"{vpc_id=data.aws_vpc.existing.idenable_dns_hostnames=true}
importpulumi_awsasawsvpc=aws.ec2.Vpc("main",cidr_block="10.0.0.0/16",enable_dns_support=True,enable_dns_hostnames=True)# Or modify existing VPCexisting_vpc=aws.ec2.get_vpc(id="vpc-xxxxx")# Note: Pulumi manages these through VPC resource properties
import*asec2from'aws-cdk-lib/aws-ec2';constvpc=newec2.Vpc(this,'MainVPC',{maxAzs:2,// DNS support and hostnames are enabled by default in CDK});// For existing VPC, these attributes should be set on the VPC construct// CDK VPCs have enableDnsHostnames and enableDnsSupport set to true by default
Enabling DNS resolution and hostnames in VPC
Private Hosted Zones
For hybrid architectures or custom internal endpoints, Route 53 private hosted zones give you fine-grained DNS control within the VPC. Useful when SageMaker needs to reach on-premises feature stores or model registries over Direct Connect.
Data Encryption in Transit
Enforce TLS for All Communications
SageMaker supports TLS 1.2 and higher across all API calls and notebook access. Enforce this at the VPC endpoint policy level so no unencrypted traffic slips through, even if someone misconfigures a client.
VPN or Direct Connect for Hybrid Environments
ML pipelines pulling data from on-premises sources need AWS VPN or Direct Connect. Routing training data over the public internet is a non-starter for any regulated workload. I've set up Direct Connect for three separate ML platforms; the latency improvement alone (typically 30-40% reduction for large dataset transfers) justifies the cost beyond the security benefits.
Monitoring and Logging
VPC Flow Logs
Turn on VPC Flow Logs from day one. Cannot overstate how much time these save during incident investigation. Last year I traced a mysterious training job failure to a misconfigured NACL rule in under ten minutes; the flow logs showed rejected packets right there. Without them, that's a half-day debugging session.
importpulumi_awsasawsimportpulumilog_group=aws.cloudwatch.LogGroup("vpc-flow-logs",name="/aws/vpc/flowlogs",retention_in_days=7)# Create IAM role for VPC Flow Logsflow_logs_role=aws.iam.Role("vpc-flow-logs-role",assume_role_policy=pulumi.Output.json_dumps({"Version":"2012-10-17","Statement":[{"Action":"sts:AssumeRole","Effect":"Allow","Principal":{"Service":"vpc-flow-logs.amazonaws.com"}}]}))# Attach policy to roleflow_logs_policy=aws.iam.RolePolicy("vpc-flow-logs-policy",role=flow_logs_role.id,policy=pulumi.Output.json_dumps({"Version":"2012-10-17","Statement":[{"Action":["logs:CreateLogGroup","logs:CreateLogStream","logs:PutLogEvents","logs:DescribeLogGroups","logs:DescribeLogStreams"],"Effect":"Allow","Resource":"*"}]}))# Create VPC Flow Logflow_log=aws.ec2.FlowLog("vpc-flow-log",vpc_id="vpc-xxxxx",traffic_type="ALL",log_destination_type="cloud-watch-logs",log_destination=log_group.arn,iam_role_arn=flow_logs_role.arn)
import*asec2from'aws-cdk-lib/aws-ec2';import*aslogsfrom'aws-cdk-lib/aws-logs';import*asiamfrom'aws-cdk-lib/aws-iam';constlogGroup=newlogs.LogGroup(this,'VPCFlowLogs',{logGroupName:'/aws/vpc/flowlogs',retention:logs.RetentionDays.ONE_WEEK});// Create VPC Flow Log (IAM role created automatically)constvpc=ec2.Vpc.fromLookup(this,'VPC',{vpcId:'vpc-xxxxx'});vpc.addFlowLog('FlowLog',{destination:ec2.FlowLogDestination.toCloudWatchLogs(logGroup),trafficType:ec2.FlowLogTrafficType.ALL});
Creating VPC Flow Logs for network monitoring
CloudWatch Metrics
Pair flow logs with SageMaker-specific CloudWatch metrics. The four I watch most closely:
Endpoint invocation latency
Model container CPU/memory utilization
Training job network throughput
Failed API calls
A spike in failed API calls combined with rejected packets in flow logs almost always points to a security group or endpoint policy issue.
AWS Config Rules
Set up Config rules to catch drift before it becomes an incident:
Verify SageMaker notebooks are in VPCs
Confirm network isolation is enabled for training jobs
Validate security group configurations
These rules caught misconfigurations three times in the past year that would have made it to production otherwise.
Cost Optimization
Use S3 Gateway Endpoints
S3 gateway endpoints cost nothing. Zero. They eliminate NAT Gateway data processing charges for S3 traffic within the same region. ML workloads shuffle large datasets between S3 and training instances; the savings add up fast. Monthly bills dropping by thousands of dollars from this one change is common.
Right-Size VPC Endpoints
Interface endpoints run about $7.50 per AZ per month, plus $0.01 per GB of data processed. Sounds small until you realize a six-endpoint setup across two AZs costs $90/month before data charges. Only provision endpoints for services your SageMaker workloads actually call, and share them across workloads in the same VPC.
NAT Gateway Optimization
Sometimes you genuinely need internet access: pip installs, pulling external datasets, calling third-party APIs. Ground rules for keeping NAT costs under control:
Share NAT Gateways across multiple private subnets
Consider NAT instances for dev/test environments
Use VPC endpoints instead of NAT for AWS service access
Compliance and Governance
Network Segmentation
Separate dev, staging, and production SageMaker environments into distinct VPCs (or, at minimum, dedicated subnet groups with strict security group boundaries). I watched a data scientist's notebook in a shared VPC accidentally query a production database during experimentation. Segmentation prevents those accidents.
Tagging Strategy
Tag every network resource: every VPC endpoint, every security group, every flow log. During a compliance audit, the auditor asks you to prove which resources belong to which workload. Tags let you answer that question in minutes instead of days.
When your ML pipeline integrates with third-party tools (model monitoring services, feature platforms, data providers), use PrivateLink to keep that connectivity private. Poking holes in your VPC's internet gateway for vendor integrations undermines the isolation you spent weeks building.
Troubleshooting Common Network Issues
Connection Timeouts
Connection timeouts in VPC-isolated SageMaker resources usually boil down to one of four causes. I work through them in this order because it matches their frequency:
Verify security group rules allow outbound traffic
Confirm route tables direct traffic correctly
Check VPC endpoint configuration and policies
Validate IAM role permissions
Nine times out of ten: the security group.
DNS Resolution Failures
DNS issues with VPC endpoints are sneaky; the error messages rarely mention DNS. If API calls hang or return "could not resolve host" errors:
Ensure DNS resolution is enabled in the VPC
Verify private DNS is enabled for interface endpoints
Check DHCP options sets are correct
Performance Issues
Slow data transfers or high inference latency usually point to a network topology problem. Before reaching for a bigger instance type:
Review VPC endpoint placement across AZs
Check for cross-AZ data transfer
Validate that network bandwidth has sufficient headroom
Consider using Enhanced Networking for EC2-based components
Once tracked a 200ms latency spike on an inference endpoint to cross-AZ traffic between the endpoint and its VPC endpoint for S3 model artifact loading. Moving them to the same AZ cut p99 latency in half.
Example: Secure Production Architecture
Below is the full Terraform, Pulumi, CDK, and boto3 setup I use as a starting point for production SageMaker deployments: VPC endpoints, network-isolated training, encrypted inter-container traffic, multi-AZ endpoint deployment.
importboto3fromsagemakerimportSession,get_execution_roleec2=boto3.client('ec2')sagemaker_client=boto3.client('sagemaker')# VPC Configurationvpc_id='vpc-xxxxx'private_subnet_1='subnet-aaaaa'# us-east-1aprivate_subnet_2='subnet-bbbbb'# us-east-1b# Security Groupstraining_sg='sg-training'endpoint_sg='sg-endpoint'# Create VPC Endpointsendpoints=['com.amazonaws.us-east-1.sagemaker.api','com.amazonaws.us-east-1.sagemaker.runtime','com.amazonaws.us-east-1.ecr.api','com.amazonaws.us-east-1.ecr.dkr','com.amazonaws.us-east-1.s3','com.amazonaws.us-east-1.logs']# Training job with network isolationsagemaker=Session()role=get_execution_role()fromsagemaker.estimatorimportEstimatorestimator=Estimator(image_uri='my-secure-training-image',role=role,instance_count=2,instance_type='ml.p3.2xlarge',volume_size=100,subnets=[private_subnet_1,private_subnet_2],security_group_ids=[training_sg],enable_network_isolation=True,encrypt_inter_container_traffic=True)# Deploy model with VPC configurationpredictor=estimator.deploy(initial_instance_count=2,instance_type='ml.m5.xlarge',endpoint_name='secure-ml-endpoint',vpc_config={'Subnets':[private_subnet_1,private_subnet_2],'SecurityGroupIds':[endpoint_sg]})
The networking layer in a SageMaker deployment is where security, cost, and operational reliability come together or fall apart. Private subnets, VPC endpoints, function-specific security groups, flow logs, network isolation for training: these represent the minimum viable network architecture for any production ML workload on AWS.
Start with VPC isolation and the six essential endpoints while still in development. Add network isolation for training, multi-AZ endpoints, and Config-based drift detection as you approach production. Review security group rules and flow logs quarterly. Networking configurations drift, and that drift always moves toward more permissive.
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.