Skip to main content

Best Practices for Networking in AWS SageMaker

AWS SageMaker Machine Learning Networking

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

Three years of locking down SageMaker environments across regulated industries taught me one thing early: your networking decisions on day one determine whether the ML infrastructure passes an audit six months later. Teams treat SageMaker networking as an afterthought. Notebook instances get default settings. Models train with full internet access. Then the security review arrives and everybody scrambles. Give the networking layer the same rigor you'd give any production VPC workload.

Understanding SageMaker Network Architecture

SageMaker resources run in one of two modes: internet-facing or VPC-isolated. Get this distinction right early or plan on a painful re-architecture later.

Internet-Facing vs. VPC Mode

Out of the box, SageMaker notebook instances, training jobs, and endpoints all get direct internet access. Quick to set up. Terrible for production. I watched teams run like this for months, then face a brutal migration when compliance caught up. VPC mode places SageMaker resources inside your own Virtual Private Cloud, giving you actual control over what talks to what. Every production SageMaker deployment I build starts here.

VPC Configuration Best Practices

Use Private Subnets

Deploy SageMaker resources in private subnets. Full stop. Public subnets give your ML infrastructure a direct path to and from the internet; that attack surface buys you nothing. I have yet to see a legitimate production use case requiring SageMaker resources in a public subnet.

import boto3

sagemaker = boto3.client('sagemaker')

response = sagemaker.create_notebook_instance(
    NotebookInstanceName='my-secure-notebook',
    InstanceType='ml.t3.medium',
    SubnetId='subnet-xxxxx',  # Private subnet ID
    SecurityGroupIds=['sg-xxxxx'],
    DirectInternetAccess='Disabled',
    RoleArn='arn:aws:iam::account-id:role/SageMakerRole'
)
Creating a SageMaker notebook instance in a private subnet

Implement Multiple Availability Zones

For production workloads (especially SageMaker endpoints handling real-time inference), spread resources across at least two Availability Zones. Learned this one the hard way. An AZ outage in us-east-1 took down a single-AZ inference endpoint for 47 minutes. The multi-AZ version of that same service? Up the entire time.

Network Isolation for Training Jobs

Working with sensitive data? Run training jobs in network isolation mode. This blocks the training container from making any outbound network calls. A healthcare client had a compliance requirement that no patient data could leave the VPC boundary during model training. Network isolation was the cleanest way to satisfy that requirement.

import boto3

sagemaker_client = boto3.client('sagemaker')

response = sagemaker_client.create_training_job(
    TrainingJobName='my-isolated-training-job',
    RoleArn='arn:aws:iam::account-id:role/SageMakerRole',
    AlgorithmSpecification={
        'TrainingImage': 'my-training-image',
        'TrainingInputMode': 'File'
    },
    ResourceConfig={
        'InstanceType': 'ml.p3.2xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 50
    },
    VpcConfig={
        'Subnets': ['subnet-xxxxx'],
        'SecurityGroupIds': ['sg-xxxxx']
    },
    EnableNetworkIsolation=True,
    StoppingCondition={
        'MaxRuntimeInSeconds': 86400
    },
    OutputDataConfig={
        'S3OutputPath': 's3://my-bucket/output'
    }
)
Enabling network isolation for SageMaker training jobs

Security Groups and NACLs

Principle of Least Privilege

Most teams over-permit SageMaker security groups because troubleshooting connectivity issues during training is painful and nobody wants to be the bottleneck. Resist that urge. Notebook instances typically need inbound HTTPS on port 443 for Jupyter access through the console. That's it.

Inbound Rules:
- Type: HTTPS
- Protocol: TCP
- Port: 443
- Source: Your VPC CIDR or specific IP ranges

Outbound Rules:
- Allow necessary AWS service endpoints
- Block all other traffic by default

Separate Security Groups by Function

Maintain distinct security groups for each SageMaker component type. Merging them into a single "sagemaker-sg" feels convenient until you need to debug why your endpoint can reach a resource that only training jobs should access.

Security Group Type Purpose
Notebook Security Group For development and experimentation
Training Security Group For training jobs
Endpoint Security Group For model serving
Data Source Security Group For databases and data storage

The payoff comes during incident response. Traffic from a specific security group in your flow logs tells you immediately which component generated it.

Use Network ACLs as a Second Layer

Security groups handle instance-level filtering. NACLs operate at the subnet level. Think of NACLs as a safety net: coarse-grained rules that catch anything the security groups miss. They matter most for subnets hosting production inference endpoints, where a misconfigured security group could expose your model to unauthorized callers.

VPC Endpoints for AWS Services

Why VPC Endpoints Matter

Without VPC endpoints, every API call from your VPC-isolated SageMaker resources to S3, ECR, or CloudWatch routes through a NAT Gateway (or fails entirely). VPC endpoints keep that traffic on the AWS backbone: lower latency, better security posture, and no more hemorrhaging money on NAT Gateway data processing charges. On one project, switching S3 access from NAT to a gateway endpoint cut the monthly data transfer bill by $2,400.

Essential VPC Endpoints for SageMaker

Six endpoints at minimum. Teams forget one (usually STS or CloudWatch Logs) and then spend hours debugging why training jobs fail silently.

Endpoint Type Purpose
S3 Gateway Endpoint For accessing training data and model artifacts
SageMaker API Endpoint For SageMaker service API calls
SageMaker Runtime Endpoint For inference requests to endpoints
CloudWatch Logs Endpoint For logging and monitoring
ECR API and DKR Endpoints For pulling custom container images
STS Endpoint For assuming IAM roles
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-xxxxx \
    --service-name com.amazonaws.us-east-1.s3 \
    --route-table-ids rtb-xxxxx
Creating an S3 gateway endpoint

Interface Endpoints Configuration

Interface endpoints (SageMaker API, ECR, CloudWatch Logs) need multiple AZs and their own security groups. Put them in the same subnets as your SageMaker resources so traffic stays local to the AZ whenever possible.

aws ec2 create-vpc-endpoint \
    --vpc-id vpc-xxxxx \
    --vpc-endpoint-type Interface \
    --service-name com.amazonaws.us-east-1.sagemaker.api \
    --subnet-ids subnet-xxxxx subnet-yyyyy \
    --security-group-ids sg-xxxxx
Creating a SageMaker API interface endpoint

DNS Resolution and Private Hosted Zones

Enable DNS Resolution

Both enableDnsSupport and enableDnsHostnames must be turned on in your VPC. Skip this and your VPC endpoints silently fail to resolve. I've wasted more hours debugging this than I care to admit; usually somebody provisioned the VPC manually through the console and missed the checkbox.

aws ec2 modify-vpc-attribute \
    --vpc-id vpc-xxxxx \
    --enable-dns-support

aws ec2 modify-vpc-attribute \
    --vpc-id vpc-xxxxx \
    --enable-dns-hostnames
Enabling DNS resolution and hostnames in VPC

Private Hosted Zones

For hybrid architectures or custom internal endpoints, Route 53 private hosted zones give you fine-grained DNS control within the VPC. Useful when SageMaker needs to reach on-premises feature stores or model registries over Direct Connect.

Data Encryption in Transit

Enforce TLS for All Communications

SageMaker supports TLS 1.2 and higher across all API calls and notebook access. Enforce this at the VPC endpoint policy level so no unencrypted traffic slips through, even if someone misconfigures a client.

VPN or Direct Connect for Hybrid Environments

ML pipelines pulling data from on-premises sources need AWS VPN or Direct Connect. Routing training data over the public internet is a non-starter for any regulated workload. I've set up Direct Connect for three separate ML platforms; the latency improvement alone (typically 30-40% reduction for large dataset transfers) justifies the cost beyond the security benefits.

Monitoring and Logging

VPC Flow Logs

Turn on VPC Flow Logs from day one. Cannot overstate how much time these save during incident investigation. Last year I traced a mysterious training job failure to a misconfigured NACL rule in under ten minutes; the flow logs showed rejected packets right there. Without them, that's a half-day debugging session.

aws ec2 create-flow-logs \
    --resource-type VPC \
    --resource-ids vpc-xxxxx \
    --traffic-type ALL \
    --log-destination-type cloud-watch-logs \
    --log-group-name /aws/vpc/flowlogs
Creating VPC Flow Logs for network monitoring

CloudWatch Metrics

Pair flow logs with SageMaker-specific CloudWatch metrics. The four I watch most closely:

  • Endpoint invocation latency
  • Model container CPU/memory utilization
  • Training job network throughput
  • Failed API calls

A spike in failed API calls combined with rejected packets in flow logs almost always points to a security group or endpoint policy issue.

AWS Config Rules

Set up Config rules to catch drift before it becomes an incident:

  • Verify SageMaker notebooks are in VPCs
  • Confirm network isolation is enabled for training jobs
  • Validate security group configurations

These rules caught misconfigurations three times in the past year that would have made it to production otherwise.

Cost Optimization

Use S3 Gateway Endpoints

S3 gateway endpoints cost nothing. Zero. They eliminate NAT Gateway data processing charges for S3 traffic within the same region. ML workloads shuffle large datasets between S3 and training instances; the savings add up fast. Monthly bills dropping by thousands of dollars from this one change is common.

Right-Size VPC Endpoints

Interface endpoints run about $7.50 per AZ per month, plus $0.01 per GB of data processed. Sounds small until you realize a six-endpoint setup across two AZs costs $90/month before data charges. Only provision endpoints for services your SageMaker workloads actually call, and share them across workloads in the same VPC.

NAT Gateway Optimization

Sometimes you genuinely need internet access: pip installs, pulling external datasets, calling third-party APIs. Ground rules for keeping NAT costs under control:

  • Share NAT Gateways across multiple private subnets
  • Consider NAT instances for dev/test environments
  • Use VPC endpoints instead of NAT for AWS service access

Compliance and Governance

Network Segmentation

Separate dev, staging, and production SageMaker environments into distinct VPCs (or, at minimum, dedicated subnet groups with strict security group boundaries). I watched a data scientist's notebook in a shared VPC accidentally query a production database during experimentation. Segmentation prevents those accidents.

Tagging Strategy

Tag every network resource: every VPC endpoint, every security group, every flow log. During a compliance audit, the auditor asks you to prove which resources belong to which workload. Tags let you answer that question in minutes instead of days.

Environment: Production
Project: ML-Pipeline
Owner: DataScience-Team
CostCenter: ML-Operations
Compliance: HIPAA

When your ML pipeline integrates with third-party tools (model monitoring services, feature platforms, data providers), use PrivateLink to keep that connectivity private. Poking holes in your VPC's internet gateway for vendor integrations undermines the isolation you spent weeks building.

Troubleshooting Common Network Issues

Connection Timeouts

Connection timeouts in VPC-isolated SageMaker resources usually boil down to one of four causes. I work through them in this order because it matches their frequency:

  1. Verify security group rules allow outbound traffic
  2. Confirm route tables direct traffic correctly
  3. Check VPC endpoint configuration and policies
  4. Validate IAM role permissions

Nine times out of ten: the security group.

DNS Resolution Failures

DNS issues with VPC endpoints are sneaky; the error messages rarely mention DNS. If API calls hang or return "could not resolve host" errors:

  1. Ensure DNS resolution is enabled in the VPC
  2. Verify private DNS is enabled for interface endpoints
  3. Check DHCP options sets are correct

Performance Issues

Slow data transfers or high inference latency usually point to a network topology problem. Before reaching for a bigger instance type:

  1. Review VPC endpoint placement across AZs
  2. Check for cross-AZ data transfer
  3. Validate that network bandwidth has sufficient headroom
  4. Consider using Enhanced Networking for EC2-based components

Once tracked a 200ms latency spike on an inference endpoint to cross-AZ traffic between the endpoint and its VPC endpoint for S3 model artifact loading. Moving them to the same AZ cut p99 latency in half.

Example: Secure Production Architecture

Below is the full Terraform, Pulumi, CDK, and boto3 setup I use as a starting point for production SageMaker deployments: VPC endpoints, network-isolated training, encrypted inter-container traffic, multi-AZ endpoint deployment.

import boto3
from sagemaker import Session, get_execution_role
ec2 = boto3.client('ec2')
sagemaker_client = boto3.client('sagemaker')

# VPC Configuration
vpc_id = 'vpc-xxxxx'
private_subnet_1 = 'subnet-aaaaa'  # us-east-1a
private_subnet_2 = 'subnet-bbbbb'  # us-east-1b

# Security Groups
training_sg = 'sg-training'
endpoint_sg = 'sg-endpoint'

# Create VPC Endpoints
endpoints = [
    'com.amazonaws.us-east-1.sagemaker.api',
    'com.amazonaws.us-east-1.sagemaker.runtime',
    'com.amazonaws.us-east-1.ecr.api',
    'com.amazonaws.us-east-1.ecr.dkr',
    'com.amazonaws.us-east-1.s3',
    'com.amazonaws.us-east-1.logs'
]

# Training job with network isolation
sagemaker = Session()
role = get_execution_role()

from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri='my-secure-training-image',
    role=role,
    instance_count=2,
    instance_type='ml.p3.2xlarge',
    volume_size=100,
    subnets=[private_subnet_1, private_subnet_2],
    security_group_ids=[training_sg],
    enable_network_isolation=True,
    encrypt_inter_container_traffic=True
)

# Deploy model with VPC configuration
predictor = estimator.deploy(
    initial_instance_count=2,
    instance_type='ml.m5.xlarge',
    endpoint_name='secure-ml-endpoint',
    vpc_config={
        'Subnets': [private_subnet_1, private_subnet_2],
        'SecurityGroupIds': [endpoint_sg]
    }
)
Complete secure production SageMaker architecture

Closing Thoughts

The networking layer in a SageMaker deployment is where security, cost, and operational reliability come together or fall apart. Private subnets, VPC endpoints, function-specific security groups, flow logs, network isolation for training: these represent the minimum viable network architecture for any production ML workload on AWS.

Start with VPC isolation and the six essential endpoints while still in development. Add network isolation for training, multi-AZ endpoints, and Config-based drift detection as you approach production. Review security group rules and flow logs quarterly. Networking configurations drift, and that drift always moves toward more permissive.

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.