Skip to main content

iOS Telemetry Pipeline with Kinesis, Glue, and Athena

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

Any iOS app with real users generates telemetry. Session starts, feature usage, error events, performance metrics, purchase funnels. Most teams start by shipping all of it to Amplitude or Mixpanel and calling it done. That works for a while. Then the monthly invoice triples, you discover the vendor's data model cannot answer a question your PM asked three days ago, and you realize you are paying somebody else to store your data in a format optimized for their business.

I have deployed the pipeline documented here across several production iOS applications. Pinpoint handles ingestion, Kinesis Data Firehose delivers records reliably to S3, Glue discovers the schema automatically, and Athena gives you full SQL over the raw data. No servers. Scales from zero to billions of events per month. I provide the full infrastructure code in both Terraform and Pulumi so you can pick whichever fits your team.

Why a Dedicated Telemetry Pipeline

The first pushback I always hear: "Amplitude already does this." Sure. It does, at a price, with their query model, and with your data locked in their infrastructure. Once you evaluate cost, flexibility, and data ownership together, the case for owning the pipeline gets hard to argue against.

Dimension Third-Party Platform AWS Telemetry Pipeline
Data ownership Vendor stores your data; export often limited or delayed You own the S3 bucket and query anytime with any tool
Query flexibility Constrained to vendor's UI and query model Full SQL via Athena; join with any other dataset
Data residency Limited region selection; dependent on vendor's infrastructure Deploy in any AWS region; meets sovereignty requirements
Retention Often tiered pricing for longer retention S3 lifecycle policies; pennies per GB-month
Integration API/webhook exports; often batched or rate-limited Direct S3 access; feed into ML pipelines, data lakes, BI tools
Schema control Vendor's event schema with limited customization Your schema, your structure, your partitioning

The numbers at scale speak for themselves:

Monthly Event Volume Amplitude (Growth) Mixpanel (Growth) AWS Pipeline (estimated)
10M events ~$1,000/mo ~$1,100/mo ~$25/mo
100M events ~$4,500/mo ~$5,000/mo ~$180/mo
1B events ~$20,000+/mo (Enterprise) ~$25,000+/mo (Enterprise) ~$1,400/mo

Those AWS estimates include Pinpoint, Firehose, S3, Glue crawler runs, and moderate Athena query volume. At a billion events per month, this pipeline costs roughly 15x less. The gap only widens with longer retention windows because S3 storage is pennies compared to what analytics vendors charge for historical data access.

Pipeline Architecture

Standard streaming ingestion pattern. Each component does one thing:

A AWS Pinpoint B C Firehose D Partitioned E Schema F Catalog G Athena
iOS telemetry pipeline architecture

I chose each component for a specific reason:

Component Role Scaling Model Cost Driver
AWS Amplify SDK Client-side event capture and batching Runs on device; batches events automatically Free (client-side)
Amazon Pinpoint Event ingestion and fan-out Fully managed; scales automatically $0.000001 per event collected
Kinesis Data Firehose Reliable buffered delivery to S3 Auto-scales; no shard management $0.029 per GB ingested
Amazon S3 Durable, partitioned event storage Infinite scale; pay per GB stored $0.023/GB (Standard), $0.0125/GB (IA)
AWS Glue Crawler Automatic schema discovery from S3 data Runs on schedule; DPU-hour billing ~$0.44 per run (minimal DPU)
Glue Data Catalog Centralized metadata and table definitions Managed; free tier covers most use cases Free for first 1M objects
Amazon Athena Ad-hoc SQL queries over S3 data Serverless; per-query billing $5.00 per TB scanned

There are zero always-on compute resources here. Everything is either event-driven (Pinpoint, Firehose) or invoked on demand (Glue crawler, Athena queries). Process zero events, pay close to nothing. Process a billion events a month, still pay a fraction of what Amplitude would charge. I have run this pipeline on a side project that generated maybe 500 events per day and the bill rounded to zero for months.

Data Flow: From Tap to Query

Follow a single telemetry event from the user tapping a button all the way to an analyst running a query. This walkthrough shows why each component exists and where the handoffs happen:

recordEvent(type, attributes) Batch events locally submitEvents(batch) PutRecord (event stream) Buffer (size/time threshold) PUT object (GZIP compressed) List new partitions Update table schema Get table metadata Scan partitioned data Return query results data/year=2026/month=02/day=22/ iOS App Amplify SDK Pinpoint Firehose S3 Bucket Glue Crawler Glue Catalog Athena
Telemetry event lifecycle from device to query

Step 1: Client-side event capture. The iOS app uses the AWS Amplify Analytics SDK to record events. The SDK batches locally, retries on failure, and queues events when the device is offline. Batching keeps network overhead low and avoids hammering the user's battery.

Step 2: Pinpoint ingestion. Pinpoint receives the event batches and tacks on metadata: application ID, client context, session information. Then it forwards each event to the Kinesis Data Firehose delivery stream you configure in the event stream settings.

Step 3: Firehose buffering and delivery. Firehose accumulates incoming records until either the buffer size (default 5 MB) or the buffer interval (default 300 seconds) trips, whichever happens first. It GZIP-compresses the batch and writes one object to S3 using a Hive-style partitioned prefix.

Step 4: S3 partitioned storage. Objects land in S3 with keys like data/year=2026/month=02/day=22/firehose-telemetry-1-2026-02-22-14-30-00-abc123.gz. This partitioning scheme is what makes Athena queries both fast and cheap. Skip it and you will regret it within a month.

Step 5: Glue schema discovery. A Glue crawler runs on schedule (every 6 hours in my configuration), picks up new partitions, infers the JSON schema from the data, and updates the Glue Data Catalog.

Step 6: Athena queries. Analysts run standard SQL through Athena. Automated reports do the same. Athena pulls table metadata from the Glue Catalog and reads directly from S3, scanning only the partitions your WHERE clause specifies.

S3 Partitioning Strategy

Partitioning is the single most impactful design decision in the entire pipeline. Full stop. Athena charges $5 per TB scanned. A query that scans 100 GB costs $0.50. Partition properly and that same query touches 1 GB: $0.005. I learned this the expensive way on an early project where I skipped partitioning to "move fast" and ended up with a $400 Athena bill in the first month.

The Firehose delivery stream uses Hive-style partitioning with this prefix pattern:

data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/

That produces S3 keys Athena recognizes as partition columns:

s3://myapp-dev-telemetry/
├── data/
│   ├── year=2026/
│   │   ├── month=01/
│   │   │   ├── day=15/
│   │   │   │   ├── telemetry-1-2026-01-15-00-05-00-abc123.gz
│   │   │   │   ├── telemetry-1-2026-01-15-00-10-00-def456.gz
│   │   │   │   └── ...
│   │   │   ├── day=16/
│   │   │   └── ...
│   │   ├── month=02/
│   │   └── ...
│   └── ...
└── errors/
    └── ...

I tier the data with S3 lifecycle policies based on how often it gets queried:

Storage Tier Days After Creation Cost per GB-Month Use Case
S3 Standard 0-30 $0.023 Active analysis, recent events
S3 Standard-IA 30-90 $0.0125 Historical lookback, trend analysis
Expired 90+ $0.000 Data deleted per retention policy

All of these thresholds are configurable. In production deployments with compliance requirements, I extend expiration to 365 days or longer and add a Glacier Deep Archive tier at 180 days for audit retention.

Kinesis Data Firehose Configuration

I use Firehose here instead of Kinesis Data Streams. Firehose requires no shard capacity planning, no consumer code, and no scaling logic. Data Streams gives you lower latency and consumer fan-out, but for telemetry where a few minutes of lag is fine, Firehose removes an entire category of operational work. I have operated Data Streams pipelines in other contexts and the shard management alone justified switching to Firehose for analytics use cases.

Parameter Value Tradeoff
Buffer size 5 MB (configurable 1-128 MB) Larger buffers = fewer S3 objects = lower S3 request costs, but higher delivery latency
Buffer interval 300 seconds (configurable 60-900s) Shorter intervals = lower latency but more small files; longer intervals = better compression ratios
Compression GZIP 70-85% compression ratio for JSON telemetry; Athena reads GZIP natively
Error handling Separate error prefix in same bucket Failed records land in errors/ with diagnostic metadata for debugging
Firehose flushes when either the buffer size or interval threshold is reached, whichever comes first. During high-volume periods, the size threshold triggers more frequently, producing well-compressed files. During low-volume periods, the interval threshold ensures data is delivered within a predictable window. This adaptive behavior is why Firehose works well from prototype scale to production scale without reconfiguration.

You need the extended_s3 destination type specifically. The basic s3 destination only supports a static prefix, which dumps all data into one directory and forces Athena to scan everything regardless of the time range in your query. I have seen teams make this mistake and wonder why their Athena bills are ten times what they expected.

AWS Glue: Schema Discovery and Cataloging

Glue does two things here: infer the schema and register partitions. The crawler reads sample files from S3, figures out the JSON structure, registers new date partitions as they appear, and keeps everything updated in the Glue Data Catalog. Athena uses that catalog as its metastore.

Crawler Setting Value Why
Schedule Every 6 hours (cron(0 */6 * * ? *)) Balances partition freshness against crawler cost
Recrawl behavior CRAWL_NEW_FOLDERS_ONLY Only scans new partitions; avoids re-reading historical data
Schema change policy UPDATE_IN_DATABASE / LOG Evolves schema forward (new columns added); never deletes columns
Grouping CombineCompatibleSchemas Merges slightly different schemas across partitions into a single table

Crawler schedule is a cost/freshness tradeoff:

Schedule Monthly Cost (~) Partition Delay
Every hour ~$13/mo Up to 1 hour
Every 6 hours ~$2.20/mo Up to 6 hours
Every 12 hours ~$1.10/mo Up to 12 hours
Daily ~$0.55/mo Up to 24 hours

Six hours between data landing in S3 and being queryable in Athena is fine for most analytics work. If you absolutely need near-real-time access, add a Lambda triggered by S3 events that calls batch_create_partition to register partitions immediately. I have done this on one project where the product team wanted a live dashboard. It works, but it adds moving parts that are rarely worth it for batch analytics.

Schema evolution just works. Ship a new iOS build with additional event attributes and the crawler picks up the new columns on its next run. Older data returns NULL for those columns. No migrations, no backfills. I have gone through a dozen schema changes on a single pipeline and never had to touch the Glue configuration.

Athena: Querying Telemetry Data

Athena gives you serverless SQL directly against the S3 data. The workgroup configuration is where you enforce cost controls and keep everyone honest:

Workgroup Setting Value Purpose
Enforce configuration true Prevents users from overriding output location or encryption
Bytes scanned cutoff 1 GB Kills queries that would scan more than 1 GB (cost protection)
Result encryption SSE-KMS Encrypts query results at rest
CloudWatch metrics Enabled Tracks query counts, data scanned, and execution times

That byte scan cutoff matters more than you think. One careless query without partition filters scans the entire dataset. At $5/TB, a full scan of 100 GB costs $0.50. Sounds small. Now multiply that by a team of analysts running queries all day. The 1 GB cutoff caps any single query at roughly $0.005 and trains analysts to include partition filters fast. People learn when their queries fail.

Some example queries showing how to use partitioning correctly:

-- Daily event counts for the past week (scans ~7 partitions)
SELECT year, month, day, COUNT(*) as event_count
FROM telemetry_db.data
WHERE year = '2026' AND month = '02' AND day >= '15'
GROUP BY year, month, day
ORDER BY day DESC;

-- Top event types for a specific day (scans 1 partition)
SELECT event_type, COUNT(*) as occurrences
FROM telemetry_db.data
WHERE year = '2026' AND month = '02' AND day = '22'
GROUP BY event_type
ORDER BY occurrences DESC
LIMIT 20;

-- Session duration analysis (scans 1 month of data)
SELECT
  DATE(from_iso8601_timestamp(event_timestamp)) as event_date,
  AVG(session_duration) as avg_session_seconds,
  APPROX_PERCENTILE(session_duration, 0.95) as p95_session_seconds
FROM telemetry_db.data
WHERE year = '2026' AND month = '02'
  AND event_type = '_session.stop'
GROUP BY DATE(from_iso8601_timestamp(event_timestamp))
ORDER BY event_date;

Notice every query includes year, month, and day in the WHERE clause. Make this a habit. It keeps your Athena bill predictable and your queries fast.

IAM Architecture

Three IAM roles, each scoped to the bare minimum for its integration point. No wildcards. Each trust policy locks down to the specific AWS service that assumes the role:

Trust Policies Permissions Permissions AWSGlueServiceRole AssumeRole AssumeRole AssumeRole pinpoint.amazonaws.com R1 Role firehose.amazonaws.com R2 glue.amazonaws.com R3 PutRecordBatch Delivery GetObject ListBucket Bucket GLUE
IAM role architecture with trust and permission relationships
Role Trust Principal Permissions Scope
Pinpoint → Firehose pinpoint.amazonaws.com firehose:PutRecord, firehose:PutRecordBatch Specific Firehose stream ARN only
Firehose → S3 firehose.amazonaws.com s3:PutObject, s3:GetObject, s3:ListBucket, multipart upload actions Specific S3 bucket and objects only
Glue Crawler glue.amazonaws.com s3:GetObject, s3:ListBucket + AWSGlueServiceRole managed policy Specific S3 bucket and objects only

All three roles include aws:SourceAccount conditions in their trust policies (where supported) to block confused deputy attacks. The Firehose role also needs s3:AbortMultipartUpload and s3:ListBucketMultipartUploads because Firehose uses multipart uploads for large buffered deliveries. I have forgotten those permissions before and spent an hour staring at cryptic access denied errors in the Firehose error log.

Cost Estimation

Cost predictability is one of the reasons I keep coming back to this architecture. Every component bills on usage with no minimums:

Component 10M Events/Month 100M Events/Month 1B Events/Month
Pinpoint $10.00 $100.00 $1,000.00
Firehose (ingestion) $0.87 $8.70 $87.00
S3 (storage, 90-day retention) $1.50 $15.00 $150.00
S3 (requests) $0.50 $5.00 $50.00
Glue Crawler (4x daily) $2.20 $2.20 $2.20
Athena (moderate queries) $5.00 $25.00 $75.00
CloudWatch Logs $1.00 $5.00 $30.00
Total ~$21 ~$161 ~$1,394

Assumptions: 3 KB average event size (JSON), 80% GZIP compression ratio, Athena scanning 50 GB/month at the 10M tier and scaling proportionally, S3 Standard for the first 30 days then Standard-IA. Pinpoint dominates costs at every tier. If you already have an ingestion layer (API Gateway + Lambda, say), drop Pinpoint and cut costs 40-70%.

Where to cut costs further:

Lever Impact Tradeoff
Replace Pinpoint with API Gateway + Lambda 40-70% cost reduction More infrastructure to manage; lose Pinpoint's campaign features
Increase Firehose buffer size Fewer S3 PUT requests Higher delivery latency
Shorten S3 retention Linear storage cost reduction Less historical data available
Add Glacier tier instead of expiration 90%+ storage savings for archive data Minutes-to-hours retrieval time
Use columnar format (Parquet via Firehose transform) 50-80% Athena cost reduction Requires Firehose data transformation (Lambda)

Terraform vs. Pulumi: Side-by-Side

I maintain both a Terraform and a Pulumi implementation of this pipeline. They produce identical infrastructure; pick whichever fits your team's workflow. For a deeper comparison of IaC tools, see Infrastructure as Code: CloudFormation, CDK, Terraform, and Pulumi Compared.

Aspect Terraform Pulumi (Python)
Repository tf-config-telemetry-pipeline pul-py-config-telemetry-pipeline
Language HCL (HashiCorp Configuration Language) Python
File structure One .tf file per resource group One .py module per resource group
State management Terraform state file (local or remote) Pulumi service or self-managed backend
Variable handling variables.tf with type constraints Pulumi.yaml config with Python helpers
Dynamic values String interpolation, jsonencode() Output.apply() with lambda functions
IAM policies Inline jsonencode() blocks json.dumps() inside apply() callbacks
Total lines of code ~350 ~400

The Firehose delivery stream is the most interesting resource to compare because it has complex nested configuration and cross-resource references:

Terraform:

resource "aws_kinesis_firehose_delivery_stream" "telemetry" {
  name        = "${local.name_prefix}-telemetry"
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = aws_iam_role.firehose_s3.arn
    bucket_arn = aws_s3_bucket.telemetry.arn
    prefix     = "data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"

    buffering_size     = var.firehose_buffer_size_mb
    buffering_interval = var.firehose_buffer_interval_seconds
    compression_format = "GZIP"
  }
}

Pulumi (Python):

firehose = aws.kinesis.FirehoseDeliveryStream(
    "telemetry-firehose",
    name=f"{name_prefix}-telemetry",
    destination="extended_s3",
    extended_s3_configuration={
        "role_arn": firehose_s3_role.arn,
        "bucket_arn": telemetry_bucket.arn,
        "prefix": "data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
        "buffering_size": config["firehose_buffer_size_mb"],
        "buffering_interval": config["firehose_buffer_interval_seconds"],
        "compression_format": "GZIP",
    },
)

The structural similarity is intentional; both tools use declarative resource definitions backed by the same AWS provider. Pulumi gives you Python's full type system, IDE support, and real programming constructs for dynamic logic. Terraform's HCL is purpose-built for infrastructure and keeps things simpler for engineers who do not want to think in Python while writing infra. I tend to reach for Pulumi on projects where the infrastructure logic has conditionals and loops, and Terraform when the setup is straightforward.

Common Failure Modes

Every one of these has bitten me in production at least once. Save yourself the debugging time:

Failure Mode Symptom Root Cause Fix
Firehose delivery failures Records appear in errors/ prefix IAM role permissions insufficient; bucket policy conflict Verify Firehose role has s3:PutObject on the exact bucket ARN including /* suffix
Glue crawler finds no tables Empty database after crawler completes Crawler S3 target path does not match Firehose prefix; missing trailing slash Ensure crawler path is s3://bucket/data/ (with trailing slash) matching Firehose prefix
Athena query scans entire dataset High query cost; slow execution Query missing partition filters (year, month, day in WHERE clause) Always include partition columns in WHERE; use workgroup byte cutoff as safety net
Schema mismatch after app update NULL values in new columns; query errors Crawler has not run since new event attributes were deployed Run crawler manually after deploying app changes; or reduce crawler interval
Small file problem Thousands of tiny S3 objects per day Buffer interval too short for the event volume Increase buffer interval to 900s and buffer size to 64-128 MB for high-volume streams
Pinpoint event stream lag Events delayed minutes-to-hours Pinpoint event stream disabled or Firehose throttled Check Pinpoint event stream configuration; verify Firehose is not in SUSPENDED state
Athena workgroup query failures Queries fail immediately Byte scan cutoff too low for the query pattern Increase bytes_scanned_cutoff_per_query; or optimize query to scan fewer partitions
Crawler DPU timeout Crawler fails to complete Too many small files; schema too complex Enable CRAWL_NEW_FOLDERS_ONLY; use CombineCompatibleSchemas grouping

Key Architectural Recommendations

  1. Always partition by date. This is the one decision that makes or breaks the economics of the whole pipeline. Without Hive-style date partitions, every Athena query scans every byte you have ever stored.
  2. Set Firehose buffer interval to at least 300 seconds. Shorter intervals create swarms of tiny S3 objects. Each one adds GET request overhead during Athena scans. Five minutes balances latency against the small-file problem well enough for analytics.
  3. Use GZIP compression. 70-85% storage reduction. Athena reads GZIP natively. Firehose handles compression; your pipeline sees zero performance impact. There is no reason to skip this.
  4. Configure CRAWL_NEW_FOLDERS_ONLY. I cannot stress this enough. Without it, the Glue crawler re-reads every historical file on every run. On a pipeline with six months of data, that turns a $2/month crawler into a $20/month one for no benefit.
  5. Enforce the Athena workgroup configuration. Set enforce_workgroup_configuration = true. Otherwise someone will accidentally write query results to an unencrypted bucket or blow past the byte scan cutoff.
  6. Scope IAM roles to specific resource ARNs. Never Resource: "*" on Firehose or Glue roles. If an ARN exists, use it. Limits the blast radius when (not if) something goes sideways.
  7. Set up lifecycle policies from day one. Telemetry accumulates faster than you expect. A 90-day retention policy with a 30-day Standard-to-IA transition keeps storage costs from surprising you in month three.
  8. Monitor the Firehose errors/ prefix. Delivery failures are completely silent unless you watch for them. Set up an S3 event notification or a CloudWatch alarm. I once lost two days of telemetry before noticing because I skipped this step on an early deployment.
  9. Start with Firehose, not Data Streams. Shard management is complexity you do not need for analytics telemetry. If sub-second latency or consumer fan-out becomes a requirement later, you can add a Data Stream upstream without redesigning anything.
  10. Codify everything. Terraform or Pulumi, I do not care which. Just do not make Firehose buffer or Glue crawler changes in the console. Manual tweaks drift, and when they cause an incident at 2 AM, nobody remembers what was changed.

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.