iOS Telemetry Pipeline with Kinesis, Glue, and Athena

Any iOS app with real users generates telemetry. Session starts, feature usage, error events, performance metrics, purchase funnels. Most teams start by shipping all of it to Amplitude or Mixpanel and calling it done. That works for a while. Then the monthly invoice triples, you discover the vendor's data model cannot answer a question your PM asked three days ago, and you realize you are paying somebody else to store your data in a format optimized for their business.

I have deployed the pipeline documented here across several production iOS applications. Pinpoint handles ingestion, Amazon Data Firehose (formerly Kinesis Data Firehose) delivers records reliably to S3, Glue discovers the schema automatically, and Athena gives you full SQL over the raw data. No servers. Scales from zero to billions of events per month. I provide the full infrastructure code in both Terraform and Pulumi so you can pick whichever fits your team.

Why a Dedicated Telemetry Pipeline

The first pushback I always hear: "Amplitude already does this." Sure. It does, at a price, with their query model, and with your data locked in their infrastructure. Once you evaluate cost, flexibility, and data ownership together, the case for owning the pipeline gets hard to argue against.

Dimension	Third-Party Platform	AWS Telemetry Pipeline
Data ownership	Vendor stores your data; export often limited or delayed	You own the S3 bucket and query anytime with any tool
Query flexibility	Constrained to vendor's UI and query model	Full SQL via Athena; join with any other dataset
Data residency	Limited region selection; dependent on vendor's infrastructure	Deploy in any AWS region; meets sovereignty requirements
Retention	Often tiered pricing for longer retention	S3 lifecycle policies; pennies per GB-month
Integration	API/webhook exports; often batched or rate-limited	Direct S3 access; feed into ML pipelines, data lakes, BI tools
Schema control	Vendor's event schema with limited customization	Your schema, your structure, your partitioning

The numbers at scale speak for themselves:

Monthly Event Volume	Amplitude (Growth)	Mixpanel (Growth)	AWS Pipeline (estimated)
10M events	~$1,000/mo	~$1,100/mo	~$25/mo
100M events	~$4,500/mo	~$5,000/mo	~$180/mo
1B events	~$20,000+/mo (Enterprise)	~$25,000+/mo (Enterprise)	~$1,400/mo

Those AWS estimates include Pinpoint, Firehose, S3, Glue crawler runs, and moderate Athena query volume. At a billion events per month, this pipeline costs roughly 15x less. The gap only widens with longer retention windows because S3 storage is pennies compared to what analytics vendors charge for historical data access.

Pipeline Architecture

Standard streaming ingestion pattern. Each component does one thing:

iOS telemetry pipeline architecture

I chose each component for a specific reason:

Component	Role	Scaling Model	Cost Driver
AWS Amplify SDK	Client-side event capture and batching	Runs on device; batches events automatically	Free (client-side)
Amazon Pinpoint	Event ingestion and fan-out	Fully managed; scales automatically	$0.000001 per event collected
Amazon Data Firehose	Reliable buffered delivery to S3	Auto-scales; no shard management	$0.029 per GB ingested
Amazon S3	Durable, partitioned event storage	Infinite scale; pay per GB stored	$0.023/GB (Standard), $0.0125/GB (IA)
AWS Glue Crawler	Automatic schema discovery from S3 data	Runs on schedule; DPU-hour billing	~$0.44 per run (minimal DPU)
Glue Data Catalog	Centralized metadata and table definitions	Managed; free tier covers most use cases	Free for first 1M objects
Amazon Athena	Ad-hoc SQL queries over S3 data	Serverless; per-query billing	$5.00 per TB scanned

There are zero always-on compute resources here. Everything is either event-driven (Pinpoint, Firehose) or invoked on demand (Glue crawler, Athena queries). Process zero events, pay close to nothing. Process a billion events a month, still pay a fraction of what Amplitude would charge. I have run this pipeline on a side project that generated maybe 500 events per day and the bill rounded to zero for months.

Data Flow: From Tap to Query

Follow a single telemetry event from the user tapping a button all the way to an analyst running a query. This walkthrough shows why each component exists and where the handoffs happen:

Telemetry event lifecycle from device to query

Step 1: Client-side event capture. The iOS app uses the AWS Amplify Analytics SDK to record events. The SDK batches locally, retries on failure, and queues events when the device is offline. Batching keeps network overhead low and avoids hammering the user's battery.

Step 2: Pinpoint ingestion. Pinpoint receives the event batches and tacks on metadata: application ID, client context, session information. Then it forwards each event to the Amazon Data Firehose delivery stream you configure in the event stream settings.

Step 3: Firehose buffering and delivery. Firehose accumulates incoming records until either the buffer size (default 5 MB) or the buffer interval (default 300 seconds) trips, whichever happens first. It GZIP-compresses the batch and writes one object to S3 using a Hive-style partitioned prefix.

Step 4: S3 partitioned storage. Objects land in S3 with keys like data/year=2026/month=02/day=22/firehose-telemetry-1-2026-02-22-14-30-00-abc123.gz. This partitioning scheme is what makes Athena queries both fast and cheap. Skip it and you will regret it within a month.

Step 5: Glue schema discovery. A Glue crawler runs on schedule (every 6 hours in my configuration), picks up new partitions, infers the JSON schema from the data, and updates the Glue Data Catalog.

Step 6: Athena queries. Analysts run standard SQL through Athena. Automated reports do the same. Athena pulls table metadata from the Glue Catalog and reads directly from S3, scanning only the partitions your WHERE clause specifies.

S3 Partitioning Strategy

Partitioning is the single most impactful design decision in the entire pipeline. Full stop. Athena charges $5 per TB scanned. A query that scans 100 GB costs $0.50. Partition properly and that same query touches 1 GB: $0.005. I learned this the expensive way on an early project where I skipped partitioning to "move fast" and ended up with a $400 Athena bill in the first month.

The Firehose delivery stream uses Hive-style partitioning with this prefix pattern:

data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/

That produces S3 keys Athena recognizes as partition columns:

s3://myapp-dev-telemetry/
├── data/
│   ├── year=2026/
│   │   ├── month=01/
│   │   │   ├── day=15/
│   │   │   │   ├── telemetry-1-2026-01-15-00-05-00-abc123.gz
│   │   │   │   ├── telemetry-1-2026-01-15-00-10-00-def456.gz
│   │   │   │   └── ...
│   │   │   ├── day=16/
│   │   │   └── ...
│   │   ├── month=02/
│   │   └── ...
│   └── ...
└── errors/
    └── ...

I tier the data with S3 lifecycle policies based on how often it gets queried:

Storage Tier	Days After Creation	Cost per GB-Month	Use Case
S3 Standard	0-30	$0.023	Active analysis, recent events
S3 Standard-IA	30-90	$0.0125	Historical lookback, trend analysis
Expired	90+	$0.000	Data deleted per retention policy

All of these thresholds are configurable. In production deployments with compliance requirements, I extend expiration to 365 days or longer and add a Glacier Deep Archive tier at 180 days for audit retention.

Amazon Data Firehose Configuration

I use Firehose here instead of Kinesis Data Streams. Firehose requires no shard capacity planning, no consumer code, and no scaling logic. Data Streams gives you lower latency and consumer fan-out, but for telemetry where a few minutes of lag is fine, Firehose removes an entire category of operational work. I have operated Data Streams pipelines in other contexts and the shard management alone justified switching to Firehose for analytics use cases.

Parameter	Value	Tradeoff
Buffer size	5 MB (configurable 1-128 MB)	Larger buffers = fewer S3 objects = lower S3 request costs, but higher delivery latency
Buffer interval	300 seconds (configurable 60-900s)	Shorter intervals = lower latency but more small files; longer intervals = better compression ratios
Compression	GZIP	70-85% compression ratio for JSON telemetry; Athena reads GZIP natively
Error handling	Separate error prefix in same bucket	Failed records land in `errors/` with diagnostic metadata for debugging

Firehose flushes when either the buffer size or interval threshold is reached, whichever comes first. During high-volume periods, the size threshold triggers more frequently, producing well-compressed files. During low-volume periods, the interval threshold ensures data is delivered within a predictable window. This adaptive behavior is why Firehose works well from prototype scale to production scale without reconfiguration.

In Terraform, you need the extended_s3 destination type specifically. The basic s3 destination type in older Terraform provider versions only supports a static prefix, which dumps all data into one directory and forces Athena to scan everything regardless of the time range in your query. Current versions of the AWS Terraform provider have deprecated the basic s3 destination in favor of extended_s3 exclusively. I have seen teams using the basic destination and wondering why their Athena bills are ten times what they expected.

AWS Glue: Schema Discovery and Cataloging

Glue does two things here: infer the schema and register partitions. The crawler reads sample files from S3, figures out the JSON structure, registers new date partitions as they appear, and keeps everything updated in the Glue Data Catalog. Athena uses that catalog as its metastore.

Crawler Setting	Value	Why
Schedule	Every 6 hours (`cron(0 /6 * ? *)`)	Balances partition freshness against crawler cost
Recrawl behavior	`CRAWL_NEW_FOLDERS_ONLY`	Only scans new partitions; avoids re-reading historical data
Schema change policy	`UPDATE_IN_DATABASE` / `LOG`	Evolves schema forward (new columns added); never deletes columns
Grouping	`CombineCompatibleSchemas`	Merges slightly different schemas across partitions into a single table

Crawler schedule is a cost/freshness tradeoff:

Schedule	Monthly Cost (~)	Partition Delay
Every hour	~$13/mo	Up to 1 hour
Every 6 hours	~$2.20/mo	Up to 6 hours
Every 12 hours	~$1.10/mo	Up to 12 hours
Daily	~$0.55/mo	Up to 24 hours

Six hours between data landing in S3 and being queryable in Athena is fine for most analytics work. If you absolutely need near-real-time access, add a Lambda triggered by S3 events that calls batch_create_partition to register partitions immediately. I have done this on one project where the product team wanted a live dashboard. It works, but it adds moving parts that are rarely worth it for batch analytics.

Schema evolution just works. Ship a new iOS build with additional event attributes and the crawler picks up the new columns on its next run. Older data returns NULL for those columns. No migrations, no backfills. I have gone through a dozen schema changes on a single pipeline and never had to touch the Glue configuration.

Athena: Querying Telemetry Data

Athena gives you serverless SQL directly against the S3 data. The workgroup configuration is where you enforce cost controls and keep everyone honest:

Workgroup Setting	Value	Purpose
Enforce configuration	`true`	Prevents users from overriding output location or encryption
Bytes scanned cutoff	1 GB	Kills queries that would scan more than 1 GB (cost protection)
Result encryption	SSE-KMS	Encrypts query results at rest
CloudWatch metrics	Enabled	Tracks query counts, data scanned, and execution times

That byte scan cutoff matters more than you think. One careless query without partition filters scans the entire dataset. At $5/TB, a full scan of 100 GB costs $0.50. Sounds small. Now multiply that by a team of analysts running queries all day. The 1 GB cutoff caps any single query at roughly $0.005 and trains analysts to include partition filters fast. People learn when their queries fail.

Some example queries showing how to use partitioning correctly:

-- Daily event counts for the past week (scans ~7 partitions)
SELECT year, month, day, COUNT(*) as event_count
FROM telemetry_db.data
WHERE year = '2026' AND month = '02' AND day >= '15'
GROUP BY year, month, day
ORDER BY day DESC;

-- Top event types for a specific day (scans 1 partition)
SELECT event_type, COUNT(*) as occurrences
FROM telemetry_db.data
WHERE year = '2026' AND month = '02' AND day = '22'
GROUP BY event_type
ORDER BY occurrences DESC
LIMIT 20;

-- Session duration analysis (scans 1 month of data)
SELECT
  DATE(from_iso8601_timestamp(event_timestamp)) as event_date,
  AVG(session_duration) as avg_session_seconds,
  APPROX_PERCENTILE(session_duration, 0.95) as p95_session_seconds
FROM telemetry_db.data
WHERE year = '2026' AND month = '02'
  AND event_type = '_session.stop'
GROUP BY DATE(from_iso8601_timestamp(event_timestamp))
ORDER BY event_date;

Notice every query includes year, month, and day in the WHERE clause. Make this a habit. It keeps your Athena bill predictable and your queries fast.

IAM Architecture

Three IAM roles, each scoped to the bare minimum for its integration point. No wildcards. Each trust policy locks down to the specific AWS service that assumes the role:

IAM role architecture with trust and permission relationships

Role	Trust Principal	Permissions	Scope
Pinpoint → Firehose	`pinpoint.amazonaws.com`	`firehose:PutRecord`, `firehose:PutRecordBatch`	Specific Firehose stream ARN only
Firehose → S3	`firehose.amazonaws.com`	`s3:PutObject`, `s3:GetObject`, `s3:ListBucket`, multipart upload actions	Specific S3 bucket and objects only
Glue Crawler	`glue.amazonaws.com`	`s3:GetObject`, `s3:ListBucket` + `AWSGlueServiceRole` managed policy	Specific S3 bucket and objects only

All three roles include aws:SourceAccount conditions in their trust policies (where supported) to block confused deputy attacks. The Firehose role also needs s3:AbortMultipartUpload and s3:ListBucketMultipartUploads because Firehose uses multipart uploads for large buffered deliveries. I have forgotten those permissions before and spent an hour staring at cryptic access denied errors in the Firehose error log.

Cost Estimation

Cost predictability is one of the reasons I keep coming back to this architecture. Every component bills on usage with no minimums:

Component	10M Events/Month	100M Events/Month	1B Events/Month
Pinpoint	$10.00	$100.00	$1,000.00
Firehose (ingestion)	$0.87	$8.70	$87.00
S3 (storage, 90-day retention)	$1.50	$15.00	$150.00
S3 (requests)	$0.50	$5.00	$50.00
Glue Crawler (4x daily)	$2.20	$2.20	$2.20
Athena (moderate queries)	$5.00	$25.00	$75.00
CloudWatch Logs	$1.00	$5.00	$30.00
Total	~$21	~$161	~$1,394

Assumptions: 3 KB average event size (JSON), 80% GZIP compression ratio, Athena scanning 50 GB/month at the 10M tier and scaling proportionally, S3 Standard for the first 30 days then Standard-IA. Pinpoint dominates costs at every tier. If you already have an ingestion layer (API Gateway + Lambda, say), drop Pinpoint and cut costs 40-70%.

Where to cut costs further:

Lever	Impact	Tradeoff
Replace Pinpoint with API Gateway + Lambda	40-70% cost reduction	More infrastructure to manage; lose Pinpoint's campaign features
Increase Firehose buffer size	Fewer S3 PUT requests	Higher delivery latency
Shorten S3 retention	Linear storage cost reduction	Less historical data available
Add Glacier tier instead of expiration	90%+ storage savings for archive data	Minutes-to-hours retrieval time
Use columnar format (Parquet via Firehose transform)	50-80% Athena cost reduction	Requires Firehose data transformation (Lambda)

Terraform vs. Pulumi: Side-by-Side

I maintain both a Terraform and a Pulumi implementation of this pipeline. They produce identical infrastructure; pick whichever fits your team's workflow. For a deeper comparison of IaC tools, see Infrastructure as Code: CloudFormation, CDK, Terraform, and Pulumi Compared.

Aspect	Terraform	Pulumi (Python)
Repository	tf-config-telemetry-pipeline	pul-py-config-telemetry-pipeline
Language	HCL (HashiCorp Configuration Language)	Python
File structure	One `.tf` file per resource group	One `.py` module per resource group
State management	Terraform state file (local or remote)	Pulumi service or self-managed backend
Variable handling	`variables.tf` with type constraints	`Pulumi.yaml` config with Python helpers
Dynamic values	String interpolation, `jsonencode()`	`Output.apply()` with lambda functions
IAM policies	Inline `jsonencode()` blocks	`json.dumps()` inside `apply()` callbacks
Total lines of code	~350	~400

The Firehose delivery stream is the most interesting resource to compare because it has complex nested configuration and cross-resource references:

Terraform:

resource "aws_kinesis_firehose_delivery_stream" "telemetry" {
  name        = "${local.name_prefix}-telemetry"
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = aws_iam_role.firehose_s3.arn
    bucket_arn = aws_s3_bucket.telemetry.arn
    prefix     = "data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"

    buffering_size     = var.firehose_buffer_size_mb
    buffering_interval = var.firehose_buffer_interval_seconds
    compression_format = "GZIP"
  }
}

Pulumi (Python):

firehose = aws.kinesis.FirehoseDeliveryStream(
    "telemetry-firehose",
    name=f"{name_prefix}-telemetry",
    destination="extended_s3",
    extended_s3_configuration={
        "role_arn": firehose_s3_role.arn,
        "bucket_arn": telemetry_bucket.arn,
        "prefix": "data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
        "buffering_size": config["firehose_buffer_size_mb"],
        "buffering_interval": config["firehose_buffer_interval_seconds"],
        "compression_format": "GZIP",
    },
)

The structural similarity is intentional; both tools use declarative resource definitions backed by the same AWS provider. Pulumi gives you Python's full type system, IDE support, and real programming constructs for dynamic logic. Terraform's HCL is purpose-built for infrastructure and keeps things simpler for engineers who do not want to think in Python while writing infra. I tend to reach for Pulumi on projects where the infrastructure logic has conditionals and loops, and Terraform when the setup is straightforward.

Common Failure Modes

Every one of these has bitten me in production at least once. Save yourself the debugging time:

Failure Mode	Symptom	Root Cause	Fix
Firehose delivery failures	Records appear in `errors/` prefix	IAM role permissions insufficient; bucket policy conflict	Verify Firehose role has `s3:PutObject` on the exact bucket ARN including `/*` suffix
Glue crawler finds no tables	Empty database after crawler completes	Crawler S3 target path does not match Firehose prefix; missing trailing slash	Ensure crawler path is `s3://bucket/data/` (with trailing slash) matching Firehose prefix
Athena query scans entire dataset	High query cost; slow execution	Query missing partition filters (`year`, `month`, `day` in WHERE clause)	Always include partition columns in WHERE; use workgroup byte cutoff as safety net
Schema mismatch after app update	NULL values in new columns; query errors	Crawler has not run since new event attributes were deployed	Run crawler manually after deploying app changes; or reduce crawler interval
Small file problem	Thousands of tiny S3 objects per day	Buffer interval too short for the event volume	Increase buffer interval to 900s and buffer size to 64-128 MB for high-volume streams
Pinpoint event stream lag	Events delayed minutes-to-hours	Pinpoint event stream disabled or Firehose throttled	Check Pinpoint event stream configuration; verify Firehose is not in `SUSPENDED` state
Athena workgroup query failures	Queries fail immediately	Byte scan cutoff too low for the query pattern	Increase `bytes_scanned_cutoff_per_query`; or optimize query to scan fewer partitions
Crawler DPU timeout	Crawler fails to complete	Too many small files; schema too complex	Enable `CRAWL_NEW_FOLDERS_ONLY`; use `CombineCompatibleSchemas` grouping

Key Architectural Recommendations

Always partition by date. This is the one decision that makes or breaks the economics of the whole pipeline. Without Hive-style date partitions, every Athena query scans every byte you have ever stored.
Set Firehose buffer interval to at least 300 seconds. Shorter intervals create swarms of tiny S3 objects. Each one adds GET request overhead during Athena scans. Five minutes balances latency against the small-file problem well enough for analytics.
Use GZIP compression. 70-85% storage reduction. Athena reads GZIP natively. Firehose handles compression; your pipeline sees zero performance impact. There is no reason to skip this.
Configure CRAWL_NEW_FOLDERS_ONLY. I cannot stress this enough. Without it, the Glue crawler re-reads every historical file on every run. On a pipeline with six months of data, that turns a $2/month crawler into a $20/month one for no benefit.
Enforce the Athena workgroup configuration. Set enforce_workgroup_configuration = true. Otherwise someone will accidentally write query results to an unencrypted bucket or blow past the byte scan cutoff.
Scope IAM roles to specific resource ARNs. Never Resource: "*" on Firehose or Glue roles. If an ARN exists, use it. Limits the blast radius when (not if) something goes sideways.
Set up lifecycle policies from day one. Telemetry accumulates faster than you expect. A 90-day retention policy with a 30-day Standard-to-IA transition keeps storage costs from surprising you in month three.
Monitor the Firehose errors/ prefix. Delivery failures are completely silent unless you watch for them. Set up an S3 event notification or a CloudWatch alarm. I once lost two days of telemetry before noticing because I skipped this step on an early deployment.
Start with Firehose, not Data Streams. Shard management is complexity you do not need for analytics telemetry. If sub-second latency or consumer fan-out becomes a requirement later, you can add a Data Stream upstream without redesigning anything.
Codify everything. Terraform or Pulumi, I do not care which. Just do not make Firehose buffer or Glue crawler changes in the console. Manual tweaks drift, and when they cause an incident at 2 AM, nobody remembers what was changed.

Why a Dedicated Telemetry Pipeline

Pipeline Architecture

Data Flow: From Tap to Query

S3 Partitioning Strategy

Amazon Data Firehose Configuration

AWS Glue: Schema Discovery and Cataloging

Athena: Querying Telemetry Data

IAM Architecture

Cost Estimation

Terraform vs. Pulumi: Side-by-Side

Common Failure Modes

Key Architectural Recommendations

Additional Resources