Amazon Web Services

Contents: [ Exporting Mixpanel Data to Redshift Spectrum , S3 Bucket Permissions , Glue Configurations , Schema , Table Example , Queries ]
[Mixpanel Amazon AWS Export Design, Data Modification Policy, Server-Side Encryption, Encryption with Amazon S3-Managed Keys (SSE-S3), Encryption with AWS KMS-Managed Keys (SSE-KMS), S3 Access Role, Configuring Glue for Mixpanel Direct Export, Note , Configuring Glue to Use Crawlers, Using One Table for All Events, Using One Table for Each Event, Note , One Table for All Events, One Table for Each Event, Nested and Repeated Fields]

Visit the Dev Docs

A better version of this page exists at https://developer.mixpanel.com/docs/mixpanel-amazon-s3-export.

Mixpanel's Data Warehouse Export lets you export your Mixpanel data directly into an S3 bucket, allowing the use of Glue to query it.

After your data is cataloged in Glue, you can use SQL with multiple AWS products, including Amazon Athena and Redshift Spectrum, to query the imported data.

Mixpanel Amazon AWS Export Design

To set up the Mixpanel AWS pipeline, you must configure AWS to receive the exported data, then set up the pipeline using the Data Warehouse Export API.

We recommend the use of Spectrum and Glue to query the imported data.

Mixpanel applies transformation rules to make the data compatible with data warehouses.

Exporting Mixpanel Data to Redshift Spectrum

The following summarizes the steps to export Mixpanel data to an S3 bucket. Consult AWS documentation for AWS specific tasks, such as creating an S3 bucket and permissions.

To prepare S3 for the incoming data:

  1. Create a S3 bucket.
  2. Give Mixpanel the required permissions to write to the bucket.
  3. Configure Glue as shown below.

S3 Bucket Permissions

Mixpanel supports a wide range of configurations to secure and manage your data on S3. To access resources, the pipeline uses AWS cross-account roles.

This section highlights the permissions you must give Mixpanel depending on the configuration of the target S3 bucket.

Data Modification Policy

All exports from Mixpanel to AWS require that you create a new data modification policy, or add the following permissions to an existing data modification policy.

Replacing <BUCKET_NAME> with your bucket name before inserting this JSON:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "SomeSidYouChoose",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<BUCKET_NAME>",
                "arn:aws:s3:::<BUCKET_NAME>/*"
            ]
        }
    ]
}

Server-Side Encryption

Mixpanel always sends data to your S3 bucket on a TLS encrypted connection. To secure your data at rest on S3, you can enable Server-Side Encryption (SSE).

There are two options when using SSE: Encryption with Amazon S3-Managed Keys (SSE-S3) and Encryption with AWS KMS-Managed Keys (SSE-KMS)

Encryption with Amazon S3-Managed Keys (SSE-S3)

This setting on your bucket encrypts data at rest using the AES-256 algorithm that uses keys managed by S3.

If you are using this type of SSE, you only need to configure your pipeline by passing the s3_encryption=aes parameter when calling the Mixpanel Data Warehouse Export API. See Additional AWS S3 and Glue Parameters.

Encryption with AWS KMS-Managed Keys (SSE-KMS)

You have a choice of keys if you use the Key Management Service (KMS).

For S3 buckets, you can pick a default key named aws/s3. If you opt to use the default key you don’t need any further configuration on AWS, and only need to configure your pipeline by passing s3_encryption=kms when calling the Mixpanel Data Warehouse Export API.

If you choose to use your own custom keys for encrypting the contents of your bucket, you will need to allow Mixpanel to use the key to encrypt the data properly as it is written to your bucket.

To achieve this, create an IAM policy that gives permission to Mixpanel to use the KMS key. Use the following JSON snippet and replace <KEY_ARN> with your custom key’s ARN:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "SomeSidYouChooseAgain",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:Encrypt",
                "kms:GenerateDataKey",
                "kms:ReEncryptTo",
                "kms:GenerateDataKeyWithoutPlaintext",
                "kms:DescribeKey",
                "kms:ReEncryptFrom"
            ],
            "Resource": "<KEY_ARN>"
        }
    ]
}

You must configure your pipeline by passing s3_encryption=kms and s3_kms_key_id=<KEY_ARN> when calling the Mixpanel Data Warehouse Export API.

S3 Access Role

After creating the policies in the sections above, you must create a cross account IAM Role to assign the policies to the role.

  • Go to the AWS IAM service on the console.
  • Click Roles in the sidebar.
  • Click Create Role.
  • Select Other AWS Accounts on the trust policy page and enter "485438090326" for the account ID.
  • In the Permissions page, find and select the policies you created above.
  • In the Review page, enter a name and description for the role and click Save.

Next, limit the trust relationship to the Mixpanel export user to ensure only Mixpanel has the ability to assume this specific role.

  • Navigate to the AWS IAM service in the console.
  • Click Roles in the sidebar.
  • Find and click the role you just created.
  • Navigate to the Trust Relationships tab.
  • Click Edit trust relationship.
  • Replace the contents with the following JSON:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::485438090326:user/mixpanel-export"
      },
      "Action": "sts:AssumeRole",
      "Condition": {}
    }
  ]
}

Glue Configurations

Glue offers a data catalog service that will facilitate access to the S3 data from other services on your AWS account.

This section describes how to connect Glue to the exported data in S3. You can select either of the following options:

  • Configure Glue for Mixpanel direct export (recommended)
  • Configure Glue to use crawlers.

Configuring Glue for Mixpanel Direct Export

Mixpanel can write and update a schema in your Glue instance as soon as new data is available. To set this up:

  1. Create a Glue database.
    • In the AWS console, go to "Glue". (Make sure you are in the same region as your S3 exported data).
    • Click Databases in the sidebar.
    • Click Add Database.
    • Enter a database name and click Create.
  1. Create an IAM policy.
    • Go to the "AWS IAM" service on the console.
    • Click Policies in the sidebar.
    • Click Create Policy on the top of the screen.
    • In the new screen, click the JSON tab.
    • Paste the following policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "MixpanelGlueAccessStatement",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateTable",
                "glue:GetTables",
                "glue:GetTableVersions",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetTable"
            ],
            "Resource": "*"
        }
    ]
}
  • Click Review Policy and give the policy a name in the next screen.
  • Click Create Policy to save.

Note

AWS does not support granular resources when granting Glue access. So you will have to use * (asterisk) in the Resources field until AWS can support more granular resources.

  1. Create an IAM Role and assign the Policy you created in Step 2.

    • Go to the "AWS IAM" service on the console.
    • Click Roles in the sidebar.
    • Click Create Role.
    • Select Other AWS Accounts on the trust policy page and enter "485438090326" for the account ID.
    • In the "Permissions" page, find and select the policy you created in Step 2.
    • In the Review page, enter a name and description for the role and click Save.
  2. Limit the trust relationship to the Mixpanel export user.

    • Navigate to the "AWS IAM" service in the console.
    • Click Roles in the sidebar.
    • Find and click the role you created in Step 3.
    • Navigate to the "Trust Relationships" tab.
    • Click Edit trust relationship.
    • Replace the contents with the following JSON:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::485438090326:user/mixpanel-export"
      },
      "Action": "sts:AssumeRole",
      "Condition": {}
    }
  ]
}
  • Save the contents.

Configuring Glue to Use Crawlers

You can configure Glue to crawl the S3 bucket with Mixpanel data. Glue crawlers are convenient because they automatically extract the schema from the data files and update the Glue schema.

This convenience, however, can also pose issues, such as:

  • The run time of the crawlers could be out of sync with Mixpanel’s export schedule. As a result, it can prevent you from using your data as soon as new properties are available.
  • The time column is interpreted as an integer. As a result, it is more difficult to perform timestamp operations.
  • When using the one table per event schema option, Glue crawlers can merge data from multiple events in one table based on similarity.

To avoid these issues, Mixpanel can write and update a schema in your Glue instance as soon as new data is available. Follow these instructions to enable Mixpanel to write your data catalog to AWS Glue.

To use crawlers, you must point the crawler to the top level folder with your Mixpanel project ID.

For more details to configure Glue to use crawlers, see Cataloging Tables with a Crawler.

Setting up Redshift

When the data catalog and table definitions are available in Glue through either of the aforementioned means, you can connect your Redshift cluster to the catalog and query it from Redshift. AWS documentation walks through the process in detail.

The following is an overview of the process. To properly configure Redshift:

  1. Create an IAM role with read access to Glue and the S3 bucket containing your Mixpanel data.
  2. Assign that role to your Redshift cluster.
  3. After you insert your role and the database name in the following SQL query, run it as cluster admin:
CREATE EXTERNAL SCHEMA mixpanel FROM DATA CATALOG 
DATABASE '<YOUR_GLUE_DATABASE_NAME>' -- defined when you configured Glue
IAM_ROLE '<YOUR_ROLE_ARN>' -- this is the ARN for the role with access to Glue+S3
CREATE EXTERNAL DATABASE IF NOT EXISTS;

You only need to connect Redshift to Glue once. As the Mixpanel pipeline exports more data to your bucket and Glue catalog, you will automatically have access to all the new data and schema changes.

You can also join data stored in Redshift with Mixpanel data available in S3 through the external schema.

Schema

This section describes the schemas for AWS. Event data and people data are loaded into separate tables. People profile data is loaded into one table, where events can be loaded into either a single table for each event or one table for all events.

Using One Table for All Events

In this schema, all your Mixpanel data exists in a single table. Mixpanel recommends a single table schema because it makes querying simpler.

If you select the single table schema, the export pipeline creates a mp_master_event table. The table has one column per unique property name across all events in the history of the dataset.

Suppose you have an event with a purchase_amount property and one with a referrer property, the table will have a purchase_amount column and a referrer column. Events that don’t have a property contain a NULL value in that column.

For examples of one table for all events, see One Table for All Events.

Using One Table for Each Event

Note

One table for each event is not available during the trial period.

Each Mixpanel event exists in a single table. This schema is a useful workaround for limitations in the number of columns that are allowed in a S3 table.

If you select the multi-table option, Mixpanel creates one table per unique event name. Each table is named for the event. For example, if you have an event called signup there will be a table named signup.

Each table for each event has one column per unique property name across all events in the history of the dataset.

For an example of one table for each event, see One Table for Each Event.

Table Example

Here is an example dataset to view different schema options in either one table for all events or one table for each event.

{
       “event”: “Signup”, 
       “properties”: {
       “plan”: “free”, 
       “browser”: “chrome”, 
       “distinct_id”: 1, 
       “time”: 123
       }
}
{
       “event”: “Purchase”, 
       “properties”: {
       “product_id”: “123”, 
       “browser”: “chrome”, 
       “distinct_id”: 1, 
       “time”: 124
       }
}
{
       “event”: “Signup”, 
       “properties”: {
       “plan”: “paid”, 
       “browser”: “firefox”, 
       “ab_test”: 1, 
       “distinct_id”: 2, 
       “time”: 125
       }

One Table for All Events

Single table: mixpanel_events

mp_event_name
time
distinct_id
plan
browser
product_id
ab_test

“Signup”

123

1

“free”

“chrome”

NULL

NULL

“Purchase”

124

1

NULL

“chrome”

"123"

NULL

“Purchase”

125

2

“paid”

“chrome”

NULL

1

One Table for Each Event

Two tables: signup and purchase

Table: signup

time
distinct_id
plan
browser
ab_test

123

1

“free”

“chrome”

NULL

125

2

“paid”

“firefox”

1

Table: purchase

time
distinct_id
browser
product_id

124

1

“chrome”

“123”

Queries

You can query data with a single table schema or with a multiple table schema in Redshift Spectrum.

To query a single table schema, use this snippet.

SELECT count(*)
FROM mixpanel.mixpanel_events
WHERE mp_event_name = “signup”

To query a multiple table schema, use this snippet.

SELECT count(*)
FROM mixpanel.signup;

Nested and Repeated Fields

Redshift Spectrum does not support nested and repeated fields.

Mixpanel exports array and object properties as a string. You can use built in functions to convert the string to JSON at query time.

Is this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.