使用redshift加载S3桶数据

1. 创建S3桶

区域选择东京，创建S3桶s3-redshift-dw

2. 准备数据

2.1. 创建策略从其它账号S3桶复制数据

2.1.1. 创建策略 AcrossAccountCopyS3

2.1.2. 对dw_user用户添加策略 AcrossAccountCopyS3

2.1.3. 创建并下载dw_user的访问密钥

2.1.4. 使用访问密钥从其它账号S3桶中复制测试数据

3. 创建redshift

3.1. 对dw_user用户添加策略

https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2.html

AmazonRedshiftReadOnlyAccess / AmazonRedshiftQueryEditor / AWSKeyManagementServicePowerUser / AmazonRedshiftQueryEditorV2FullAccess / AmazonRedshiftQueryEditorV2ReadSharing / AmazonRedshiftQueryEditorV2NoSharing / AmazonRedshiftQueryEditorV2ReadWriteSharing

3.2. 创建用于redshift的S3只读权限的角色

3.3. 使用控制台创建Redshift

3.3.1. 在Redshift控制台点击右上角区域，选择东京

3.3.2. 创建Redshift

点击集群 → 创建集群，在新打开的页面中进行如下操作：

输入集群标识符 redshift-dw-test01
集群用途选择生产
节点大小选择 ra3.4xlarge
AQUA (高级查询加速器) 默认自动
节点数改成 4
不勾选加载示例数据
勾选自动生成密码或输入数据管理员用户密码（此步骤输入的密码 _1234Qwer）
集群权限选择 3.2 中创建的 RedshiftRole-S3ReadOnly
点击页面下方创建集群，等待集群创建

3.4. 连接Redshift数据库

点击集群 → 连接到Redshift集群 → 查询数据，打开Redshift操作页面

3.5. 创建表

将创建表的SQL语句粘贴到命令窗口，点击Run，执行SQL语句创建表：

create table slim_shred (
  user_pseudo_id varchar(60),
  advertising_id varchar(60),
  event_timestamp int8,
  event_name varchar(60),
  country varchar(20),
  version varchar(10),
  id varchar(60),
  user_first_touch_timestamp int8,
  event_params super
);

创建完成后，在public下可以看到表slim_shred

4. 从S3加载数据

4.1. 执行以下命令，从S3加载数据

copy slim_shred from 's3://s3-redshift-dw/trans03/part'
iam_role 'arn:aws:iam::961692981863:role/RedshiftRole-S3ReadOnly'
json 'auto ignorecase' MAXERROR 1000;

4.2. 执行过程报错

在执行过程中出现因country字段长度超过20，version字段长度超过10，导致插入数据失败

4.3. 对表结构进行调整

alter table slim_shred ALTER COLUMN country TYPE varchar(50);
alter table slim_shred ALTER COLUMN version TYPE varchar(20);

4.4. 重新执行加载命令

再次执行命令，经过22分钟21秒加载完成

4.5. 执行查询SQL

执行以下SQL，经过59秒查询出结果，共921条

SELECT
  first_open_date, country, event_name,
  DATE_DIFF('d', retention_date, first_open_date) AS date_diff,
  COUNT(DISTINCT first_uid) AS users,
  COUNT(first_uid) AS events
FROM ((
  SELECT
    DATE(date_add('h',-8,date_add('us',event_timestamp,'1970-01-01'))) AS first_open_date,
    country, user_pseudo_id AS first_uid
  FROM slim_shred
  WHERE event_name = '19d2da1d0d3c1cd08a318cbd5c04a3b0'
  GROUP BY first_open_date, country, first_uid)
  INNER JOIN (
  SELECT
    DATE(date_add('h',-8,date_add('us',event_timestamp,'1970-01-01'))) AS retention_date,
    event_name, user_pseudo_id AS retention_uid
  FROM slim_shred
  WHERE event_name IN ('e37b4b622992267149d25437cbb75237')
  GROUP BY retention_date, event_name, retention_uid)
  ON first_uid = retention_uid)
GROUP BY first_open_date, country, event_name, date_diff
HAVING date_diff BETWEEN 0 AND 15
ORDER BY first_open_date, country, date_diff

4.6. 执行重复SQL

在新窗口中再次执行上述SQL，经过7毫秒，查询出结果

返回技术博客

1. Create an S3 Bucket

Select the Tokyo region and create an S3 bucket named s3-redshift-dw.

2. Prepare Data

2.1. Create a Policy to Copy Data from Another Account's S3 Bucket

2.1.1. Create the policy AcrossAccountCopyS3

2.1.2. Attach the AcrossAccountCopyS3 policy to the dw_user user

2.1.3. Create and download access keys for dw_user

2.1.4. Use the access keys to copy test data from the other account's S3 bucket

3. Create a Redshift Cluster

3.1. Attach the required policies to dw_user

Attach the following policies: AmazonRedshiftReadOnlyAccess, AmazonRedshiftQueryEditor, AWSKeyManagementServicePowerUser, AmazonRedshiftQueryEditorV2FullAccess, AmazonRedshiftQueryEditorV2ReadSharing, AmazonRedshiftQueryEditorV2NoSharing, AmazonRedshiftQueryEditorV2ReadWriteSharing

3.2. Create an IAM Role with S3 Read-Only Access for Redshift

3.3. Create the Redshift Cluster via Console

3.3.1. In the Redshift console, click the region selector in the top-right corner and choose Tokyo

3.3.2. Create the Redshift cluster

Click Clusters → Create cluster and configure as follows:

Cluster identifier: redshift-dw-test01
Cluster purpose: Production
Node type: ra3.4xlarge
AQUA (Advanced Query Accelerator): Auto (default)
Number of nodes: 4
Uncheck "Load sample data"
Auto-generate a password or enter an admin password (e.g., _1234Qwer)
Cluster permissions: select the RedshiftRole-S3ReadOnly role created in step 3.2
Click Create cluster and wait for provisioning to complete

3.4. Connect to the Redshift Database

Click Clusters → Connect to Redshift cluster → Query data to open the Redshift query interface.

3.5. Create a Table

Paste the following SQL into the query window and click Run:

create table slim_shred (
  user_pseudo_id varchar(60),
  advertising_id varchar(60),
  event_timestamp int8,
  event_name varchar(60),
  country varchar(20),
  version varchar(10),
  id varchar(60),
  user_first_touch_timestamp int8,
  event_params super
);

After creation, the table slim_shred will appear under the public schema.

4. Load Data from S3

4.1. Run the COPY command to load data from S3

copy slim_shred from 's3://s3-redshift-dw/trans03/part'
iam_role 'arn:aws:iam::961692981863:role/RedshiftRole-S3ReadOnly'
json 'auto ignorecase' MAXERROR 1000;

4.2. Error During Execution

The load failed because the country field exceeded 20 characters and the version field exceeded 10 characters.

4.3. Alter the Table Schema

alter table slim_shred ALTER COLUMN country TYPE varchar(50);
alter table slim_shred ALTER COLUMN version TYPE varchar(20);

4.4. Re-run the Load Command

Run the COPY command again. The load completed in 22 minutes and 21 seconds.

4.5. Run a Query

Execute the following SQL. The query returned 921 rows in 59 seconds.

SELECT
  first_open_date, country, event_name,
  DATE_DIFF('d', retention_date, first_open_date) AS date_diff,
  COUNT(DISTINCT first_uid) AS users,
  COUNT(first_uid) AS events
FROM ((
  SELECT
    DATE(date_add('h',-8,date_add('us',event_timestamp,'1970-01-01'))) AS first_open_date,
    country, user_pseudo_id AS first_uid
  FROM slim_shred
  WHERE event_name = '19d2da1d0d3c1cd08a318cbd5c04a3b0'
  GROUP BY first_open_date, country, first_uid)
  INNER JOIN (
  SELECT
    DATE(date_add('h',-8,date_add('us',event_timestamp,'1970-01-01'))) AS retention_date,
    event_name, user_pseudo_id AS retention_uid
  FROM slim_shred
  WHERE event_name IN ('e37b4b622992267149d25437cbb75237')
  GROUP BY retention_date, event_name, retention_uid)
  ON first_uid = retention_uid)
GROUP BY first_open_date, country, event_name, date_diff
HAVING date_diff BETWEEN 0 AND 15
ORDER BY first_open_date, country, date_diff

4.6. Re-run the Same Query

Running the same query again in a new window returned results in just 7 milliseconds (cached).

Back to Tech Blog

Loading S3 Bucket Data with Redshift