使用AWS Glue自定义连接器将数据从BigQuery迁移到S3

1. 从 Google Cloud 下载服务账号凭据 JSON 文件

1.1. 点击 IAM和管理 → 服务账号，进入服务账号管理页面

1.2. 选择账号，进入密钥管理

1.3. 创建新密钥，选择 JSON 类型，点击创建后会自动下载 JSON 格式的密钥文件到本机

1.4. 对密钥文件进行 base64 编码

若为 Windows 机器，请使用在线工具进行编码。

若为 Linux 和 Mac，执行 base64 service_account_json_file.json，将打印出来的内容复制到文本中，注意删除换行符。

2. 获取 Google Cloud 的项目 ID

点击项目名称，获取对应项目 ID，保存以备用。

3. 使用 AWS Secrets Manager 管理密钥 bigquery_credentials_poc

使用 Secrets Manager 控制台创建密钥，选择其它类型密钥，键输入 credentials，值输入 base64 编码后的字符串。

输入密钥名称 bigquery_credentials_poc，其它选项默认，点击存储生成密钥。

在密钥列表点击 bigquery_credentials_poc，打开详情，复制密钥 ARN 以备用：
arn:aws:secretsmanager:us-west-2:260527533511:secret:bigquery_credentials_poc-wCHyT3

4. 创建 S3 桶 s3-redshift-glue

4.1. 选择 AWS 区域（东京）创建 S3 桶 s3-redshift-glue，不开放公有访问权限

复制 S3 桶 ARN 以备用：arn:aws:s3:::s3-redshift-glue

4.2. 在 S3 桶中创建存放数据的文件夹

文件夹名称可根据要导出的表名创建，此次创建文件夹 311_service_requests。

5. 创建策略 policy_secrets_s3

使用以下 JSON 创建策略，允许访问密钥 bigquery_credentials_poc 和 S3 桶 s3-redshift-glue：

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GetDescribeSecret",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetResourcePolicy",
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret",
        "secretsmanager:ListSecretVersionIds"
      ],
      "Resource": "arn:aws:secretsmanager:us-west-2:260527533511:secret:bigquery_credentials_poc-wCHyT3"
    },
    {
      "Sid": "S3Policy",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation","s3:ListBucket","s3:GetBucketAcl",
        "s3:GetObject","s3:PutObject","s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::s3-redshift-glue",
        "arn:aws:s3:::s3-redshift-glue/*"
      ]
    }
  ]
}

6. 创建 IAM 角色 roleGlueBigqueryS3

创建角色 roleGlueBigqueryS3，受信任实体类型选择 Glue，添加以下三个策略：

AmazonEC2ContainerRegistryReadOnly
AWSGlueServiceRole
policy_s3_secrets

7. 通过 AWS Marketplace 订阅 AWS Glue Connector for Google BigQuery

7.1. 搜索并找到 AWS Glue Connector for Google BigQuery

7.2. 选择继续订阅

7.3. 查看条款和条件、定价和其他详细信息

7.4. 选择继续配置

7.5. 选择交付方式和软件版本，然后选择继续启动

7.6. 在使用说明下，选择在 AWS Glue Studio 中激活 Glue 连接器

7.7. 创建 Glue 连接器：bigquery

输入连接名称 bigquery，密钥文件选择 bigquery_credentials_poc。

8. 在 AWS Glue Studio 中创建 ETL 作业

8.1. 创建 Job

在 Glue Studio 上，选择 Jobs
源选择 AWS Glue Connector for Google BigQuery
目标选择 S3
点击创建

8.2. 删除 ApplyMapping

选中 ApplyMapping 并删除它。

8.3. 配置 AWS Glue Connector

8.3.1. 连接选择 bigquery

8.3.2. 在连接选项中添加以下键值

键：parentProject，值：步骤 2 中获取的项目 ID
键：table，值：bigquery-public-data.austin_311.311_service_requests

8.4. 配置 S3

8.4.1. 选择导出到 S3 的数据格式（此次 POC 选择 JSON）

8.4.2. 选择压缩类型

8.4.3. 指定 S3 目标位置

8.5. 配置 Job_details

Name：Glue_BigQuery_S3
IAM Role：roleGlueBigqueryS3（步骤 6 中创建的角色）
类型：Spark
Glue version：Glue 2.0 – Supports Spark 2.4, Scala 2, Python 3
其余选项保留默认值，点击 Save 保存

9. 运行 ETL 作业

9.1. 点击详情右上角 Run，运行作业

9.2. 点击 Runs，查看执行情况

9.3. 执行成功后查看 S3 桶中文件

9.4. 下载文件，查看文件格式

10. 总结

至此，使用 Glue 自定义连接器从 BigQuery 同步数据到 S3 桶已完成。若需要将数据加载到 Redshift，请在 Redshift 中创建对应表，再使用 COPY 命令导入 S3 数据到表中。

返回技术博客

1. Download the Service Account Credentials JSON File from Google Cloud

1.1. Go to IAM & Admin → Service Accounts

1.2. Select the account and open Key Management

1.3. Create a new key, select JSON type — the file will be downloaded automatically

1.4. Base64-encode the key file

On Windows, use an online base64 encoding tool.

On Linux/Mac, run base64 service_account_json_file.json and copy the output, removing any line breaks.

2. Get the Google Cloud Project ID

Click the project name to find the project ID and save it for later use.

3. Store the Credentials in AWS Secrets Manager

In the Secrets Manager console, create a new secret. Choose "Other type of secret", set the key to credentials and the value to the base64-encoded string.

Name the secret bigquery_credentials_poc and click Store. Then copy the secret ARN for later:
arn:aws:secretsmanager:us-west-2:260527533511:secret:bigquery_credentials_poc-wCHyT3

4. Create S3 Bucket s3-redshift-glue

4.1. Create the bucket in the Tokyo region with public access blocked

Copy the bucket ARN: arn:aws:s3:::s3-redshift-glue

4.2. Create a folder inside the bucket for the data

Name the folder after the table to be exported, e.g. 311_service_requests.

5. Create IAM Policy policy_secrets_s3

Create a policy using the following JSON to allow access to the secret and the S3 bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GetDescribeSecret",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetResourcePolicy",
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret",
        "secretsmanager:ListSecretVersionIds"
      ],
      "Resource": "arn:aws:secretsmanager:us-west-2:260527533511:secret:bigquery_credentials_poc-wCHyT3"
    },
    {
      "Sid": "S3Policy",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation","s3:ListBucket","s3:GetBucketAcl",
        "s3:GetObject","s3:PutObject","s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::s3-redshift-glue",
        "arn:aws:s3:::s3-redshift-glue/*"
      ]
    }
  ]
}

6. Create IAM Role roleGlueBigqueryS3

Create a role named roleGlueBigqueryS3 with Glue as the trusted entity. Attach these three policies:

AmazonEC2ContainerRegistryReadOnly
AWSGlueServiceRole
policy_s3_secrets

7. Subscribe to AWS Glue Connector for Google BigQuery on AWS Marketplace

7.1. Search for and find the AWS Glue Connector for Google BigQuery

7.2. Click Continue to Subscribe

7.3. Review terms, pricing, and details

7.4. Click Continue to Configuration

7.5. Select fulfillment option and software version, then click Continue to Launch

7.6. Under Usage Instructions, select "Activate the Glue connector from AWS Glue Studio"

7.7. Create the Glue connection named bigquery

Enter connection name bigquery and select bigquery_credentials_poc as the secret.

8. Create an ETL Job in AWS Glue Studio

8.1. Create the Job

In Glue Studio, select Jobs
For source, choose AWS Glue Connector for Google BigQuery
For target, choose S3
Click Create

8.2. Remove the ApplyMapping node

Select ApplyMapping and delete it.

8.3. Configure the AWS Glue Connector

8.3.1. Select bigquery as the connection

8.3.2. Add the following connection options

Key: parentProject, Value: the project ID from Step 2
Key: table, Value: bigquery-public-data.austin_311.311_service_requests

8.4. Configure the S3 Target

8.4.1. Select the output format (JSON for this POC)

8.4.2. Select compression type

8.4.3. Specify the S3 target path

8.5. Configure Job Details

Name: Glue_BigQuery_S3
IAM Role: roleGlueBigqueryS3 (created in Step 6)
Type: Spark
Glue version: Glue 2.0 – Supports Spark 2.4, Scala 2, Python 3
Leave other options as default and click Save

9. Run the ETL Job

9.1. Click Run in the top-right of the job details page

9.2. Click Runs to monitor execution status

9.3. After successful completion, verify the files in the S3 bucket

9.4. Download and inspect the output file format

10. Summary

The migration of data from BigQuery to S3 using the AWS Glue custom connector is now complete. To load the data into Redshift, create the corresponding table in Redshift and use the COPY command to import the S3 data.

Back to Tech Blog