You need to enable JavaScript to run this app.
导航

Iceberg与Trino/Presto集成

最近更新时间2023.09.04 10:28:53

首次发布时间2022.04.27 18:51:54

在 Trino 或 Presto 上可以实现对 Icerberg 表的操作。火山引擎 E-MapReduce(EMR)默认配置了Trino(或 Presto)和 Icerberg 的连接器。本文介绍下在 Trino 中操作 Iceberg 表,同样的方法也可以操作在 Presto 中操作。

1 前提条件

  1. 适用于 EMR1.2.0 以后版本(包含1.2.0版本)

  2. 已创建 EMR 集群,且安装有 Iceberg 组件和 Trino 组件。有两种方式可以安装 Iceberg 和 Trino 组件:

    1. 在创建 EMR 集群时,选择 Icerberg 和Trino作为可选组件,详见:创建集群

    2. 对已安装 EMR 集群,参考 服务管理章节 添加 Iceberg 和 Trino 服务。

2 操作步骤

2.1 打开 Trino 客户端并使用 Iceberg 连接器获取Metastore

参考 Trino组件操作, 在 Trino的 cli 命令后面添加下 Iceberg 的连接信息

--catalog iceberg --schema default

参数描述说明如下:

参数说明
catalog iceberg:iceberg 为使用的 Iceberg 的 catalog 名称
schema defaultdefault 为使用的 Iceberg 的 schema 名称

2.2 基本操作

详情参考 Trino官网Presto官网

2.2.1 创建表

CREATE TABLE iceberg.default.customer_orders (
    order_id BIGINT,
    order_date DATE,
    account_number BIGINT,
    customer VARCHAR,
    country VARCHAR
 )
 WITH (
    format = 'ORC',
    partitioning = ARRAY['bucket(account_number, 10)','country']
 )

2.2.2 写入数据

INSERT INTO iceberg.default.customer_orders VALUES (2, CAST('2021-01-12' AS DATE), 670011, 'bcd', 'ca');
INSERT INTO iceberg.default.customer_orders VALUES (2, CAST('2021-01-12' AS DATE), 670012, 'bcde', 'cn');

2.2.3 查询数据

查询语句示例:

SELECT * FROM iceberg.default.customer_orders;

输出示例:

order_id | order_date | account_number | customer | country
----------+------------+----------------+----------+---------
       1 | 2021-10-11 |        1002123 | abc      | us
       2 | 2021-01-12 |         670011 | bcd      | ca
       2 | 2022-02-11 |        9070011 | efg      | cn
       2 | 2022-01-01 |         560012 | zxc      | us
(4 rows)

增加筛选条件:

SELECT * FROM iceberg.default.customer_orders WHERE account_number = 670012;

2.2.4 查看执行计划

EXPLAIN ANALYZE SELECT * SELECT iceberg.default.customer_orders WHERE account_number = 670012;

2.2.5 查看表信息

DESC customer_orders;

2.2.6 时间旅行(Time Travel)

查看快照:

SELECT * FROM "customer_orders$snapshots"

输出示例:

committed_at              |     snapshot_id     |      parent_id      | operation |
---------------------------------------+---------------------+---------------------+-----------+------------------------------------------------------------------
 2022-01-13 13:34:00.657 Asia/Shanghai | 1942301704603642054 |                NULL | append    | hdfs://emr-master-1:8020/warehouse/tablespace/managed/hive/custom
 2022-01-13 13:39:18.665 Asia/Shanghai | 2127546335909550653 | 1942301704603642054 | append    | hdfs://emr-master-1:8020/warehouse/tablespace/managed/hive/custom
 2022-01-17 19:56:28.807 Asia/Shanghai | 6215604771056148566 | 2127546335909550653 | append    | hdfs://emr-master-1:8020/warehouse/tablespace/managed/hive/custom
 2022-01-17 19:57:12.160 Asia/Shanghai | 7079051850206849569 | 6215604771056148566 | append    | hdfs://emr-master-1:8020/warehouse/tablespace/managed/hive/custom
 2022-01-17 19:57:55.779 Asia/Shanghai | 3727192433427494890 | 7079051850206849569 | append    | hdfs://emr-master-1:8020/warehouse/tablespace/managed/hive/custom
(5 rows)

时间旅行查询:

SELECT count(*) FROM "customer_orders@6215604771056148566";
 _col0
-------
     2
(1 row)

SELECT count(*) FROM "customer_orders@3727192433427494890";
 _col0
-------
     4
(1 row)

2.2.7 回滚到特定版本

CALL iceberg.system.rollback_to_snapshot('default', 'customer_orders', 3727192433427494890);

2.2.8 分区

SELECT * FROM "customer_orders$partitions";
partition               | record_count | file_count | total_size |
---------------------------------------+--------------+------------+------------+---------------------------------------------------------------------------------
 {account_number_bucket=9, country=us} |            1 |          1 |        714 | {order_id={min=2, max=2, null_count=0}, order_date={min=2022-01-01, max=2022-01-
 {account_number_bucket=2, country=ca} |            1 |          1 |        711 | {order_id={min=2, max=2, null_count=0}, order_date={min=2021-01-12, max=2021-01-
 {account_number_bucket=1, country=us} |            1 |          1 |        711 | {order_id={min=1, max=1, null_count=0}, order_date={min=2021-10-11, max=2021-10-
 {account_number_bucket=5, country=cn} |            1 |          1 |        716 | {order_id={min=2, max=2, null_count=0}, order_date={min=2022-02-11, max=2022-02-