为了支持 Las Catalog 的元数据容灾备份,Las Catalog 提供了元数据导出工具,本文提供一个具体的实践操作,帮助用户理解整个导出过程。
先在这个 catalog 下面建立一个 database,并写入一些表和分区。
metastore.catalog.default=
**metastore.catalog.default=
las_eporter
,并重启 HiveMetastore。
mode: las_to_hive lasClientInfo: akMode: AUTO endPoint: thrift://cs.lakeformation.las.cn-beijing.ivolces.com:48869 regionId: cn-beijing hmsClientInfo: hiveConfPath: /etc/emr/hive/conf/hive-site.xml runOptions: includeCatalogPrefixs: [las_exporter] includeDatabasePrefixs: [las_exporter] includeTablePrefixs: [exporter_table_1000] batchSize: 1000 objectTypes: - catalog - database - table - partition - function
hadoop fs -put export_to_hive.yaml ./export_to_hive.yaml
wget https://lasformation-cn-beijing.tos-cn-beijing.ivolces.com/las-exporter/application.jar
hadoop fs -put application.jar ./application.jar
如果分区数量较多,可以适当调大**num-executors
**的数量,比如100w分区配置100个,能加快写入的速度。需要将最后一个参数配置为用户的 TOS 路径。
spark-submit --master yarn --deploy-mode cluster --driver-memory 12G --executor-memory 8G --executor-cores 2 --num-executors 5 --conf spark.sql.shuffle.partitions=200 --conf spark.sql.adaptive.enabled=false --class bytedance.olap.las.Exporter ./application.jar ./export_to_hive.yaml
通过hive 命令查看,发现数据库和表能被正常导入进去;分区数也符合预期
hive> show databases; OK las_exporter Time taken: 0.026 seconds, Fetched: 1 row(s) hive> use las_exporter; OK Time taken: 0.109 seconds hive> show tables; OK exporter_table_1000_0 exporter_table_1000_1 exporter_table_1000_2 exporter_table_1000_3 exporter_table_1000_4 Time taken: 0.074 seconds, Fetched: 5 row(s) hive> show partitions exporter_table_1000_4; ..... year=2017-997 year=2017-998 year=2017-999 Time taken: 0.14 seconds, Fetched: 1000 row(s) hive>
验证增量导入的补全功能:
现在我们删除其中一张表exporter_table_1000_4,和exporter_table_1000_3的其中一个分区year='2017-999',并且增加一个新的table exporter_table_1000_5(1000分区),然后重新跑一遍上面的命令。预期删除的分区和表,以及新增的表,都能被完整地导入成功:
hive>drop table exporter_table_1000_4; OK Time taken: 5.565 seconds hive>alter table exporter table 1000 3 drop partition(year='2017-999'); Dropped the partition year=2017-999 OK Time taken: 0.702 seconds hive> show tables; OK exporter_table_1000_0 exporter_table_1000_1 exporter_table_1000_2 exporter_table_1000_3 Time taken: 0.054 seconds, Fetched: 4 row(s) hive>
执行spark 命令结果:
说明
根据返回结果判断:符合预期。
hive> show tables; OK exporter_table_1000_0 exporter_table_1000_1 exporter_table_1000_2 exporter_table_1000_3 exporter_table_1000_4 exporter_table_1000_5 Time taken: 0.073 seconds, Fetched: 6 row(s) hive> show partitions exporter_table_1000_3; ... ... year=2017-993 year=2017-994 year=2017-995 year=2017-996 year=2017-997 year=2017-998 year=2017-999 #重新添加的分区 Time taken: 0.116 seconds, Fetched: 1000 row(s)
export_to_hive.yaml
mode: las_to_hive lasClientInfo: akMode: AUTO endPoint: thrift://cs.lakeformation.las.cn-beijing.ivolces.com:48869 regionId: cn-beijing hmsClientInfo: hiveConfPath: /etc/emr/hive/conf/hive-site.xml runOptions: includeCatalogPrefixs: [las_exporter] includeDatabasePrefixs: [las_exporter] includeTablePrefixs: [exporter_table_20w] batchSize: 1000 outputBaseDir: xxx objectTypes: - catalog - database - table - partition - function
执行 Spark 命令。
root@master-1-1(10.1.0.27):~$ time spark-submit --masteryarn --deploy-mode cluster --driver-memory 12G --executor-memory 8G --executor-cores 2 --num-executors 5 --conf spark.sql.shuffle.partitions=200 --conf spark.sql.adaptive.enabled=false --classs bytedance.olap.las.Exporter ./application.jar ./export_to_hive.yaml SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder", SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. real 2m28.509s user 0m8.486s sys 0m0.410s
执行时间: 2m28s
结果符合预期:
hive> show tables; OK exporter_table_1000_0 exporter_table_1000_1 exporter_table_1000_2 exporter_table_1000_3 exporter_table_1000_4 exporter_table_1000_5 exporter_table_20w_0 exporter_table_20w_1 exporter_table_20w_2 exporter_table_20w_3 exporter_table_20w_4 hive> show partitions exporter_table_20w_0; ... ... year=2017-99999 Time taken: 1.012 seconds, Fetched: 2000000 row(s)
las_to_tos 模式会将las-catalog的元数据信息进行序列化成 json,存储在 TOS 里。配置参数 outputBaseDir 是必填项,Spark 导出任务会在该目录下创建一个按日期格式 year-month-day 命名的文件夹,如果 outputBaseDir 下面有重命名的文件夹,会进行覆盖。
mode: las_to_tos lasClientInfo: akMode: AUTO endPoint: thrift://cs.lakeformation.las.cn-beijing.ivolces.com:48869 regionId: cn-beijing hmsClientInfo: hiveConfPath: /etc/emr/hive/conf/hive-site.xml runOptions: includeCatalogPrefixs: [las_exporter] includeDatabasePrefixs: [las_exporter] includeTablePrefixs: [exporter_table_1000] objectTypes: - catalog - database - table - partition - function outputBaseDir: tos://xxx/exporter_las
spark-submit --master yarn --deploy-mode cluster --driver-memory 12G --executor-memory 8G --executor-cores 2 --num-executors 5 --conf spark.sql.shuffle.partitions=200 --conf spark.sql.adaptive.enabled=false --class bytedance.olap.las.Exporter ./application.jar ./export_to_tos.yaml
从 TOS 路径 tos://xxx/exporter_las 中查看生成具体日志的文件夹,可查看元数据的目录信息:
主动删除一个 partitions 这个 TOS 目录,然后重新进行导入,可以看到被删除的 partitions 文件夹和里面的数据都被重写了。
通常情况下,tos_to_hive 需要和 las_to_tos 模式搭配使用,在使用 tos_to_hive 模式前需要保证 TOS 中存在数据。
因此我们使用上面 las_to_tos 模式下产生的数据,并且预先在 hive 表里删除两张表exporter_table_1000_0;exporter_table_1000_1; 期望最终 tos_to_hive 模式下能成功导入成功恢复这两张表的数据。
Time taken: 0.102 seconds, Fetched: 11 row(s) nive> drop table exporter_table_1000_0; OK Time taken: 4.239 seconds hive> drop table exporter_table_1000_1; Time taken: 3.411 seconds hive> show tables; OK exporter_table_1000_2 exporter_table_1000_3 exporter_table_1000_4 exporter_table_1000_5 exporter_table_20w_0 exporter_table_20w_1 exporter_table_20w_2 exporter_table_20w_3 exporter_table_20w_4 Time taken: 0.055 seconds, Fetched: 9 row(s) hive>
tos_to_hive.yaml
说明
配置中 inputBaseDir 是 TOS 上存储元数据的地址。
mode: tos_to_hive lasClientInfo: akMode: AUTO endPoint: thrift://cs.lakeformation.las.cn-beijing.ivolces.com:48869 regionId: cn-beijing hmsClientInfo: hiveConfPath: /etc/emr/hive/conf/hive-site.xml runOptions: includeCatalogPrefixs: [las_exporter] includeDatabasePrefixs: [las_exporter] objectTypes: - catalog - database - table - partition - function inputBaseDir: tos://xxx/exporter_las/2025-02-19
time spark-submit --master yarn --deploy-mode cluster --driver-memory 12G --executor-memory 8G --executor-cores 2 --num-executors 5 --conf spark.sql.shuffle.partitions=200 --conf spark.sql.adaptive.enabled=false --class bytedance.olap.las.Exporter ./application.jar ./tos_to_hive.yaml
从返回结果可确认,exporter_table_1000_0;exporter_table_1000_1;已被恢复。
hive> show tables; exporter_table_1000_0 exporter_table_1000_1 exporter_table_1000_2 exporter_table_1000_3 exporter_table_1000_4 exporter_table_1000_5 exporter_table_20w_0 exporter_table_20w_1 exporter_table_20w_2 exporter_table_20w_3 exporter_table_20w_4 Time taken: 0.071 seconds, Fetched: 11 row(s) hive>hive> show partitions exporter_table_1000_1; ... year=2017-991 year=2017-992 year=2017-993 year=2017-994 year=2017-995 year=2017-996 year=2017-997 year=2017-998 year=2017-999 Time taken: 0.119 seconds, Fetched: 1000 row(s)