You need to enable JavaScript to run this app.
导航
SQL 语法
最近更新时间:2024.06.11 14:50:19首次发布时间:2021.09.30 14:31:53

1. 概述

LAS SQL 语法标准以 ANSI SQL 2011 为基础,增加了 OLAP 相关语法,同时基于 Spark 3.0,支持了大部分的 Spark SQL build-in functions。

2. 阅读说明

  • 中括号[] 括起来的部分代表 可选
    比如 CREATE TABLE [ IF NOT EXISTS ] [database_name.]table_name,代表这个语句可以是 CREATE TABLE table_name,也可以是 CREATE TABLE IF NOT EXISTS [database_name.]table_name。
  • [ A | B ] 代表 A 和 B 都是可选的,但 至多只能有一个
    比如 ORDER BY column_name [ ASC | DESC ],代表这个语句可以是 ORDER BY column_name,也可以是 ORDER BY column_name ASC 或者 ORDER BY column_name DESC。
  • { A | B } 代表 A 和 B 有且只有一个
    比如 SHOW { DATABASES | SCHEMAS },代表这个语句可以是 SHOW DATABASES,也可以是 SHOW SCHEMAS,但不能只有 SHOW。
  • <expression, ...> 代表与 expression 格式相同的串,还可以通过 逗号分隔可重复多次
    比如 WITH DBPROPERTIES ( <property, ...> ) 代表 property 这个格式可以在括号里出现一次,也可以出现多次。
  • 所有语法中:单词为 大写 时,代表是语法 关键字 ;单词为 小写 时,代表为一个 Sql 语法 片段 / 参数 ,这些会有单独的定义或者在参数里有相应解释。
    比如 CREATE TABLE table_name [ ( <column_defination, ...> ) ] 中,CREATE TABLE 为两个关键字, column_defination 可参考下文紧邻的【参数】中描述的格式替换为具体语句。
  • 语法参数的解释全文只出现一次,解释一次后,后续在语法出现时将不再赘述。
  • 每一种语法后都会提供一些十分常见的 Demo 以供参考。

3. DDL 语句

3.1 库操作

3.1.1 创建库

  • 语法
CREATE { DATABASE | SCHEMA } [ IF NOT EXISTS ] database_name 
[ WITH DBPROPERTIES ( <property, ...>) ]
[ COMMENT comment ]

  • 参数
    • property
      通过键=值格式来表示的键值对,键和值的两端都需要有单引号,比如 'propKey'='propValue'
    • comment
      指定数据库备注内容。
  • 示例
CREATE SCHEMA test_olap COMMENT 'this is a database for test';

CREATE DATABASE IF NOT EXISTS test_olap;

CREATE SCHEMA IF NOT EXISTS test_olap;
WITH DBPROPERTIES ('scope' = 'inner', 'ownerName' = 'user1')
COMMENT 'this is a database for test';

3.1.2 使用库

  • 语法
USE database_name

  • 示例
USE test_olap

3.1.3 删除库

  • 语法
DROP { DATABASE | SCHEMA } [ IF EXISTS ] 
database_identifier [ RESTRICT | CASCADE ]

  • 参数
    • RESTRICT
      如果指定,将限制删除非空数据库,并默认启用。
    • CASCADE
      如果指定,将删除所有关联的表和函数。
  • 示例
DROP SCHMEA IF EXISTS test_hive_db;

3.2 表操作

3.2.1 创建表

3.2.1.1 创建新表
  • 语法
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [database_name.]table_name
    [ ( <column_defination, ...> ) ]
    [ COMMENT 'table_comment' ]
    [ PARTITIONED BY ( <column_defination, ...> ) ]
    [ bucket_spec ]
    [ skew_spec ]
    [ row_format ]
    [ create_file_format ]
    [ TBLPROPERTIES ( <property, ...> ) ]

column_defination:
    column_name column_type [ COMMENT comment ]

bucket_spec:
    CLUSTERED BY ( <column_name, ...> ) 
    [ SORTED BY ( <column_name [ ASC | DESC ], ...> ) ] 
    INTO num_buckets BUCKETS
        
skew_spec:
    SKEWED BY ( column_name_1, column_name_2, ..., column_name_n )
    ON ( 
        <( column_value_for_name_1, ..., column_value_for_name_n ), ...> 
    )
    [STORED AS DIRECTORIES]
        
row_format:
    ROW FORMAT SERDE serde_class [ WITH SERDEPROPERTIES ( <property, ...> ) ]
    | ROW FORMAT DELIMITED [ FIELDS TERMINATED BY fields_termiated_char [ ESCAPED BY escaped_char ] ] 
        [ COLLECTION ITEMS TERMINATED BY collection_items_termiated_char ] 
        [ MAP KEYS TERMINATED BY map_key_termiated_char ]
        [ LINES TERMINATED BY row_termiated_char ]
        [ NULL DEFINED AS null_char ]

create_file_format:
    STORED AS file_format
    | STORED BY storage_handler
    
file_format:
    INPUTFORMAT 'input_format_class' OUTPUTFORMAT 'output_format_class'
    | {TEXTFILE | PARQUET | ORCFILE | RCFILE | SEQUENCEFILE}

storage_handler:
    'hander_class' [WITH SERDEPROPERTIES ( <property, ...> )]


  • 参数
    • column_defination
      定义一个列。
    • column_name
      指定列名,格式可以是字母下划线的字符串,也可以使用```进行引用。
    • row_format
      使用SERDE子句指定一个自定义 SerDe,或者使用DELIMITED子句使用原生 SerDe 并指定分隔符、转义字符、空字符等。
    • SERDE
      指定自定义 SerDe。
    • serde_class
      指定自定义 SerDe 类的全限定名。
    • SERDEPROPERTIES
      用于标记 SerDe 定义的键值对列表。
      例如: ROW FORMAT SERDE 'org.....serde.ParquetHiveSerDe'
    • DELIMITED
      DELIMITED子句可用于指定原生 SerDe 并声明分隔符、转义字符、空字符等。
    • FIELDS TERMINATED BY
      用于定义列分隔符。
    • COLLECTION ITEMS TERMINATED BY
      用于定义集合项分隔符。
    • MAP KEYS TERMINATED BY
      用于定义字典项分隔符。
    • LINES TERMINATED BY
      用于定义行分隔符。
    • NULL DEFINED AS
      用于定义 NULL 的特定值。
      例如: ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '。
  • 示例
-- 创建一个内表(主键表)
CREATE TABLE db.tb (
    user_id         BIGINT,
    product_id      BIGINT,
    action_type     STRING,
    gmv             DOUBLE,
    ts              BIGINT,
    PRIMARY KEY (user_id, product_id), -- 指定主键
    PRECOMBINE KEY (ts) -- 指定排序字段,一般为业务时间戳字段,保证乱序情况下,数据符合预期。
                        -- 不指定时可能导致查询结果不符合预期。
                        -- 主键相同时,基于此字段排序选择最大的一条进行去重。
)
PARTITIONED BY (date STRING, hour STRING) -- 指定分区列,非分区表可以不指定
CLUSTERED BY (user_id, product_id) -- bucket 字段,需为主键列的子集(如果没有特殊需求,建议和主键列保持一致)
INTO 32 BUCKETS -- bucket 数,决定数据读写并发度,基于实际数据量预估,单个 bucket 不超过 1G,建议为 2 次幂
STORED AS ByteLake;

-- 创建一个内表(非主键表)
CREATE TABLE db.tb (
    user_id         BIGINT,
    product_id      BIGINT,
    action_type     STRING,
    gmv             DOUBLE,
    ts              BIGINT
)
PARTITIONED BY (date STRING, hour STRING) -- 指定分区列,非分区表可以不指定
STORED AS ByteLake;
 

-- 创建一个hive分区表
CREATE TABLE IF NOT EXISTS test_olap.student (
    id INT COMMENT 'id', 
    name STRING COMMENT 'name', 
    age INT COMMENT 'age'
) PARTITIONED BY (date string COMMENT 'date partition')
    COMMENT 'this is a comment'
    STORED AS PARQUET
    TBLPROPERTIES ('created.date' = 'xxxx-xx-xx');

-- 使用 Row Format 和 file format
CREATE TABLE test_olap.student (id INT, name STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;
    
-- 使用复杂类型 (外表需要指定 Location)
CREATE EXTERNAL TABLE test_olap.family(
        name STRING,
        friends ARRAY<STRING>,
        children MAP<STRING, INT>,
        address STRUCT<street: STRING, city: STRING>
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
    COLLECTION ITEMS TERMINATED BY '_'
    MAP KEYS TERMINATED BY ':'
    LINES TERMINATED BY '\n'
    NULL DEFINED AS 'foonull'
    STORED AS TEXTFILE
    LOCATION "tos://xxxxx/xxxx";

3.2.1.2 基于已存在的表结构创建表
  • 功能
    基于已存在的表创建具备相同结构的新表但不复制数据。
  • 语法
CREATE TABLE [IF NOT EXISTS] table_name LIKE source_table_name
    [ ROW FORMAT row_format ]
    [ STORED AS create_file_format ]
    [ TBLPROPERTIES ( <property, ...> ) ]

  • 示例
-- 根据一个已存在的表创建一张表
CREATE TABLE test_olap.student_copy like test_olap.student;

-- 使用 row format
CREATE TABLE test_olap.student_copy like test_olap.student
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES ('prop1'='xxxx');

3.2.2 修改表

3.2.2.1 增加列

Alter Table 大部分操作适用于 LAS 外表,对于LAS 内表的相关分区等属性通常由系统自动管理,不允许直接通过接口变更。

  • 语法
ALTER TABLE [database_name.]table_name ADD COLUMNS ( <column_defination, ...> )

增删列语法支持内表和外表,属于 Schema Evolution 的特性,需要说明的是,对于LAS内表而言,有如下限制:
对于 LAS 标准内表,alter table add column 执行后需要至少写入一条数据,否则 presto 执行查询时,将会报错不支持,这种情况建议增加数据确保 schema 演化功能生效。对于外表没有该限制。

  • 示例
-- 添加一个列
ALTER TABLE test_olap.student ADD COLUMNS (grade STRING COMMENT 'student grade')

3.2.2.2 增加分区(仅支持外表)
  • 语法
ALTER TABLE [database_name.]table_name ADD [IF NOT EXISTS] PARTITION ( <partition_spec, ...> )

  • 参数
    partition_spec
    要添加的分区。
    语法 partition_column_name = 'partition_column_value'
  • 示例
ALTER TABLE test_olap.student_part ADD IF NOT EXISTS PARTITION(date='20200101')

3.2.2.3 重命名表(仅支持外表)
  • 语法
ALTER TABLE [database_name.]table_name RENAME TO [database_name.]table_name

  • 示例
ALTER TABLE test_olap.student RENAME TO test_olap.student_2

3.2.2.4 重命名分区(仅支持外表)
  • 语法
ALTER TABLE [database_name.]table_name partition_spec RENAME TO partition_spec

  • 参数
    part_from RENAME TO part_to
    将原始分区值重命名到新分区值。
  • 示例
ALTER TABLE test_olap.student_part 
    PARTITION (date = '20200101') RENAME TO PARTITION (date = '220200102')

3.2.2.5 删除分区
  • 语法
ALTER TABLE [database_name.]table_name DROP [ IF EXISTS ] partition_spec [PURGE]

  • 参数
    PURGE
    如果指定 PURGE,数据将立即删除,不会移动到垃圾桶。
  • 示例
ALTER TABLE test_olap.student_part 
    DROP IF EXISTS PARTITION(date='20200101') PURGE

3.2.2.6 设置/去除表属性
  • 语法
ALTER TABLE [database_name.]table_name SET TBLPROPERTIES ( <property, ...> )

ALTER TABLE [database_name.]table_name UNSET TBLPROPERTIES [ IF EXISTS ] ( <property_key, ...> )

  • 示例
-- 添加 Table 属性
ALTER TABLE test_olap.student SET TBLPROPERTIES ('ownerName' = 'xxx', 'size' = '10M')

-- 去掉 Table 属性
ALTER TABLE test_olap.student UNSET TBLPROPERTIES IF EXISTS ('ownerName', 'size')


3.2.2.7 设置序列化参数
  • 语法
-- Set SERDE Properties
ALTER TABLE [database_name.]table_name [ partition_spec ]
    SET SERDEPROPERTIES ( <property, ...> )

ALTER TABLE [database_name.]table_name [ partition_spec ]
    SET SERDE serde_class_name
    [ WITH SERDEPROPERTIES ( <property, ...> ) ]

  • 参数
    serde_class_name
    指定 Serde 的 class 全限定名。
  • 示例
-- Set SERDE
-- Before set SERDE
DESC EXTENDED test_olap.student
...
|Serde Library           
|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
...

ALTER TABLE test_olap.student SET SERDE 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe';

-- After set SERDE
DESC EXTENDED test_olap.student
...
|Serde Library               
|org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerD
...

-- Set SERDE properties
-- Before set SERDE properties
DESC EXTENDED test_olap.student
...
|Storage Properties          
|[serialization.format=1]
...

ALTER TABLE test_olap.student SET SERDEPROPERTIES ('k' = 'v', 'kay' = 'vee')

-- After set SERDE properties
DESC EXTENDED test_olap.student
...
|Storage Properties          
|[kay=vee, k=v, serialization.format=1]
...

3.2.2.8 修复分区
  • 语法
    请注意: 请注意,ALTER TABLE RECOVER分区仅适用于分区表。
ALTER TABLE database_name.table_name RECOVER PARTITIONS
  • 示例
ALTER TABLE test_olap.student_part RECOVER PARTITIONS

3.2.2.9 设置表分桶数
  • 语法
ALTER TABLE [database_name.]table_name
CLUSTERED BY ( <column_name, ...> )
[ SORTED BY ( <column_name [ ASC | DESC ], ...> ) ]
INTO bucket_num BUCKETS

  • 参数
    • CLUSTERED BY
      指定要设置为 clusted 的列名列表。
    • SORTED BY
      指定要设置为 sorted 的列名列表。
    • bucket_num
      指定桶的数量。
  • 示例
ALTER TABLE test_olap.student 
CLUSTERED BY (name ,age)
SORTED BY (name, age)
INTO 8 BUCKETS

3.2.2.10 修改多个列定义
  • 语法
ALTER TABLE database_name.table_name REPLACE COLUMNS ( <column_defination, ...> )
  • 示例
-- Before replace columns
DESC test_olap.student
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|id      |int      |null   |
|name    |string   |null   |
|age     |int      |null   |
+--------+---------+-------+

ALTER TABLE test_olap.student REPLACE COLUMNS
(
    id INT COMMENT 'id comment',
    age STRING,
    idd INT
)

-- After replace columns
DESC test_olap.student
+--------+---------+----------+
|col_name|data_type|comment   |
+--------+---------+----------+
|id      |int      |id comment|
|age     |string   |null      |
|idd     |int      |null      |
+--------+---------+----------+

3.2.2.11 单个修改列定义
  • 语法
ALTER TABLE database_name.table_name
CHANGE [COLUMN] column_name column_defination
  • 示例
ALTER TABLE test_olap.student CHANGE age age INT COMMENT 'new age comment'

3.2.3 删除表

  • 语法
DROP TABLE [ IF EXISTS ] [database_name.]table_name

  • 示例
DROP TABLE test_olap.student

3.3 视图操作

3.3.1 创建视图

  • 语法
CREATE [OR REPLACE] [[GLOBAL] TEMPORARY] VIEW [IF NOT EXISTS] database_name.view_name
    [(<column_name [COMMENT 'column_comment'], ...>)]
    [COMMENT 'view_comment']
    [TBLPROPERTIES (<property, ...>)]
AS select_statement
  • 参数
- select_statement
           查询语句,具体见后面[DQL语句]部分
  • 示例
-- 创建一个view
CREATE OR REPLACE VIEW test_olap.older_student (
    id COMMENT 'id', 
    name COMMENT 'name'
) COMMENT 'view for older student'
  TBLPROPERTIES('created.date' = 'xxxx-xx-xx')
AS 
    SELECT 
        id, 
        name 
    FROM 
        test_olap.student
    WHERE 
        age > 18;

3.3.2 修改视图

3.3.2.1 重命名视图
  • 语法
ALTER VIEW database_name.view_name RENAME TO database_name.view_name
  • 示例
ALTER VIEW test_olap.older_student RENAME TO test_olap.older_student_copy

3.3.2.2 设置/去除视图属性
  • 语法
-- Set View Properties
ALTER VIEW database_name.view_name 
{ SET | UNSET } TBLPROPERTIES ( <property, ...> )

-- Unset View Properties
ALTER VIEW database_name.view_name 
UNSET TBLPROPERTIES [ IF EXISTS ]  ( <property_key, ...> )
  • 示例
-- Set View Properties
ALTER VIEW test_olap.older_student SET TBLPROPERTIES ('owner' = 'xxx', 'size' = '10M')

--Show View Properties after set using command 
SHOW TBLPROPERTIES test_olap.older_student
+---------------------+----------+
|key                  |value     |
+---------------------+----------+
|size                 |10M       |
|created.date         |2020-10-10|
|transient_lastDdlTime|1603265772|
|last_modified_time   |1603264813|
|owner                |xxx       |
+---------------------+----------+

-- Unset View properties
ALTER VIEW test_olap.older_student UNSET TBLPROPERTIES IF EXISTS ('owner', 'size')

--Show View Properties after set using command 
SHOW TBLPROPERTIES test_olap.older_student
+---------------------+----------+
|key                  |value     |
+---------------------+----------+
|created.date         |2020-10-10|
|transient_lastDdlTime|1603266451|
|last_modified_time   |1603264813|
+---------------------+----------+

3.3.2.3 修改视图定义
  • 语法
ALTER VIEW database_name.view_name AS query
  • 示例
ALTER VIEW test_olap.older_student AS
SELECT id, name FROM test_olap.student
WHERE age > 20;

3.3.3 删除视图

  • 语法
DROP VIEW [ IF EXISTS ] database_name.view_name
  • 示例
DROP VIEW IF EXISTS test_olap.older_student;

4. DML 语句

4.1 插入数据

  • 语法
INSERT INTO [ TABLE ] [database_name.]table_name [ partition_spec ]
{ 
    VALUES <( <{ value | NULL }, ...> ), ...>
    | select_statement 
}

  • 参数
    • VALUES
      指定要插入的值。可以插入显式指定的值或 NULL。必须使用逗号分隔子句中的每个值。可以指定多个值集来插入多行。
    • select_statement
      生成要插入的行的查询。它可以是以下格式之一:
      SELECT子句
      TABLE子句
      FROM子句
  • 示例
-- Single row insert using a VALUES clause
INSERT INTO test_olap.student VALUES
    (1, 'Tom', 20);
-- Single row insert using VALUES clause on partitioned table 
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO test_olap.student_part2 VALUES
    (1, 'Tom', 21, '20200101');

--Multi-row insert using a VALUES clause
INSERT INTO test_olap.student VALUES
    (1, 'Tom', 20),
    (2, 'Jack', 22);

-- Insert using a SELECT statement
INSERT INTO test_olap.student_part2 PARTITION (date = '20200201')
    SELECT id, name, age FROM test_olap.student WHERE age > 10;

-- Insert using a TABLE statement
INSERT INTO test_olap.student TABLE test_olap.student2;

-- Insert using a FROM statement
INSERT INTO test_olap.student_part2 PARTITION (date = '20200201')
    FROM test_olap.student SELECT id, name, age WHERE age > 10;

4.2 覆盖写入数据

  • 语法
INSERT OVERWRITE TABLE [database_name.]table_name [ partition_spec [ IF NOT EXISTS ] ]
{ 
    VALUES <( <{ value | NULL }, ...> ), ...>
    | select_statement 
}

  • 示例
-- Insert using a VALUES clause
INSERT OVERWRITE TABLE test_olap.student VALUES
    (1, 'Tom', 20),
    (2, 'Jack', 22);

-- Insert using a SELECT statement
INSERT OVERWRITE TABLE test_olap.student_part2 PARTITION (date = '20200101')
    SELECT id, name, age FROM test_olap.student WHERE age > 20;

-- Insert using a TABLE statement
INSERT OVERWRITE TABLE test_olap.student TABLE test_olap.student2;

-- Insert using a FROM statement
INSERT OVERWRITE TABLE test_olap.student_part2 PARTITION (date = '20200101')
    FROM test_olap.student  SELECT id, name, age WHERE age > 20;

4.3 Update/Delete

delete 操作仅支持 ACID 表,参考https://www.volcengine.com/docs/6492/101914。
非主键 LAS 内表/外表无法支持 delete。

已知限制:
需要说明的是,目前 DML 语句暂不支持和 DDL 语句放在同一个任务中使用。原因为目前 DML 语句会针对性的对 LAS 表提前进行优化判断。当在同一个任务中使用 DDL 建表并立刻运行 DML 插入时,LAS 内表目前暂时无法应用内表优化规则,导致查询失败。对于 LAS 外表在 DDL+DML 时,DML 则无法立刻获得 TOS 认证信息。因此,如果您原来的查询语句中有 DDL+DML 一起的情况下,需要先起一个任务运行 DDL,再进行 DML 修改。

5. DQL 语句

5.1 查询数据

语法概览

[ cte_defination ]
select_statement 
[ <{ UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_statement, ...> ]
[ ORDER BY <expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ], ...> ]
[ SORT BY <expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ], ...> ]
[ CLUSTER BY <expression, ...> ]
[ DISTRIBUTE BY <expression, ...> ]
[ <WINDOW named_window, ...> ]
[ LIMIT { ALL | expression } ]

参数概览

  • cte_defination

指定主查询块前的公共表表达式(Common Table Expression, CTE)。这些表表达式允许在稍后的从句中引用。这对于从句中抽象出重复的子查询块很有用,并提高查询的可读性。

  • select_statement

指定查询逻辑

  • UNION & INTERSECT & EXCEPT
    对查询结果数据集执行交集(intersectintersect allintersect distinct)、并集(unionunion allunion distinct)或补集(exceptexcept allexcept distinct)操作
  • ORDER BY

ORDER BY子句用于按用户指定的顺序以排序方式返回结果行。与SORT BY子句不同,该子句保证输出中的总顺序。

  • SORT BY

指定每个分区内行的排序。此参数与ORDER BYCLUSTER BY互斥,不能一起指定。
请注意,这与保证输出总顺序的ORDER BY子句不同,SORT BY子句用于返回按用户指定顺序在每个分区内排序的结果行。当有多个分区时,SORT BY 可能返回部分有序的结果。

  • CLUSTER BY

指定一组表达式,用于对行进行重新分区和排序。使用此子句的效果与同时使用DISTRIBUTE BYSORT BY相同。

  • DISTRIBUTE BY

指定将结果行根据分区 key 进行重分区的一组表达式。此参数与ORDER BYCLUSTER BY互斥,不能一起指定。

  • WINDOW

窗口函数对一组行进行操作,并根据该组行计算每行的返回值。窗口函数对于处理任务很有用,例如计算移动平均值、计算累积统计数据或访问给定当前行相对位置的行的值。

  • LIMIT

指定语句或子查询可返回的最大行数。此子句主要与ORDER BY一起使用,以生成确定性结果。

5.1.1 WITH 子句/CTE 语法(cte_defination)

5.1.1.1 语法
WITH <cte_name [ AS ] ( select_statement ), ...>

5.1.1.2 参数
  • cte_name
    指定 CTE 的名称,允许在稍后的从句中引用。不能与当前with子句中的其他 CTE 的名称相同。

5.1.1.3 示例
-- CTE
-- Single CTE
WITH t AS (
    SELECT id, name, age FROM test_olap.student WHERE age > 20
)
SELECT * FROM t;

--Multi CTEs
WITH t1 AS (
    SELECT id, name, age FROM test_olap.student WHERE age > 20
),
t2 AS (
    SELECT id, name FROM t1 WHERE age < 25
)
SELECT * FROM t2;

-- CTE in CTE definition
WITH t AS (
    WITH t2 AS (
    SELECT id, name, age FROM test_olap.student WHERE age > 20
    )
    SELECT id, name FROM t2 WHERE age < 25
)
SELECT * FROM t;

-- CTE in subquery
SELECT
    max(t2.id)
FROM
    (
        WITH t AS (
            SELECT id, name, age FROM test_olap.student WHERE age > 20
        )
        SELECT * FROM t
    ) t2

5.1.2 SELECT 子句(select_statement)

语法概览

SELECT [ <hints, ...> ] [ ALL | DISTINCT ] <named_expression, ...>
    FROM <from_item, ...>
    [ LATERAL VIEW clause ] [ LATERAL VIEW clause ... ]
    [ WHERE boolean_expression ]
    [ GROUP BY <expression, ...> ]
    [ HAVING boolean_expression ]
    [ ORDER BY <expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ], ...> ]
    [ SORT BY <expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ], ...> ]
    [ CLUSTER BY <expression, ...> ]
    [ DISTRIBUTE BY <expression, ...> ]
    [ <WINDOW named_window, ...> ]
    [ LIMIT { ALL | expression } ]

参数概览

  • expression
    指定生成值的一个或多个值、运算符和 SQL 函数的组合。
  • hints
    可以指定Hints来帮助引擎做出更好的规划决策。例如,目前 Spark 支持影响 join 策略选择和数据 repartition 的Hints
  • ALL | DISTINCT
    All代表从关系中选择所有匹配的行,并默认启用。DISTINCT代表删除结果中的重复项后,从关系中选择所有匹配行。
  • named_expression
    具有指定名称的表达式。通常,它表示列表达式。
    语法: expression [ [ AS ] alias ]
  • from_item
    指定查询的输入源。它可以是以下之一:
    • Join Relation
    • Inline Table
    • 子查询
  • LATERAL VIEW
    LATERAL VIEW子句需要与EXPLODE等生成器函数一起使用,EXPLODE将生成包含一行或多行的虚拟表,LATERAL VIEW将把这些行应用到每个原始输出行。
  • WHERE
    根据所提供的谓词来筛选FROM子句的结果。
  • GROUP BY
    指定用于对行进行分组的表达式。这将与聚合函数(MIN、MAX、COUNT、SUM、AVG 等)一起使用。)根据分组表达式和每个组中的聚合值对行进行分组。当 FILTER 子句附加到聚合函数时,仅将匹配的行传递给该函数。
  • HAVING
    指定用于筛选GROUP BY生成的行的谓词。HAVING子句用于在执行分组后筛选行。如果指定HAVING而不指定GROUP BY,则表示GROUP BY不包含分组表达式(全局聚合)。

5.1.2.1 查询输入源(from_item)

5.1.2.1.1 Join Relation

语法

relation 
{ 
    [ join_type ] JOIN relation [ join_criteria ]
    | NATURAL join_type JOIN relation 
}

参数

  • relation
    指定要联接的关系。
  • join_type
    指定连接类型。
    语法:
    [ INNER ]
    | CROSS
    | LEFT [ OUTER ]
    | [ LEFT ] SEMI
    | RIGHT [ OUTER ]
    | FULL [ OUTER ]
  • join_criteria
    指定如何连接两个关系的记录。
    语法: ON boolean_expression | USING ( <column_name, ...> )

示例

-- Join Hints
SELECT /*+ BROADCASTJOIN (t1) */ * FROM test_olap.student t1
    LEFT JOIN (SELECT * FROM test_olap.student_part2 WHERE date = '20200101') t2
    ON t1.id = t2.id;

5.1.2.1.2 Inline Table

语法

VALUES ( <expression, ...> ) [ table_alias ]

参数

  • expression
    指定生成值的一个或多个值、运算符和 SQL 函数的组合。
  • table_alias
    使用可选列名称列表指定临时名称。
    语法: [ AS ] table_name [ ( <column_name, ...> ) ]

示例

-- inline table
SELECT * FROM (VALUES ("one", 1));

5.1.2.2 LATERAL VIEW

LATERAL VIEW子句需要与EXPLODE等生成器函数一起使用,EXPLODE将生成包含一行或多行的虚拟表,LATERAL VIEW将把这些行应用到每个原始输出行。
语法

LATERAL VIEW [ OUTER ] 
generator_function ( <expression, ... ) [ table_alias ] 
AS <column_alias, ...>

参数

  • OUTER
    如果指定了OUTER,当输入数组/映射为空或 null 时,则返回 null。
  • generator_function
    指定生成器函数(EXPLODE、INLINE 等)。
  • table_alias
    generator_function的别名,这是可选的。
  • column_alias
    指定generator_function的列的别名。如果generator_function有多个输出列,则可能有多个别名。
  • WHERE
    根据所提供的谓词来筛选FROM子句的结果。

示例

-- LATERAL VIEW
SELECT t1.age, t2.c_age, t3.d_age FROM test_olap.student t1
    LATERAL VIEW EXPLODE(ARRAY(30, 60)) t2 AS c_age
    LATERAL VIEW EXPLODE(ARRAY(40, 80)) t3 AS d_age;

5.1.2.3 GROUP BY

指定用于对行进行分组的表达式。这将与聚合函数(MIN、MAX、COUNT、SUM、AVG 等)一起使用。根据分组表达式和每个组中的聚合值对行进行分组。当 FILTER 子句附加到聚合函数时,仅将匹配的行传递给该函数。
语法

{
    GROUP BY <group_expression, ...> [ { WITH ROLLUP | WITH CUBE | GROUPING SETS ( <grouping_set, ...> ) } ]
    | GROUP BY GROUPING SETS ( <grouping_set, ...> )
}

其中 group_expression 定义如下:

aggregate_name ( [ DISTINCT ] <expression, ...> ] ) 
[ FILTER ( WHERE boolean_expression ) ]

参数

  • GROUPING SETS
    对分组集中指定的表达式的每个子集的行进行分组。例如,GROUP BY GROUPING SETS (warehouse, product)在语义上等同于GROUP BY warehouse仓库和GROUP BY product的结果的联合。该子句是UNION ALL的简写,其中UNION ALL运算符的每个分支执行GROUPING SETS子句中指定的列子集的聚合。
  • grouping_set
    分组集由括号中的零个或多个逗号分隔表达式指定。
    语法: ( [ <expression, ...> ] )
  • grouping_expression
    指定将行分组在一起所依据的条件。根据分组表达式的结果值执行行分组。分组表达式可以是列别名或表达式。
  • ROLLUP
    按层级聚合的方式产生grouping sets,也可以和grouping sets组合使用。
  • CUBE
    使用指定列的所有可能组合作为grouping sets,也可以和grouping sets组合使用。
  • aggregate_name
    指定集合函数名(MIN、MAX、COUNT、SUM、AVG 等)。
  • DISTINCT
    删除输入行中的重复项,然后将其传递给聚合函数。
  • FILTER
    筛选WHERE子句中boolean_expression为 true 的输入行,将其传递给集合函数;其他行将被丢弃。

示例

-- Group by
-- Grouping SETS
SELECT id, count(1) FROM test_olap.student GROUP BY id, name
    GROUPING SETS((id, name), (id), ());

-- CUBE
SELECT id, count(1) FROM test_olap.student GROUP BY id, name WITH CUBE;

-- ROLLUP
SELECT id, count(1) FROM test_olap.student GROUP BY id, name WITH ROLLUP;

-- Distinct in aggregate functions
SELECT id, sum(DISTINCT age) FROM test_olap.student GROUP BY id;

-- Aggregate filter
SELECT id,
     sum(age) FILTER (WHERE name in ('Tom', 'Jack'))
     FROM 
        test_olap.student
     GROUP BY 
        id;

-- Group by alias
SELECT 
     regexp_replace(name, 'T', 'S') as name_alias,
     count(1)
     FROM 
        test_olap.student
     GROUP BY
        name_alias

5.1.2.4 CASE子句

语法

CASE [ expression ] { WHEN boolean_expression THEN then_expression } [ ... ]
    [ ELSE else_expression ] END

参数

  • then_expression
    指定基于boolean_expression条件的然后表达式。
  • else_expression
    指定默认表达式。then_expressionelse_expression都应该是相同的类型或可强制为通用类型。

示例

-- Case clause
-- Case clause specifies expression
SELECT
    id,
    case age
        when 18 then 'young'
        when 20 then 'old'
        else 'others'
    end as stu_tag
FROM test_olap.student;
    
-- Case clause specifes no expression
SELECT
    id,
    case
        when age = 18 then 'young'
        when age = 20 then 'old'
        else 'others'
    end as stu_tag
FROM test_olap.student;

-- Case clause as WHERE condtion
SELECT
    id,
    name
FROM test_olap.student
WHERE
        case
            when age = 18 then 'young'
            when age = 20 then 'old'
            else 'others'
        end = 'young';

5.1.3 WINDOW 子句

语法

window_function OVER ( 
  [ { PARTITION | DISTRIBUTE } BY <column_spec, ...> ]
  [ { ORDER | SORT } BY <expression [ ASC | DESC ][ NULLS { FIRST | LAST } ], ...> ]
  [ window_frame ] 
)

参数

  • window_function
    Ranking Functions
    语法: RANK | DENSE_RANK | PERCENT_RANK | NTILE | ROW_NUMBER
  • Ranking Functions
    语法: RANK | DENSE_RANK | PERCENT_RANK | NTILE | ROW_NUMBER
  • Analytic Functions
    语法: CUME_DIST | LAG | LEAD
  • Aggregate Functions
    语法: MAX | MIN | COUNT | SUM | AVG | ...
  • window_frame
    指定窗口在哪条记录开启、在哪条记录结束。
    • 语法与参数如下:
{ RANGE | ROWS } { frame_start | BETWEEN frame_start AND frame_end }

frame_startframe_end具有以下语法:

UNBOUNDED PRECEDING | offset PRECEDING | CURRENT ROW | offset FOLLOWING | UNBOUNDED FOLLOWING

offset指定frame_end与当前行位置的偏移量。如果省略frame_end,则默认为当前行。
示例

-- Window
-- Simple in-line window
SELECT 
    id,
    name,
    RANK() OVER (PARTITION BY name ORDER BY age) AS rank
FROM test_olap.student;

-- Rows between
SELECT
    id,
    name,
    DENSE_RANK() OVER (PARTITION BY name ORDER BY age ROWS BETWEEN
    UNBOUNDED PRECEDING AND CURRENT ROW) AS rank
FROM test_olap.student;

-- range between
SELECT
    id,
    name,
    DENSE_RANK() OVER (PARTITION BY name ORDER BY age Range BETWEEN
    UNBOUNDED PRECEDING AND CURRENT ROW) AS rank
FROM test_olap.student;

-- Aggregate functions with window
SELECT 
    id,
    name,
    max(age) OVER (PARTITION BY name ORDER BY age) AS rank
FROM test_olap.student;

-- Window expression
SELECT 
    id,
    name,
    max(age) OVER w as rank
    FROM test_olap.student
    WINDOW w as (PARTITION BY name ORDER BY age)

5.1.4 其他示例

5.1.4.1 HAVING

-- Having
-- `HAVING` clause referring to column in `GROUP BY`.
SELECT id,
     sum(age) FILTER (WHERE name in ('Tom', 'Jack')) as sum
     FROM test_olap.student
     GROUP BY id
     Having sum > 50;

-- `HAVING` clause referring to aggregate function.
SELECT id,
     name
     FROM test_olap.student
     GROUP BY id, name
     Having sum(age) > 50;

-- `HAVING` clause referring to aggregate function by its alias
SELECT id,
     name,
     sum(age) as sum
     FROM test_olap.student
     GROUP BY id, name
     Having sum > 50;

5.1.4.2 ORDER BY

-- Order by
-- Sort rows by id. By default rows are sorted in ascending manner with NULL FIRST.
SELECT * FROM test_olap.student ORDER BY id;

-- Sort rows in ascending manner keeping null values to be last.
SELECT * FROM test_olap.student ORDER BY id NULLS LAST;

-- Sort rows by id in descending manner, which defaults to NULL LAST.
SELECT * FROM test_olap.student ORDER BY id DESC;

-- Sort rows in ascending manner keeping null values to be first.
SELECT * FROM test_olap.student ORDER BY id DESC NULLS FIRST;

-- Sort rows based on more than one column with each column having different-- sort direction.
SELECT * FROM test_olap.student ORDER BY id DESC, name ASC;

5.1.4.3 SORT BY

-- Sort by
-- Sort rows by `name` within each partition in ascending manner
SELECT id, name, age FROM test_olap.student SORT BY name;

-- Sort rows within each partition using column position.
SELECT id, name ,age FROM test_olap.student SORT BY 1;

-- Sort rows within partition in ascending manner keeping null values to be last.
SELECT id, name, age FROM test_olap.student SORT BY id NULLS LAST;

-- Sort rows by id within each partition in descending manner, which defaults to NULL LAST.
SELECT id, name, age FROM test_olap.student SORT BY id DESC;

-- Sort rows by id within each partition in descending manner keeping null values to be first.
SELECT id, name, age FROM test_olap.student SORT BY id DESC NULLS FIRST;

-- Sort rows within each partition based on more than one column with each column having different sort direction.
SELECT id, name, age FROM test_olap.student SORT BY id DESC, name;

5.1.4.4 CLUSTER BY & DISTRIBUTE BY

-- Cluster by & distribute by
-- Cluster by
SELECT * FROM test_olap.student CLUSTER BY id;

-- The above CLUSTER BY statement equals to DISTRIBUTE BY with SORT BY like the following SQL
SELECT * FROM test_olap.student DISTRIBUTE BY id SORT BY id;

-- DISTRIBUTE by
SELECT * FROM test_olap.student DISTRIBUTE BY id;

5.1.4.5 LIMIT

-- LIMIT
-- LIMIT 1000 rows
SELECT * FROM test_olap.student LIMIT 1000;
-- LIMIT ALL
SELECT * FROM test_olap.student LIMIT ALL;

5.2 查询执行计划

5.2.1 语法

EXPLAIN [ EXTENDED | CODEGEN | COST] statement

5.2.2 参数

  • CODEGEN
    为语句(如果有的话)和物理计划生成代码。
  • COST
    如果逻辑计划节点统计信息可用,则生成逻辑计划和统计信息。
  • statement
    指定要解释的 SQL 语句。

5.2.3 示例

-- Default Output
EXPLAIN
    SELECT id,
     sum(age) FILTER (WHERE name in ('Tom', 'Jack'))
     FROM test_olap.student
     GROUP BY id;

-- Using Extended
EXPLAIN EXTENDED
    SELECT id,
     sum(age) FILTER (WHERE name in ('Tom', 'Jack'))
     FROM test_olap.student
     GROUP BY id;

-- Using Formatted
EXPLAIN FORMATTED
    SELECT id,
     sum(age) FILTER (WHERE name in ('Tom', 'Jack'))
     FROM test_olap.student
     GROUP BY id;

6. 其他语句

6.1 CACHE 命令

6.1.1 缓存表

  • 语法
CACHE [ LAZY ] TABLE [database_name.]table_name [ [ AS ] select_statement ]

  • 参数
    LAZY
    仅在首次使用时缓存表,而不是立即缓存。
  • 示例
CACHE LAZY TABLE cacheTable AS SELECT * FROM test_olap.student

6.1.2 取消缓存表

  • 语法
UNCACHE TABLE [ IF EXISTS ] [database_name.]table_name

  • 示例
UNCACHE TABLE cacheTable

6.2 DESCRIBE 命令

6.2.1 获取库结构信息

  • 语法
{ DESC | DESCRIBE } { DATABASE | SCHEMA } [ EXTENDED ] database_identifier

  • 示例
DESC DATABASE EXTENDED test_olap

6.2.2 获取表结构信息

  • 语法
{ DESC | DESCRIBE } [ TABLE ] [ EXTENDED | FORMATTED ] [database_name.]table_name [ partition_spec ] [ col_name ]

  • 参数
    FORMATTED
    指定描述输出的可选格式。如果指定EXTENDED,则返回其他元数据信息(如父数据库、所有者和访问时间)。
  • 示例
-- Describe table
DESC FORMATTED test_olap.student

-- Describe partition
DESC EXTENDED test_olap.student_part PARTITION (date = '20200101')

-- Describe column
DESC test_olap.student name

6.2.3 获取函数信息

  • 语法
{ DESC | DESCRIBE } FUNCTION function_name

  • 示例
DESC FUNCTION from_json

6.3 SHOW 命令

6.3.1 展示表的所有列

  • 语法
SHOW COLUMNS { IN | FROM } [database_name.]table_name [ { IN | FROM } database_identifier ]

  • 示例
-- Specifies table qualified with a database name
SHOW COLUMNS IN test_olap.student;

-- Specifies database
SHOW COLUMNS in student in test_olap;

-- Specifies table qualified with a database name also specifies database
SHOW COLUMNS in test_olap.student in test_olap;

6.3.2 查看建表语句

  • 语法
SHOW CREATE TABLE [database_name.]table_name

  • 示例
SHOW CREATE TABLE test_olap.student 

6.3.3 展示库

  • 语法
SHOW { DATABASES | SCHEMAS } [ LIKE regex_pattern ]

  • 参数
    • regex_pattern
      指定用于筛选语句结果的正则表达式模式。
    • 除了*|字符之外,该模式的工作方式类似于正则表达式。
    • * alone 匹配 0 个或更多字符,|用于分隔多个不同的正则表达式,其中任何一个都可以匹配。
    • 在处理之前,在输入模式中修剪前置和后置空白。模式匹配不区分大小写。
  • 示例
-- Show all databases
SHOW DATABASES

-- Filter with regex pattern
SHOW SCHEMAS LIKE 'test_olap'

6.3.4 展示函数

  • 语法
SHOW [ function_kind ] FUNCTIONS [ [ LIKE ] { 'function_name' | 'regex_pattern' } ]

  • 参数
    • function_kind
      指定要搜索的函数的名称空间。有效的名称空间为
    • USER:在用户定义的函数中查找函数。
    • SYSTEM:在系统定义的函数中查找函数。
    • ALL:查找用户和系统定义函数之间的函数。
  • 示例
-- Specify function kind
SHOW SYSTEM FUNCTIONS

-- Specify function name
SHOW FUNCTIONS 'from_json'

-- Specify regex_pattern
SHOW FUNCTIONS 'a*'

6.3.5 展示表的分区

  • 语法
SHOW PARTITIONS [database_name.]table_name

  • 示例
-- Show all partitions
SHOW PARTITIONS test_olap.student_part2

6.3.6 展示表

  • 语法
SHOW TABLES [ { FROM | IN } database_identifier ] [ LIKE 'regex_pattern' ]

  • 示例
-- Show all of the tables in the database
SHOW TABLES IN test_olap;

-- Show tables which names match the regex pattern
SHOW TABLES LIKE 'a*'

-- Show tables in the database which names match the regex pattern
SHOW TABLES FROM test_olap LIKE 'a*'

6.4 SET 命令

  • 语法
SET [ -v ]
SET property_key [ = property_value ]

  • 参数
    • -v
      输出现有 SQLConf 属性的键、值和含义。
    • property_key
      返回指定属性键的值。
    • property_key=property_value
      设置给定属性键的值。如果给定属性键的旧值存在,则它会被新值重写。
  • 示例
-- Set a property.
SET spark.sql.variable.substitute=false;

-- List all SQLConf properties with value and meaning.
SET -v;

-- List all SQLConf properties with value for current session.
SET;

-- List the value of specified property key.
SET spark.sql.variable.substitute;