You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

在Presto中计算偏度与峰度:内置函数及实现方法问询

Skewness & Kurtosis Calculation in Presto

Hey there! Let's break down how to handle skewness and kurtosis in Presto, since it doesn't offer built-in functions for these two stats (yet!). Since you're already familiar with mean() and variance(), we can build on those with standard statistical formulas.

Calculating Skewness

Skewness measures how asymmetric your data distribution is. The formula for sample skewness is:

Skewness = (n / ((n-1)*(n-2))) * Σ((xᵢ - μ)/σ)³

Where:

  • n = total number of samples
  • μ = mean of the dataset (mean() in Presto)
  • σ = standard deviation (stddev() in Presto, which is the square root of variance())

Presto SQL Example

WITH dataset_stats AS (
    SELECT
        COUNT(*) AS sample_count,
        mean(your_column) AS avg_value,
        stddev(your_column) AS std_dev,
        SUM(POWER((your_column - mean(your_column)) / stddev(your_column), 3)) AS cubed_deviation_total
    FROM your_table
)
SELECT
    -- Compute sample skewness
    (sample_count / ((sample_count - 1) * (sample_count - 2))) * cubed_deviation_total AS skewness
FROM dataset_stats
-- Filter out cases where sample size is too small (n must be >2)
WHERE sample_count > 2;

Calculating Kurtosis

Kurtosis measures how "peaked" or "flat" your data distribution is compared to a normal distribution. There are two common variants:

  1. Sample Kurtosis: The raw metric relative to a normal distribution (which has a kurtosis of 3)
  2. Excess Kurtosis: Sample kurtosis minus 3 (so a normal distribution has excess kurtosis of 0)

The formula for sample kurtosis is:

Kurtosis = [(n*(n+1)) / ((n-1)*(n-2)*(n-3))] * Σ((xᵢ - μ)/σ)⁴ - [(3*(n-1)²) / ((n-2)*(n-3))]

Presto SQL Example

WITH dataset_stats AS (
    SELECT
        COUNT(*) AS sample_count,
        mean(your_column) AS avg_value,
        stddev(your_column) AS std_dev,
        SUM(POWER((your_column - mean(your_column)) / stddev(your_column), 4)) AS fourth_power_deviation_total
    FROM your_table
)
SELECT
    -- Compute sample kurtosis
    (sample_count * (sample_count + 1) / ((sample_count - 1) * (sample_count - 2) * (sample_count - 3))) * fourth_power_deviation_total 
    - (3 * POWER(sample_count - 1, 2) / ((sample_count - 2) * (sample_count - 3))) AS kurtosis,
    -- Compute excess kurtosis (more commonly used)
    ((sample_count * (sample_count + 1) / ((sample_count - 1) * (sample_count - 2) * (sample_count - 3))) * fourth_power_deviation_total 
    - (3 * POWER(sample_count - 1, 2) / ((sample_count - 2) * (sample_count - 3)))) - 3 AS excess_kurtosis
FROM dataset_stats
-- Filter out cases where sample size is too small (n must be >3)
WHERE sample_count > 3;

Quick Notes

  • Sample Size Limits: Skewness requires at least 3 samples, kurtosis needs at least 4—otherwise the formulas will hit division-by-zero errors, hence the WHERE clauses.
  • Population vs Sample: If you're working with an entire population (not a sample), adjust the formulas to use population-based coefficients (e.g., replace n-1 with n in variance/stddev calculations, and tweak the scaling factors accordingly).

内容的提问来源于stack exchange,提问作者user2020282

火山引擎 最新活动