在Presto中计算偏度与峰度:内置函数及实现方法问询
Skewness & Kurtosis Calculation in Presto
Hey there! Let's break down how to handle skewness and kurtosis in Presto, since it doesn't offer built-in functions for these two stats (yet!). Since you're already familiar with mean() and variance(), we can build on those with standard statistical formulas.
Calculating Skewness
Skewness measures how asymmetric your data distribution is. The formula for sample skewness is:
Skewness = (n / ((n-1)*(n-2))) * Σ((xᵢ - μ)/σ)³
Where:
n= total number of samplesμ= mean of the dataset (mean()in Presto)σ= standard deviation (stddev()in Presto, which is the square root ofvariance())
Presto SQL Example
WITH dataset_stats AS ( SELECT COUNT(*) AS sample_count, mean(your_column) AS avg_value, stddev(your_column) AS std_dev, SUM(POWER((your_column - mean(your_column)) / stddev(your_column), 3)) AS cubed_deviation_total FROM your_table ) SELECT -- Compute sample skewness (sample_count / ((sample_count - 1) * (sample_count - 2))) * cubed_deviation_total AS skewness FROM dataset_stats -- Filter out cases where sample size is too small (n must be >2) WHERE sample_count > 2;
Calculating Kurtosis
Kurtosis measures how "peaked" or "flat" your data distribution is compared to a normal distribution. There are two common variants:
- Sample Kurtosis: The raw metric relative to a normal distribution (which has a kurtosis of 3)
- Excess Kurtosis: Sample kurtosis minus 3 (so a normal distribution has excess kurtosis of 0)
The formula for sample kurtosis is:
Kurtosis = [(n*(n+1)) / ((n-1)*(n-2)*(n-3))] * Σ((xᵢ - μ)/σ)⁴ - [(3*(n-1)²) / ((n-2)*(n-3))]
Presto SQL Example
WITH dataset_stats AS ( SELECT COUNT(*) AS sample_count, mean(your_column) AS avg_value, stddev(your_column) AS std_dev, SUM(POWER((your_column - mean(your_column)) / stddev(your_column), 4)) AS fourth_power_deviation_total FROM your_table ) SELECT -- Compute sample kurtosis (sample_count * (sample_count + 1) / ((sample_count - 1) * (sample_count - 2) * (sample_count - 3))) * fourth_power_deviation_total - (3 * POWER(sample_count - 1, 2) / ((sample_count - 2) * (sample_count - 3))) AS kurtosis, -- Compute excess kurtosis (more commonly used) ((sample_count * (sample_count + 1) / ((sample_count - 1) * (sample_count - 2) * (sample_count - 3))) * fourth_power_deviation_total - (3 * POWER(sample_count - 1, 2) / ((sample_count - 2) * (sample_count - 3)))) - 3 AS excess_kurtosis FROM dataset_stats -- Filter out cases where sample size is too small (n must be >3) WHERE sample_count > 3;
Quick Notes
- Sample Size Limits: Skewness requires at least 3 samples, kurtosis needs at least 4—otherwise the formulas will hit division-by-zero errors, hence the
WHEREclauses. - Population vs Sample: If you're working with an entire population (not a sample), adjust the formulas to use population-based coefficients (e.g., replace
n-1withnin variance/stddev calculations, and tweak the scaling factors accordingly).
内容的提问来源于stack exchange,提问作者user2020282




