You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何在Apache Spark ML API中从DataFrame创建Vector(含DenseVector、SparseVector)?

Creating DenseVector and SparseVector from DataFrames in Apache Spark ML API

Hey there! Let's walk through how to create vectors from the org.apache.spark.mllib.linalg package (specifically DenseVector and SparseVector) using Spark's DataFrame API. I'll cover practical, common methods with code examples you can test out right away.


1. Creating a DenseVector

A dense vector stores all elements explicitly—even zeros. This is ideal when your data doesn't have many zero values and you want straightforward storage.

Method 1: Use VectorAssembler (Most Common)

VectorAssembler (from org.apache.spark.ml.feature) is the easiest way to combine multiple numeric columns into a single DenseVector column. It's optimized for this use case and requires minimal code.

First, import the necessary classes:

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.DenseVector

Then, initialize your SparkSession and create a sample DataFrame:

val spark = SparkSession.builder()
  .appName("VectorCreationExample")
  .master("local[*]")
  .getOrCreate()

import spark.implicits._

val df = Seq(
  (1.0, 2.0, 3.0),
  (4.0, 5.0, 6.0)
).toDF("col1", "col2", "col3")

Now, use VectorAssembler to build your dense vector column:

val assembler = new VectorAssembler()
  .setInputCols(Array("col1", "col2", "col3"))
  .setOutputCol("dense_vector")

val dfWithDenseVector = assembler.transform(df)

// Check the result
dfWithDenseVector.select("dense_vector").show(false)

The output will show rows like [1.0,2.0,3.0]—this is a DenseVector under the hood.

Method 2: Custom UDF (For Manual Control)

If you need to combine values in a custom way (e.g., filtering or transforming values before creating the vector), use a User-Defined Function (UDF):

import org.apache.spark.sql.functions.udf
import org.apache.spark.mllib.linalg.Vectors

// Define a UDF that takes three doubles and returns a DenseVector
val createDenseVector = udf((a: Double, b: Double, c: Double) => {
  Vectors.dense(Array(a, b, c)).asInstanceOf[DenseVector]
})

val dfWithCustomDense = df.withColumn("custom_dense", createDenseVector($"col1", $"col2", $"col3"))

dfWithCustomDense.select("custom_dense").show(false)

2. Creating a SparseVector

A sparse vector only stores non-zero values and their indices, which saves massive amounts of memory for high-dimensional datasets with lots of zeros (like text features or one-hot encoded data). It requires three components:

  • Size: Total length of the vector
  • Indices: Array of positions where values are non-zero
  • Values: Array of non-zero values corresponding to the indices

Method 1: UDF for Sparse Vector Construction

Let's create a sample DataFrame with the components needed for a sparse vector, then use a UDF to assemble it:

First, add the sparse vector import to your existing imports:

import org.apache.spark.mllib.linalg.SparseVector

Create a sample DataFrame with size, indices, and values:

val sparseDF = Seq(
  (5, Array(0, 2), Array(1.0, 3.0)),  // Represents: [1.0, 0, 3.0, 0, 0]
  (5, Array(1, 3), Array(2.0, 4.0))   // Represents: [0, 2.0, 0, 4.0, 0]
).toDF("vec_size", "indices", "values")

Define a UDF to build the SparseVector:

val createSparseVector = udf((size: Int, indices: Array[Int], values: Array[Double]) => {
  Vectors.sparse(size, indices, values).asInstanceOf[SparseVector]
})

val dfWithSparseVector = sparseDF.withColumn("sparse_vector", createSparseVector($"vec_size", $"indices", $"values"))

// View the result
dfWithSparseVector.select("sparse_vector").show(false)

The output will display sparse vectors like (5,[0,2],[1.0,3.0]).

Method 2: Convert Dense to Sparse (Optional)

If you already have a dense vector column, you can convert it to a sparse vector using a UDF to save space:

val denseToSparse = udf((denseVec: DenseVector) => {
  val nonZeroPairs = denseVec.toArray.zipWithIndex.filter(_._1 != 0.0)
  Vectors.sparse(denseVec.size, nonZeroPairs.map(_._2), nonZeroPairs.map(_._1))
})

val dfWithSparseFromDense = dfWithDenseVector.withColumn("sparse_from_dense", denseToSparse($"dense_vector"))

dfWithSparseFromDense.select("dense_vector", "sparse_from_dense").show(false)

Key Notes

  • org.apache.spark.ml.linalg.Vector is an alias for org.apache.spark.mllib.linalg.Vector, so vectors created via ML components are fully compatible with the mllib package.
  • Stick with VectorAssembler for dense vectors whenever possible—it's more efficient and easier to maintain than custom UDFs.
  • Sparse vectors are a game-changer for high-dimensional data—always use them when most of your values are zero.

内容的提问来源于stack exchange,提问作者Aman Raturi

火山引擎 最新活动