如何在Apache Spark ML API中从DataFrame创建Vector(含DenseVector、SparseVector)?
DenseVector and SparseVector from DataFrames in Apache Spark ML API Hey there! Let's walk through how to create vectors from the org.apache.spark.mllib.linalg package (specifically DenseVector and SparseVector) using Spark's DataFrame API. I'll cover practical, common methods with code examples you can test out right away.
1. Creating a DenseVector
A dense vector stores all elements explicitly—even zeros. This is ideal when your data doesn't have many zero values and you want straightforward storage.
Method 1: Use VectorAssembler (Most Common)
VectorAssembler (from org.apache.spark.ml.feature) is the easiest way to combine multiple numeric columns into a single DenseVector column. It's optimized for this use case and requires minimal code.
First, import the necessary classes:
import org.apache.spark.sql.SparkSession import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.mllib.linalg.DenseVector
Then, initialize your SparkSession and create a sample DataFrame:
val spark = SparkSession.builder() .appName("VectorCreationExample") .master("local[*]") .getOrCreate() import spark.implicits._ val df = Seq( (1.0, 2.0, 3.0), (4.0, 5.0, 6.0) ).toDF("col1", "col2", "col3")
Now, use VectorAssembler to build your dense vector column:
val assembler = new VectorAssembler() .setInputCols(Array("col1", "col2", "col3")) .setOutputCol("dense_vector") val dfWithDenseVector = assembler.transform(df) // Check the result dfWithDenseVector.select("dense_vector").show(false)
The output will show rows like [1.0,2.0,3.0]—this is a DenseVector under the hood.
Method 2: Custom UDF (For Manual Control)
If you need to combine values in a custom way (e.g., filtering or transforming values before creating the vector), use a User-Defined Function (UDF):
import org.apache.spark.sql.functions.udf import org.apache.spark.mllib.linalg.Vectors // Define a UDF that takes three doubles and returns a DenseVector val createDenseVector = udf((a: Double, b: Double, c: Double) => { Vectors.dense(Array(a, b, c)).asInstanceOf[DenseVector] }) val dfWithCustomDense = df.withColumn("custom_dense", createDenseVector($"col1", $"col2", $"col3")) dfWithCustomDense.select("custom_dense").show(false)
2. Creating a SparseVector
A sparse vector only stores non-zero values and their indices, which saves massive amounts of memory for high-dimensional datasets with lots of zeros (like text features or one-hot encoded data). It requires three components:
- Size: Total length of the vector
- Indices: Array of positions where values are non-zero
- Values: Array of non-zero values corresponding to the indices
Method 1: UDF for Sparse Vector Construction
Let's create a sample DataFrame with the components needed for a sparse vector, then use a UDF to assemble it:
First, add the sparse vector import to your existing imports:
import org.apache.spark.mllib.linalg.SparseVector
Create a sample DataFrame with size, indices, and values:
val sparseDF = Seq( (5, Array(0, 2), Array(1.0, 3.0)), // Represents: [1.0, 0, 3.0, 0, 0] (5, Array(1, 3), Array(2.0, 4.0)) // Represents: [0, 2.0, 0, 4.0, 0] ).toDF("vec_size", "indices", "values")
Define a UDF to build the SparseVector:
val createSparseVector = udf((size: Int, indices: Array[Int], values: Array[Double]) => { Vectors.sparse(size, indices, values).asInstanceOf[SparseVector] }) val dfWithSparseVector = sparseDF.withColumn("sparse_vector", createSparseVector($"vec_size", $"indices", $"values")) // View the result dfWithSparseVector.select("sparse_vector").show(false)
The output will display sparse vectors like (5,[0,2],[1.0,3.0]).
Method 2: Convert Dense to Sparse (Optional)
If you already have a dense vector column, you can convert it to a sparse vector using a UDF to save space:
val denseToSparse = udf((denseVec: DenseVector) => { val nonZeroPairs = denseVec.toArray.zipWithIndex.filter(_._1 != 0.0) Vectors.sparse(denseVec.size, nonZeroPairs.map(_._2), nonZeroPairs.map(_._1)) }) val dfWithSparseFromDense = dfWithDenseVector.withColumn("sparse_from_dense", denseToSparse($"dense_vector")) dfWithSparseFromDense.select("dense_vector", "sparse_from_dense").show(false)
Key Notes
org.apache.spark.ml.linalg.Vectoris an alias fororg.apache.spark.mllib.linalg.Vector, so vectors created via ML components are fully compatible with the mllib package.- Stick with
VectorAssemblerfor dense vectors whenever possible—it's more efficient and easier to maintain than custom UDFs. - Sparse vectors are a game-changer for high-dimensional data—always use them when most of your values are zero.
内容的提问来源于stack exchange,提问作者Aman Raturi




