You need to enable JavaScript to run this app.
Lake AI Service

Lake AI Service

Copy page
Download PDF
Text processing
Special character ratio calculator
Copy page
Download PDF
Special character ratio calculator

Operator introduction

Description

Special character ratio calculator - Text feature extraction based on the proportion of special characters

Key features

  • Special character ratio calculation: Accurately calculates the proportion of special characters in text
  • Multi-granularity support: Allows selection of different types of special characters for calculation
  • Flexible configuration: Supports calculating the ratio of all special characters or specific types of characters

Application scenarios

  • Text quality assessment
  • Data cleaning and preprocessing
  • Text classification feature extraction
  • Content security detection
  • Multilingual text analysis

Technical features

  • Supports multiple character types:
    • all: All special characters (default)
    • whitespace: Whitespace characters
    • punctuation: Punctuation marks
    • digits: Digit characters
    • emoji: Emoji symbols
  • Supports multilingual Unicode character recognition

Daft invocation

Operator parameters

Input

Input column name

Note

texts

The text column to be processed. Elements must be of string type.

Output

Ratio result column, elements are floating-point numbers representing the proportion of special characters

Parameters

If a parameter does not have a default value, it is required

Parameter name

Type

Default value

Description

character_type

str

all

Character type Description: Select the type of special character to calculate Optional values: all, whitespace, punctuation, digits, emoji Default value: all

Examples

The following code demonstrates how to use daft to run the operator and calculate the proportion of special characters in text.

from __future__ import annotations

import os

import daft
from daft import col
from daft.las.functions.text.special_characters_ratio_calculator import SpecialCharactersRatioCalculator
from daft.las.functions.udf import las_udf

if __name__ == "__main__":

    if os.getenv("DAFT_RUNNER", "native") == "ray":
        import logging

        import ray

        def configure_logging():
            logging.basicConfig(
                level=logging.INFO,
                format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
                datefmt="%Y-%m-%d %H:%M:%S.%s".format(),
            )
            logging.getLogger("tracing.span").setLevel(logging.WARNING)
            logging.getLogger("daft_io.stats").setLevel(logging.WARNING)
            logging.getLogger("DaftStatisticsManager").setLevel(logging.WARNING)
            logging.getLogger("DaftFlotillaScheduler").setLevel(logging.WARNING)
            logging.getLogger("DaftFlotillaDispatcher").setLevel(logging.WARNING)

        ray.init(dashboard_host="0.0.0.0", runtime_env={"worker_process_setup_hook": configure_logging})
        daft.set_runner_ray()
    daft.set_execution_config(actor_udf_ready_timeout=600)
    daft.set_execution_config(min_cpu_per_task=0)

    samples = {
        "text": [
            "Hello world!",
            "1234567890",
            "     ",
            "这是中文文本",
            "你好 Hello 😊 123 !!!",
        ]
    }

    ds = daft.from_pydict(samples)
    ds = ds.with_column(
        "special_ratio",
        las_udf(
            SpecialCharactersRatioCalculator,
            construct_args={"character_type": "all"},
        )(col("text")),
    )

    ds.show()
    # ╭───────────────────────┬─────────────────────╮
    # │ text                  ┆ special_ratio       │
    # │ ---                   ┆ ---                 │
    # │ String                ┆ Float64             │
    # ╞═══════════════════════╪═════════════════════╡
    # │ Hello world! ┆ 0.16666666666666666 │
    # ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    # │ 1234567890            ┆ 1                   │
    # ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    # │                       ┆ None                │
    # ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    # │ 这是中文文本          ┆ 0                   │
    # ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    # │ 你好 Hello 😊 123 !!! ┆ 0.6111111111111112  │
    # ╰───────────────────────┴─────────────────────╯
Last updated: 2026.05.12 19:06:37