You need to enable JavaScript to run this app.
Lake AI Service

Lake AI Service

Copy page
Download PDF
Text processing
Maximum English word length calculator
Copy page
Download PDF
Maximum English word length calculator

Operator introduction

Description

Maximum English word length calculator – calculates the maximum length of English words in text
Key features

  • English word recognition: Uses regular expressions to identify English words in text
  • Maximum length calculation: Calculates the maximum length among all English words
  • Batch processing: Supports calculating the maximum English word length for batch text

Application scenarios

  • Text quality inspection
  • Data preprocessing and filtering

Technical features

  • Recognizes only English words: Uses the regular expression [A-Za-z]+ to match English words
  • Intelligent processing: Automatically ignores non-English words and only calculates the length of English words

Daft usage

Operator parameters

Input

Input column name

Description

texts

The text column to be processed; element type must be string

Output

Maximum word length column, element type is integer

Examples

The following code demonstrates how to use daft to run the operator to calculate the maximum length of English words in text.

from __future__ import annotations

import os

import daft
from daft import col
from daft.las.functions.text.maximum_word_length_calculator import MaximumWordLengthCalculator
from daft.las.functions.udf import las_udf

if __name__ == "__main__":

    if os.getenv("DAFT_RUNNER", "native") == "ray":
        import logging

        import ray

        def configure_logging():
            logging.basicConfig(
                level=logging.INFO,
                format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
                datefmt="%Y-%m-%d %H:%M:%S.%s".format(),
            )
            logging.getLogger("tracing.span").setLevel(logging.WARNING)
            logging.getLogger("daft_io.stats").setLevel(logging.WARNING)
            logging.getLogger("DaftStatisticsManager").setLevel(logging.WARNING)
            logging.getLogger("DaftFlotillaScheduler").setLevel(logging.WARNING)
            logging.getLogger("DaftFlotillaDispatcher").setLevel(logging.WARNING)

        ray.init(dashboard_host="0.0.0.0", runtime_env={"worker_process_setup_hook": configure_logging})
        daft.set_runner_ray()
    daft.set_execution_config(actor_udf_ready_timeout=600)
    daft.set_execution_config(min_cpu_per_task=0)

    samples = {
        "text": [
            "Hello world 你好世界",
            "Python编程 is fun",
            "这是一个中文句子",
            "The quick brown fox jumps over the lazy dog",
            "supercalifragilisticexpialidocious is a very long word",
            None,
        ]
    }

    ds = daft.from_pydict(samples)
    ds = ds.with_column(
        "max_word_length",
        las_udf(
            MaximumWordLengthCalculator,
            construct_args={},
        )(col("text")),
    )

    ds.show()
    # ╭────────────────────────────────┬─────────────────╮
    # │ text                           ┆ max_word_length │
    # │ ---                            ┆ ---             │
    # │ String                         ┆ Int64           │
    # ╞════════════════════════════════╪═════════════════╡
    # │ Hello world 你好世界           ┆ 5               │
    # ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    # │ Python编程 is fun              ┆ 6               │
    # ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    # │ 这是一个中文句子               ┆ 0               │
    # ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    # │ The quick brown fox jumps ove… ┆ 5               │
    # ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    # │ supercalifragilisticexpialido… ┆ 34              │
    # ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    # │ None                           ┆ None            │
    # ╰────────────────────────────────┴─────────────────╯
Last updated: 2026.05.12 19:06:37