如何基于Excel某列中的相似内容对行进行分组?求VBA/Python实现方案及算法建议
Hey there! Let's break down how to tackle your problem of grouping Excel rows based on similar link text—whether you're leaning into VBA (since you're new to it) or open to a Python alternative. I'll start with the core algorithm logic, then dive into actionable code snippets and learning resources tailored to your skill level.
Before jumping into code, let's outline the key steps to group similar links:
- Extract core text from links: Most links have extra fluff (like domain names, query parameters, or random IDs) that don't affect similarity. We'll strip that away to focus on the meaningful part (e.g., turning
https://blog.example.com/post/learn-vbaintopost/learn-vba). - Measure text similarity: Use a method to compare how "alike" two core texts are. Options include:
- Simple substring matching (e.g., checking if one text contains another)
- Levenshtein Distance (counts how many edits are needed to turn one string into another—lower = more similar)
- Fuzzy matching (calculates a similarity percentage, e.g., 80% match)
- Assign groups: Set a similarity threshold (e.g., Levenshtein distance ≤ 3, or 80% fuzzy match) and group rows that meet this threshold. Rows that don't match any existing group get a new group ID.
Since you're new to VBA, we'll start with straightforward, commented code you can adapt. First, enable the Developer tab in Excel, then insert a new module (Developer > Visual Basic > Insert > Module).
Step 1: Helper Function to Extract Core Link Text
This function strips out non-essential parts of the link—tweak it based on your actual link structure!
Function GetCoreLinkText(link As String) As String ' Split the link by slashes to isolate path parts Dim urlParts As Variant urlParts = Split(link, "/") ' Example: Grab the last two path segments (adjust based on your links!) If UBound(urlParts) >= 2 Then GetCoreLinkText = urlParts(UBound(urlParts) - 1) & "/" & urlParts(UBound(urlParts)) Else ' Fallback: return the full link if it's too short GetCoreLinkText = link End If End Function
Step 2: Main Grouping Macro
This macro will add a "Group ID" column and assign groups based on similar core text:
Sub GroupSimilarLinks() Dim ws As Worksheet Set ws = ActiveSheet ' Replace with Sheets("YourSheetName") if needed Dim lastRow As Long lastRow = ws.Cells(ws.Rows.Count, "A").End(xlUp).Row ' Assume links are in column A Dim coreTexts() As String ReDim coreTexts(1 To lastRow) Dim i As Long, j As Long Dim groupNum As Long groupNum = 1 ' First, extract core text for all links For i = 2 To lastRow ' Skip header row (row 1) coreTexts(i) = GetCoreLinkText(ws.Cells(i, "A").Value) Next i ' Set up the Group ID column (column B) ws.Cells(1, "B").Value = "Group ID" ws.Cells(2, "B").Value = groupNum ' Compare each row to previous rows and assign groups For i = 3 To lastRow Dim isMatch As Boolean isMatch = False ' Check against all already grouped rows For j = 2 To i - 1 ' Use Levenshtein Distance for accurate similarity (threshold = 3) ' Or swap with InStr for simple substring matching: InStr(1, coreTexts(i), coreTexts(j), vbTextCompare) > 0 If LevenshteinDistance(coreTexts(i), coreTexts(j)) <= 3 Then ws.Cells(i, "B").Value = ws.Cells(j, "B").Value isMatch = True Exit For End If Next j ' If no match found, create a new group If Not isMatch Then groupNum = groupNum + 1 ws.Cells(i, "B").Value = groupNum End If Next i End Sub ' Helper Function: Calculate Levenshtein Distance (string similarity) Function LevenshteinDistance(s1 As String, s2 As String) As Integer Dim len1 As Integer, len2 As Integer Dim i As Integer, j As Integer Dim matrix() As Integer len1 = Len(s1) len2 = Len(s2) ReDim matrix(0 To len1, 0 To len2) ' Initialize matrix boundaries For i = 0 To len1 matrix(i, 0) = i Next i For j = 0 To len2 matrix(0, j) = j Next j ' Fill matrix to calculate distance For i = 1 To len1 For j = 1 To len2 If Mid(s1, i, 1) = Mid(s2, j, 1) Then matrix(i, j) = matrix(i - 1, j - 1) Else matrix(i, j) = 1 + WorksheetFunction.Min(matrix(i - 1, j), _ matrix(i, j - 1), _ matrix(i - 1, j - 1)) End If Next j Next i LevenshteinDistance = matrix(len1, len2) End Function
VBA Learning Tips for Beginners
- Start with macro recording: Record a simple task (like formatting cells) to see how VBA translates actions into code.
- Use Excel's built-in help: Press
F1in the VBA editor to access official documentation for functions and syntax. - Practice small: Try modifying the
GetCoreLinkTextfunction to match your specific link structure first—this will help you get comfortable with string manipulation.
If you're open to Python, this approach is faster for large datasets and uses powerful libraries for fuzzy matching.
Step 1: Install Required Libraries
First, install the packages you'll need:
pip install pandas fuzzywuzzy python-Levenshtein
Step 2: Python Code to Group Links
import pandas as pd from fuzzywuzzy import process from urllib.parse import urlparse # Load your Excel file df = pd.read_excel("your_excel_file.xlsx") # Replace with your file path link_column_name = "Link" # Replace with your actual link column name # Function to extract core text from links def get_core_link_text(link): parsed_url = urlparse(link) # Extract path and split into segments (adjust based on your links!) path_segments = parsed_url.path.strip("/").split("/") # Grab the last two meaningful segments if len(path_segments) >= 2: return "/".join(path_segments[-2:]) return parsed_url.path # Add core text column to the dataframe df["CoreLinkText"] = df[link_column_name].apply(get_core_link_text) # Function to assign group IDs based on fuzzy similarity def assign_group_id(row, existing_groups, threshold=80): # Check if we have any groups yet if not existing_groups: existing_groups.append(row["CoreLinkText"]) return 1 # Find the most similar existing group best_match, similarity_score = process.extractOne(row["CoreLinkText"], existing_groups) if similarity_score >= threshold: return existing_groups.index(best_match) + 1 # Create new group if no good match existing_groups.append(row["CoreLinkText"]) return len(existing_groups) # Initialize groups and assign IDs existing_groups = [] df["GroupID"] = df.apply(lambda row: assign_group_id(row, existing_groups), axis=1) # Save the grouped data to a new Excel file df.to_excel("grouped_links_result.xlsx", index=False)
Python Learning Tips
- Start with pandas basics: Learn how to read/write Excel files and manipulate dataframes—this is essential for Excel-related tasks.
- Experiment with fuzzy matching: Adjust the
thresholdvalue (e.g., 70 vs. 90) to see how it affects grouping. - Use the official docs: Pandas and fuzzywuzzy have great documentation with examples tailored to common data tasks.
- Link core text extraction: This is the most critical part! Tweak the
GetCoreLinkText(VBA) orget_core_link_text(Python) functions to match your link structure. For example, if your links have query parameters (?id=123), add code to strip those out. - Similarity threshold: Adjust the threshold (Levenshtein distance in VBA, percentage in Python) based on how strict you want the grouping to be. Test with a small subset of your data first!
内容的提问来源于stack exchange,提问作者weblasky




