如何在Python的Pandas DataFrame中提取指定单词后的特定文本?及提取text字段内容时遇TypeError的解决咨询
Hey there! Let's tackle your two Pandas questions one by one:
The most reliable way to do this is using regular expressions with str.extract, which lets you target the exact word and capture everything that follows it. Here's a concrete example:
Suppose you have a DataFrame with a column containing strings like "text: your desired content" and you want to pull out everything after "text:":
import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'content': [ "text: extract this part", "random leading text text: another example", "no matching keyword here" ] }) # Extract text after "text:" (ignores any spaces right after the keyword) df['extracted_text'] = df['content'].str.extract(r'text:\s*(.*)', expand=False)
Breakdown of the regex:
text:\s*matches the exact keyword"text:"plus any number of spaces that follow it(.*)captures all characters after that point (the()creates a capture group thatstr.extractwill return)expand=Falseensures we get a Series back, which fits perfectly into a new DataFrame column
If you need to target a different word, just replace "text:" in the regex with your desired keyword (e.g., r'keyword:\s*(.*)').
Your error happens because some values in the fields.description.content column are missing values (NaN). In Pandas, NaNs are stored as float objects, and you can't use dictionary-style indexing (x['text']) on a float.
Here are two simple fixes:
Option 1: Use str.get() (cleanest approach)
Pandas has a built-in str.get() method that safely pulls values from dictionary-like objects, and returns NaN for non-dictionary values (like missing data):
issues_df['new_column'] = issues_df['fields.description.content'].str.get('text')
Option 2: Add a type check in your lambda
If you prefer using apply, add a check to make sure x is a dictionary before accessing its 'text' key:
import pandas as pd issues_df['new_column'] = issues_df['fields.description.content'].apply( lambda x: x['text'] if isinstance(x, dict) else pd.NA )
Both approaches will handle missing values gracefully without throwing errors. The first option is more concise and idiomatic for Pandas, so it's my go-to recommendation.
内容的提问来源于stack exchange,提问作者Junior P




