Apache Hive字符串转时间戳及CSV时间字段转NULL问题求助
Hey there! Let's work through your Apache Hive timestamp conversion questions step by step:
Hive offers two straightforward ways to handle this, since your string uses the standard yyyy-MM-dd HH:mm:ss format that Hive recognizes by default:
- Using
CAST(): This is the simplest method for standard formats:SELECT CAST('2017-10-31 20:42:00' AS TIMESTAMP); - Using
to_timestamp(): This function is more flexible if you ever need to handle non-standard formats (you can explicitly define the pattern). For your case, both versions work:-- Omit the format pattern (Hive auto-detects standard formats) SELECT to_timestamp('2017-10-31 20:42:00'); -- Explicitly specify the format for clarity/consistency SELECT to_timestamp('2017-10-31 20:42:00', 'yyyy-MM-dd HH:mm:ss');
It sounds like the issue is likely related to how your CSV data was parsed during import (specifically those surrounding double quotes) or unhandled invalid data. Here's how to resolve it:
First, diagnose the root cause
If converting returns NULL, check if your string columns still contain double quotes (like "2017-10-23 11:07:00" instead of 2017-10-23 11:07:00). Hive can't parse quoted strings as timestamps directly. Run this query to verify:
SELECT start_time FROM your_table WHERE start_time LIKE '"%"';
Solutions to fix NULL values
Clean existing quoted strings before conversion
If your columns have leftover quotes, strip them first usingregexp_replace()before converting:SELECT to_timestamp(regexp_replace(start_time, '"', ''), 'yyyy-MM-dd HH:mm:ss') AS start_time_ts, to_timestamp(regexp_replace(end_time, '"', ''), 'yyyy-MM-dd HH:mm:ss') AS end_time_ts, to_timestamp(regexp_replace(last_updated, '"', ''), 'yyyy-MM-dd HH:mm:ss') AS last_updated_ts FROM your_table;Import CSV correctly with quote handling (for future imports)
To avoid this issue entirely when importing, use Hive's OpenCSVSerde to automatically parse quoted values. Create your table like this before uploading the CSV:CREATE TABLE your_table ( start_time STRING, end_time STRING, last_updated STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"", "escapeChar" = "\\" ) STORED AS TEXTFILE;This will strip the double quotes during import, so your string columns will have clean timestamp values ready for conversion.
Create a new table with Timestamp types (permanent solution)
If you want to permanently switch the columns to Timestamp, create a new table with the correct types and insert converted data:-- Create the new table with Timestamp columns CREATE TABLE your_table_clean ( start_time TIMESTAMP, end_time TIMESTAMP, last_updated TIMESTAMP ) STORED AS TEXTFILE; -- Match your original table's storage format (ORC, Parquet, etc.) -- Insert converted data from the old table INSERT INTO your_table_clean SELECT to_timestamp(regexp_replace(start_time, '"', ''), 'yyyy-MM-dd HH:mm:ss'), to_timestamp(regexp_replace(end_time, '"', ''), 'yyyy-MM-dd HH:mm:ss'), to_timestamp(regexp_replace(last_updated, '"', ''), 'yyyy-MM-dd HH:mm:ss') FROM your_table;Check for invalid timestamp values
If you still get NULLs after cleaning quotes, check for invalid dates/times (like February 30th, or times like 25:00:00) with this query:SELECT start_time FROM your_table WHERE start_time NOT RLIKE '^"\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}"$';Fix any invalid entries before converting.
内容的提问来源于stack exchange,提问作者Shantesh




