Mapping pandas data types to redshift data types

3/19/2023

select JSON_EXTRACT_PATH_TEXT(mod_metadata, 'QCN') as QCN from _from_files ¶ awswrangler.redshift.

To query the map columns (converted into json string) you can use JSON_EXTRACT_PATH_TEXT function in Redshift database. With this change the Glue job execution was successful and the data was loaded into Amazon Redshift. withColumn("mod_metadata", F.to_json(df1.mod_metadata)) withColumn("rel_metadata", F.to_json(df1.rel_metadata)) ("Convert the map datatype to json datatype") The way I handled this issue is using the “ to_json” pyspark function which converts the map column into JSON string. And as CSV does not support map data type the Glue job was failing. Then these CSV files are loaded into the redshift database via COPY command using the manifest file. Now, the way AWS Glue service internally handles the write_dynamic_frame_from_jdbc_conf method for redshift is to write the Glue DyanamicFrame data into multiple CSV files and create a manifest file onto the S3 location specified in the temporary directory parameter of the Glue job. To debug the issue, if you look at the schema of the data, you will notice that the columns rel_metadata and mod_metadata are map datatype. The AWS Glue job was failing with ‘CSV data source does not support map data type.’ error during the last step of writing the dynamicframe to database. datasink = glueContext.write_dynamic_frame_from_jdbc_conf( The final step in the script was to convert the spark dataframe into Glue DynamicFrame and write it to Amazon Redshift database using write_dynamic_frame_from_jdbc_conf method of glueContext class. withColumn('last_updated', F.from_unixtime(F.col('last_updated')/1000).cast(T.TimestampType()))Ĥ. withColumn('created_at', F.from_unixtime(F.col('created_at')/1000).cast(T.TimestampType())) For example, the last_updated and created_at column with long datatype were converted to timestamp datatype. As part of the script I was performing some basic transformations. The pyspark script started with reading the data from source S3 into a Glue DynamicFrameĭatasource0 = glueContext.create_dynamic_frame_from_options(ģ.Lets look into it in little more details – Write the transformed data to Amazon Redshift using write_dynamic_frame_from_jdbc_confĪnd it was during this write step that the glue job was failing.Read the data from S3 using create_dynamic_frame_from_options.In a nutshell, the job was performing below steps: Recently I came across “CSV data source does not support map data type” error for a newly created glue job. AWS Glue is a serverless ETL service to process large amount of datasets from various sources for analytics and data processing.

0 Comments

Mapping pandas data types to redshift data types

Leave a Reply.

Author

Archives

Categories