{"id":64,"date":"2023-12-05T08:35:07","date_gmt":"2023-12-05T08:35:07","guid":{"rendered":"https:\/\/dataaiguru.com\/?p=64"},"modified":"2024-07-02T23:17:29","modified_gmt":"2024-07-02T17:47:29","slug":"how-to-write-deltalake-format-table-from-aws-glue","status":"publish","type":"post","link":"https:\/\/dataaiguru.com\/index.php\/2023\/12\/05\/how-to-write-deltalake-format-table-from-aws-glue\/","title":{"rendered":"How to write deltalake format table from AWS Glue"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"64\" class=\"elementor elementor-64\">\n\t\t\t\t<div class=\"elementor-element elementor-element-4105ea3 e-flex e-con-boxed wpr-particle-no wpr-jarallax-no wpr-parallax-no wpr-sticky-section-no wpr-column-slider-no wpr-equal-height-no e-con e-parent\" data-id=\"4105ea3\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-6daf5b0e elementor-widget elementor-widget-text-editor\" data-id=\"6daf5b0e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<h2>Introduction<\/h2>\n<p>DeltaLake is a powerful open-source storage layer that brings reliability and performance to data lakes. It provides ACID transactions, schema enforcement, and time travel capabilities, making it ideal for building scalable and reliable data pipelines.<\/p>\n<h2>Getting Started with AWS Glue<\/h2>\n<p>AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It provides a serverless environment for running ETL jobs and supports various data sources and destinations.<\/p>\n<h2>Writing to DeltaLake Format Table<\/h2>\n<p>To write data into a DeltaLake format table in AWS Glue, follow these steps:<\/p>\n<ol>\n<li>Create a GlueContext object to connect to your data source and destination.<\/li>\n<li>Read the source data using the GlueContext and convert it into a DynamicFrame.<\/li>\n<li>Apply any transformations or filters to the DynamicFrame as needed.<\/li>\n<li>Write the transformed data to the DeltaLake format table using the `glueContext.write_dynamic_frame.from_options` method.<\/li>\n<li>Specify the format options for writing to DeltaLake, such as the table name, file format, and partitioning options.<\/li>\n<li>Run the AWS Glue job to execute the data write operation.<\/li>\n<\/ol>\n<h2>Example Code<\/h2>\n<pre><code>from awsglue.context import GlueContext\nfrom pyspark.context import SparkContext\n\nsc = SparkContext()\nglueContext = GlueContext(sc)\n\n# Read the source data\nsource_dyf = glueContext.create_dynamic_frame.from_catalog(database='my_database', table_name='my_source_table')\n\n# Apply transformations\ntransformed_dyf = apply_transformations(source_dyf)\n\n# Write to DeltaLake format table\nglueContext.write_dynamic_frame.from_options(\n    frame=transformed_dyf,\n    connection_type='s3',\n    connection_options={'path': 's3:\/\/my_bucket\/my_table'},\n    format='delta'\n)\n<\/code><\/pre>\n<h2>Conclusion<\/h2>\n<p>Writing data into DeltaLake format tables in AWS Glue jobs is straightforward and allows you to take advantage of the reliability and performance benefits offered by DeltaLake. By following the steps outlined in this blog post and using the example code provided, you can easily incorporate DeltaLake into your data pipelines and ensure the integrity and scalability of your data lake.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Introduction DeltaLake is a powerful open-source storage layer that brings reliability and performance to data lakes. It provides ACID transactions, schema enforcement, and time travel capabilities, making it ideal for building scalable and reliable data pipelines. Getting Started with AWS Glue AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It provides a serverless environment for running ETL jobs and supports various data sources and destinations. Writing to DeltaLake Format Table To write data into a DeltaLake format table in AWS Glue, follow these steps: Create a GlueContext object to connect to your data source and destination. Read the source data using the GlueContext and convert it into a DynamicFrame. Apply any transformations or filters to the DynamicFrame as needed. Write the transformed data to the DeltaLake format table using the `glueContext.write_dynamic_frame.from_options` method. Specify the format options for writing to DeltaLake, such as the table name, file format, and partitioning options. Run the AWS Glue job to execute the data write operation. Example Code from awsglue.context import GlueContext from pyspark.context import SparkContext sc = SparkContext() glueContext = GlueContext(sc) # Read the source data source_dyf = glueContext.create_dynamic_frame.from_catalog(database=&#8217;my_database&#8217;, table_name=&#8217;my_source_table&#8217;) # Apply transformations transformed_dyf = apply_transformations(source_dyf) # Write to DeltaLake format table glueContext.write_dynamic_frame.from_options( frame=transformed_dyf, connection_type=&#8217;s3&#8242;, connection_options={&#8216;path&#8217;: &#8216;s3:\/\/my_bucket\/my_table&#8217;}, format=&#8217;delta&#8217; ) Conclusion Writing data into DeltaLake format tables in AWS Glue jobs is straightforward and allows you to take advantage of the reliability and performance benefits offered by DeltaLake. By following the steps outlined in this blog post and using the example code provided, you can easily incorporate DeltaLake into your data pipelines and ensure the integrity and scalability of your data lake.<\/p>\n","protected":false},"author":8,"featured_media":80,"comment_status":"open","ping_status":"open","sticky":true,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[4],"tags":[6,18,17],"class_list":["post-64","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-engineering","tag-aws-glue","tag-data-pipelines","tag-deltalake"],"_links":{"self":[{"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/posts\/64","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/comments?post=64"}],"version-history":[{"count":21,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/posts\/64\/revisions"}],"predecessor-version":[{"id":565,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/posts\/64\/revisions\/565"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/media\/80"}],"wp:attachment":[{"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/media?parent=64"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/categories?post=64"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/tags?post=64"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}