{"id":51,"date":"2023-11-17T18:20:04","date_gmt":"2023-11-17T18:20:04","guid":{"rendered":"https:\/\/dataaiguru.com\/?p=51"},"modified":"2023-12-04T14:10:44","modified_gmt":"2023-12-04T14:10:44","slug":"reading-the-deltalake-format-table-in-aws-glue-job","status":"publish","type":"post","link":"https:\/\/dataaiguru.com\/index.php\/2023\/11\/17\/reading-the-deltalake-format-table-in-aws-glue-job\/","title":{"rendered":"Reading the DeltaLake format table in AWS Glue Job"},"content":{"rendered":"<h2>Introduction to Delta Lake Format<\/h2>\n<p>The Delta Lake format is a powerful data storage format that combines the reliability of ACID transactions, schema enforcement, and data versioning for big data workloads. It is designed to address common data lake challenges such as data quality, data reliability, and data lifecycle management.<\/p>\n<h2>Reading Delta Lake Format in AWS Glue<\/h2>\n<p>AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It supports various data formats, including the Delta Lake format.<\/p>\n<p>To read the Delta Lake format in AWS Glue, follow these steps:<\/p>\n<ol>\n<li>Set up an AWS Glue job in the AWS Glue Studio.<\/li>\n<li>Specify the Delta Lake format as the data source in the job configuration.<\/li>\n<li>Configure the job to read the Delta Lake format using the appropriate input options.<\/li>\n<li>Run the AWS Glue job to read the data from the Delta Lake format.<\/li>\n<\/ol>\n<h2>Step-by-Step Guide<\/h2>\n<p>1. Set up an AWS Glue job:<\/p>\n<p>Log in to the AWS Management Console and navigate to the AWS Glue service. Click on &#8216;ETL Jobs&#8217; in the navigation pane and then click on &#8216;Author code with a script editor&#8217;, choose engine as &#8216;Spark&#8217; then hit &#8216;Create Script&#8217;.<\/p>\n<p>2. Specify the Delta Lake format as the data source:<\/p>\n<p>In the &#8216;Job details&#8217; section, provide a name for the job and select the &#8216;IAM Role&#8217;. Under &#8216;Job parameters&#8217;, choose &#8216;&#8211;datalake-formats&#8217; as key and value as &#8216;delta&#8217;. Create another key named &#8216;&#8211;conf&#8217; \u00a0and set it to the following value.<\/p>\n<pre>spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore<\/pre>\n<p>3. Configure the job script to read the Delta Lake format via glue libraries or using spark dataframe :<br \/>\nUsing Glue libraries :<\/p>\n<pre>import sys\nfrom awsglue.transforms import *\nfrom awsglue.utils import getResolvedOptions\nfrom pyspark.context import SparkContext\nfrom awsglue.context import GlueContext\nfrom awsglue.job import Job\n\n## @params: [JOB_NAME]\nargs = getResolvedOptions(sys.argv, ['JOB_NAME'])\n\nsc = SparkContext()\nglueContext = GlueContext(sc)\nspark = glueContext.spark_session\njob = Job(glueContext)\njob.init(args['JOB_NAME'], args)\n\ndf = glueContext.create_data_frame.from_catalog(\ndatabase=\"&lt;your_database_name&gt;\",\ntable_name=\"&lt;your_table_name&gt;\",\nadditional_options=additional_options\n)\ndf.show()\njob.commit()<\/pre>\n<p id=\"aws-glue-programming-etl-format-delta_lake-read-spark\">Using the Spark dataframe :<\/p>\n<pre>deltaDF = spark.read.format(\"delta\").load(\"&lt;s3path for delta table&gt;\")\ndeltaDF.show()<\/pre>\n<p>4. Run the AWS Glue job:<\/p>\n<p>Click on &#8216;Save&#8217; to proceed to the &#8216;Job details&#8217; section. Review and modify any necessary job parameters, such as the number of concurrent workers and the maximum capacity.<\/p>\n<p>Finally, click on &#8216;Run job&#8217; to start reading the data from the S3 DeltaLake.<\/p>\n<h2>Conclusion<\/h2>\n<p>The Delta Lake format in AWS Glue provides a reliable and efficient way to read and process big data workloads. By following the step-by-step guide mentioned above, you can easily set up an AWS Glue job to read the Delta Lake format and leverage its benefits for your data analytics needs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction to Delta Lake Format The Delta Lake format is a powerful data storage format that combines the reliability of ACID transactions, schema enforcement, and data versioning for big data workloads. It is designed to address common data lake challenges such as data quality, data reliability, and data lifecycle management. Reading Delta Lake Format in AWS Glue AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It supports various data formats, including the Delta Lake format. To read the Delta Lake format in AWS Glue, follow these steps: Set up an AWS Glue job in the AWS Glue Studio. Specify the Delta Lake format as the data source in the job configuration. Configure the job to read the Delta Lake format using the appropriate input options. Run the AWS Glue job to read the data from the Delta Lake format. Step-by-Step Guide 1. Set up an AWS Glue job: Log in to the AWS Management Console and navigate to the AWS Glue service. Click on &#8216;ETL Jobs&#8217; in the navigation pane and then click on &#8216;Author code with a script editor&#8217;, choose engine as &#8216;Spark&#8217; then hit &#8216;Create Script&#8217;. 2. Specify the Delta Lake format as the data source: In the &#8216;Job details&#8217; section, provide a name for the job and select the &#8216;IAM Role&#8217;. Under &#8216;Job parameters&#8217;, choose &#8216;&#8211;datalake-formats&#8217; as key and value as &#8216;delta&#8217;. Create another key named &#8216;&#8211;conf&#8217; \u00a0and set it to the following value. spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension &#8211;conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog &#8211;conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore 3. Configure the job script to read the Delta Lake format via glue libraries or using spark dataframe : Using Glue libraries : import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, [&#8216;JOB_NAME&#8217;]) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args[&#8216;JOB_NAME&#8217;], args) df = glueContext.create_data_frame.from_catalog( database=&#8221;&lt;your_database_name&gt;&#8221;, table_name=&#8221;&lt;your_table_name&gt;&#8221;, additional_options=additional_options ) df.show() job.commit() Using the Spark dataframe : deltaDF = spark.read.format(&#8220;delta&#8221;).load(&#8220;&lt;s3path for delta table&gt;&#8221;) deltaDF.show() 4. Run the AWS Glue job: Click on &#8216;Save&#8217; to proceed to the &#8216;Job details&#8217; section. Review and modify any necessary job parameters, such as the number of concurrent workers and the maximum capacity. Finally, click on &#8216;Run job&#8217; to start reading the data from the S3 DeltaLake. Conclusion The Delta Lake format in AWS Glue provides a reliable and efficient way to read and process big data workloads. By following the step-by-step guide mentioned above, you can easily set up an AWS Glue job to read the Delta Lake format and leverage its benefits for your data analytics needs.<\/p>\n","protected":false},"author":8,"featured_media":76,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[4],"tags":[6,7,5],"class_list":["post-51","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-engineering","tag-aws-glue","tag-data-analytics","tag-delta-lake"],"_links":{"self":[{"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/posts\/51","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/comments?post=51"}],"version-history":[{"count":3,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/posts\/51\/revisions"}],"predecessor-version":[{"id":79,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/posts\/51\/revisions\/79"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/media\/76"}],"wp:attachment":[{"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/media?parent=51"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/categories?post=51"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataaiguru.com\/index.php\/wp-json\/wp\/v2\/tags?post=51"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}