Sunday, November 27, 2022
Home3D PrintingGet began with information integration from Amazon S3 to Amazon Redshift utilizing...

Get began with information integration from Amazon S3 to Amazon Redshift utilizing AWS Glue interactive classes


Organizations are inserting a excessive precedence on information integration, particularly to assist analytics, machine studying (ML), enterprise intelligence (BI), and software improvement initiatives. Knowledge is rising exponentially and is generated by more and more numerous information sources. Knowledge integration turns into difficult when processing information at scale and the inherent heavy lifting related to infrastructure required to handle it. This is without doubt one of the key the reason why organizations are always in search of easy-to-use and low upkeep information integration options to maneuver information from one location to a different or to consolidate their enterprise information from a number of sources right into a centralized location to make strategic enterprise selections.

Most organizations use Spark for his or her massive information processing wants. When you’re seeking to simplify information integration, and don’t need the effort of spinning up servers, managing assets, or organising Spark clusters, now we have the answer for you.

AWS Glue is a serverless information integration service that makes it straightforward to find, put together, and mix information for analytics, ML, and software improvement. AWS Glue supplies each visible and code-based interfaces to make information integration easy and accessible for everybody.

When you favor a code-based expertise and need to interactively creator information integration jobs, we advocate interactive classes. Interactive classes is a lately launched AWS Glue function that lets you interactively develop AWS Glue processes, run and take a look at every step, and examine the outcomes.

There are completely different choices to make use of interactive classes. You may create and work with interactive classes by means of the AWS Command Line Interface (AWS CLI) and API. You may also use Jupyter-compatible notebooks to visually creator and take a look at your pocket book scripts. Interactive classes present a Jupyter kernel that integrates nearly wherever that Jupyter does, together with integrating with IDEs reminiscent of PyCharm, IntelliJ, and Visible Studio Code. This allows you to creator code in your native atmosphere and run it seamlessly on the interactive session backend. You may also begin a pocket book by means of AWS Glue Studio; all of the configuration steps are carried out for you so to discover your information and begin growing your job script after just a few seconds. When the code is prepared, you’ll be able to configure, schedule, and monitor job notebooks as AWS Glue jobs.

When you haven’t tried AWS Glue interactive classes earlier than, this submit is very beneficial. We work by means of a easy situation the place you may must incrementally load information from Amazon Easy Storage Service (Amazon S3) into Amazon Redshift or remodel and enrich your information earlier than loading into Amazon Redshift. On this submit, we use interactive classes inside an AWS Glue Studio pocket book to load the NYC Taxi dataset into an Amazon Redshift Serverless cluster, question the loaded dataset, save our Jupyter pocket book as a job, and schedule it to run utilizing a cron expression. Let’s get began.

Resolution overview

We stroll you thru the next steps:

  1. Arrange an AWS Glue Jupyter pocket book with interactive classes.
  2. Use pocket book’s magics, together with AWS Glue connection and bookmarks.
  3. Learn information from Amazon S3, and remodel and cargo it into Redshift Serverless.
  4. Save the pocket book as an AWS Glue job and schedule it to run.

Conditions

For this walkthrough, we should full the next conditions:

  1. Add Yellow Taxi Journey Data information and the taxi zone lookup desk datasets into Amazon S3. Steps to try this are listed within the subsequent part.
  2. Put together the required AWS Identification and Entry Administration (IAM) insurance policies and roles to work with AWS Glue Studio Jupyter notebooks, interactive classes, and AWS Glue.
  3. Create the AWS Glue connection for Redshift Serverless.

Add datasets into Amazon S3

Obtain Yellow Taxi Journey Data information and taxi zone lookup desk information to your native atmosphere. For this submit, we obtain the January 2022 information for yellow taxi journey data information in Parquet format. The taxi zone lookup information is in CSV format. You may also obtain the information dictionary for the journey document dataset.

  1. On the Amazon S3 console, create a bucket referred to as my-first-aws-glue-is-project-<random quantity> within the us-east-1 Area to retailer the info.S3 bucket names have to be distinctive throughout all AWS accounts in all of the Areas.
  2. Create folders nyc_yellow_taxi and taxi_zone_lookup within the bucket you simply created and add the information you downloaded.
    Your folder buildings ought to appear to be the next screenshots.s3 yellow taxi datas3 lookup data

Put together IAM insurance policies and position

Let’s put together the required IAM insurance policies and position to work with AWS Glue Studio Jupyter notebooks and interactive classes. To get began with notebooks in AWS Glue Studio, check with Getting began with notebooks in AWS Glue Studio.

Create IAM insurance policies for the AWS Glue pocket book position

Create the coverage AWSGlueInteractiveSessionPassRolePolicy with the next permissions:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
        "Effect": "Allow",
        "Action": "iam:PassRole",
        "Resource":"arn:aws:iam::<AWS account ID>:role/AWSGlueServiceRole-GlueIS"
        }
    ]
}

This coverage permits the AWS Glue pocket book position to go to interactive classes in order that the identical position can be utilized in each locations. Be aware that AWSGlueServiceRole-GlueIS is the position that we create for the AWS Glue Studio Jupyter pocket book in a later step. Subsequent, create the coverage AmazonS3Access-MyFirstGlueISProject with the next permissions:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Useful resource": [
                "arn:aws:s3:::<your s3 bucket name>",
                "arn:aws:s3:::<your s3 bucket name>/*"
            ]
        }
    ]
}

This coverage permits the AWS Glue pocket book position to entry information within the S3 bucket.

Create an IAM position for the AWS Glue pocket book

Create a brand new AWS Glue position referred to as AWSGlueServiceRole-GlueIS with the next insurance policies hooked up to it:

Create the AWS Glue connection for Redshift Serverless

Now we’re able to configure a Redshift Serverless safety group to attach with AWS Glue parts.

  1. On the Redshift Serverless console, open the workgroup you’re utilizing.
    Yow will discover all of the namespaces and workgroups on the Redshift Serverless dashboard.
  2. Below Knowledge entry, select Community and safety.
  3. Select the hyperlink for the Redshift Serverless VPC safety group.redshift serverless vpc security groupYou’re redirected to the Amazon Elastic Compute Cloud (Amazon EC2) console.
  4. Within the Redshift Serverless safety group particulars, below Inbound guidelines, select Edit inbound guidelines.
  5. Add a self-referencing rule to permit AWS Glue parts to speak:
    1. For Sort, select All TCP.
    2. For Protocol, select TCP.
    3. For Port vary, embody all ports.
    4. For Supply, use the identical safety group because the group ID.
      redshift inbound security group
  6. Equally, add the next outbound guidelines:
    1. A self-referencing rule with Sort as All TCP, Protocol as TCP, Port vary together with all ports, and Vacation spot as the identical safety group because the group ID.
    2. An HTTPS rule for Amazon S3 entry. The s3-prefix-list-id worth is required within the safety group rule to permit site visitors from the VPC to the Amazon S3 VPC endpoint.
      redshift outbound security group

When you don’t have an Amazon S3 VPC endpoint, you’ll be able to create one on the Amazon Digital Personal Cloud (Amazon VPC) console.

s3 vpc endpoint

You may test the worth for s3-prefix-list-id on the Managed prefix lists web page on the Amazon VPC console.

s3 prefix list

Subsequent, go to the Connectors web page on AWS Glue Studio and create a brand new JDBC connection referred to as redshiftServerless to your Redshift Serverless cluster (except one already exists). Yow will discover the Redshift Serverless endpoint particulars below your workgroup’s Basic Data part. The connection setting appears to be like like the next screenshot.

redshift serverless connection page

Write interactive code on an AWS Glue Studio Jupyter pocket book powered by interactive classes

Now you will get began with writing interactive code utilizing AWS Glue Studio Jupyter pocket book powered by interactive classes. Be aware that it’s a great observe to maintain saving the pocket book at common intervals when you work by means of it.

  1. On the AWS Glue Studio console, create a brand new job.
  2. Choose Jupyter Pocket book and choose Create a brand new pocket book from scratch.
  3. Select Create.
    glue interactive session create notebook
  4. For Job title, enter a reputation (for instance, myFirstGlueISProject).
  5. For IAM Function, select the position you created (AWSGlueServiceRole-GlueIS).
  6. Select Begin pocket book job.
    glue interactive session notebook setupAfter the pocket book is initialized, you’ll be able to see among the accessible magics and a cell with boilerplate code. To view all of the magics of interactive classes, run %assist in a cell to print a full record. Excluding %%sql, working a cell of solely magics doesn’t begin a session, however units the configuration for the session that begins whenever you run your first cell of code.glue interactive session jupyter notebook initializationFor this submit, we configure AWS Glue with model 3.0, three G.1X employees, idle timeout, and an Amazon Redshift reference to the assistance of obtainable magics.
  7. Let’s enter the next magics into our first cell and run it:
    %glue_version 3.0
    %number_of_workers 3
    %worker_type G.1X
    %idle_timeout 60
    %connections redshiftServerless

    We get the next response:

    Welcome to the Glue Interactive Periods Kernel
    For extra info on accessible magic instructions, please kind %assist in any new cell.
    
    Please view our Getting Began web page to entry essentially the most up-to-date info on the Interactive Periods kernel: https://docs.aws.amazon.com/glue/newest/dg/interactive-sessions.html
    Put in kernel model: 0.35 
    Setting Glue model to: 3.0
    Earlier variety of employees: 5
    Setting new variety of employees to: 3
    Earlier employee kind: G.1X
    Setting new employee kind to: G.1X
    Present idle_timeout is 2880 minutes.
    idle_timeout has been set to 60 minutes.
    Connections to be included:
    redshiftServerless

  8. Let’s run our first code cell (boilerplate code) to begin an interactive pocket book session inside just a few seconds:
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
      
    sc = SparkContext.getOrCreate()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)

    We get the next response:

    Authenticating with atmosphere variables and user-defined glue_role_arn:arn:aws:iam::xxxxxxxxxxxx:position/AWSGlueServiceRole-GlueIS
    Making an attempt to make use of current AssumeRole session credentials.
    Attempting to create a Glue session for the kernel.
    Employee Sort: G.1X
    Variety of Employees: 3
    Session ID: 7c9eadb1-9f9b-424f-9fba-d0abc57e610d
    Making use of the next default arguments:
    --glue_kernel_version 0.35
    --enable-glue-datacatalog true
    --job-bookmark-option job-bookmark-enable
    Ready for session 7c9eadb1-9f9b-424f-9fba-d0abc57e610d to get into prepared standing...
    Session 7c9eadb1-9f9b-424f-9fba-d0abc57e610d has been created

  9. Subsequent, learn the NYC yellow taxi information from the S3 bucket into an AWS Glue dynamic body:
    nyc_taxi_trip_input_dyf = glueContext.create_dynamic_frame.from_options(
        connection_type = "s3", 
        connection_options = {
            "paths": ["s3://<your-s3-bucket-name>/nyc_yellow_taxi/"]
        }, 
        format = "parquet",
        transformation_ctx = "nyc_taxi_trip_input_dyf"
    )

    Let’s depend the variety of rows, take a look at the schema and some rows of the dataset.

  10. Depend the rows with the next code:
    nyc_taxi_trip_input_df = nyc_taxi_trip_input_dyf.toDF()
    nyc_taxi_trip_input_df.depend()

    We get the next response:

  11. View the schema with the next code:
    nyc_taxi_trip_input_df.printSchema()

    We get the next response:

    root
     |-- VendorID: lengthy (nullable = true)
     |-- tpep_pickup_datetime: timestamp (nullable = true)
     |-- tpep_dropoff_datetime: timestamp (nullable = true)
     |-- passenger_count: double (nullable = true)
     |-- trip_distance: double (nullable = true)
     |-- RatecodeID: double (nullable = true)
     |-- store_and_fwd_flag: string (nullable = true)
     |-- PULocationID: lengthy (nullable = true)
     |-- DOLocationID: lengthy (nullable = true)
     |-- payment_type: lengthy (nullable = true)
     |-- fare_amount: double (nullable = true)
     |-- further: double (nullable = true)
     |-- mta_tax: double (nullable = true)
     |-- tip_amount: double (nullable = true)
     |-- tolls_amount: double (nullable = true)
     |-- improvement_surcharge: double (nullable = true)
     |-- total_amount: double (nullable = true)
     |-- congestion_surcharge: double (nullable = true)
     |-- airport_fee: double (nullable = true)

  12. View just a few rows of the dataset with the next code:
    nyc_taxi_trip_input_df.present(5)

    We get the next response:

    +--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
    |VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|further|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
    +--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
    |       2| 2022-01-18 15:04:43|  2022-01-18 15:12:51|            1.0|         1.13|       1.0|                 N|         141|         229|           2|        7.0|  0.0|    0.5|       0.0|         0.0|                  0.3|        10.3|                 2.5|        0.0|
    |       2| 2022-01-18 15:03:28|  2022-01-18 15:15:52|            2.0|         1.36|       1.0|                 N|         237|         142|           1|        9.5|  0.0|    0.5|      2.56|         0.0|                  0.3|       15.36|                 2.5|        0.0|
    |       1| 2022-01-06 17:49:22|  2022-01-06 17:57:03|            1.0|          1.1|       1.0|                 N|         161|         229|           2|        7.0|  3.5|    0.5|       0.0|         0.0|                  0.3|        11.3|                 2.5|        0.0|
    |       2| 2022-01-09 20:00:55|  2022-01-09 20:04:14|            1.0|         0.56|       1.0|                 N|         230|         230|           1|        4.5|  0.5|    0.5|      1.66|         0.0|                  0.3|        9.96|                 2.5|        0.0|
    |       2| 2022-01-24 16:16:53|  2022-01-24 16:31:36|            1.0|         2.02|       1.0|                 N|         163|         234|           1|       10.5|  1.0|    0.5|       3.7|         0.0|                  0.3|        18.5|                 2.5|        0.0|
    +--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
    solely displaying prime 5 rows

  13. Now, learn the taxi zone lookup information from the S3 bucket into an AWS Glue dynamic body:
    nyc_taxi_zone_lookup_dyf = glueContext.create_dynamic_frame.from_options(
        connection_type = "s3", 
        connection_options = {
            "paths": ["s3://<your-s3-bucket-name>/taxi_zone_lookup/"]
        }, 
        format = "csv",
        format_options= {
            'withHeader': True
        },
        transformation_ctx = "nyc_taxi_zone_lookup_dyf"
    )

    Let’s depend the variety of rows, take a look at the schema and some rows of the dataset.

  14. Depend the rows with the next code:
    nyc_taxi_zone_lookup_df = nyc_taxi_zone_lookup_dyf.toDF()
    nyc_taxi_zone_lookup_df.depend()

    We get the next response:

  15. View the schema with the next code:
    nyc_taxi_zone_lookup_apply_mapping_dyf.toDF().printSchema()

    We get the next response:

    root
     |-- LocationID: string (nullable = true)
     |-- Borough: string (nullable = true)
     |-- Zone: string (nullable = true)
     |-- service_zone: string (nullable = true)

  16. View just a few rows with the next code:
    nyc_taxi_zone_lookup_df.present(5)

    We get the next response:

    +----------+-------------+--------------------+------------+
    |LocationID|      Borough|                Zone|service_zone|
    +----------+-------------+--------------------+------------+
    |         1|          EWR|      Newark Airport|         EWR|
    |         2|       Queens|         Jamaica Bay|   Boro Zone|
    |         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
    |         4|    Manhattan|       Alphabet Metropolis| Yellow Zone|
    |         5|Staten Island|       Arden Heights|   Boro Zone|
    +----------+-------------+--------------------+------------+
    solely displaying prime 5 rows

  17. Primarily based on the info dictionary, lets recalibrate the info forms of attributes in dynamic frames similar to each dynamic frames:
    nyc_taxi_trip_apply_mapping_dyf = ApplyMapping.apply(
        body = nyc_taxi_trip_input_dyf, 
        mappings = [
            ("VendorID","Long","VendorID","Integer"), 
            ("tpep_pickup_datetime","Timestamp","tpep_pickup_datetime","Timestamp"), 
            ("tpep_dropoff_datetime","Timestamp","tpep_dropoff_datetime","Timestamp"), 
            ("passenger_count","Double","passenger_count","Integer"), 
            ("trip_distance","Double","trip_distance","Double"),
            ("RatecodeID","Double","RatecodeID","Integer"), 
            ("store_and_fwd_flag","String","store_and_fwd_flag","String"), 
            ("PULocationID","Long","PULocationID","Integer"), 
            ("DOLocationID","Long","DOLocationID","Integer"),
            ("payment_type","Long","payment_type","Integer"), 
            ("fare_amount","Double","fare_amount","Double"),
            ("extra","Double","extra","Double"), 
            ("mta_tax","Double","mta_tax","Double"),
            ("tip_amount","Double","tip_amount","Double"), 
            ("tolls_amount","Double","tolls_amount","Double"), 
            ("improvement_surcharge","Double","improvement_surcharge","Double"), 
            ("total_amount","Double","total_amount","Double"), 
            ("congestion_surcharge","Double","congestion_surcharge","Double"), 
            ("airport_fee","Double","airport_fee","Double")
        ],
        transformation_ctx = "nyc_taxi_trip_apply_mapping_dyf"
    )

    nyc_taxi_zone_lookup_apply_mapping_dyf = ApplyMapping.apply(
        body = nyc_taxi_zone_lookup_dyf, 
        mappings = [ 
            ("LocationID","String","LocationID","Integer"), 
            ("Borough","String","Borough","String"), 
            ("Zone","String","Zone","String"), 
            ("service_zone","String", "service_zone","String")
        ],
        transformation_ctx = "nyc_taxi_zone_lookup_apply_mapping_dyf"
    )

  18. Now let’s test their schema:
    nyc_taxi_trip_apply_mapping_dyf.toDF().printSchema()

    We get the next response:

    root
     |-- VendorID: integer (nullable = true)
     |-- tpep_pickup_datetime: timestamp (nullable = true)
     |-- tpep_dropoff_datetime: timestamp (nullable = true)
     |-- passenger_count: integer (nullable = true)
     |-- trip_distance: double (nullable = true)
     |-- RatecodeID: integer (nullable = true)
     |-- store_and_fwd_flag: string (nullable = true)
     |-- PULocationID: integer (nullable = true)
     |-- DOLocationID: integer (nullable = true)
     |-- payment_type: integer (nullable = true)
     |-- fare_amount: double (nullable = true)
     |-- further: double (nullable = true)
     |-- mta_tax: double (nullable = true)
     |-- tip_amount: double (nullable = true)
     |-- tolls_amount: double (nullable = true)
     |-- improvement_surcharge: double (nullable = true)
     |-- total_amount: double (nullable = true)
     |-- congestion_surcharge: double (nullable = true)
     |-- airport_fee: double (nullable = true)

    nyc_taxi_zone_lookup_apply_mapping_dyf.toDF().printSchema()

    We get the next response:

    root
     |-- LocationID: integer (nullable = true)
     |-- Borough: string (nullable = true)
     |-- Zone: string (nullable = true)
     |-- service_zone: string (nullable = true)

  19. Let’s add the column trip_duration to calculate the length of every journey in minutes to the taxi journey dynamic body:
    # Perform to calculate journey length in minutes
    def trip_duration(start_timestamp,end_timestamp):
        minutes_diff = (end_timestamp - start_timestamp).total_seconds() / 60.0
        return(minutes_diff)

    # Transformation operate for every document
    def transformRecord(rec):
        rec["trip_duration"] = trip_duration(rec["tpep_pickup_datetime"], rec["tpep_dropoff_datetime"])
        return rec
    nyc_taxi_trip_final_dyf = Map.apply(
        body = nyc_taxi_trip_apply_mapping_dyf, 
        f = transformRecord, 
        transformation_ctx = "nyc_taxi_trip_final_dyf"
    )

    Let’s depend the variety of rows, take a look at the schema and some rows of the dataset after making use of the above transformation.

  20. Get a document depend with the next code:
    nyc_taxi_trip_final_df = nyc_taxi_trip_final_dyf.toDF()
    nyc_taxi_trip_final_df.depend()

    We get the next response:

  21. View the schema with the next code:
    nyc_taxi_trip_final_df.printSchema()

    We get the next response:

    root
     |-- further: double (nullable = true)
     |-- tpep_dropoff_datetime: timestamp (nullable = true)
     |-- trip_duration: double (nullable = true)
     |-- trip_distance: double (nullable = true)
     |-- mta_tax: double (nullable = true)
     |-- improvement_surcharge: double (nullable = true)
     |-- DOLocationID: integer (nullable = true)
     |-- congestion_surcharge: double (nullable = true)
     |-- total_amount: double (nullable = true)
     |-- airport_fee: double (nullable = true)
     |-- payment_type: integer (nullable = true)
     |-- fare_amount: double (nullable = true)
     |-- RatecodeID: integer (nullable = true)
     |-- tpep_pickup_datetime: timestamp (nullable = true)
     |-- VendorID: integer (nullable = true)
     |-- PULocationID: integer (nullable = true)
     |-- tip_amount: double (nullable = true)
     |-- tolls_amount: double (nullable = true)
     |-- store_and_fwd_flag: string (nullable = true)
     |-- passenger_count: integer (nullable = true)

  22. View just a few rows with the next code:
    nyc_taxi_trip_final_df.present(5)

    We get the next response:

    +-----+---------------------+------------------+-------------+-------+---------------------+------------+--------------------+------------+-----------+------------+-----------+----------+--------------------+--------+------------+----------+------------+------------------+---------------+
    |further|tpep_dropoff_datetime|     trip_duration|trip_distance|mta_tax|improvement_surcharge|DOLocationID|congestion_surcharge|total_amount|airport_fee|payment_type|fare_amount|RatecodeID|tpep_pickup_datetime|VendorID|PULocationID|tip_amount|tolls_amount|store_and_fwd_flag|passenger_count|
    +-----+---------------------+------------------+-------------+-------+---------------------+------------+--------------------+------------+-----------+------------+-----------+----------+--------------------+--------+------------+----------+------------+------------------+---------------+
    |  0.0|  2022-01-18 15:12:51| 8.133333333333333|         1.13|    0.5|                  0.3|         229|                 2.5|        10.3|        0.0|           2|        7.0|         1| 2022-01-18 15:04:43|       2|         141|       0.0|         0.0|                 N|              1|
    |  0.0|  2022-01-18 15:15:52|              12.4|         1.36|    0.5|                  0.3|         142|                 2.5|       15.36|        0.0|           1|        9.5|         1| 2022-01-18 15:03:28|       2|         237|      2.56|         0.0|                 N|              2|
    |  3.5|  2022-01-06 17:57:03| 7.683333333333334|          1.1|    0.5|                  0.3|         229|                 2.5|        11.3|        0.0|           2|        7.0|         1| 2022-01-06 17:49:22|       1|         161|       0.0|         0.0|                 N|              1|
    |  0.5|  2022-01-09 20:04:14| 3.316666666666667|         0.56|    0.5|                  0.3|         230|                 2.5|        9.96|        0.0|           1|        4.5|         1| 2022-01-09 20:00:55|       2|         230|      1.66|         0.0|                 N|              1|
    |  1.0|  2022-01-24 16:31:36|14.716666666666667|         2.02|    0.5|                  0.3|         234|                 2.5|        18.5|        0.0|           1|       10.5|         1| 2022-01-24 16:16:53|       2|         163|       3.7|         0.0|                 N|              1|
    +-----+---------------------+------------------+-------------+-------+---------------------+------------+--------------------+------------+-----------+------------+-----------+----------+--------------------+--------+------------+----------+------------+------------------+---------------+
    solely displaying prime 5 rows

  23. Subsequent, load each the dynamic frames into our Amazon Redshift Serverless cluster:
    nyc_taxi_trip_sink_dyf = glueContext.write_dynamic_frame.from_jdbc_conf(
        body = nyc_taxi_trip_final_dyf, 
        catalog_connection = "redshiftServerless", 
        connection_options =  {"dbtable": "public.f_nyc_yellow_taxi_trip","database": "dev"}, 
        redshift_tmp_dir = "s3://aws-glue-assets-<AWS-account-ID>-us-east-1/momentary/", 
        transformation_ctx = "nyc_taxi_trip_sink_dyf"
    )

    nyc_taxi_zone_lookup_sink_dyf = glueContext.write_dynamic_frame.from_jdbc_conf(
        body = nyc_taxi_zone_lookup_apply_mapping_dyf, 
        catalog_connection = "redshiftServerless", 
        connection_options = {"dbtable": "public.d_nyc_taxi_zone_lookup", "database": "dev"}, 
        redshift_tmp_dir = "s3://aws-glue-assets-<AWS-account-ID>-us-east-1/momentary/", 
        transformation_ctx = "nyc_taxi_zone_lookup_sink_dyf"
    )

    Now let’s validate the info loaded in Amazon Redshift Serverless cluster by working just a few queries in Amazon Redshift question editor v2. You may also use your most popular question editor.

  24. First, we depend the variety of data and choose just a few rows in each the goal tables (f_nyc_yellow_taxi_trip and d_nyc_taxi_zone_lookup):
    SELECT 'f_nyc_yellow_taxi_trip' AS table_name, COUNT(1) FROM "public"."f_nyc_yellow_taxi_trip"
    UNION ALL
    SELECT 'd_nyc_taxi_zone_lookup' AS table_name, COUNT(1) FROM "public"."d_nyc_taxi_zone_lookup";

    redshift table record count query output

    The variety of data in f_nyc_yellow_taxi_trip (2,463,931) and d_nyc_taxi_zone_lookup (265) match the variety of data in our enter dynamic body. This validates that every one data from information in Amazon S3 have been efficiently loaded into Amazon Redshift.

    You may view among the data for every desk with the next instructions:

    SELECT * FROM public.f_nyc_yellow_taxi_trip LIMIT 10;

    redshift fact data select query

    SELECT * FROM public.d_nyc_taxi_zone_lookup LIMIT 10;

    redshift lookup data select query

  25. One of many insights that we need to generate from the datasets is to get the highest 5 routes with their journey length. Let’s run the SQL for that on Amazon Redshift:
    SELECT 
        CASE WHEN putzl.zone >= dotzl.zone 
            THEN putzl.zone || ' - ' || dotzl.zone 
            ELSE  dotzl.zone || ' - ' || putzl.zone 
        END AS "Route",
        COUNT(1) AS "Frequency",
        ROUND(SUM(trip_duration),1) AS "Whole Journey Length (minutes)"
    FROM 
        public.f_nyc_yellow_taxi_trip ytt
    INNER JOIN 
        public.d_nyc_taxi_zone_lookup putzl ON ytt.pulocationid = putzl.locationid
    INNER JOIN 
        public.d_nyc_taxi_zone_lookup dotzl ON ytt.dolocationid = dotzl.locationid
    GROUP BY 
        "Route"
    ORDER BY 
        "Frequency" DESC, "Whole Journey Length (minutes)" DESC
    LIMIT 5;

    redshift top 5 route query

Remodel the pocket book into an AWS Glue job and schedule it

Now that now we have authored the code and examined its performance, let’s put it aside as a job and schedule it.

Let’s first allow job bookmarks. Job bookmarks assist AWS Glue keep state info and stop the reprocessing of outdated information. With job bookmarks, you’ll be able to course of new information when rerunning on a scheduled interval.

  1. Add the next magic command after the primary cell that comprises different magic instructions initialized throughout authoring the code:
    %%configure
    {
        "--job-bookmark-option": "job-bookmark-enable"
    }

    To initialize job bookmarks, we run the next code with the title of the job because the default argument (myFirstGlueISProject for this submit). Job bookmarks retailer the states for a job. You must at all times have job.init() to start with of the script and the job.commit() on the finish of the script. These two features are used to initialize the bookmark service and replace the state change to the service. Bookmarks received’t work with out calling them.

  2. Add the next piece of code after the boilerplate code:
    params = []
    if '--JOB_NAME' in sys.argv:
        params.append('JOB_NAME')
    args = getResolvedOptions(sys.argv, params)
    if 'JOB_NAME' in args:
        jobname = args['JOB_NAME']
    else:
        jobname = "myFirstGlueISProject"
    job.init(jobname, args)

  3. Then remark out all of the traces of code that had been authored to confirm the specified final result and aren’t vital for the job to ship its goal:
    #nyc_taxi_trip_input_df = nyc_taxi_trip_input_dyf.toDF()
    #nyc_taxi_trip_input_df.depend()
    #nyc_taxi_trip_input_df.printSchema()
    #nyc_taxi_trip_input_df.present(5)
    
    #nyc_taxi_zone_lookup_df = nyc_taxi_zone_lookup_dyf.toDF()
    #nyc_taxi_zone_lookup_df.depend()
    #nyc_taxi_zone_lookup_df.printSchema()
    #nyc_taxi_zone_lookup_df.present(5)
    
    #nyc_taxi_trip_apply_mapping_dyf.toDF().printSchema()
    #nyc_taxi_zone_lookup_apply_mapping_dyf.toDF().printSchema()
    
    #nyc_taxi_trip_final_df = nyc_taxi_trip_final_dyf.toDF()
    #nyc_taxi_trip_final_df.depend()
    #nyc_taxi_trip_final_df.printSchema()
    #nyc_taxi_trip_final_df.present(5)

  4. Save the pocket book.
    glue interactive session save job
    You may test the corresponding script on the Script tab.glue interactive session script tabBe aware that job.commit() is robotically added on the finish of the script.Let’s run the pocket book as a job.
  5. First, truncate f_nyc_yellow_taxi_trip and d_nyc_taxi_zone_lookup tables in Amazon Redshift utilizing the question editor v2 in order that we don’t have duplicates in each the tables:
    truncate "public"."f_nyc_yellow_taxi_trip";
    truncate "public"."d_nyc_taxi_zone_lookup";

  6. Select Run to run the job.
    glue interactive session run jobYou may test its standing on the Runs tab.glue interactive session job run statusThe job accomplished in lower than 5 minutes with G1.x 3 DPUs.
  7. Let’s test the depend of data in f_nyc_yellow_taxi_trip and d_nyc_taxi_zone_lookup tables in Amazon Redshift:
    SELECT 'f_nyc_yellow_taxi_trip' AS table_name, COUNT(1) FROM "public"."f_nyc_yellow_taxi_trip"
    UNION ALL
    SELECT 'd_nyc_taxi_zone_lookup' AS table_name, COUNT(1) FROM "public"."d_nyc_taxi_zone_lookup";

    redshift count query output

    With job bookmarks enabled, even when you run the job once more with no new information in corresponding folders within the S3 bucket, it doesn’t course of the identical information once more. The next screenshot reveals a subsequent job run in my atmosphere, which accomplished in lower than 2 minutes as a result of there have been no new information to course of.

    glue interactive session job re-run

    Now let’s schedule the job.

  8. On the Schedules tab, select Create schedule.
    glue interactive session create schedule
  9. For Identify¸ enter a reputation (for instance, myFirstGlueISProject-testSchedule).
  10. For Frequency, select Customized.
  11. Enter a cron expression so the job runs each Monday at 6:00 AM.
  12. Add an optionally available description.
  13. Select Create schedule.
    glue interactive session add schedule

The schedule has been saved and activated. You may edit, pause, resume, or delete the schedule from the Actions menu.

glue interactive session schedule action

Clear up

To keep away from incurring future fees, delete the AWS assets you created.

  • Delete the AWS Glue job (myFirstGlueISProject for this submit).
  • Delete the Amazon S3 objects and bucket (my-first-aws-glue-is-project-<random quantity> for this submit).
  • Delete the AWS IAM insurance policies and roles (AWSGlueInteractiveSessionPassRolePolicy, AmazonS3Access-MyFirstGlueISProject and AWSGlueServiceRole-GlueIS).
  • Delete the Amazon Redshift tables (f_nyc_yellow_taxi_trip and d_nyc_taxi_zone_lookup).
  • Delete the AWS Glue JDBC Connection (redshiftServerless).
  • Additionally delete the self-referencing Redshift Serverless safety group, and Amazon S3 endpoint (when you created it whereas following the steps for this submit).

Conclusion

On this submit, we demonstrated how one can do the next:

  • Arrange an AWS Glue Jupyter pocket book with interactive classes
  • Use the pocket book’s magics, together with the AWS Glue connection onboarding and bookmarks
  • Learn the info from Amazon S3, and remodel and cargo it into Amazon Redshift Serverless
  • Configure magics to allow job bookmarks, save the pocket book as an AWS Glue job, and schedule it utilizing a cron expression

The aim of this submit is to offer you step-by-step fundamentals to get you going with AWS Glue Studio Jupyter notebooks and interactive classes. You may arrange an AWS Glue Jupyter pocket book in minutes, begin an interactive session in seconds, and enormously enhance the event expertise with AWS Glue jobs. Interactive classes have a 1-minute billing minimal with value management options that scale back the price of growing information preparation functions. You may construct and take a look at functions from the atmosphere of your selection, even in your native atmosphere, utilizing the interactive classes backend.

Interactive classes present a sooner, cheaper, and extra versatile approach to construct and run information preparation and analytics functions. To be taught extra about interactive classes, check with Job improvement (interactive classes), and begin exploring an entire new improvement expertise with AWS Glue. Moreover, take a look at the next posts to stroll by means of extra examples of utilizing interactive classes with completely different choices:


Concerning the Authors

Vikas blog picVikas Omer is a principal analytics specialist options architect at Amazon Internet Companies. Vikas has a powerful background in analytics, buyer expertise administration (CEM), and information monetization, with over 13 years of expertise within the trade globally. With six AWS Certifications, together with Analytics Specialty, he’s a trusted analytics advocate to AWS prospects and companions. He loves touring, assembly prospects, and serving to them turn into profitable in what they do.

Nori profile picNoritaka Sekiyama is a Principal Large Knowledge Architect on the AWS Glue group. He enjoys collaborating with completely different groups to ship outcomes like this submit. In his spare time, he enjoys enjoying video video games along with his household.

Gal blog picGal Heyne is a Product Supervisor for AWS Glue and has over 15 years of expertise as a product supervisor, information engineer and information architect. She is captivated with growing a deep understanding of consumers’ enterprise wants and collaborating with engineers to design elegant, highly effective and straightforward to make use of information merchandise. Gal has a Grasp’s diploma in Knowledge Science from UC Berkeley and he or she enjoys touring, enjoying board video games and going to music concert events.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments