Tuesday, November 29, 2022
Home3D PrintingSaying AWS Glue crawler help for Snowflake

Saying AWS Glue crawler help for Snowflake


For knowledge lake clients who want to find petabytes of information, AWS Glue crawlers are a well-liked strategy to scan knowledge within the background, so you possibly can deal with utilizing the information to make higher clever selections. You may additionally have knowledge in knowledge warehouses corresponding to Snowflake and need the power to find the information within the warehouse and mix with knowledge from knowledge lakes to derive insights. AWS Glue crawlers now help Snowflake, making it simpler so that you can perceive updates to Snowflake schema and extract significant insights.

To crawl a Snowflake database, you possibly can create and schedule an AWS Glue crawler with an JDBC URL with credential info from AWS Secrets and techniques Supervisor. A configuration possibility means that you can specify if you need the crawler to crawl your complete database or restrict the tables by together with the schema or desk path and exclude patterns to cut back crawl time. With every run of the crawler, the crawler inspects and catalogs info, corresponding to updates or deletes to Snowflake tables, exterior tables, views, and materialized views within the AWS Glue Information Catalog. For Snowflake columns with non-Hive appropriate sorts, corresponding to geography or geometry, the crawler extracts that info as a uncooked knowledge kind and makes it accessible within the Information Catalog.

On this put up, we arrange an AWS Glue crawler to crawl the OpenStreetMap geospatial dataset, which is freely accessible by Snowflake Market. This dataset consists of all the OpenStreetMap location knowledge for New York. OpenStreetMap maintains knowledge about companies, roads, trails, cafes, railway stations, and far more, from all around the world.

Overview of answer

Snowflake is a cloud knowledge platform that gives knowledge options from knowledge warehousing to knowledge science. Snowflake Computing is an AWS Superior Expertise Companion with AWS Competencies in Information & Analytics, Machine Studying, and Retail, in addition to an AWS service validation for AWS PrivateLink.

On this answer, we use a pattern use case involving factors of curiosity in New York Metropolis, based mostly on the next Snowflake fast begin. Observe sections 1 and a couple of to get entry to pattern geospatial knowledge from Snowflake Market. We present tips on how to interpret the geography knowledge kind and perceive the completely different codecs. We use the AWS Glue crawler to crawl this OpenStreetMap geospatial dataset and make it accessible within the Information Catalog with the geography knowledge kind maintained the place applicable.

Conditions

To comply with alongside, you want the next:

  • An AWS account.
  • An AWS Identification and Entry Administration (IAM) person with entry to the next companies:
  • An IAM position with entry to run AWS Glue crawlers.
  • If the AWS account you employ to comply with this put up makes use of AWS Lake Formation to handle permissions on the AWS Glue Information Catalog, just be sure you log in as a person with entry to create databases and tables. For extra info, consult with Implicit Lake Formation permissions.
  • A Snowflake Enterprise Version account with permission to create storage integrations, ideally within the AWS us-east-1 Area or closest accessible trial Area, like us-east-2. If essential, you possibly can subscribe to a Snowflake trial account on AWS Market.
    • On the Market itemizing web page, select Proceed to Subscribe, after which select Settle for Phrases. You’re redirected to the Snowflake web site to start utilizing the software program. To finish your registration, select Set Up Your Account.
    • When you’re new to Snowflake, think about finishing the Snowflake in 20 Minutes tutorial. By the top of the tutorial, it’s best to know tips on how to create required Snowflake objects, together with warehouses, databases, and tables for storing and querying knowledge.
  • A Snowflake worksheet (question editor) and related entry to a Snowflake digital warehouse (compute) and database (storage).
  • Entry to an current Snowflake account with the ACCOUNTADMIN position or the IMPORT SHARE privilege.

Create an AWS Glue connection to Snowflake

For this put up, an AWS Glue connection to your Snowflake cluster is critical. For extra particulars about tips on how to create it, comply with the steps in Performing knowledge transformations utilizing Snowflake and AWS Glue. The next screenshot exhibits the configuration used to create a connection to the Snowflake cluster for this put up.
configuration used to create a connection to the Snowflake cluster for this post.

Create an AWS Glue crawler

To create your crawler, full the next steps:

  1. On the AWS Glue console, select Crawlers within the navigation pane.
    1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Select Create crawler.
    Choose Create crawler.
  3. For Title, enter a reputation (for instance, glue-blog-snowflake-crawler).
  4. Select Subsequent.
    Choose Next
  5. For Is your knowledge already mapped to Glue tables, choose Not but.
  6. Within the Information sources part, select Add a knowledge supply.
    6. In the Data sources section, choose Add a data source.

For this put up, you employ a JDBC dataset as a supply.

  1. For Information supply, select JDBC.
  2. For Connection, choose the connection that you simply created earlier (for this put up, SA-snowflake-connection).
  3. For Embrace path, enter the trail to the Snowflake database you created as a prerequisite (OSM_NEWYORK/NEW_YORK/%).
  4. For Extra metadata, select COMMENTS and RAWTYPE.

This enables the crawler to reap metadata associated to feedback and uncooked sorts like geospatial columns.

  1. Select Add a JDBC knowledge supply.
  1. Select Subsequent.
    Choose Next
  2. For Present IAM position¸ select the position you created as a prerequisite (for this put up, we use AWSGlueServiceRole-DefualtRole).
  3. Select Subsequent.
    Choose Next

Now let’s create an AWS Glue database.

  1. Underneath Goal database, select Add database.
    Under Target database, choose Add database.
  2. For Title, enter gluesnowdb.
  3. Select Create database.
    Choose Create database.
  4. On the Set output and scheduling web page, for Goal database, select the database you simply created (gluesnowdb).
  5. For Desk title prefix, enter blog_.
  6. For Frequency, select On demand.
  7. Select Subsequent.
    Choose Next.
  8. Evaluate the configuration and select Create crawler.
    Review the configuration and choose Create crawler.

Run the AWS Glue crawler

To run the crawler, full the next steps:

  1. On the AWS Glue console, select Crawlers within the navigation pane.
  2. Select the crawler you created.
    Choose the crawler you created.
  3. Select Run crawler.
    Choose Run crawler

On the Crawler runs tab, you possibly can see the present run of the crawler.
On the Crawler runs tab, you can see the current run of the crawler.

  1. Wait till the crawler run is full.

As proven within the following screenshot, 27 tables had been added.
As shown in the following screenshot, 27 tables were added.

Now let’s see how these tables look within the AWS Glue Information Catalog.

Discover the AWS Glue tables

Let’s discover the tables created by the crawler.

  1. On the AWS Glue console, selected Databases within the navigation pane.
    On the AWS Glue console, chose Databases in the navigation pane.
  2. Seek for and select the gluesnowdb database.
    Search for and choose the gluesnowdb database.

Now you possibly can see the listing of the tables created by the crawler.
Now you can see the list of the tables created by the crawler.

  1. Select the blog_osm_newyork_new_york_v_osm_ny_amenity desk.
    3. Choose the blog_osm_newyork_new_york_v_osm_ny_amenity table.

Within the Schema part, you possibly can see that the uncooked kind was additionally harvested from the supply Snowflake database.
In the Schema section, you can see that the raw type was also harvested from the source Snowflake database.

  1. Select the Superior properties tab.
  2. Within the Desk properties part, you possibly can see that the classification is snowflake and the typeOfData is view.
    5. In the Table properties section, you can see that the classification is snowflake and the typeOfData is view.

Clear up

To keep away from incurring future fees, and to wash up unused roles and insurance policies, delete the sources you created: the CloudFormation stack, S3 bucket, AWS Glue crawler, AWS Glue database, and AWS Glue desk.

Conclusion

AWS Glue crawlers now help Snowflake tables, views, and materialized views. Providing extra choices to combine Snowflake databases to your AWS Glue Information Catalog. You should utilize AWS Glue crawlers to find Snowflake datasets, extract schema info, and populate the Information Catalog.

On this put up, we offered a process to arrange AWS Glue crawlers to find Snowflake tables, which reduces the time and price wanted to incrementally course of Snowflake desk knowledge updates within the Information Catalog. To study extra about this function, consult with the docs.

Particular due to everybody who contributed to this crawler function launch: Theo Xu, Hunny Vankawala, and Jessica Cheng.

Joyful crawling!

Attribution

OpenStreetMap knowledge by OpenStreetMap Basis is licensed underneath Open Information Commons Open Database License (ODbL)


Concerning the authors

Leonardo Gómez is a Senior Analytics Specialist Options Architect at AWS. Based mostly in Toronto, Canada, he has over a decade of expertise in knowledge administration, serving to clients across the globe handle their enterprise and technical wants.

Bosco Albuquerque is a Sr. Companion Options Architect at AWS and has over 20 years of expertise working with database and analytics merchandise from enterprise database distributors and cloud suppliers. He has helped expertise firms design and implement knowledge analytics options and merchandise.

Sandeep Adwankar is a Senior Technical Product Supervisor at AWS. Based mostly within the California Bay Space, he works with clients across the globe to translate enterprise and technical necessities into merchandise that allow clients to enhance how they handle, safe, and entry knowledge.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments