Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
What is Glue?
A cloud-optimized Extract, Transform and Load service. Glue is different from other ETL tools in
3 different ways.
(1) Glue is server-less
- No need to provision, configure, manage and maintain servers for the ETL
processes/jobs
(2) Glue provides automatic schema-inference thru crawlers
- Crawlers automatically discovers all your data sets, file types and defines the schema
of both structured and semi-structured data sets.
(3) Glue provides auto-generation of ETL scripts
- Glue does the heavy-coding so developers can focus on customizations.
Used as the main integration tool in integrating both (either/or) structured (OLTP/Relational
Database) and semi-structured (Amazon S3/JSON) data in to the data warehouse (Amazon
Redshift).
Used Glue Crawlers to crawl all their data and index all the information and stores all the
information on the Data Catalog and make the data available and ready for analysis to one of
many available analytics services that Amazon currently has including BI tools that works on top
of those services.
How to build an ETL Flow?
Crawl and Catalogue your data
Automatically discovers schema of your data source and its partitioning and added additional
fields for the said partitioning (in this case, it added year, month and day).
This is where you can covert the data, cast columns in to different data types, change the order
of the columns and map the source column to its target column.
Interactively Edit and Explore with Dev-Endpoints
It can also be connected to a notebook (e.g. Zeppelin) to interactively explore and experiment
with your data.
Once the ETL job is registered with the system, the job can be triggered in several events, triggers
and schedules.
Benefits
No server maintenance, cost savings by eliminating over-provisioning or under-provisioning
resources, support for data sources including easy integration with Oracle and MS SQL data
sources, and AWS Lambda integration.
As an AWS product, it works well with other AWS services such as Amazon Athena, Amazon
Redshift Spectrum, and AWS Identity and Access Management.
A lot of its automated and intelligent features (i.e. Crawlers, auto-generation of ETL code) helps
the developer a lot in terms of laying the foundation of the data sources and its structures,
schema and mappings and the ETL flow which gives developers more time to focus on
customizations and the architecture of the whole ETL process up to the analysis and reporting.