Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Update catalog docs to show automatic catalog syncs to Snowflake and Glue #549

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 43 additions & 7 deletions website/docs/glue-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ From your terminal, create a glue database.
aws glue create-database --database-input "{\"Name\":\"xtable_synced_db\"}"
```

#### Method 1: Using Glue Crawler
From your terminal, create a glue crawler. Modify the `<yourAccountId>`, `<yourRoleName>`
and `<path/to/your/data>`, with appropriate values.

Expand Down Expand Up @@ -149,6 +150,47 @@ From your terminal, run the glue crawler.
Once the crawler succeeds, you’ll be able to query this Iceberg table from Athena,
EMR and/or Redshift query engines.


#### Method 2: Using XTable APIs to sync with AWS Glue Data Catalog directly
This applies for Iceberg target format only.

**Pre-requisites:**
* Download iceberg-aws-X.X.X.jar from the [Maven repository](https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-aws)
* Download bundle-X.X.X.jar from the [Maven repository](https://mvnrepository.com/artifact/software.amazon.awssdk/bundle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Download AWS Java SDK bundle-X.X.X.jar ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

This is unclear from docs.


Create a `glue-sync-config.yaml` file:

```yaml md title="yaml"
sourceFormat: HUDI|DELTA # choose only one
targetFormats:
- ICEBERG
datasets:
-
tableBasePath: s3://path/to/source/data
tableName: table_name
partitionSpec: partitionpath:VALUE
namespace: xtable_synced_db
```

Create a `glue-sync-catalog.yaml` file:

```yaml md title="yaml"
catalogImpl: org.apache.iceberg.aws.glue.GlueCatalog
catalogName: <catalog_name>
catalogOptions:
io-impl: org.apache.iceberg.aws.s3.S3FileIO
warehouse: s3://path/to/source
```

Sample command to sync the table with Glue Data Catalog:

```shell md title="shell"
java -cp /path/to/xtable-utilities-0.2.0-SNAPSHOT-bundled.jar:/path/to/iceberg-aws-1.3.1.jar:/path/to/bundle-2.23.9.jar org.apache.xtable.utilities.RunSync --datasetConfig glue-sync-config.yaml --icebergCatalogConfig glue-sync-catalog.yaml
```
### Validating the results
Once the sync is complete (or in case of Glue Crawler option, once the crawler succeeds) you can inspect the catalogued tables in Glue
and also query the table in Amazon Athena like below:

<Tabs
groupId="table-format"
defaultValue="hudi"
Expand All @@ -169,20 +211,14 @@ supports Hudi version 0.14.0 as mentioned [here](/docs/features-and-limitations#
</TabItem>
<TabItem value="delta">

### Validating the results
After the crawler runs successfully, you can inspect the catalogued tables in Glue
and also query the table in Amazon Athena like below:

```sql
SELECT * FROM xtable_synced_db.<table_name>;
```

</TabItem>
<TabItem value="iceberg">

### Validating the results
After the crawler runs successfully, you can inspect the catalogued tables in Glue
and also query the table in Amazon Athena like below:


```sql
SELECT * FROM xtable_synced_db.<table_name>;
Expand Down
50 changes: 43 additions & 7 deletions website/docs/snowflake.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,6 @@ title: "Snowflake"
Currently, Snowflake supports [Iceberg tables through External Tables](https://www.snowflake.com/blog/expanding-the-data-cloud-with-apache-iceberg/)
and also [Native Iceberg Tables](https://www.snowflake.com/blog/iceberg-tables-powering-open-standards-with-snowflake-innovations/).

:::note NOTE:
Iceberg on Snowflake is currently supported in
[public preview](https://www.snowflake.com/blog/build-open-data-lakehouse-iceberg-tables/)
:::

## Steps:
These are high level steps to help you integrate Apache XTable™ (Incubating) synced Iceberg tables on Snowflake. For more additional information
refer to the [Getting started with Iceberg tables](https://docs.snowflake.com/LIMITEDACCESS/iceberg-2023/tables-iceberg-getting-started).
Expand Down Expand Up @@ -47,7 +42,7 @@ TABLE_FORMAT=ICEBERG
ENABLED=TRUE;
```

### Create an Iceberg table from Iceberg metadata in object storage
### Method 1: Create an Iceberg table from Iceberg metadata in object storage
Refer to additional [examples](https://docs.snowflake.com/LIMITEDACCESS/iceberg-2023/create-iceberg-table#examples)
in the Snowflake Create Iceberg Table guide for more information.

Expand All @@ -58,4 +53,45 @@ CATALOG=<catalog_name>
METADATA_FILE_PATH='path/to/metadata/<VERSION>.metadata.json';
```

Once the table creation succeeds you can start using the Iceberg table as any other table in Snowflake.
Once the table creation succeeds you can start using the Iceberg table as any other table in Snowflake.

### Method 2: Using XTable APIs to sync with Snowflake Catalog directly

#### Pre-requisites:

* Build Apache XTable™ (Incubating) from [source](https://github.com/apache/incubator-xtable)
* Download `iceberg-aws-X.X.X.jar` from the [Maven repository](https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-aws)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Clarification] Are AWS libraries required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you suggest keeping it cloud agnostic? I have only tried with AWS S3 for Snowflake. I'm not even sure what libraries would be needed for GCP and Azure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Snowflake, we don't need iceberg-aws, it contains integrations with glue, dynamodb etc.
https://github.com/apache/iceberg/tree/main/aws/src/integration/java/org/apache/iceberg/aws

I'm not even sure what libraries would be needed for GCP and Azure

For snowflake we need permissions (IAM for AWS, service account for GCP etc.) and external volume setup.
https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume#create-an-external-volume

XTable can already read from S3/GCS/Azure Blob/HDFS using the hadoop library dependencies.
https://github.com/apache/incubator-xtable/blob/main/pom.xml#L360

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For snowflake we need permissions (IAM for AWS, service account for GCP etc.) and external volume setup.

Please confirm if my understanding below is correct.
Iceberg supports various catalogs, including JDBC and REST. The Snowflake catalog appears to be JDBC-based [1]. Therefore, when connecting XTable to the Snowflake catalog and updating Iceberg tables, a Snowflake JDBC driver should be a dependency [2]. Iceberg’s JDBC catalog clients should not need Spark or AWS dependencies. However, if someone wants to follow this tutorial end-to-end, they may need Spark runtime and AWS libraries.

If this is correct, it would be helpful to separate the prereqs into two sections: one for what XTable needs and another for the tutorial prerequisites.

[1] https://www.snowflake.com/en/blog/iceberg-tables-catalog-support-available-now/
[2] https://iceberg.apache.org/docs/1.5.0/jdbc/

* Download `bundle-X.X.X.jar` from the [Maven repository](https://mvnrepository.com/artifact/software.amazon.awssdk/bundle)
* Download `iceberg-spark-runtime-3.X_2.12/X.X.X.jar` from [here](https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/1.4.2/)
* Download `snowflake-jdbc-X.X.X.jar` from the [Maven repository](https://mvnrepository.com/artifact/net.snowflake/snowflake-jdbc)
Comment on lines +64 to +66
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include AWS Java SDK for aws bundle download.


Create a `snowflake-sync-config.yaml` file:

```yaml md title="yaml"
sourceFormat: DELTA
targetFormats:
- ICEBERG
datasets:
-
tableBasePath: s3://path/to/table
tableName: <table_name>
namespace: <db_name>.<schema_name>
```

Create a `snowflake-sync-catalog.yaml` file:

```yaml md title="yaml"
catalogImpl: org.apache.iceberg.snowflake.SnowflakeCatalog
catalogName: <catalog_name>
catalogOptions:
io-impl: org.apache.iceberg.aws.s3.S3FileIO
warehouse: s3://path/to/table
uri: jdbc:snowflake://<account-identifier>.snowflakecomputing.com
jdbc.user: <snowflake-username>
jdbc.password: <snowflake-password>
```

Sample command to sync the table with Snowflake:
```shell md title="shell"
java -cp /path/to/iceberg-spark-runtime-3.2_2.12-1.4.2.jar:/path/to/xtable-utilities-0.2.0-SNAPSHOT-bundled.jar:/path/to/snowflake-jdbc-3.13.28.jar:/path/to/iceberg-aws-1.4.2.jar:/Users/sagarl/Downloads/bundle-2.23.9.jar org.apache.xtable.utilities.RunSync --datasetConfig snowflake-sync-config.yaml --icebergCatalogConfig snowflake-sync-catalog.yaml
```