Skip to content

Commit

Permalink
Refresh cloud documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
proddata committed Aug 1, 2024
1 parent 36c2e09 commit 636dabb
Show file tree
Hide file tree
Showing 13 changed files with 453 additions and 518 deletions.
Binary file removed docs/_assets/img/cluster-export-tab-history.png
Binary file not shown.
Binary file removed docs/_assets/img/cluster-export.png
Binary file not shown.
192 changes: 192 additions & 0 deletions docs/cluster/automation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
(cluster-automation)=
# Automation

Automation in CrateDB Cloud allows users to streamline and manage routine
database operations efficiently. Two primary automation features available are
the SQL Scheduler and Table Policies, both of which facilitate the maintenance
and optimization of database tasks.

:::{important}
- Automation is available for all newly deployed clusters.
- For existing clusters, the feature can be enabled on demand. (Contact
[support](https://support.crate.io/) for activation.)

Automation utilizes a dedicated database user `gc_admin` with full cluster
privileges to execute scheduled tasks and persists data in the `gc` schema.
:::

## SQL Scheduler

The SQL Scheduler is designed to automate routine database tasks by scheduling
SQL queries to run at specific times, in UTC time. This feature supports
creating job descriptions with valid [cron patterns](https://www.ibm.com/docs/en/db2oc?topic=task-unix-cron-format)
and SQL statements, enabling a wide range of tasks. Users can manage these jobs
through the Cloud UI, adding, removing, editing, activating, and deactivating
them as needed.

### Use Cases

- Regularly updating or aggregating table data.
- Automating export and import of data.
- Deleting old/redundant data to maintain database efficiency.

### Accessing and Using the SQL Scheduler

SQL Scheduler can be found in the "Automation" tab in the left-hand
navigation menu. There are two tabs relevant to the SQL Scheduler:


**SQL Scheduler** shows a list of your existing jobs. In the list, you can
activate/deactivate each job with a toggle in the "Active" column. You can
also edit and delete jobs with buttons on the right side of the list.

![SQL Scheduler overview](../_assets/img/cluster-sql-scheduler-overview.png)


**Logs** shows a list of *scheduled* job runs, whether they failed or succeeded,
execution time, run time, and the error in case they were unsuccessful. In case
of an error, more details can be viewed showing the executed query and a stack
trace. You can filter the logs by status or by a specific job.

![SQL Scheduler overview](../_assets/img/cluster-sql-scheduler-logs.png)

### Examples

#### Cleanup of Old Files

Cleanup tasks represent a common use case for these types of automated jobs.
This example deletes records older than 30 days from a specified table once a
day:

```sql
DELETE FROM "sample_data"
WHERE
"timestamp_column" < NOW() - INTERVAL '30 days';
```

How often you run it, of course, depends on you, but once a day is common for
cleanup. This expression runs every day at 2:30 PM UTC:

Schedule: `30 14 * * *`

![SQL Scheduler overview](../_assets/img/cluster-sql-scheduler-example-cleanup.png)

#### Copying Logs into a Persistent Table

Another useful example might be copying data to another table for archival
purposes. This specifically copies from the system logs table into one of
our own tables.

```sql
CREATE TABLE IF NOT EXISTS "logs"."persistent_jobs_log" (
"classification" OBJECT (DYNAMIC),
"ended" TIMESTAMP WITH TIME ZONE,
"error" TEXT,
"id" TEXT,
"node" OBJECT (DYNAMIC),
"started" TIMESTAMP WITH TIME ZONE,
"stmt" TEXT,
"username" TEXT,
PRIMARY KEY (id)
) CLUSTERED INTO 1 SHARDS;

INSERT INTO
"logs"."persistent_jobs_log"
SELECT
*
FROM
sys.jobs_log
ON CONFLICT ("id") DO NOTHING;
```

In this example, we schedule the job to run every hour:

Schedule: `0 * * * *`

![SQL Scheduler overview](../_assets/img/cluster-sql-scheduler-example-copying.png)

:::{note}
Limitations and Known Issues:
* Only one job can run at a time; subsequent jobs will be queued until the
current one completes.
* Long-running jobs may block the execution of queued jobs, leading to
potential delays.
:::


## Table Policies

Table policies allow automating maintenance operations for **partitioned tables**.
Automated actions can be set up to be executed daily based on a pre-configured
ruleset.

![Table policy list](../_assets/img/cluster-table-policy.png)

### Overview

Table policy overview can be found in the left-hand navigation menu under
"Automation". From the list of policies, you can create, delete, edit, or
(de)activate them. Logs of executed policies can be found in the "Logs" tab.

![Table policy list](../_assets/img/cluster-table-policy-logs.png)

A new policy can be created with the "Add New Policy" button.

![Table policy list](../_assets/img/cluster-table-policy-create.png)

After naming the policy and selecting the tables/schemas to be impacted, you
must specify the time column. This column, which should be a timestamp used for
partitioning, will determine the data affected by the policy. It is important
that this time column is consistently present across all targeted tables/schemas.
While you can apply the policy to tables without the specified time column,
it will not get executed for those. If your tables have different timestamp
columns, consider setting up separate policies for each to ensure accuracy.

:::{note}
The "Time Column" must be of type `TIMESTAMP`.
:::

Next, a condition is used to determine affected partitions. The system is
time-based. A partition is eligible for action if the value in the partitioned
column is smaller (`<`), or smaller or equal (`<=`) than the current date minus
`n` days, months, or years.

### Actions

Following actions are supported:
* **Delete:** Deletes eligible partitions along with their data.
* **Set replicas:** Changes the replication factor of eligible partitions.
* **Force merge:** Merges segments on eligible partitions to ensure a specified number.

After filling out the info, you can see the affected schemas/tables and the
number of affected partitions if the policy gets executed at this very moment.

### Examples

Consider a scenario where you have a table and wish to optimize space on your
cluster. For older data, which might already be snapshotted, it may be
sufficient for it to exist just once in the cluster without replication. In
such cases, high availability is not a priority, and you plan to retain the data
for only 60 days.

Assume the following table schema:

```sql
CREATE TABLE data_table (
ts TIMESTAMP,
ts_day GENERATED ALWAYS AS date_trunc('day',ts),
val DOUBLE
) PARTITIONED BY (ts_day);
```

For the outlined scenario, the policies would be as follows:

**Policy 1 - Saving replica space:**
* **Time Column:** `ts_day`
* **Condition:** `older than 30 days`
* **Actions:** `Set replicas to 0.`

**Policy 2 - Data removal:**
* **Time Column:** `ts_day`
* **Condition:** `older than 60 days`
* **Actions:** `Delete eligible partition(s)`
31 changes: 31 additions & 0 deletions docs/cluster/console.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
(cluster-console)=
# Console

The Console in the CrateDB Cloud Console allows users to execute SQL queries
seamlessly against their CrateDB Cloud cluster. The Console can be accessed
by users having the "Organization Admin" role in the left-hand navigation menu
within a cluster.

- **Table and Schema Tree View:** Easily navigate through your database
structure.
- **Client-Side Query Validation:** Ensure your SQL queries are correct before
execution.
- **Multiple Query Execution:** Run several queries in sequence.
- **Query History:** Access and manage your past queries.

:::{important}
- The Console is available for all newly deployed clusters.
- For older clusters, this feature can be enabled on demand. Contact
[support](https://support.crate.io/) for activation.

The Console currently utilizes a dedicated database user `gc_admin` with full
cluster privileges.
:::

:::{note}
**Multi-Query Execution:**
When running multiple queries at once, the Console executes them sequentially,
not within a single session or transaction. If one query fails, the subsequent
queries will not be executed. Currently, session settings are not persisted
between queries.
:::
27 changes: 27 additions & 0 deletions docs/cluster/export.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
(cluster-export)=
# Export

The "Export" section allows users to download specific tables/views. When you
first visit the Export tab, you can specify the name of a table/view,
format (CSV, JSON, or Parquet) and whether you'd like your data to be
gzip compressed (recommended for CSV and JSON files).

:::{important}
- Size limit for exporting is 1 GiB
- Exports are held for 3 days, then automatically deleted
:::

:::{note}
**Limitations with Parquet**:
Parquet is a highly compressed data format for very efficient storage of
tabular data. Please note that for OBJECT and ARRAY columns in CrateDB,
the exported data will be JSON encoded when saving to Parquet
(effectively saving them as strings). This is due to the complexity of
encoding structs and lists in the Parquet format, where determining the
exact schema might not be possible. When re-importing such a Parquet
file, make sure you pre-create the table with the correct schema.
:::




Loading

0 comments on commit 636dabb

Please sign in to comment.