Skip to content

Commit

Permalink
Documentation updates and refinement
Browse files Browse the repository at this point in the history
  • Loading branch information
jonmjoyce committed Jun 25, 2024
1 parent e98a4aa commit b20c531
Show file tree
Hide file tree
Showing 6 changed files with 36 additions and 11 deletions.
2 changes: 1 addition & 1 deletion docs/architecture/infrastructure.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Infrastructure
nav_order: 3
nav_order: 2
has_children: true
---

Expand Down
12 changes: 8 additions & 4 deletions docs/ingest/events.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
---
layout: default
parent: Data Ingest
nav_order: 3
nav_order: 4
---

# Event Messaging

Event messaging allows the system to respond and operate in near real-time while supporting scalability and extensibility for future use-cases. Designing around a messaging system provides a centralized mechanism for system components to communicate without coupling those components. That means new components to be added to the system ad-hoc and without redesigning core capabilities. Event-driven systems are able to scale by sending messages to many listeners at once. They also tend to distribute data through the system faster than batch systems because the event is raised immediately versus waiting a set increment of time.

**Key Points**
- An event is any change in data state, fundamentally new available data
- An event is any change in data state, i.e. fundamentally new available data (new, updated, and deleted)
- Event systems make extending the system easier as requirements evolve
- Event-driven design results in data propagating through the system faster than traditional scheduled batch systems

Expand All @@ -25,6 +25,10 @@ We have prototyped messaging using SNS because we are able to receive messages d

## RabbitMq

There are many modern messaging frameworks to choose from today. Every cloud platform provides their own brand of messaging (Amazon Simple Queue Service (SQS), Google Pub/Sub, and Azure Service Bus) and there are numerous open-source platforms as well. We initially prototyped RabbitMq as the messaging broker because it is relatively simple to configure, open source, and cloud platform independent. From a system architecture perspective, the main difference between SQS and RabbitMq is that SQS only works on AWS while RabbitMq will work on whatever platform it is installed on. However, SQS is already configured and "comes with" AWS natively.
There are many modern messaging frameworks to choose from today. Every cloud platform provides their own brand of messaging (Amazon Simple Queue Service (SQS), Google Pub/Sub, and Azure Service Bus) and there are numerous open-source platforms as well. We initially prototyped RabbitMq as the messaging broker because it is relatively simple to configure, open source, and cloud platform independent.

> RabbitMQ is the most widely deployed open source message broker. RabbitMQ is lightweight and easy to deploy on premises and in the cloud. It supports multiple messaging protocols. (https://www.rabbitmq.com/)
> RabbitMQ is the most widely deployed open source message broker. RabbitMQ is lightweight and easy to deploy on premises and in the cloud. It supports multiple messaging protocols. (https://www.rabbitmq.com/)
## Comparison

From a system architecture perspective, the main difference between SQS and RabbitMq is that SQS only works on AWS while RabbitMq will work on whatever platform it is installed on. However, SQS is already configured and "comes with" AWS natively. This is not really a limitation because if and when IOOS has the need to connect to other cloud providers (e.g. GCP, Azure), those new integrations can be developed without a major change to the underlying technical strategy. All of the major cloud providers support a messaging framework so there is no technical limitation, but each additional supported platform requires additional developer support and maintenance. On the other hand, using consistent tooling among cloud providers reduces the number of configurations and therefore platform-specific test-cases to be addressed. One could also argue that the maintenance and understanding of the RabbitMq system is another hidden developer cost that is not an issue in the managed services such as AWS SQS which is [maintained by Amazon 24/7](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/example-implementations-for-availability-goals.html) and just works.
6 changes: 3 additions & 3 deletions docs/ingest/ingest-prototype.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
layout: default
title: Prototype
title: Kerchunk Workflow (Argo/K8s)
parent: Data Ingest
nav_order: 1
nav_order: 2
---

# Data Ingest Prototype
# Argo Kerchunk Workflow

![Prototype Diagram](modeldata-prototype-diagram.png)

Expand Down
7 changes: 4 additions & 3 deletions docs/ingest/ingest.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Data Ingest
nav_order: 2
nav_order: 3
has_children: true
---

Expand All @@ -13,9 +13,10 @@ Data Ingest starts the process of preparing data for transformation and notifyin
- The system needs a method of incorporating raw data in order to provide more value to that data.
- *Raw data* referred to here may be a traditional GRIB or NETCDF file, but could also be video or imagery.
- The raw data does not necessarily need to be copied to be ingested; if data is already cloud accessible it still needs to be ingested but not copied.
- Data needs to pass quality control checks. Bad data in causes bad data analysis, and in the case of AI/ML, potentially invalid models.
- Data needs to pass quality control checks. Bad data inputs corrupt good data analysis.
- Using bad data to train AI/ML models wastes effort because it will likely produce inaccurate results.
- The metadata needs to be extracted from the raw data files to feed the larger system.
- The system does not enforce raw data standards. If a data product requires reformatting or rechunking then it's the responsibility of that product subdomain to provide that product.
- The system requires that the data is indexable by byte-range requests. Common scientific formats such as GRIB and NetCDF adhere to this requirement and are easily indexed using the [kerchunk process](ingest-prototype.md).
- This process needs to be constantly monitored. A dedicated data team should be responsible for data ingest in operations.

## Data Ingest Concepts
Expand Down
20 changes: 20 additions & 0 deletions docs/ingest/kerchunk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
layout: default
title: Kerchunk Workflow (RPS/Lambda)
parent: Data Ingest
nav_order: 1
---

# Lambda Kerchunk Workflow

![Prototype Diagram](lambda-workflow.png)

[Source Code and Technical Documentation](https://github.com/asascience-open/nextgen-dmac/tree/main/cloud_aggregator)

The data ingest prototype starts by listening to events from the NODD bucket when new files are added. This kicks off an SNS notification, which is then queued in SQS. The reason for doing this is so that all messages can be received and read even if there might not be a listener ready exactly when the notification is generated.

The Lambda functions are triggered when new messages arrive in the SQS queue. A Lambda function creates a temporary virtual machine running the Docker image we built, which executes our custom Python code. This Python code kerchunks the appropriate files and then writes the index zarr files to the destination S3 bucket.

The kerchunked data is written to the public Cloud--Optimized DMAC bucket. Note that the kerchunked data is a reference to the NODD data, not a copy, so the NODD data must remain available for the kerchunk reference to work.

The same listener pattern is applied to the destination bucket: when new files are added, it generates events that go into a queue, and the `aggregation` Lambda kicks off a workflow to produce the "best forecast" kerchunk for the available data, and the entire model run (1 cycle) as one reference file.
Binary file added docs/ingest/lambda-workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b20c531

Please sign in to comment.