Documentation updates and refinement

asascience-open · Jun 25, 2024 · b20c531 · b20c531
1 parent e98a4aa
commit b20c531
Show file tree

Hide file tree

Showing 6 changed files with 36 additions and 11 deletions.
diff --git a/docs/architecture/infrastructure.md b/docs/architecture/infrastructure.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Infrastructure
-nav_order: 3
+nav_order: 2
 has_children: true
 ---
 

diff --git a/docs/ingest/events.md b/docs/ingest/events.md
@@ -1,15 +1,15 @@
 ---
 layout: default
 parent: Data Ingest
-nav_order: 3
+nav_order: 4
 ---
 
 # Event Messaging
 
 Event messaging allows the system to respond and operate in near real-time while supporting scalability and extensibility for future use-cases. Designing around a messaging system provides a centralized mechanism for system components to communicate without coupling those components. That means new components to be added to the system ad-hoc and without redesigning core capabilities. Event-driven systems are able to scale by sending messages to many listeners at once. They also tend to distribute data through the system faster than batch systems because the event is raised immediately versus waiting a set increment of time.
 
 **Key Points**
-- An event is any change in data state, fundamentally new available data
+- An event is any change in data state, i.e. fundamentally new available data (new, updated, and deleted)
 - Event systems make extending the system easier as requirements evolve
 - Event-driven design results in data propagating through the system faster than traditional scheduled batch systems
 
@@ -25,6 +25,10 @@ We have prototyped messaging using SNS because we are able to receive messages d
 
 ## RabbitMq
 
-There are many modern messaging frameworks to choose from today. Every cloud platform provides their own brand of messaging (Amazon Simple Queue Service (SQS), Google Pub/Sub, and Azure Service Bus) and there are numerous open-source platforms as well. We initially prototyped RabbitMq as the messaging broker because it is relatively simple to configure, open source, and cloud platform independent. From a system architecture perspective, the main difference between SQS and RabbitMq is that SQS only works on AWS while RabbitMq will work on whatever platform it is installed on. However, SQS is already configured and "comes with" AWS natively.
+There are many modern messaging frameworks to choose from today. Every cloud platform provides their own brand of messaging (Amazon Simple Queue Service (SQS), Google Pub/Sub, and Azure Service Bus) and there are numerous open-source platforms as well. We initially prototyped RabbitMq as the messaging broker because it is relatively simple to configure, open source, and cloud platform independent. 
 
-> RabbitMQ is the most widely deployed open source message broker. RabbitMQ is lightweight and easy to deploy on premises and in the cloud. It supports multiple messaging protocols. (https://www.rabbitmq.com/)
+> RabbitMQ is the most widely deployed open source message broker. RabbitMQ is lightweight and easy to deploy on premises and in the cloud. It supports multiple messaging protocols. (https://www.rabbitmq.com/)
+
+## Comparison
+
+From a system architecture perspective, the main difference between SQS and RabbitMq is that SQS only works on AWS while RabbitMq will work on whatever platform it is installed on. However, SQS is already configured and "comes with" AWS natively. This is not really a limitation because if and when IOOS has the need to connect to other cloud providers (e.g. GCP, Azure), those new integrations can be developed without a major change to the underlying technical strategy. All of the major cloud providers support a messaging framework so there is no technical limitation, but each additional supported platform requires additional developer support and maintenance. On the other hand, using consistent tooling among cloud providers reduces the number of configurations and therefore platform-specific test-cases to be addressed. One could also argue that the maintenance and understanding of the RabbitMq system is another hidden developer cost that is not an issue in the managed services such as AWS SQS which is [maintained by Amazon 24/7](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/example-implementations-for-availability-goals.html) and just works.
diff --git a/docs/ingest/ingest-prototype.md b/docs/ingest/ingest-prototype.md
@@ -1,11 +1,11 @@
 ---
 layout: default
-title: Prototype
+title: Kerchunk Workflow (Argo/K8s)
 parent: Data Ingest
-nav_order: 1
+nav_order: 2
 ---
 
-# Data Ingest Prototype
+# Argo Kerchunk Workflow
 
 ![Prototype Diagram](modeldata-prototype-diagram.png)
 

diff --git a/docs/ingest/ingest.md b/docs/ingest/ingest.md
@@ -1,7 +1,7 @@
 ---
 layout: default
 title: Data Ingest
-nav_order: 2
+nav_order: 3
 has_children: true
 ---
 
@@ -13,9 +13,10 @@ Data Ingest starts the process of preparing data for transformation and notifyin
 - The system needs a method of incorporating raw data in order to provide more value to that data.
 - *Raw data* referred to here may be a traditional GRIB or NETCDF file, but could also be video or imagery.
 - The raw data does not necessarily need to be copied to be ingested; if data is already cloud accessible it still needs to be ingested but not copied.
-- Data needs to pass quality control checks. Bad data in causes bad data analysis, and in the case of AI/ML, potentially invalid models.
+- Data needs to pass quality control checks. Bad data inputs corrupt good data analysis. 
+    - Using bad data to train AI/ML models wastes effort because it will likely produce inaccurate results.
 - The metadata needs to be extracted from the raw data files to feed the larger system.
-- The system does not enforce raw data standards. If a data product requires reformatting or rechunking then it's the responsibility of that product subdomain to provide that product.
+- The system requires that the data is indexable by byte-range requests. Common scientific formats such as GRIB and NetCDF adhere to this requirement and are easily indexed using the [kerchunk process](ingest-prototype.md).
 - This process needs to be constantly monitored. A dedicated data team should be responsible for data ingest in operations.
 
 ## Data Ingest Concepts

diff --git a/docs/ingest/kerchunk.md b/docs/ingest/kerchunk.md
@@ -0,0 +1,20 @@
+---
+layout: default
+title: Kerchunk Workflow (RPS/Lambda)
+parent: Data Ingest
+nav_order: 1
+---
+
+# Lambda Kerchunk Workflow
+
+![Prototype Diagram](lambda-workflow.png)
+
+[Source Code and Technical Documentation](https://github.com/asascience-open/nextgen-dmac/tree/main/cloud_aggregator)
+
+The data ingest prototype starts by listening to events from the NODD bucket when new files are added. This kicks off an SNS notification, which is then queued in SQS. The reason for doing this is so that all messages can be received and read even if there might not be a listener ready exactly when the notification is generated.
+
+The Lambda functions are triggered when new messages arrive in the SQS queue. A Lambda function creates a temporary virtual machine running the Docker image we built, which executes our custom Python code. This Python code kerchunks the appropriate files and then writes the index zarr files to the destination S3 bucket.
+
+The kerchunked data is written to the public Cloud--Optimized DMAC bucket. Note that the kerchunked data is a reference to the NODD data, not a copy, so the NODD data must remain available for the kerchunk reference to work. 
+
+The same listener pattern is applied to the destination bucket: when new files are added, it generates events that go into a queue, and the `aggregation` Lambda kicks off a workflow to produce the "best forecast" kerchunk for the available data, and the entire model run (1 cycle) as one reference file. 
diff --git a/docs/ingest/lambda-workflow.png b/docs/ingest/lambda-workflow.png