Add Kinesis Sink #12

colmsnowplow · 2023-10-25T17:32:36Z

This PR adds kinesis sink. Integration tests are not included here - I'll work on those in a separate branch off this one.

@istreeter assigned you as you have a dependency on this, but I think review can come from anyone so will ask the team.

colmsnowplow · 2023-10-25T17:35:12Z

...es/kinesis/src/main/scala/com/snowplowanalytics/snowplow/sources/kinesis/KinesisSource.scala

@@ -113,7 +114,7 @@ object KinesisSource {
  private def kinesisStream[F[_]: Async](config: KinesisSourceConfig): Stream[F, LowLevelEvents[Map[String, KinesisMetadata[F]]]] = {
    val resources =
      for {
-        region <- Resource.eval(KinesisSourceConfig.getRuntimeRegion)
+        region <- Resource.eval(com.snowplowanalytics.snowplow.kinesis.Util.getRuntimeRegion)


I can replace this with (new DefaultAwsRegionProviderChain).getRegion, and rip out the getRuntimeRegion function entirely - I don't really understand why it was in a function in the first place.

Are the two functionally equivalent, or is there some meaningful difference I'm missing?

I agree it does not need to be its own function. I like your suggestion to replace it with new DefaultAwsRegionProviderChain.getRegion (I think you don't even need those parantheses).

For some reason my editor required the brackets - not sure if they're required or it's just my editor, but I assume it's a style choice regardless so doesn't feel worth spending time on it.

Note that something does change in a potentially meaningful way, in that rather than being part of the effects-y execution flow, these are now just evaluated as values outside of that scope (within the scope of the same function though). Integration tests still pass and I don't think there's any real difference there just not 100% sure of my understanding.

colmsnowplow · 2023-10-25T17:40:54Z

...sis-it/src/test/scala/com/snowplowanalytics/snowplow/sources/kinesis/KinesisSourceSpec.scala

@@ -20,6 +20,7 @@ import software.amazon.awssdk.services.kinesis.KinesisAsyncClient

 import com.snowplowanalytics.snowplow.sources.EventProcessingConfig
 import com.snowplowanalytics.snowplow.sources.EventProcessingConfig.NoWindowing
+import com.snowplowanalytics.snowplow.kinesis._


I tried to import com.snowplowanalytics.snowplow.kinesis.Util and call Util.getRuntimeRegion, but I got an error:

object Util is not a member of package com.snowplowanalytics.snowplow.kinesis

I don't understand why - to me it seems like it should be the same as importing any other object. Would be interested to hear why if someone can explain.

It's because your Util.scala file contains two separate lines both declaring:

package com.snowplowanalytics.snowplow.kinesis

See line 1 and line 10.

It means the fully qualified class name of your Util object is actually com.snowplowanalytics.snowplow.kinesis.com.snowplowanalytics.snowplow.kinesis.Util.

🤦 It's cos I didn't scroll my editor up far enough... Thanks!

istreeter

This looks good @colmsnowplow.

What are your thoughts on adding an integration test? No worries if you think that's too much for the current PR. You already know I am keen to get this merged without too much extra delay!

istreeter · 2023-10-25T21:52:24Z

modules/kinesis/src/main/scala/com/snowplowanalytics/snowplow/sinks/kinesis/KinesisSink.scala

+    val withRegion = KinesisClient.builder().region(region)
+    customEndpoint match {
+      case Some(endpoint) => withRegion.endpointOverride(endpoint).build()
+      case None           => withRegion.build()
+    }


So an alternative way of writing this is....

val builder = KinesisClient.builder().region(region) customEndpoint.foreach(e => builder.endpointOverride(e)) builder.build()

Some people in the team write it the way you did it... I think because in functional programming we pretend that everything is immutable.

But IMO, if we accept that builder is mutable whether we like it or not... then my alternative suggestion looks simpler and is exactly how the builder was intended to be used when someone designed it.

Yeah I like that better too.

istreeter · 2023-10-25T21:55:17Z

modules/kinesis/src/main/scala/com/snowplowanalytics/snowplow/sinks/kinesis/KinesisSink.scala

+    Resource.make(make) { producer =>
+      Sync[F].blocking {
+        producer.close()
+      }
+    }


Nothing wrong with this, but I think you can also do....

Resource.fromAutoCloseable(make)

I copied the pattern from pubsub target. Would lean on not changing it as it feels like both should be the same, but if there's some reason to use esource.fromAutoCloseable(make) for both happy to make that change too.

istreeter · 2023-10-25T22:03:38Z

modules/kinesis/src/main/scala/com/snowplowanalytics/snowplow/sinks/kinesis/KinesisSink.scala

+
+  final case class RequestLimits(recordLimit: Int, bytesLimit: Int)
+
+  object Retries {


Can this object be private?

Anything which is not private ends up being part of the public API of this library. And if we are conscientious library maintainers then the API should be stable between versions, i.e. it restricts us from ever removing this Retries object in future when we want to do a code refactor.

istreeter · 2023-10-25T22:22:56Z

modules/kinesis/src/main/scala/com/snowplowanalytics/snowplow/sinks/kinesis/KinesisSink.scala

+  }
+
+  private def getRecordSize(record: PutRecordsRequestEntry) =
+    record.data.asByteArray().length + record.partitionKey().getBytes(UTF_8).length


I was curious so I looked up sdkbytes in the aws java core repo.

If you call .asByteArray().length then it copes the data into a new array. This is relatively expensive -- we should try to avoid ever copying lots of bytes like this.

If you call .asByteBuffer().limit then it does not copy the bytes. Similarly .asByteArrayUnsafe().length is cheap.

istreeter · 2023-10-25T22:34:46Z

modules/kinesis/src/main/scala/com/snowplowanalytics/snowplow/sinks/kinesis/KinesisSink.scala

+
+  private def toKinesisRecords(records: List[Sinkable]): List[PutRecordsRequestEntry] =
+    records.map { r =>
+      val data = SdkBytes.fromByteArray(r.bytes)


It's very tempting to use SdkBytes.fromByteArrayUnsafe(r.bytes) here, because it saves doing an extra copy of the data in the byte array. See source code here.

It's called "unsafe" because byte arrays are mutable, so the aws sdk needs to trust us that we're not going to modify the shared byte array. But if we promise not to modify it, then the unsafe method will be faster.

istreeter · 2023-10-25T22:37:55Z

.../kinesis/src/main/scala/com/snowplowanalytics/snowplow/sinks/kinesis/KinesisSinkConfig.scala

+
+case class KinesisSinkConfig(
+  streamName: String,
+  partitionKey: Option[String],


Is this partitionKey ever used?

Here I think, no? https://github.com/snowplow-incubator/common-streams/pull/12/files#r1373364675

(Couldn't link to the line of code so commented)

istreeter · 2023-10-25T22:38:41Z

modules/kinesis/src/main/scala/com/snowplowanalytics/snowplow/sinks/kinesis/KinesisSink.scala

+    } yield ()
+  }
+
+  final case class RequestLimits(recordLimit: Int, bytesLimit: Int)


This class could be private because it's never used outside of our private methods.

colmsnowplow · 2023-10-26T09:31:12Z

Thanks @istreeter I'll take a look at the specific comments in a bit, just to address the top line question first:

What are your thoughts on adding an integration test? No worries if you think that's too much for the current PR. You already know I am keen to get this merged without too much extra delay!

Yes keen to implement it but doing so in a separate branch as I'm aware of blockers. I'll prioritise addressing feedback first, and from there it's up to you really - happy to get this merged first or wait for the integration tests.

colmsnowplow · 2023-10-26T15:29:56Z

modules/kinesis/src/main/scala/com/snowplowanalytics/snowplow/sinks/kinesis/KinesisSink.scala

+      val data = SdkBytes.fromByteArray(r.bytes)
+      val prre = PutRecordsRequestEntry
+        .builder()
+        .partitionKey(r.partitionKey.getOrElse(UUID.randomUUID.toString()))


Partition key is used here I think?

That is the partition key from the Sinkable. It is not the partition key from the KinesisSinkConfig.

Ah! OK I follow now. I guess we don't need to configure it this way then, removing.

istreeter · 2023-10-26T15:48:39Z

...sis-it/src/test/scala/com/snowplowanalytics/snowplow/sources/kinesis/KinesisSourceSpec.scala

@@ -33,12 +34,13 @@ class KinesisSourceSpec
  override val Timeout: FiniteDuration = 3.minutes

  /** Resources which are shared across tests */
-  override val resource: Resource[IO, (LocalStackContainer, KinesisAsyncClient, String => KinesisSourceConfig)] =
+  override val resource: Resource[IO, (LocalStackContainer, KinesisAsyncClient, String => KinesisSourceConfig)] = {
+    val region = (new DefaultAwsRegionProviderChain).getRegion


I think we should keep it effects-y

for { region <- Resource.eval(IO.blocking((new DefaultAwsRegionProviderChain).getRegion)) localstack <- Localstack.resource(region, KINESIS_INITIALIZE_STREAMS)

And now I've got to explain why :)

Reason 1: thread pools

If this app is running on a EC2 instance (or EKS or ECS) then the AWS SDK discovers the region by making a HTTP request to a AWS metadata server. This blocks the current thread until we get the response from the server.

This bit of code we're looking at runs on the cats-effect thread pool, which is a relatively small fixed size pool. So while we're waiting for the response it blocks other cats-effect things to run concurrently.

If this bit of code was in a data-processing part of the the app, then it would be a big problem because it could hurt throughput. Luckily this code only runs once at startup, and the lookup is fairly quick, so it would not hurt us too much.

So it is a weak reason. But still it is best practice to never block the cats-effect thread pool.

Reason 2: Exceptions

If the AWS sdk cannot discover the region, then this line is going to raise an exception. With cats-effect stuff, the exception can happen either....

When you create the resource

When you use the resource

Scala developers using your library will expect case 2 to happen. They would be surprised if it's case 1. They might write their code to do some special error-handling around use'ing the resource, but then their error handling does not work properly because the exception happens in the wrong place.

My Reason 2 is stronger than my Reason 1. Maybe neither reason is very strong. But I do think we should write the library in way that does not cause surprises for scala devs who are used to certain conventions on how stuff works.

Just realised I left this comment on the integration test, but the comment was meant more for the production code.

This makes sense! Thanks so much for the explanation that helps a lot. :)

I had an idea of something like point 1 but indeed I figured it wasn't much of a big deal since this gets executed once on startup. Point 2 I didn't consider at all. But still even on just point 1 I see the argument that it should be the other way.

I've made the changes to source and integration tests. The one in sink, I believe was already the way we want it because it just saves the effects-y bit in the make value, then after calls that as a Resource. But I might be wrong about that.

colmsnowplow requested a review from istreeter October 25, 2023 17:32

colmsnowplow commented Oct 25, 2023

View reviewed changes

colmsnowplow force-pushed the kinesis-sink branch from dcc3b9a to 033f71b Compare October 25, 2023 17:44

istreeter reviewed Oct 25, 2023

View reviewed changes

colmsnowplow force-pushed the kinesis-sink branch from 033f71b to ebd3722 Compare October 26, 2023 12:48

colmsnowplow commented Oct 26, 2023

View reviewed changes

istreeter reviewed Oct 26, 2023

View reviewed changes

colmsnowplow force-pushed the kinesis-sink branch 4 times, most recently from 0fbd58e to d2b5092 Compare November 14, 2023 10:38

Add Kinesis Sink

66c40d3

colmsnowplow force-pushed the kinesis-sink branch from d2b5092 to 66c40d3 Compare November 14, 2023 10:50

Add Kinesis Sink tests

9d12850

colmsnowplow merged commit 11153c4 into develop Nov 14, 2023
1 check passed

colmsnowplow deleted the kinesis-sink branch November 14, 2023 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kinesis Sink #12

Add Kinesis Sink #12

colmsnowplow commented Oct 25, 2023

colmsnowplow Oct 25, 2023

istreeter Oct 25, 2023

colmsnowplow Oct 26, 2023 •

edited

Loading

colmsnowplow Oct 25, 2023

istreeter Oct 25, 2023

colmsnowplow Oct 26, 2023

istreeter left a comment

istreeter Oct 25, 2023

colmsnowplow Oct 26, 2023

istreeter Oct 25, 2023

colmsnowplow Oct 26, 2023

istreeter Oct 25, 2023

istreeter Oct 25, 2023

istreeter Oct 25, 2023

istreeter Oct 25, 2023

colmsnowplow Oct 26, 2023 •

edited

Loading

istreeter Oct 25, 2023

colmsnowplow commented Oct 26, 2023

colmsnowplow Oct 26, 2023 •

edited

Loading

istreeter Oct 26, 2023

colmsnowplow Oct 27, 2023

istreeter Oct 26, 2023

istreeter Oct 26, 2023

colmsnowplow Oct 27, 2023


		final case class RequestLimits(recordLimit: Int, bytesLimit: Int)

		object Retries {

Add Kinesis Sink #12

Add Kinesis Sink #12

Conversation

colmsnowplow commented Oct 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colmsnowplow Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

istreeter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colmsnowplow Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colmsnowplow commented Oct 26, 2023

colmsnowplow Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Reason 1: thread pools

Reason 2: Exceptions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colmsnowplow Oct 26, 2023 •

edited

Loading

colmsnowplow Oct 26, 2023 •

edited

Loading

colmsnowplow Oct 26, 2023 •

edited

Loading