-
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Airflow spark * ignore logs folder * update .env.template * update airflow config * setup java and spark in airflow dockerfile * adjust spark and docker-compose * Dag with SparkSubmitOperator * fix spark Dockerfile
- Loading branch information
Showing
16 changed files
with
219 additions
and
568 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ | |
__pycache__ | ||
|
||
data/ | ||
logs/ | ||
|
||
notebooks/.cache | ||
notebooks/.conda | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
import os | ||
from datetime import datetime | ||
from airflow import DAG | ||
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator | ||
|
||
dag = DAG("airbnb", start_date=datetime(2024, 5, 20), schedule=None) | ||
|
||
spark_task = SparkSubmitOperator( | ||
conn_id="spark_master", | ||
application=os.path.abspath("dags/airbnb_job.py"), | ||
task_id="run_spark_job", | ||
dag=dag, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
from pyspark.sql import SparkSession | ||
import pandas as pd | ||
|
||
spark = SparkSession.builder.appName("airbnb").getOrCreate() | ||
sc = spark.sparkContext | ||
|
||
url = "https://data.insideairbnb.com/united-states/ma/boston/2024-03-24/data/listings.csv.gz" | ||
|
||
df = spark.createDataFrame(pd.read_csv(url)) | ||
|
||
print(f"# of rows: {df.count()}") | ||
|
||
df.show(20) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
import os | ||
from datetime import datetime | ||
from airflow import DAG | ||
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator | ||
|
||
dag = DAG("spark_job_example", start_date=datetime(2024, 5, 20)) | ||
|
||
spark_task = SparkSubmitOperator( | ||
conn_id="spark_master", | ||
application=os.path.abspath("dags/spark_job.py"), | ||
task_id="run_spark_job", | ||
dag=dag, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
from pyspark.sql import SparkSession | ||
|
||
spark = SparkSession.builder.appName("demo").getOrCreate() | ||
|
||
df = spark.createDataFrame( | ||
[ | ||
("sue", 32), | ||
("li", 3), | ||
("bob", 75), | ||
("heo", 13), | ||
], | ||
["first_name", "age"], | ||
) | ||
|
||
df.show() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,39 @@ | ||
FROM apache/airflow:2.9.1-python3.12 | ||
FROM apache/airflow:2.9.1-python3.11 | ||
|
||
USER root | ||
RUN apt-get update && apt-get install -y curl wget vim | ||
|
||
RUN wget --no-verbose -O openjdk-11.tar.gz https://builds.openlogic.com/downloadJDK/openlogic-openjdk/11.0.11%2B9/openlogic-openjdk-11.0.11%2B9-linux-x64.tar.gz | ||
RUN tar -xzf openjdk-11.tar.gz --one-top-level=openjdk-11 --strip-components 1 -C /usr/local | ||
ENV JAVA_HOME=/usr/local/openjdk-11 | ||
# ENV SPARK_WORKLOAD=submit | ||
|
||
ENV SPARK_VERSION=3.5.1 \ | ||
HADOOP_VERSION=3 \ | ||
SPARK_HOME=/opt/spark \ | ||
PYTHONHASHSEED=1 | ||
|
||
RUN mkdir ${SPARK_HOME} && chown -R "${AIRFLOW_UID}:0" "${SPARK_HOME}" | ||
|
||
USER airflow | ||
|
||
# Download and uncompress spark from the apache archive | ||
RUN wget --no-verbose -O apache-spark.tgz "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" \ | ||
&& mkdir -p ${SPARK_HOME} \ | ||
&& tar -xf apache-spark.tgz -C ${SPARK_HOME} --strip-components=1 \ | ||
&& rm apache-spark.tgz | ||
|
||
COPY requirements.txt /tmp/requirements.txt | ||
RUN pip install -r /tmp/requirements.txt | ||
|
||
USER root | ||
|
||
COPY requirements_spark.txt /tmp/requirements_spark.txt | ||
RUN cd /usr/local \ | ||
&& python -m venv pyspark_venv \ | ||
&& . pyspark_venv/bin/activate \ | ||
&& pip install -r /tmp/requirements_spark.txt | ||
|
||
USER airflow | ||
|
||
ENV PYSPARK_PYTHON=/usr/local/pyspark_venv/bin/python |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,10 @@ | ||
-c https://raw.githubusercontent.com/apache/airflow/constraints-2.9.1/constraints-3.12.txt | ||
duckdb==0.10.2 | ||
polars==0.20.26 | ||
apache-airflow-providers-apache-spark==4.7.2 | ||
plyvel==1.5.1 | ||
|
||
# duckdb==0.10.2 | ||
# polars==0.20.26 | ||
# pyspark==3.5.1 | ||
# apache-airflow-providers-slack | ||
# deltalake | ||
# delta-spark |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# requirements should be the same as `docker/spark/requirements.txt` | ||
pandas==2.2.2 | ||
setuptools==65.5.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# requirements should be the same as `docker/airflow/requirements_spark.txt` | ||
pandas==2.2.2 | ||
setuptools==65.5.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
Host * | ||
StrictHostKeyChecking no |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
Oops, something went wrong.