diff --git a/.gitignore b/.gitignore index 8b978ac..73e12bb 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ .DS_Store *.DS_Store +.idea \ No newline at end of file diff --git a/README.md b/README.md index 8bbfbc6..ccefc64 100644 --- a/README.md +++ b/README.md @@ -1,50 +1,66 @@ -# Spark: The Definitive Guide +# 스파크 완벽 가이드: 1판 -This is the central repository for all materials related to [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do) by Bill Chambers and Matei Zaharia. +이 저장소는 한빛출판사에서 출간한 한국어판 "스파크 완벽 가이드"에서 참조하는 각종 소스 코드와 예제 데이터를 담고 있습니다. 원서인 [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do)의 저자는 '빌 챔버스'와 '마테이 자하리아'입니다. 원서의 코드 저장소는 [여기](https://github.com/databricks/Spark-The-Definitive-Guide)를 참조하세요. -*This repository is currently a work in progress and new material will be added over time.* +이 저장소를 참조하실 때 아래 내용에 유의해주세요. +- 원서의 공식 코드 저장소와 마찬가지로 번역서의 내용은 독자의 의견을 반영하면서 변경될 수 있습니다. +- 원서의 코드를 실행하면서 정상동작하지 않는 부분이나, 내용상 변화가 필요한 코드 영역은 원서와 일부 다를 수 있습니다. 하지만 본질적인 내용이 변경되지는 않았습니다. -![Spark: The Definitive Guide](https://images-na.ssl-images-amazon.com/images/I/51z7TzI-Y3L._SX379_BO1,204,203,200_.jpg) +# 책 표지 -# Code from the book +![스파크 완벽 가이드 1판](https://images-na.ssl-images-amazon.com/images/I/51z7TzI-Y3L._SX379_BO1,204,203,200_.jpg) -You can find the code from the book in the `code` subfolder where it is broken down by language and chapter. +# 역자 정보 +## 우성한(panelion@gmail.com) +우성한 책임은 kt NexR R&D 2팀 소속 엔지니어로 스파크 기반의 실시간 처리 솔루션인 린 스트림을 개발하고 있습니다. 스파크, 카프카, 하둡등 다양한 빅데이터 컴포넌트를 활용해 솔루션을 기획하고 아직은 생소한 실시간 처리 분야를 다양한 기업과 공공부분에 알리는 업무를 수행하고 있습니다. 또한, KT의 빅데이터 시스템을 최초로 구축하는데 참여했으며 빅데이터 분야의 다양한 오픈소스를 활용해 kt NexR의 빅데이터 배치 처리 솔루션인 NDAP을 개발했습니다. 지금은 빅데이터 아키텍처 설계부터 Front-end/Back-end 개발까지 수행하는 full stack 엔지니어로 Lean Stream을 개발하고 있습니다. -# How to run the code +## 이영호(diesel.yh.lee@gmail.com) +이영호 팀장은 kt NexR의 R&D 2팀 소속 엔지니어로 스파크 기반의 실시간 처리 솔루션인 린 스트림 개발 팀을 이끌고 있습니다. 스파크 기반의 솔루션의 기획과 PoC를 수행하고 훌륭한 팀원들과 함께 개발해 나가고 있습니다. 이영호 팀장은 경찰청, 중소기업청등 공공분야의 다양한 업무시스템 구축과 하둡 기반의 빅데이터 솔루션 업체인 멤브로스 운영 경험을 가지고 있습니다. kt NexR 입사 후 통신사 데이터를 실시간으로 처리하는 다수의 프로젝트를 스파크로 구현했으며, 지금은 행복한 개발자가 머무는 팀을 만들기 위해 노력하고 있습니다. -## Run on your local machine +## 강재원(jwon.kang3703@gmail.com) +강재원 팀장은 kt NexR DataScience팀 소속 데이터 사이언티스트로서 빅데이터 플랫폼을 기반으로 다양한 분석 프로젝트를 수행하고 있습니다. 또한, 최근에는 하둡 기술과 연계하여 R, 파이썬, 스파크 등과 같은 오픈소스를 활용한 최적의 분석 방법론 및 아키텍처를 연구하고 있습니다. 강재원 팀장은 SAS, SPSS등 상용 솔루션 기반의 분석 컨설턴트로 활동하면서 통신, 제조, 금융, 서비스 등 다양한 도메인의 분석시스템을 구축한 경험이 있으며, 2013년 kt NexR 합류 후 국내 최초 금융권 빅데이터 분석 프로젝트를 성공적으로 수행하면서 지금까지 다양한 오픈소스 기반의 분석 프로젝트를 수행하고 있습니다. 데이터 분석을 통한 기업의 성장을 위해 도메인 특성에 따른 최적의 분석 방법론을 전파하려고 노력하고 있습니다. -To run the example on your local machine, either pull all data in the `data` subfolder to `/data` on your computer or specify the path to that particular dataset on your local machine. +# 책에서 사용한 코드 -## Run on Databricks +이 저장소의 `code` 폴더에는 책을 구성하는 각 장의 예제가 언어별 파일로 정리되어 있습니다. -To run these modules on Databricks, you're going to need to do two things. +# 코드 실행하기 -1. Sign up for an account. You can do that [here](https://databricks.com/try-databricks). -2. Import individual Notebooks to run on the platform +## 로컬 환경에서 실행하기 -Databricks is a zero-management cloud platform that provides: +로컬 환경에서 예제를 실행하려면, `data` 폴더에서 로컬 장비의 `/data` 폴더나 즐겨 사용하는 경로로 예제 데이터를 옮겨야 합니다. -- Fully managed Spark clusters -- An interactive workspace for exploration and visualization -- A production pipeline scheduler -- A platform for powering your favorite Spark-based applications +## 데이터브릭스 클라우드에서 실행하기 -### Instructions for importing +데이터브릭스 환경에서 예제를 실행하려면 두 가지 단계를 거쳐야합니다. -1. Navigate to the notebook you would like to import +1. [데이터브릭스 사이트](https://databricks.com/try-databricks)에 가입합니다. +2. 실행을 위해 개별 노트북 파일을 임포트합니다. -For instance, you might go to [this page](https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/code/A_Gentle_Introduction_to_Spark-Chapter_3_A_Tour_of_Sparks_Toolset.py). Once you do that, you're going to need to navigate to the **RAW** version of the file and save that to your Desktop. You can do that by clicking the **Raw** button. *Alternatively, you could just clone the entire repository to your local desktop and navigate to the file on your computer*. +데이터브릭스는 관리형 클라우드이므로 다음과 같은 기능을 지원합니다. +- 관리형 스파크 클러스터 환경 +- 대화형 데이터 탐색 및 시각화 기능 +- 운영용 파이프라인 스케줄러 +- 선호하는 스파크 기반 애플리케이션을 위한 플랫폼 -2. Upload that to Databricks +### 노트북 임포트 과정 -Read [the instructions](https://docs.databricks.com/user-guide/notebooks/index.html#import-a-notebook) here. Simply open the Databricks workspace and go to import in a given directory. From there, navigate to the file on your computer to upload it. *Unfortunately due to a recent security upgrade, notebooks cannot be imported from external URLs. Therefore you must upload it from your computer*. +1. **임포트하려는 노트북 파일을 결정합니다.** +예를들어, [파이썬 버전의 3장 예제](https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/code/A_Gentle_Introduction_to_Spark-Chapter_3_A_Tour_of_Sparks_Toolset.py)에 접속합니다. 그리고 파일보기 방식중 **RAW** 형태로 보기 버튼을 선택하여 데스크탑에 저장합니다. 다른 방법으로 git 명령을 이용해 이 코드 저장소를 모두 로컬로 복제할 수도 있습니다. -3. You're almost ready to go! +2. **데이터브릭스 환경에 파일을 업로드합니다** +노트북 임포트 하는 방법을 소개하는 [이 URL](https://docs.databricks.com/user-guide/notebooks/index.html#import-a-notebook)을 숙지합니다. 데이터브릭스 워크스페이스를 열고 임포트 대상 파일 경로로 이동합니다. 거기서 업로드할 파일을 선택합니다. **아쉽게도, 최근에 강화된 보안 정책에 따라 외부 URL에서 노트북 파일을 임포트 할 수 없습니다. 따라서, 반드시 로컬에서 파일을 업로드해야 합니다.** -Now you just need to simply run the notebooks! All the examples run on Databricks Runtime 3.1 and above so just be sure to create a cluster with a version equal to or greater than that. Once you've created your cluster, attach the notebook. +3. **준비가 거의 끝났습니다.** +이제 노트북을 실행하기만 하면 됩니다. 모든 예제는 데이터브릭스 런타임 3.1 버전 이상에서 실행할 수 있습니다. 따라서 클러스터를 생성할 때 버전을 3.1이상으로 지정해야 합니다. 클러스터를 생성하고 나면 노트북에 연결할 수 있습니다. -4. Replacing the data path in each notebook +4. **각 노트북의 예제 데이터 경로를 변경합니다.** +모든 예제 데이터를 직접 업로드하지 마시고 각 장의 예제에 등장하는 `/data`를 `/databricks-datasets/definitive-guide/data`로 변경해 사용하는 것이 좋습니다. 경로를 변경하고 나면 모든 예제가 큰 문제 없이 실행됩니다. "find"와 "replace" 기능을 이용하면 이 과정을 단순하게 처리할 수 있습니다. -Rather than you having to upload all of the data yourself, you simply have to change the path in each chapter from `/data` to `/databricks-datasets/definitive-guide/data`. Once you've done that, all examples should run without issue. You can use find and replace to do this very efficiently. +## docker 이미지에서 실행하기 +원서에서는 제공하지 않지만, 한글판 스파크 완벽 가이드에서는 추가적으로 [도커 이미지](https://dockr.ly/2OYIbTK)를 통해 로컬 환경을 구성하는 방법을 설명합니다. 단지 몇줄의 명령만으로 모든 환경이 준비된 제플린 노트북 화면을 로컬에 설치할 수 있습니다. 도커 이미지에 포함된 예제 코드는 필요에 따라 일부 주석으로 처리되어 있으니 필요시 주석을 해제하고 활용하시기 바랍니다. + +# 문의사항 + +책에 대한 문의 사항은 이 저장소의 issue 탭에 문의해 주시기 바랍니다. \ No newline at end of file diff --git a/code/A_Gentle_Introduction_to_Spark-Chapter_1_Defining_Spark.scala "b/code/01\354\236\245_\354\225\204\355\214\214\354\271\230_\354\212\244\355\214\214\355\201\254\353\236\200.scala" similarity index 100% rename from code/A_Gentle_Introduction_to_Spark-Chapter_1_Defining_Spark.scala rename to "code/01\354\236\245_\354\225\204\355\214\214\354\271\230_\354\212\244\355\214\214\355\201\254\353\236\200.scala" diff --git a/code/A_Gentle_Introduction_to_Spark_Chapter_2_A_Gentle_Introduction_to_Spark.java "b/code/02\354\236\245_\354\212\244\355\214\214\355\201\254_\355\206\272\354\225\204\353\263\264\352\270\260.java" similarity index 100% rename from code/A_Gentle_Introduction_to_Spark_Chapter_2_A_Gentle_Introduction_to_Spark.java rename to "code/02\354\236\245_\354\212\244\355\214\214\355\201\254_\355\206\272\354\225\204\353\263\264\352\270\260.java" diff --git a/code/A_Gentle_Introduction_to_Spark-Chapter_2_A_Gentle_Introduction_to_Spark.py "b/code/02\354\236\245_\354\212\244\355\214\214\355\201\254_\355\206\272\354\225\204\353\263\264\352\270\260.py" similarity index 100% rename from code/A_Gentle_Introduction_to_Spark-Chapter_2_A_Gentle_Introduction_to_Spark.py rename to "code/02\354\236\245_\354\212\244\355\214\214\355\201\254_\355\206\272\354\225\204\353\263\264\352\270\260.py" diff --git a/code/A_Gentle_Introduction_to_Spark-Chapter_2_A_Gentle_Introduction_to_Spark.scala "b/code/02\354\236\245_\354\212\244\355\214\214\355\201\254_\355\206\272\354\225\204\353\263\264\352\270\260.scala" similarity index 90% rename from code/A_Gentle_Introduction_to_Spark-Chapter_2_A_Gentle_Introduction_to_Spark.scala rename to "code/02\354\236\245_\354\212\244\355\214\214\355\201\254_\355\206\272\354\225\204\353\263\264\352\270\260.scala" index d7477f1..79959ed 100644 --- a/code/A_Gentle_Introduction_to_Spark-Chapter_2_A_Gentle_Introduction_to_Spark.scala +++ "b/code/02\354\236\245_\354\212\244\355\214\214\355\201\254_\355\206\272\354\225\204\353\263\264\352\270\260.scala" @@ -3,13 +3,13 @@ spark // COMMAND ---------- -// in Scala +// 스칼라 버전 val myRange = spark.range(1000).toDF("number") // COMMAND ---------- -// in Scala +// 스칼라 버전 val divisBy2 = myRange.where("number % 2 = 0") @@ -20,7 +20,7 @@ divisBy2.count() // COMMAND ---------- -// in Scala +// 스칼라 버전 val flightData2015 = spark .read .option("inferSchema", "true") @@ -55,7 +55,7 @@ flightData2015.createOrReplaceTempView("flight_data_2015") // COMMAND ---------- -// in Scala +// 스칼라 버전 val sqlWay = spark.sql(""" SELECT DEST_COUNTRY_NAME, count(1) FROM flight_data_2015 @@ -63,7 +63,7 @@ GROUP BY DEST_COUNTRY_NAME """) val dataFrameWay = flightData2015 - .groupBy('DEST_COUNTRY_NAME) + .groupBy("DEST_COUNTRY_NAME") .count() sqlWay.explain @@ -77,7 +77,7 @@ spark.sql("SELECT max(count) from flight_data_2015").take(1) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.max flightData2015.select(max("count")).take(1) @@ -85,7 +85,7 @@ flightData2015.select(max("count")).take(1) // COMMAND ---------- -// in Scala +// 스칼라 버전 val maxSql = spark.sql(""" SELECT DEST_COUNTRY_NAME, sum(count) as destination_total FROM flight_data_2015 @@ -99,7 +99,7 @@ maxSql.show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.desc flightData2015 @@ -113,7 +113,7 @@ flightData2015 // COMMAND ---------- -// in Scala +// 스칼라 버전 flightData2015 .groupBy("DEST_COUNTRY_NAME") .sum("count") diff --git a/code/A_Gentle_Introduction_to_Spark-Chapter_3_A_Tour_of_Sparks_Toolset.py "b/code/03\354\236\245_\354\212\244\355\214\214\355\201\254_\352\270\260\353\212\245_\353\221\230\353\237\254\353\263\264\352\270\260.py" similarity index 100% rename from code/A_Gentle_Introduction_to_Spark-Chapter_3_A_Tour_of_Sparks_Toolset.py rename to "code/03\354\236\245_\354\212\244\355\214\214\355\201\254_\352\270\260\353\212\245_\353\221\230\353\237\254\353\263\264\352\270\260.py" diff --git a/code/A_Gentle_Introduction_to_Spark-Chapter_3_A_Tour_of_Sparks_Toolset.r "b/code/03\354\236\245_\354\212\244\355\214\214\355\201\254_\352\270\260\353\212\245_\353\221\230\353\237\254\353\263\264\352\270\260.r" similarity index 100% rename from code/A_Gentle_Introduction_to_Spark-Chapter_3_A_Tour_of_Sparks_Toolset.r rename to "code/03\354\236\245_\354\212\244\355\214\214\355\201\254_\352\270\260\353\212\245_\353\221\230\353\237\254\353\263\264\352\270\260.r" diff --git a/code/A_Gentle_Introduction_to_Spark-Chapter_3_A_Tour_of_Sparks_Toolset.scala "b/code/03\354\236\245_\354\212\244\355\214\214\355\201\254_\352\270\260\353\212\245_\353\221\230\353\237\254\353\263\264\352\270\260.scala" similarity index 90% rename from code/A_Gentle_Introduction_to_Spark-Chapter_3_A_Tour_of_Sparks_Toolset.scala rename to "code/03\354\236\245_\354\212\244\355\214\214\355\201\254_\352\270\260\353\212\245_\353\221\230\353\237\254\353\263\264\352\270\260.scala" index c297982..72c661f 100644 --- a/code/A_Gentle_Introduction_to_Spark-Chapter_3_A_Tour_of_Sparks_Toolset.scala +++ "b/code/03\354\236\245_\354\212\244\355\214\214\355\201\254_\352\270\260\353\212\245_\353\221\230\353\237\254\353\263\264\352\270\260.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 import spark.implicits._ case class Flight(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, @@ -10,7 +10,7 @@ val flights = flightsDF.as[Flight] // COMMAND ---------- -// in Scala +// 스칼라 버전 flights .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada") .map(flight_row => flight_row) @@ -24,7 +24,7 @@ flights // COMMAND ---------- -// in Scala +// 스칼라 버전 val staticDataFrame = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") @@ -36,7 +36,7 @@ val staticSchema = staticDataFrame.schema // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{window, column, desc, col} staticDataFrame .selectExpr( @@ -71,20 +71,20 @@ streamingDataFrame.isStreaming // returns true // COMMAND ---------- -// in Scala +// 스칼라 버전 val purchaseByCustomerPerHour = streamingDataFrame .selectExpr( "CustomerId", "(UnitPrice * Quantity) as total_cost", "InvoiceDate") .groupBy( - $"CustomerId", window($"InvoiceDate", "1 day")) + col("CustomerId"), window(col("InvoiceDate"), "1 day")) .sum("total_cost") // COMMAND ---------- -// in Scala +// 스칼라 버전 purchaseByCustomerPerHour.writeStream .format("memory") // memory = store in-memory table .queryName("customer_purchases") // the name of the in-memory table @@ -94,7 +94,7 @@ purchaseByCustomerPerHour.writeStream // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.sql(""" SELECT * FROM customer_purchases @@ -110,7 +110,7 @@ staticDataFrame.printSchema() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.date_format val preppedDataFrame = staticDataFrame .na.fill(0) @@ -120,7 +120,7 @@ val preppedDataFrame = staticDataFrame // COMMAND ---------- -// in Scala +// 스칼라 버전 val trainDataFrame = preppedDataFrame .where("InvoiceDate < '2011-07-01'") val testDataFrame = preppedDataFrame @@ -135,7 +135,7 @@ testDataFrame.count() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.StringIndexer val indexer = new StringIndexer() .setInputCol("day_of_week") @@ -144,7 +144,7 @@ val indexer = new StringIndexer() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.OneHotEncoder val encoder = new OneHotEncoder() .setInputCol("day_of_week_index") @@ -153,7 +153,7 @@ val encoder = new OneHotEncoder() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.VectorAssembler val vectorAssembler = new VectorAssembler() @@ -163,7 +163,7 @@ val vectorAssembler = new VectorAssembler() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.Pipeline val transformationPipeline = new Pipeline() @@ -172,13 +172,13 @@ val transformationPipeline = new Pipeline() // COMMAND ---------- -// in Scala +// 스칼라 버전 val fittedPipeline = transformationPipeline.fit(trainDataFrame) // COMMAND ---------- -// in Scala +// 스칼라 버전 val transformedTraining = fittedPipeline.transform(trainDataFrame) @@ -189,7 +189,7 @@ transformedTraining.cache() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.clustering.KMeans val kmeans = new KMeans() .setK(20) @@ -198,7 +198,7 @@ val kmeans = new KMeans() // COMMAND ---------- -// in Scala +// 스칼라 버전 val kmModel = kmeans.fit(transformedTraining) @@ -209,7 +209,7 @@ kmModel.computeCost(transformedTraining) // COMMAND ---------- -// in Scala +// 스칼라 버전 val transformedTest = fittedPipeline.transform(testDataFrame) @@ -220,7 +220,7 @@ kmModel.computeCost(transformedTest) // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF() diff --git a/code/Structured_APIs_Chapter_4_Structured_API_Overview.java "b/code/04\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\260\234\354\232\224.java" similarity index 100% rename from code/Structured_APIs_Chapter_4_Structured_API_Overview.java rename to "code/04\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\260\234\354\232\224.java" diff --git a/code/Structured_APIs-Chapter_4_Structured_API_Overview.py "b/code/04\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\260\234\354\232\224.py" similarity index 100% rename from code/Structured_APIs-Chapter_4_Structured_API_Overview.py rename to "code/04\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\260\234\354\232\224.py" diff --git a/code/Structured_APIs-Chapter_4_Structured_API_Overview.scala "b/code/04\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\260\234\354\232\224.scala" similarity index 85% rename from code/Structured_APIs-Chapter_4_Structured_API_Overview.scala rename to "code/04\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\260\234\354\232\224.scala" index fe0c434..4991932 100644 --- a/code/Structured_APIs-Chapter_4_Structured_API_Overview.scala +++ "b/code/04\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\260\234\354\232\224.scala" @@ -1,11 +1,11 @@ -// in Scala +// 스칼라 버전 val df = spark.range(500).toDF("number") df.select(df.col("number") + 10) // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.range(2).toDF().collect() diff --git a/code/Structured_APIs-Chapter_5_Basic_Structured_Operations.py "b/code/05\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\270\260\353\263\270_\354\227\260\354\202\260.py" similarity index 100% rename from code/Structured_APIs-Chapter_5_Basic_Structured_Operations.py rename to "code/05\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\270\260\353\263\270_\354\227\260\354\202\260.py" diff --git a/code/Structured_APIs-Chapter_5_Basic_Structured_Operations.scala "b/code/05\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\270\260\353\263\270_\354\227\260\354\202\260.scala" similarity index 88% rename from code/Structured_APIs-Chapter_5_Basic_Structured_Operations.scala rename to "code/05\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\270\260\353\263\270_\354\227\260\354\202\260.scala" index adbb4d9..7afbca7 100644 --- a/code/Structured_APIs-Chapter_5_Basic_Structured_Operations.scala +++ "b/code/05\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\270\260\353\263\270_\354\227\260\354\202\260.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 val df = spark.read.format("json") .load("/data/flight-data/json/2015-summary.json") @@ -10,13 +10,13 @@ df.printSchema() // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.read.format("json").load("/data/flight-data/json/2015-summary.json").schema // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType} import org.apache.spark.sql.types.Metadata @@ -33,7 +33,7 @@ val df = spark.read.format("json").schema(myManualSchema) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{col, column} col("someColumnName") column("someColumnName") @@ -41,7 +41,7 @@ column("someColumnName") // COMMAND ---------- -// in Scala +// 스칼라 버전 $"myColumn" 'myColumn @@ -58,7 +58,7 @@ df.col("count") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.expr expr("(((someCol + 5) * 200) - 6) < otherCol") @@ -76,14 +76,14 @@ df.first() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.Row val myRow = Row("Hello", null, 1, false) // COMMAND ---------- -// in Scala +// 스칼라 버전 myRow(0) // type Any myRow(0).asInstanceOf[String] // String myRow.getString(0) // String @@ -92,7 +92,7 @@ myRow.getInt(2) // Int // COMMAND ---------- -// in Scala +// 스칼라 버전 val df = spark.read.format("json") .load("/data/flight-data/json/2015-summary.json") df.createOrReplaceTempView("dfTable") @@ -100,7 +100,7 @@ df.createOrReplaceTempView("dfTable") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType} @@ -116,25 +116,25 @@ myDf.show() // COMMAND ---------- -// in Scala +// 스칼라 버전 val myDF = Seq(("Hello", 2, 1L)).toDF("col1", "col2", "col3") // COMMAND ---------- -// in Scala +// 스칼라 버전 df.select("DEST_COUNTRY_NAME").show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{expr, col, column} df.select( df.col("DEST_COUNTRY_NAME"), @@ -148,26 +148,26 @@ df.select( // COMMAND ---------- -// in Scala +// 스칼라 버전 df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.select(expr("DEST_COUNTRY_NAME as destination").alias("DEST_COUNTRY_NAME")) .show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.selectExpr( "*", // include all original columns "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry") @@ -176,26 +176,26 @@ df.selectExpr( // COMMAND ---------- -// in Scala +// 스칼라 버전 df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.lit df.select(expr("*"), lit(1).as("One")).show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.withColumn("numberOne", lit(1)).show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME")) .show(2) @@ -207,13 +207,13 @@ df.withColumn("Destination", expr("DEST_COUNTRY_NAME")).columns // COMMAND ---------- -// in Scala +// 스칼라 버전 df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.expr val dfWithLongColName = df.withColumn( @@ -223,7 +223,7 @@ val dfWithLongColName = df.withColumn( // COMMAND ---------- -// in Scala +// 스칼라 버전 dfWithLongColName.selectExpr( "`This Long Column-Name`", "`This Long Column-Name` as `new col`") @@ -237,7 +237,7 @@ dfWithLongColName.createOrReplaceTempView("dfTableLong") // COMMAND ---------- -// in Scala +// 스칼라 버전 dfWithLongColName.select(col("This Long Column-Name")).columns @@ -264,20 +264,20 @@ df.where("count < 2").show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") =!= "Croatia") .show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count() // COMMAND ---------- -// in Scala +// 스칼라 버전 df.select("ORIGIN_COUNTRY_NAME").distinct().count() @@ -291,14 +291,14 @@ df.sample(withReplacement, fraction, seed).count() // COMMAND ---------- -// in Scala +// 스칼라 버전 val dataFrames = df.randomSplit(Array(0.25, 0.75), seed) dataFrames(0).count() > dataFrames(1).count() // False // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.Row val schema = df.schema val newRows = Seq( @@ -315,7 +315,7 @@ df.union(newDF) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.sort("count").show(5) df.orderBy("count", "DEST_COUNTRY_NAME").show(5) df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(5) @@ -323,7 +323,7 @@ df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(5) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{desc, asc} df.orderBy(expr("count desc")).show(2) df.orderBy(desc("count"), asc("DEST_COUNTRY_NAME")).show(2) @@ -331,56 +331,56 @@ df.orderBy(desc("count"), asc("DEST_COUNTRY_NAME")).show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.read.format("json").load("/data/flight-data/json/*-summary.json") .sortWithinPartitions("count") // COMMAND ---------- -// in Scala +// 스칼라 버전 df.limit(5).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 df.orderBy(expr("count desc")).limit(6).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 df.rdd.getNumPartitions // 1 // COMMAND ---------- -// in Scala +// 스칼라 버전 df.repartition(5) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.repartition(col("DEST_COUNTRY_NAME")) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.repartition(5, col("DEST_COUNTRY_NAME")) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 val collectDF = df.limit(10) collectDF.take(5) // take works with an Integer count collectDF.show() // this prints it out nicely diff --git a/code/Structured_APIs-Chapter_5_Basic_Structured_Operations.sql "b/code/05\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\270\260\353\263\270_\354\227\260\354\202\260.sql" similarity index 100% rename from code/Structured_APIs-Chapter_5_Basic_Structured_Operations.sql rename to "code/05\354\236\245_\352\265\254\354\241\260\354\240\201_API_\352\270\260\353\263\270_\354\227\260\354\202\260.sql" diff --git a/code/Structured_APIs-Chapter_6_Working_with_Different_Types_of_Data.py "b/code/06\354\236\245_\353\213\244\354\226\221\355\225\234_\353\215\260\354\235\264\355\204\260_\355\203\200\354\236\205_\353\213\244\353\243\250\352\270\260.py" similarity index 95% rename from code/Structured_APIs-Chapter_6_Working_with_Different_Types_of_Data.py rename to "code/06\354\236\245_\353\213\244\354\226\221\355\225\234_\353\215\260\354\235\264\355\204\260_\355\203\200\354\236\205_\353\213\244\353\243\250\352\270\260.py" index ed4168d..662f225 100644 --- a/code/Structured_APIs-Chapter_6_Working_with_Different_Types_of_Data.py +++ "b/code/06\354\236\245_\353\213\244\354\226\221\355\225\234_\353\215\260\354\235\264\355\204\260_\355\203\200\354\236\205_\353\213\244\353\243\250\352\270\260.py" @@ -178,7 +178,7 @@ def color_locator(column, color_string): .cast("boolean")\ .alias("is_" + color_string) selectedColumns = [color_locator(df.Description, c) for c in simpleColors] -selectedColumns.append(expr("*")) # has to a be Column type +selectedColumns.append(expr("*")) # Column 타입이어야 합니다. df.select(*selectedColumns).where(expr("is_white OR is_red"))\ .select("Description").show(3, False) @@ -283,7 +283,7 @@ def color_locator(column, color_string): # COMMAND ---------- from pyspark.sql.functions import size -df.select(size(split(col("Description"), " "))).show(2) # shows 5 and 3 +df.select(size(split(col("Description"), " "))).show(2) # 5와 3 출력 # COMMAND ---------- @@ -310,13 +310,13 @@ def color_locator(column, color_string): # COMMAND ---------- -df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map"))\ +df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\ .selectExpr("complex_map['WHITE METAL LANTERN']").show(2) # COMMAND ---------- -df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map"))\ +df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\ .selectExpr("explode(complex_map)").show(2) @@ -331,7 +331,7 @@ def color_locator(column, color_string): from pyspark.sql.functions import get_json_object, json_tuple jsonDF.select( - get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]") as "column", + get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]").alias("column"), json_tuple(col("jsonString"), "myJSONKey")).show(2) @@ -377,7 +377,7 @@ def power3(double_value): # COMMAND ---------- udfExampleDF.selectExpr("power3(num)").show(2) -# registered in Scala +# 스칼라로 등록된 UDF 사용 # COMMAND ---------- @@ -389,7 +389,7 @@ def power3(double_value): # COMMAND ---------- udfExampleDF.selectExpr("power3py(num)").show(2) -# registered via Python +# 파이썬으로 등록된 UDF 사용 # COMMAND ---------- diff --git a/code/Structured_APIs-Chapter_6_Working_with_Different_Types_of_Data.scala "b/code/06\354\236\245_\353\213\244\354\226\221\355\225\234_\353\215\260\354\235\264\355\204\260_\355\203\200\354\236\205_\353\213\244\353\243\250\352\270\260.scala" similarity index 90% rename from code/Structured_APIs-Chapter_6_Working_with_Different_Types_of_Data.scala rename to "code/06\354\236\245_\353\213\244\354\226\221\355\225\234_\353\215\260\354\235\264\355\204\260_\355\203\200\354\236\205_\353\213\244\353\243\250\352\270\260.scala" index fef9d29..00f8718 100644 --- a/code/Structured_APIs-Chapter_6_Working_with_Different_Types_of_Data.scala +++ "b/code/06\354\236\245_\353\213\244\354\226\221\355\225\234_\353\215\260\354\235\264\355\204\260_\355\203\200\354\236\205_\353\213\244\353\243\250\352\270\260.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 val df = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") @@ -9,14 +9,14 @@ df.createOrReplaceTempView("dfTable") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.lit df.select(lit(5), lit("five"), lit(5.0)) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.col df.where(col("InvoiceNo").equalTo(536365)) .select("InvoiceNo", "Description") @@ -25,7 +25,7 @@ df.where(col("InvoiceNo").equalTo(536365)) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.col df.where(col("InvoiceNo") === 536365) .select("InvoiceNo", "Description") @@ -46,7 +46,7 @@ df.where("InvoiceNo <> 536365") // COMMAND ---------- -// in Scala +// 스칼라 버전 val priceFilter = col("UnitPrice") > 600 val descripFilter = col("Description").contains("POSTAGE") df.where(col("StockCode").isin("DOT")).where(priceFilter.or(descripFilter)) @@ -55,7 +55,7 @@ df.where(col("StockCode").isin("DOT")).where(priceFilter.or(descripFilter)) // COMMAND ---------- -// in Scala +// 스칼라 버전 val DOTCodeFilter = col("StockCode") === "DOT" val priceFilter = col("UnitPrice") > 600 val descripFilter = col("Description").contains("POSTAGE") @@ -66,7 +66,7 @@ df.withColumn("isExpensive", DOTCodeFilter.and(priceFilter.or(descripFilter))) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{expr, not, col} df.withColumn("isExpensive", not(col("UnitPrice").leq(250))) .filter("isExpensive") @@ -83,7 +83,7 @@ df.where(col("Description").eqNullSafe("hello")).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{expr, pow} val fabricatedQuantity = pow(col("Quantity") * col("UnitPrice"), 2) + 5 df.select(expr("CustomerId"), fabricatedQuantity.alias("realQuantity")).show(2) @@ -91,7 +91,7 @@ df.select(expr("CustomerId"), fabricatedQuantity.alias("realQuantity")).show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.selectExpr( "CustomerId", "(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity").show(2) @@ -99,21 +99,21 @@ df.selectExpr( // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{round, bround} df.select(round(col("UnitPrice"), 1).alias("rounded"), col("UnitPrice")).show(5) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.lit df.select(round(lit("2.5")), bround(lit("2.5"))).show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{corr} df.stat.corr("Quantity", "UnitPrice") df.select(corr("Quantity", "UnitPrice")).show() @@ -121,19 +121,19 @@ df.select(corr("Quantity", "UnitPrice")).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 df.describe().show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{count, mean, stddev_pop, min, max} // COMMAND ---------- -// in Scala +// 스칼라 버전 val colName = "UnitPrice" val quantileProbs = Array(0.5) val relError = 0.05 @@ -142,33 +142,33 @@ df.stat.approxQuantile("UnitPrice", quantileProbs, relError) // 2.51 // COMMAND ---------- -// in Scala +// 스칼라 버전 df.stat.crosstab("StockCode", "Quantity").show() // COMMAND ---------- -// in Scala +// 스칼라 버전 df.stat.freqItems(Seq("StockCode", "Quantity")).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.monotonically_increasing_id df.select(monotonically_increasing_id()).show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{initcap} df.select(initcap(col("Description"))).show(2, false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{lower, upper} df.select(col("Description"), lower(col("Description")), @@ -177,7 +177,7 @@ df.select(col("Description"), // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{lit, ltrim, rtrim, rpad, lpad, trim} df.select( ltrim(lit(" HELLO ")).as("ltrim"), @@ -189,7 +189,7 @@ df.select( // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.regexp_replace val simpleColors = Seq("black", "white", "red", "green", "blue") val regexString = simpleColors.map(_.toUpperCase).mkString("|") @@ -201,7 +201,7 @@ df.select( // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.translate df.select(translate(col("Description"), "LEET", "1337"), col("Description")) .show(2) @@ -209,7 +209,7 @@ df.select(translate(col("Description"), "LEET", "1337"), col("Description")) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.regexp_extract val regexString = simpleColors.map(_.toUpperCase).mkString("(", "|", ")") // the | signifies OR in regular expression syntax @@ -220,7 +220,7 @@ df.select( // COMMAND ---------- -// in Scala +// 스칼라 버전 val containsBlack = col("Description").contains("BLACK") val containsWhite = col("DESCRIPTION").contains("WHITE") df.withColumn("hasSimpleColor", containsBlack.or(containsWhite)) @@ -230,7 +230,7 @@ df.withColumn("hasSimpleColor", containsBlack.or(containsWhite)) // COMMAND ---------- -// in Scala +// 스칼라 버전 val simpleColors = Seq("black", "white", "red", "green", "blue") val selectedColumns = simpleColors.map(color => { col("Description").contains(color.toUpperCase).alias(s"is_$color") @@ -246,7 +246,7 @@ df.printSchema() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{current_date, current_timestamp} val dateDF = spark.range(10) .withColumn("today", current_date()) @@ -261,14 +261,14 @@ dateDF.printSchema() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{date_add, date_sub} dateDF.select(date_sub(col("today"), 5), date_add(col("today"), 5)).show(1) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{datediff, months_between, to_date} dateDF.withColumn("week_ago", date_sub(col("today"), 7)) .select(datediff(col("week_ago"), col("today"))).show(1) @@ -280,7 +280,7 @@ dateDF.select( // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{to_date, lit} spark.range(5).withColumn("date", lit("2017-01-01")) .select(to_date(col("date"))).show(1) @@ -293,7 +293,7 @@ dateDF.select(to_date(lit("2016-20-12")),to_date(lit("2017-12-11"))).show(1) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.to_date val dateFormat = "yyyy-dd-MM" val cleanDateDF = spark.range(1).select( @@ -304,7 +304,7 @@ cleanDateDF.createOrReplaceTempView("dateTable2") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.to_timestamp cleanDateDF.select(to_timestamp(col("date"), dateFormat)).show() @@ -321,7 +321,7 @@ cleanDateDF.filter(col("date2") > "'2017-12-12'").show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.coalesce df.select(coalesce(col("Description"), col("CustomerId"))).show() @@ -339,7 +339,7 @@ df.na.drop("all") // COMMAND ---------- -// in Scala +// 스칼라 버전 df.na.drop("all", Seq("StockCode", "InvoiceNo")) @@ -350,20 +350,20 @@ df.na.fill("All Null values become this string") // COMMAND ---------- -// in Scala +// 스칼라 버전 df.na.fill(5, Seq("StockCode", "InvoiceNo")) // COMMAND ---------- -// in Scala +// 스칼라 버전 val fillColValues = Map("StockCode" -> 5, "Description" -> "No Value") df.na.fill(fillColValues) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.na.replace("Description", Map("" -> "UNKNOWN")) @@ -379,7 +379,7 @@ df.selectExpr("struct(Description, InvoiceNo) as complex", "*") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.struct val complexDF = df.select(struct("Description", "InvoiceNo").alias("complex")) complexDF.createOrReplaceTempView("complexDF") @@ -398,35 +398,35 @@ complexDF.select("complex.*") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.split df.select(split(col("Description"), " ")).show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.select(split(col("Description"), " ").alias("array_col")) .selectExpr("array_col[0]").show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.size df.select(size(split(col("Description"), " "))).show(2) // shows 5 and 3 // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.array_contains df.select(array_contains(split(col("Description"), " "), "WHITE")).show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{split, explode} df.withColumn("splitted", split(col("Description"), " ")) @@ -436,35 +436,35 @@ df.withColumn("splitted", split(col("Description"), " ")) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.map df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map")).show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map")) .selectExpr("complex_map['WHITE METAL LANTERN']").show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map")) .selectExpr("explode(complex_map)").show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 val jsonDF = spark.range(1).selectExpr(""" '{"myJSONKey" : {"myJSONValue" : [1, 2, 3]}}' as jsonString""") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{get_json_object, json_tuple} jsonDF.select( get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]") as "column", @@ -474,12 +474,12 @@ jsonDF.select( // COMMAND ---------- jsonDF.selectExpr( - "json_tuple(jsonString, '$.myJSONKey.myJSONValue[1]') as column").show(2) - + "get_json_object(jsonString, '$.myJSONKey.myJSONValue[1]') as column", + "json_tuple(jsonString, 'myJSONKey')").show(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.to_json df.selectExpr("(InvoiceNo, Description) as myStruct") .select(to_json(col("myStruct"))) @@ -487,7 +487,7 @@ df.selectExpr("(InvoiceNo, Description) as myStruct") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.from_json import org.apache.spark.sql.types._ val parseSchema = new StructType(Array( @@ -500,7 +500,7 @@ df.selectExpr("(InvoiceNo, Description) as myStruct") // COMMAND ---------- -// in Scala +// 스칼라 버전 val udfExampleDF = spark.range(5).toDF("num") def power3(number:Double):Double = number * number * number power3(2.0) @@ -508,20 +508,20 @@ power3(2.0) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.udf val power3udf = udf(power3(_:Double):Double) // COMMAND ---------- -// in Scala +// 스칼라 버전 udfExampleDF.select(power3udf(col("num"))).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.udf.register("power3", power3(_:Double):Double) udfExampleDF.selectExpr("power3(num)").show(2) diff --git a/code/Structured_APIs-Chapter_6_Working_with_Different_Types_of_Data.sql "b/code/06\354\236\245_\353\213\244\354\226\221\355\225\234_\353\215\260\354\235\264\355\204\260_\355\203\200\354\236\205_\353\213\244\353\243\250\352\270\260.sql" similarity index 100% rename from code/Structured_APIs-Chapter_6_Working_with_Different_Types_of_Data.sql rename to "code/06\354\236\245_\353\213\244\354\226\221\355\225\234_\353\215\260\354\235\264\355\204\260_\355\203\200\354\236\205_\353\213\244\353\243\250\352\270\260.sql" diff --git a/code/Structured_APIs-Chapter_7_Aggregations.py "b/code/07\354\236\245_\354\247\221\352\263\204_\354\227\260\354\202\260.py" similarity index 100% rename from code/Structured_APIs-Chapter_7_Aggregations.py rename to "code/07\354\236\245_\354\247\221\352\263\204_\354\227\260\354\202\260.py" diff --git a/code/Structured_APIs-Chapter_7_Aggregations.scala "b/code/07\354\236\245_\354\247\221\352\263\204_\354\227\260\354\202\260.scala" similarity index 92% rename from code/Structured_APIs-Chapter_7_Aggregations.scala rename to "code/07\354\236\245_\354\247\221\352\263\204_\354\227\260\354\202\260.scala" index 979b644..4b969f3 100644 --- a/code/Structured_APIs-Chapter_7_Aggregations.scala +++ "b/code/07\354\236\245_\354\247\221\352\263\204_\354\227\260\354\202\260.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 val df = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") @@ -15,56 +15,56 @@ df.count() == 541909 // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.count df.select(count("StockCode")).show() // 541909 // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.countDistinct df.select(countDistinct("StockCode")).show() // 4070 // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.approx_count_distinct df.select(approx_count_distinct("StockCode", 0.1)).show() // 3364 // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{first, last} df.select(first("StockCode"), last("StockCode")).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{min, max} df.select(min("Quantity"), max("Quantity")).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.sum df.select(sum("Quantity")).show() // 5176450 // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.sumDistinct df.select(sumDistinct("Quantity")).show() // 29310 // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{sum, count, avg, expr} df.select( @@ -80,7 +80,7 @@ df.select( // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{var_pop, stddev_pop} import org.apache.spark.sql.functions.{var_samp, stddev_samp} df.select(var_pop("Quantity"), var_samp("Quantity"), @@ -95,7 +95,7 @@ df.select(skewness("Quantity"), kurtosis("Quantity")).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{corr, covar_pop, covar_samp} df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"), covar_pop("InvoiceNo", "Quantity")).show() @@ -103,7 +103,7 @@ df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"), // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{collect_set, collect_list} df.agg(collect_set("Country"), collect_list("Country")).show() @@ -115,7 +115,7 @@ df.groupBy("InvoiceNo", "CustomerId").count().show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.count df.groupBy("InvoiceNo").agg( @@ -125,13 +125,13 @@ df.groupBy("InvoiceNo").agg( // COMMAND ---------- -// in Scala +// 스칼라 버전 df.groupBy("InvoiceNo").agg("Quantity"->"avg", "Quantity"->"stddev_pop").show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{col, to_date} val dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/d/yyyy H:mm")) @@ -140,7 +140,7 @@ dfWithDate.createOrReplaceTempView("dfWithDate") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.col val windowSpec = Window @@ -157,7 +157,7 @@ val maxPurchaseQuantity = max(col("Quantity")).over(windowSpec) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{dense_rank, rank} val purchaseDenseRank = dense_rank().over(windowSpec) val purchaseRank = rank().over(windowSpec) @@ -165,7 +165,7 @@ val purchaseRank = rank().over(windowSpec) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.col dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId") @@ -180,7 +180,7 @@ dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId") // COMMAND ---------- -// in Scala +// 스칼라 버전 val dfNoNull = dfWithDate.drop() dfNoNull.createOrReplaceTempView("dfNoNull") @@ -205,24 +205,24 @@ rolledUpDF.where("Date IS NULL").show() // COMMAND ---------- -// in Scala +// 스칼라 버전 dfNoNull.cube("Date", "Country").agg(sum(col("Quantity"))) .select("Date", "Country", "sum(Quantity)").orderBy("Date").show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{grouping_id, sum, expr} dfNoNull.cube("customerId", "stockCode").agg(grouping_id(), sum("Quantity")) -.orderBy(expr("grouping_id()").desc) +.orderBy(col("grouping_id()").desc) .show() // COMMAND ---------- -// in Scala +// 스칼라 버전 val pivoted = dfWithDate.groupBy("date").pivot("Country").sum() @@ -233,7 +233,7 @@ pivoted.where("date > '2011-12-05'").select("date" ,"`USA_sum(Quantity)`").show( // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.expressions.MutableAggregationBuffer import org.apache.spark.sql.expressions.UserDefinedAggregateFunction import org.apache.spark.sql.Row @@ -263,7 +263,7 @@ class BoolAnd extends UserDefinedAggregateFunction { // COMMAND ---------- -// in Scala +// 스칼라 버전 val ba = new BoolAnd spark.udf.register("booland", ba) import org.apache.spark.sql.functions._ diff --git a/code/Structured_APIs-Chapter_7_Aggregations.sql "b/code/07\354\236\245_\354\247\221\352\263\204_\354\227\260\354\202\260.sql" similarity index 100% rename from code/Structured_APIs-Chapter_7_Aggregations.sql rename to "code/07\354\236\245_\354\247\221\352\263\204_\354\227\260\354\202\260.sql" diff --git a/code/Structured_APIs-Chapter_8_Joins.py "b/code/08\354\236\245_\354\241\260\354\235\270.py" similarity index 100% rename from code/Structured_APIs-Chapter_8_Joins.py rename to "code/08\354\236\245_\354\241\260\354\235\270.py" diff --git a/code/Structured_APIs-Chapter_8_Joins.scala "b/code/08\354\236\245_\354\241\260\354\235\270.scala" similarity index 95% rename from code/Structured_APIs-Chapter_8_Joins.scala rename to "code/08\354\236\245_\354\241\260\354\235\270.scala" index e2c40d8..b1574d3 100644 --- a/code/Structured_APIs-Chapter_8_Joins.scala +++ "b/code/08\354\236\245_\354\241\260\354\235\270.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 val person = Seq( (0, "Bill Chambers", 0, Seq(100)), (1, "Matei Zaharia", 1, Seq(500, 250, 100)), @@ -25,13 +25,13 @@ sparkStatus.createOrReplaceTempView("sparkStatus") // COMMAND ---------- -// in Scala +// 스칼라 버전 val joinExpression = person.col("graduate_program") === graduateProgram.col("id") // COMMAND ---------- -// in Scala +// 스칼라 버전 val wrongJoinExpression = person.col("name") === graduateProgram.col("school") @@ -42,7 +42,7 @@ person.join(graduateProgram, joinExpression).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 var joinType = "inner" @@ -93,7 +93,7 @@ graduateProgram.join(person, joinExpression, joinType).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 val gradProgram2 = graduateProgram.union(Seq( (0, "Masters", "Duplicated Row", "Duplicated School")).toDF()) @@ -148,6 +148,7 @@ person.join(gradProgramDupe, joinExpr).show() // COMMAND ---------- +// 중복된 "graduate_program" 컬럼으로 인해 오류 발생 person.join(gradProgramDupe, joinExpr).select("graduate_program").show() diff --git a/code/Structured_APIs-Chapter_8_Joins.sql "b/code/08\354\236\245_\354\241\260\354\235\270.sql" similarity index 100% rename from code/Structured_APIs-Chapter_8_Joins.sql rename to "code/08\354\236\245_\354\241\260\354\235\270.sql" diff --git a/code/Structured_APIs-Chapter_9_Data_Sources.py "b/code/09\354\236\245_\353\215\260\354\235\264\355\204\260\354\206\214\354\212\244.py" similarity index 98% rename from code/Structured_APIs-Chapter_9_Data_Sources.py rename to "code/09\354\236\245_\353\215\260\354\235\264\355\204\260\354\206\214\354\212\244.py" index 0c11005..24813c2 100644 --- a/code/Structured_APIs-Chapter_9_Data_Sources.py +++ "b/code/09\354\236\245_\353\215\260\354\235\264\355\204\260\354\206\214\354\212\244.py" @@ -42,7 +42,7 @@ # COMMAND ---------- -csvFile.write.format("orc").mode("overwrite").save("/tmp/my-json-file.orc") +csvFile.write.format("orc").mode("overwrite").save("/tmp/my-orc-file.orc") # COMMAND ---------- diff --git a/code/Structured_APIs-Chapter_9_Data_Sources.scala "b/code/09\354\236\245_\353\215\260\354\235\264\355\204\260\354\206\214\354\212\244.scala" similarity index 89% rename from code/Structured_APIs-Chapter_9_Data_Sources.scala rename to "code/09\354\236\245_\353\215\260\354\235\264\355\204\260\354\206\214\354\212\244.scala" index 8167dd0..44f2873 100644 --- a/code/Structured_APIs-Chapter_9_Data_Sources.scala +++ "b/code/09\354\236\245_\353\215\260\354\235\264\355\204\260\354\206\214\354\212\244.scala" @@ -1,10 +1,10 @@ -// in Scala +// 스칼라 버전 dataFrame.write // COMMAND ---------- -// in Scala +// 스칼라 버전 dataframe.write.format("csv") .option("mode", "OVERWRITE") .option("dateFormat", "yyyy-MM-dd") @@ -19,7 +19,7 @@ spark.read.format("csv") // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.read.format("csv") .option("header", "true") .option("mode", "FAILFAST") @@ -29,7 +29,7 @@ spark.read.format("csv") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType} val myManualSchema = new StructType(Array( new StructField("DEST_COUNTRY_NAME", StringType, true), @@ -46,7 +46,8 @@ spark.read.format("csv") // COMMAND ---------- -// in Scala +// 스칼라 버전 +// 데이터 타입이 맞지 않기 때문에 오류가 발생함. val myManualSchema = new StructType(Array( new StructField("DEST_COUNTRY_NAME", LongType, true), new StructField("ORIGIN_COUNTRY_NAME", LongType, true), @@ -62,7 +63,7 @@ spark.read.format("csv") // COMMAND ---------- -// in Scala +// 스칼라 버전 val csvFile = spark.read.format("csv") .option("header", "true").option("mode", "FAILFAST").schema(myManualSchema) .load("/data/flight-data/csv/2010-summary.csv") @@ -70,7 +71,7 @@ val csvFile = spark.read.format("csv") // COMMAND ---------- -// in Scala +// 스칼라 버전 csvFile.write.format("csv").mode("overwrite").option("sep", "\t") .save("/tmp/my-tsv-file.tsv") @@ -82,14 +83,14 @@ spark.read.format("json") // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.read.format("json").option("mode", "FAILFAST").schema(myManualSchema) .load("/data/flight-data/json/2010-summary.json").show(5) // COMMAND ---------- -// in Scala +// 스칼라 버전 csvFile.write.format("json").mode("overwrite").save("/tmp/my-json-file.json") @@ -105,33 +106,33 @@ spark.read.format("parquet") // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.read.format("parquet") .load("/data/flight-data/parquet/2010-summary.parquet").show(5) // COMMAND ---------- -// in Scala +// 스칼라 버전 csvFile.write.format("parquet").mode("overwrite") .save("/tmp/my-parquet-file.parquet") // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.read.format("orc").load("/data/flight-data/orc/2010-summary.orc").show(5) // COMMAND ---------- -// in Scala -csvFile.write.format("orc").mode("overwrite").save("/tmp/my-json-file.orc") +// 스칼라 버전 +csvFile.write.format("orc").mode("overwrite").save("/tmp/my-orc-file.orc") // COMMAND ---------- -// in Scala +// 스칼라 버전 val driver = "org.sqlite.JDBC" val path = "/data/flight-data/jdbc/my-sqlite.db" val url = s"jdbc:sqlite:/${path}" @@ -148,14 +149,14 @@ connection.close() // COMMAND ---------- -// in Scala +// 스칼라 버전 val dbDataFrame = spark.read.format("jdbc").option("url", url) .option("dbtable", tablename).option("driver", driver).load() // COMMAND ---------- -// in Scala +// 스칼라 버전 val pgDF = spark.read .format("jdbc") .option("driver", "org.postgresql.Driver") @@ -176,13 +177,13 @@ dbDataFrame.select("DEST_COUNTRY_NAME").distinct().explain // COMMAND ---------- -// in Scala +// 스칼라 버전 dbDataFrame.filter("DEST_COUNTRY_NAME in ('Anguilla', 'Sweden')").explain // COMMAND ---------- -// in Scala +// 스칼라 버전 val pushdownQuery = """(SELECT DISTINCT(DEST_COUNTRY_NAME) FROM flight_info) AS flight_info""" val dbDataFrame = spark.read.format("jdbc") @@ -197,7 +198,7 @@ dbDataFrame.explain() // COMMAND ---------- -// in Scala +// 스칼라 버전 val dbDataFrame = spark.read.format("jdbc") .option("url", url).option("dbtable", tablename).option("driver", driver) .option("numPartitions", 10).load() @@ -210,7 +211,7 @@ dbDataFrame.select("DEST_COUNTRY_NAME").distinct().show() // COMMAND ---------- -// in Scala +// 스칼라 버전 val props = new java.util.Properties props.setProperty("driver", "org.sqlite.JDBC") val predicates = Array( @@ -222,7 +223,7 @@ spark.read.jdbc(url, tablename, predicates, props).rdd.getNumPartitions // 2 // COMMAND ---------- -// in Scala +// 스칼라 버전 val props = new java.util.Properties props.setProperty("driver", "org.sqlite.JDBC") val predicates = Array( @@ -233,7 +234,7 @@ spark.read.jdbc(url, tablename, predicates, props).count() // 510 // COMMAND ---------- -// in Scala +// 스칼라 버전 val colName = "count" val lowerBound = 0L val upperBound = 348113L // this is the max count in our database @@ -242,33 +243,33 @@ val numPartitions = 10 // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.read.jdbc(url,tablename,colName,lowerBound,upperBound,numPartitions,props) .count() // 255 // COMMAND ---------- -// in Scala +// 스칼라 버전 val newPath = "jdbc:sqlite://tmp/my-sqlite.db" csvFile.write.mode("overwrite").jdbc(newPath, tablename, props) // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.read.jdbc(newPath, tablename, props).count() // 255 // COMMAND ---------- -// in Scala +// 스칼라 버전 csvFile.write.mode("append").jdbc(newPath, tablename, props) // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.read.jdbc(newPath, tablename, props).count() // 765 @@ -285,14 +286,14 @@ csvFile.select("DEST_COUNTRY_NAME").write.text("/tmp/simple-text-file.txt") // COMMAND ---------- -// in Scala +// 스칼라 버전 csvFile.limit(10).select("DEST_COUNTRY_NAME", "count") .write.partitionBy("count").text("/tmp/five-csv-files2.csv") // COMMAND ---------- -// in Scala +// 스칼라 버전 csvFile.limit(10).write.mode("overwrite").partitionBy("DEST_COUNTRY_NAME") .save("/tmp/partitioned-files.parquet") diff --git a/code/Structured_APIs-Chapter_10_Spark_SQL.py "b/code/10\354\236\245_\354\212\244\355\214\214\355\201\254_SQL.py" similarity index 100% rename from code/Structured_APIs-Chapter_10_Spark_SQL.py rename to "code/10\354\236\245_\354\212\244\355\214\214\355\201\254_SQL.py" diff --git a/code/Structured_APIs-Chapter_10_Spark_SQL.scala "b/code/10\354\236\245_\354\212\244\355\214\214\355\201\254_SQL.scala" similarity index 97% rename from code/Structured_APIs-Chapter_10_Spark_SQL.scala rename to "code/10\354\236\245_\354\212\244\355\214\214\355\201\254_SQL.scala" index 68866b7..688057f 100644 --- a/code/Structured_APIs-Chapter_10_Spark_SQL.scala +++ "b/code/10\354\236\245_\354\212\244\355\214\214\355\201\254_SQL.scala" @@ -3,7 +3,7 @@ spark.sql("SELECT 1 + 1").show() // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.read.json("/data/flight-data/json/2015-summary.json") .createOrReplaceTempView("some_sql_view") // DF => SQL diff --git a/code/Structured_APIs-Chapter_10_Spark_SQL.sql "b/code/10\354\236\245_\354\212\244\355\214\214\355\201\254_SQL.sql" similarity index 96% rename from code/Structured_APIs-Chapter_10_Spark_SQL.sql rename to "code/10\354\236\245_\354\212\244\355\214\214\355\201\254_SQL.sql" index 8367588..13fd1f5 100644 --- a/code/Structured_APIs-Chapter_10_Spark_SQL.sql +++ "b/code/10\354\236\245_\354\212\244\355\214\214\355\201\254_SQL.sql" @@ -165,7 +165,11 @@ USE some_db SHOW tables -SELECT * FROM flights -- fails with table/view not found + +-- COMMAND ---------- + +-- 현재 데이터베이스에 flights 테이블이 없기 때문에 오류 발생 +SELECT * FROM flights -- COMMAND ---------- @@ -305,6 +309,7 @@ SHOW FUNCTIONS LIKE "collect*"; -- COMMAND ---------- +-- 실행 전, 스칼라 언어를 이용해 power3 메소드를 등록해야 함. SELECT count, power3(count) FROM flights diff --git a/code/Structured_APIs-Chapter_11_Datasets.scala "b/code/11\354\236\245_Dataset.scala" similarity index 100% rename from code/Structured_APIs-Chapter_11_Datasets.scala rename to "code/11\354\236\245_Dataset.scala" diff --git a/code/Low_Level_APIs-Chapter_12_RDD_Basics.py "b/code/12\354\236\245_RDD.py" similarity index 100% rename from code/Low_Level_APIs-Chapter_12_RDD_Basics.py rename to "code/12\354\236\245_RDD.py" diff --git a/code/Low_Level_APIs-Chapter_12_RDD_Basics.scala "b/code/12\354\236\245_RDD.scala" similarity index 89% rename from code/Low_Level_APIs-Chapter_12_RDD_Basics.scala rename to "code/12\354\236\245_RDD.scala" index ad8e151..01d09c8 100644 --- a/code/Low_Level_APIs-Chapter_12_RDD_Basics.scala +++ "b/code/12\354\236\245_RDD.scala" @@ -3,25 +3,25 @@ spark.sparkContext // COMMAND ---------- -// in Scala: converts a Dataset[Long] to RDD[Long] +// 스칼라 버전: converts a Dataset[Long] to RDD[Long] spark.range(500).rdd // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.range(10).toDF().rdd.map(rowObject => rowObject.getLong(0)) // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.range(10).rdd.toDF() // COMMAND ---------- -// in Scala +// 스칼라 버전 val myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple" .split(" ") val words = spark.sparkContext.parallelize(myCollection, 2) @@ -29,7 +29,7 @@ val words = spark.sparkContext.parallelize(myCollection, 2) // COMMAND ---------- -// in Scala +// 스칼라 버전 words.setName("myWords") words.name // myWords @@ -51,7 +51,7 @@ words.distinct().count() // COMMAND ---------- -// in Scala +// 스칼라 버전 def startsWithS(individual:String) = { individual.startsWith("S") } @@ -59,49 +59,49 @@ def startsWithS(individual:String) = { // COMMAND ---------- -// in Scala +// 스칼라 버전 words.filter(word => startsWithS(word)).collect() // COMMAND ---------- -// in Scala +// 스칼라 버전 val words2 = words.map(word => (word, word(0), word.startsWith("S"))) // COMMAND ---------- -// in Scala +// 스칼라 버전 words2.filter(record => record._3).take(5) // COMMAND ---------- -// in Scala +// 스칼라 버전 words.flatMap(word => word.toSeq).take(5) // COMMAND ---------- -// in Scala +// 스칼라 버전 words.sortBy(word => word.length() * -1).take(2) // COMMAND ---------- -// in Scala +// 스칼라 버전 val fiftyFiftySplit = words.randomSplit(Array[Double](0.5, 0.5)) // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.sparkContext.parallelize(1 to 20).reduce(_ + _) // 210 // COMMAND ---------- -// in Scala +// 스칼라 버전 def wordLengthReducer(leftWord:String, rightWord:String): String = { if (leftWord.length > rightWord.length) return leftWord @@ -173,7 +173,7 @@ words.saveAsTextFile("file:/tmp/bookTitle") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.hadoop.io.compress.BZip2Codec words.saveAsTextFile("file:/tmp/bookTitleCompressed", classOf[BZip2Codec]) @@ -190,7 +190,7 @@ words.cache() // COMMAND ---------- -// in Scala +// 스칼라 버전 words.getStorageLevel @@ -207,13 +207,13 @@ words.pipe("wc -l").collect() // COMMAND ---------- -// in Scala +// 스칼라 버전 words.mapPartitions(part => Iterator[Int](1)).sum() // 2 // COMMAND ---------- -// in Scala +// 스칼라 버전 def indexedFunc(partitionIndex:Int, withinPartIterator: Iterator[String]) = { withinPartIterator.toList.map( value => s"Partition: $partitionIndex => $value").iterator @@ -237,7 +237,7 @@ words.foreachPartition { iter => // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.sparkContext.parallelize(Seq("Hello", "World"), 2).glom().collect() // Array(Array(Hello), Array(World)) diff --git a/code/Low_Level_APIs-Chapter_13_Advanced_RDDs.py "b/code/13\354\236\245_\352\263\240\352\270\211_RDD_\352\260\234\353\205\220.py" similarity index 93% rename from code/Low_Level_APIs-Chapter_13_Advanced_RDDs.py rename to "code/13\354\236\245_\352\263\240\352\270\211_RDD_\352\260\234\353\205\220.py" index b4bd883..90b33a4 100644 --- a/code/Low_Level_APIs-Chapter_13_Advanced_RDDs.py +++ "b/code/13\354\236\245_\352\263\240\352\270\211_RDD_\352\260\234\353\205\220.py" @@ -59,7 +59,7 @@ def addFunc(left, right): KVcharacters.groupByKey().map(lambda row: (row[0], reduce(addFunc, row[1])))\ .collect() -# note this is Python 2, reduce must be imported from functools in Python 3 +# 이 코드는 파이썬2 기준으로 되어 있습니다. 파이썬 3을 사용하는 경우, functools에서 reduce를 임포트 해야 합니다.(역자주: from functools import reduce 구문을 사용합니다.) # COMMAND ---------- diff --git a/code/Low_Level_APIs-Chapter_13_Advanced_RDDs.scala "b/code/13\354\236\245_\352\263\240\352\270\211_RDD_\352\260\234\353\205\220.scala" similarity index 90% rename from code/Low_Level_APIs-Chapter_13_Advanced_RDDs.scala rename to "code/13\354\236\245_\352\263\240\352\270\211_RDD_\352\260\234\353\205\220.scala" index e050615..441e27f 100644 --- a/code/Low_Level_APIs-Chapter_13_Advanced_RDDs.scala +++ "b/code/13\354\236\245_\352\263\240\352\270\211_RDD_\352\260\234\353\205\220.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 val myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple" .split(" ") val words = spark.sparkContext.parallelize(myCollection, 2) @@ -6,31 +6,31 @@ val words = spark.sparkContext.parallelize(myCollection, 2) // COMMAND ---------- -// in Scala +// 스칼라 버전 words.map(word => (word.toLowerCase, 1)) // COMMAND ---------- -// in Scala +// 스칼라 버전 val keyword = words.keyBy(word => word.toLowerCase.toSeq(0).toString) // COMMAND ---------- -// in Scala +// 스칼라 버전 keyword.mapValues(word => word.toUpperCase).collect() // COMMAND ---------- -// in Scala +// 스칼라 버전 keyword.flatMapValues(word => word.toUpperCase).collect() // COMMAND ---------- -// in Scala +// 스칼라 버전 keyword.keys.collect() keyword.values.collect() @@ -42,7 +42,7 @@ keyword.lookup("s") // COMMAND ---------- -// in Scala +// 스칼라 버전 val distinctChars = words.flatMap(word => word.toLowerCase.toSeq).distinct .collect() import scala.util.Random @@ -54,14 +54,14 @@ words.map(word => (word.toLowerCase.toSeq(0), word)) // COMMAND ---------- -// in Scala +// 스칼라 버전 words.map(word => (word.toLowerCase.toSeq(0), word)) .sampleByKeyExact(true, sampleMap, 6L).collect() // COMMAND ---------- -// in Scala +// 스칼라 버전 val chars = words.flatMap(word => word.toLowerCase.toSeq) val KVcharacters = chars.map(letter => (letter, 1)) def maxFunc(left:Int, right:Int) = math.max(left, right) @@ -71,7 +71,7 @@ val nums = sc.parallelize(1 to 30, 5) // COMMAND ---------- -// in Scala +// 스칼라 버전 val timeout = 1000L //milliseconds val confidence = 0.95 KVcharacters.countByKey() @@ -80,7 +80,7 @@ KVcharacters.countByKeyApprox(timeout, confidence) // COMMAND ---------- -// in Scala +// 스칼라 버전 KVcharacters.groupByKey().map(row => (row._1, row._2.reduce(addFunc))).collect() @@ -91,26 +91,26 @@ KVcharacters.reduceByKey(addFunc).collect() // COMMAND ---------- -// in Scala +// 스칼라 버전 nums.aggregate(0)(maxFunc, addFunc) // COMMAND ---------- -// in Scala +// 스칼라 버전 val depth = 3 nums.treeAggregate(0)(maxFunc, addFunc, depth) // COMMAND ---------- -// in Scala +// 스칼라 버전 KVcharacters.aggregateByKey(0)(addFunc, maxFunc).collect() // COMMAND ---------- -// in Scala +// 스칼라 버전 val valToCombiner = (value:Int) => List(value) val mergeValuesFunc = (vals:List[Int], valToAppend:Int) => valToAppend :: vals val mergeCombinerFunc = (vals1:List[Int], vals2:List[Int]) => vals1 ::: vals2 @@ -127,13 +127,13 @@ KVcharacters // COMMAND ---------- -// in Scala +// 스칼라 버전 KVcharacters.foldByKey(0)(addFunc).collect() // COMMAND ---------- -// in Scala +// 스칼라 버전 import scala.util.Random val distinctChars = words.flatMap(word => word.toLowerCase.toSeq).distinct val charRDD = distinctChars.map(c => (c, new Random().nextDouble())) @@ -144,7 +144,7 @@ charRDD.cogroup(charRDD2, charRDD3).take(5) // COMMAND ---------- -// in Scala +// 스칼라 버전 val keyedChars = distinctChars.map(c => (c, new Random().nextDouble())) val outputPartitions = 10 KVcharacters.join(keyedChars).count() @@ -153,14 +153,14 @@ KVcharacters.join(keyedChars, outputPartitions).count() // COMMAND ---------- -// in Scala +// 스칼라 버전 val numRange = sc.parallelize(0 to 9, 2) words.zip(numRange).collect() // COMMAND ---------- -// in Scala +// 스칼라 버전 words.coalesce(1).getNumPartitions // 1 @@ -171,7 +171,7 @@ words.repartition(10) // gives us 10 partitions // COMMAND ---------- -// in Scala +// 스칼라 버전 val df = spark.read.option("header", "true").option("inferSchema", "true") .csv("/data/retail-data/all/") val rdd = df.coalesce(10).rdd @@ -184,7 +184,7 @@ df.printSchema() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.HashPartitioner rdd.map(r => r(6)).take(5).foreach(println) val keyedRDD = rdd.keyBy(row => row(6).asInstanceOf[Int].toDouble) @@ -197,7 +197,7 @@ keyedRDD.partitionBy(new HashPartitioner(10)).take(10) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.Partitioner class DomainPartitioner extends Partitioner { def numPartitions = 3 @@ -218,7 +218,7 @@ keyedRDD // COMMAND ---------- -// in Scala +// 스칼라 버전 class SomeClass extends Serializable { var someValue = 0 def setSomeValue(i:Int) = { @@ -232,7 +232,7 @@ sc.parallelize(1 to 10).map(num => new SomeClass().setSomeValue(num)) // COMMAND ---------- -// in Scala +// 스칼라 버전 val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) diff --git a/code/Low_Level_APIs-Chapter_14_Distributed_Variables.py "b/code/14\354\236\245_\353\266\204\354\202\260\355\230\225_\352\263\265\354\234\240_\353\263\200\354\210\230.py" similarity index 100% rename from code/Low_Level_APIs-Chapter_14_Distributed_Variables.py rename to "code/14\354\236\245_\353\266\204\354\202\260\355\230\225_\352\263\265\354\234\240_\353\263\200\354\210\230.py" diff --git a/code/Low_Level_APIs-Chapter_14_Distributed_Variables.scala "b/code/14\354\236\245_\353\266\204\354\202\260\355\230\225_\352\263\265\354\234\240_\353\263\200\354\210\230.scala" similarity index 90% rename from code/Low_Level_APIs-Chapter_14_Distributed_Variables.scala rename to "code/14\354\236\245_\353\266\204\354\202\260\355\230\225_\352\263\265\354\234\240_\353\263\200\354\210\230.scala" index 03ebd88..e8faef9 100644 --- a/code/Low_Level_APIs-Chapter_14_Distributed_Variables.scala +++ "b/code/14\354\236\245_\353\266\204\354\202\260\355\230\225_\352\263\265\354\234\240_\353\263\200\354\210\230.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 val myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple" .split(" ") val words = spark.sparkContext.parallelize(myCollection, 2) @@ -6,26 +6,26 @@ val words = spark.sparkContext.parallelize(myCollection, 2) // COMMAND ---------- -// in Scala +// 스칼라 버전 val supplementalData = Map("Spark" -> 1000, "Definitive" -> 200, "Big" -> -300, "Simple" -> 100) // COMMAND ---------- -// in Scala +// 스칼라 버전 val suppBroadcast = spark.sparkContext.broadcast(supplementalData) // COMMAND ---------- -// in Scala +// 스칼라 버전 suppBroadcast.value // COMMAND ---------- -// in Scala +// 스칼라 버전 words.map(word => (word, suppBroadcast.value.getOrElse(word, 0))) .sortBy(wordPair => wordPair._2) .collect() @@ -33,7 +33,7 @@ words.map(word => (word, suppBroadcast.value.getOrElse(word, 0))) // COMMAND ---------- -// in Scala +// 스칼라 버전 case class Flight(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count: BigInt) val flights = spark.read @@ -43,7 +43,7 @@ val flights = spark.read // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.util.LongAccumulator val accUnnamed = new LongAccumulator val acc = spark.sparkContext.register(accUnnamed) @@ -51,7 +51,7 @@ val acc = spark.sparkContext.register(accUnnamed) // COMMAND ---------- -// in Scala +// 스칼라 버전 val accChina = new LongAccumulator val accChina2 = spark.sparkContext.longAccumulator("China") spark.sparkContext.register(accChina, "China") @@ -59,7 +59,7 @@ spark.sparkContext.register(accChina, "China") // COMMAND ---------- -// in Scala +// 스칼라 버전 def accChinaFunc(flight_row: Flight) = { val destination = flight_row.DEST_COUNTRY_NAME val origin = flight_row.ORIGIN_COUNTRY_NAME @@ -74,19 +74,19 @@ def accChinaFunc(flight_row: Flight) = { // COMMAND ---------- -// in Scala +// 스칼라 버전 flights.foreach(flight_row => accChinaFunc(flight_row)) // COMMAND ---------- -// in Scala +// 스칼라 버전 accChina.value // 953 // COMMAND ---------- -// in Scala +// 스칼라 버전 import scala.collection.mutable.ArrayBuffer import org.apache.spark.util.AccumulatorV2 @@ -121,7 +121,7 @@ val newAcc = sc.register(acc, "evenAcc") // COMMAND ---------- -// in Scala +// 스칼라 버전 acc.value // 0 flights.foreach(flight_row => acc.add(flight_row.count)) acc.value // 31390 diff --git a/code/Production_Applications-Chapter_15_How_Spark_Runs_on_a_Cluster.py "b/code/15\354\236\245_\355\201\264\353\237\254\354\212\244\355\204\260\354\227\220\354\204\234_\354\212\244\355\214\214\355\201\254_\354\213\244\355\226\211\355\225\230\352\270\260.py" similarity index 96% rename from code/Production_Applications-Chapter_15_How_Spark_Runs_on_a_Cluster.py rename to "code/15\354\236\245_\355\201\264\353\237\254\354\212\244\355\204\260\354\227\220\354\204\234_\354\212\244\355\214\214\355\201\254_\354\213\244\355\226\211\355\225\230\352\270\260.py" index 1f83526..3389601 100644 --- a/code/Production_Applications-Chapter_15_How_Spark_Runs_on_a_Cluster.py +++ "b/code/15\354\236\245_\355\201\264\353\237\254\354\212\244\355\204\260\354\227\220\354\204\234_\354\212\244\355\214\214\355\201\254_\354\213\244\355\226\211\355\225\230\352\270\260.py" @@ -20,3 +20,5 @@ # COMMAND ---------- +step4.explain() + diff --git a/code/Production_Applications-Chapter_15_How_Spark_Runs_on_a_Cluster.scala "b/code/15\354\236\245_\355\201\264\353\237\254\354\212\244\355\204\260\354\227\220\354\204\234_\354\212\244\355\214\214\355\201\254_\354\213\244\355\226\211\355\225\230\352\270\260.scala" similarity index 88% rename from code/Production_Applications-Chapter_15_How_Spark_Runs_on_a_Cluster.scala rename to "code/15\354\236\245_\355\201\264\353\237\254\354\212\244\355\204\260\354\227\220\354\204\234_\354\212\244\355\214\214\355\201\254_\354\213\244\355\226\211\355\225\230\352\270\260.scala" index 8ef7fb9..4ee7fee 100644 --- a/code/Production_Applications-Chapter_15_How_Spark_Runs_on_a_Cluster.scala +++ "b/code/15\354\236\245_\355\201\264\353\237\254\354\212\244\355\204\260\354\227\220\354\204\234_\354\212\244\355\214\214\355\201\254_\354\213\244\355\226\211\355\225\230\352\270\260.scala" @@ -7,16 +7,10 @@ val spark = SparkSession.builder().appName("Databricks Spark Example") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.SparkContext val sc = SparkContext.getOrCreate() - -// COMMAND ---------- - -step4.explain() - - // COMMAND ---------- spark.conf.set("spark.sql.shuffle.partitions", 50) diff --git a/code/Production_Applications-Chapter_16_Spark_Applications.py "b/code/16\354\236\245_\354\212\244\355\214\214\355\201\254_\354\225\240\355\224\214\353\246\254\354\274\200\354\235\264\354\205\230_\352\260\234\353\260\234\355\225\230\352\270\260.py" similarity index 100% rename from code/Production_Applications-Chapter_16_Spark_Applications.py rename to "code/16\354\236\245_\354\212\244\355\214\214\355\201\254_\354\225\240\355\224\214\353\246\254\354\274\200\354\235\264\354\205\230_\352\260\234\353\260\234\355\225\230\352\270\260.py" diff --git a/code/Production_Applications-Chapter_16_Spark_Applications.scala "b/code/16\354\236\245_\354\212\244\355\214\214\355\201\254_\354\225\240\355\224\214\353\246\254\354\274\200\354\235\264\354\205\230_\352\260\234\353\260\234\355\225\230\352\270\260.scala" similarity index 89% rename from code/Production_Applications-Chapter_16_Spark_Applications.scala rename to "code/16\354\236\245_\354\212\244\355\214\214\355\201\254_\354\225\240\355\224\214\353\246\254\354\274\200\354\235\264\354\205\230_\352\260\234\353\260\234\355\225\230\352\270\260.scala" index 194a8fd..3bd7bab 100644 --- a/code/Production_Applications-Chapter_16_Spark_Applications.scala +++ "b/code/16\354\236\245_\354\212\244\355\214\214\355\201\254_\354\225\240\355\224\214\353\246\254\354\274\200\354\235\264\354\205\230_\352\260\234\353\260\234\355\225\230\352\270\260.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 import org.apache.spark.SparkConf val conf = new SparkConf().setMaster("local[2]").setAppName("DefinitiveGuide") .set("some.conf", "to.some.value") diff --git a/code/Production_Applications-Chapter_17_Deploying_Spark.scala "b/code/17\354\236\245_\354\212\244\355\214\214\355\201\254_\353\260\260\355\217\254_\355\231\230\352\262\275.scala" similarity index 92% rename from code/Production_Applications-Chapter_17_Deploying_Spark.scala rename to "code/17\354\236\245_\354\212\244\355\214\214\355\201\254_\353\260\260\355\217\254_\355\231\230\352\262\275.scala" index 4f610a6..280c838 100644 --- a/code/Production_Applications-Chapter_17_Deploying_Spark.scala +++ "b/code/17\354\236\245_\354\212\244\355\214\214\355\201\254_\353\260\260\355\217\254_\355\231\230\352\262\275.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 import org.apache.spark.sql.SparkSession val spark = SparkSession.builder .master("mesos://HOST:5050") diff --git a/code/Production_Applications-Chapter_18_Monitoring_and_Debugging.py "b/code/18\354\236\245_\353\252\250\353\213\210\355\204\260\353\247\201\352\263\274_\353\224\224\353\262\204\352\271\205.py" similarity index 100% rename from code/Production_Applications-Chapter_18_Monitoring_and_Debugging.py rename to "code/18\354\236\245_\353\252\250\353\213\210\355\204\260\353\247\201\352\263\274_\353\224\224\353\262\204\352\271\205.py" diff --git a/code/Production_Applications-Chapter_18_Monitoring_and_Debugging.scala "b/code/18\354\236\245_\353\252\250\353\213\210\355\204\260\353\247\201\352\263\274_\353\224\224\353\262\204\352\271\205.scala" similarity index 100% rename from code/Production_Applications-Chapter_18_Monitoring_and_Debugging.scala rename to "code/18\354\236\245_\353\252\250\353\213\210\355\204\260\353\247\201\352\263\274_\353\224\224\353\262\204\352\271\205.scala" diff --git a/code/Production_Applications-Chapter_19_Performance_Tuning.py "b/code/19\354\236\245_\354\204\261\353\212\245_\355\212\234\353\213\235.py" similarity index 90% rename from code/Production_Applications-Chapter_19_Performance_Tuning.py rename to "code/19\354\236\245_\354\204\261\353\212\245_\355\212\234\353\213\235.py" index 5027983..1a5096c 100644 --- a/code/Production_Applications-Chapter_19_Performance_Tuning.py +++ "b/code/19\354\236\245_\354\204\261\353\212\245_\355\212\234\353\213\235.py" @@ -1,4 +1,4 @@ -# Original loading code that does *not* cache DataFrame +# DataFrame을 캐싱하지 않는 원본 코드 DF1 = spark.read.format("csv")\ .option("inferSchema", "true")\ .option("header", "true")\ diff --git a/code/Streaming-Chapter_21_Structured_Streaming_Basics.py "b/code/21\354\236\245_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215_\355\231\234\354\232\251.py" similarity index 100% rename from code/Streaming-Chapter_21_Structured_Streaming_Basics.py rename to "code/21\354\236\245_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215_\355\231\234\354\232\251.py" diff --git a/code/Streaming-Chapter_21_Structured_Streaming_Basics.scala "b/code/21\354\236\245_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215_\355\231\234\354\232\251.scala" similarity index 89% rename from code/Streaming-Chapter_21_Structured_Streaming_Basics.scala rename to "code/21\354\236\245_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215_\355\231\234\354\232\251.scala" index 45b23d6..013fabe 100644 --- a/code/Streaming-Chapter_21_Structured_Streaming_Basics.scala +++ "b/code/21\354\236\245_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215_\355\231\234\354\232\251.scala" @@ -1,18 +1,18 @@ -// in Scala +// 스칼라 버전 val static = spark.read.json("/data/activity-data/") val dataSchema = static.schema // COMMAND ---------- -// in Scala +// 스칼라 버전 val streaming = spark.readStream.schema(dataSchema) .option("maxFilesPerTrigger", 1).json("/data/activity-data") // COMMAND ---------- -// in Scala +// 스칼라 버전 val activityCounts = streaming.groupBy("gt").count() @@ -23,7 +23,7 @@ spark.conf.set("spark.sql.shuffle.partitions", 5) // COMMAND ---------- -// in Scala +// 스칼라 버전 val activityQuery = activityCounts.writeStream.queryName("activity_counts") .format("memory").outputMode("complete") .start() @@ -41,7 +41,7 @@ spark.streams.active // COMMAND ---------- -// in Scala +// 스칼라 버전 for( i <- 1 to 5 ) { spark.sql("SELECT * FROM activity_counts").show() Thread.sleep(1000) @@ -50,7 +50,7 @@ for( i <- 1 to 5 ) { // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.expr val simpleTransform = streaming.withColumn("stairs", expr("gt like '%stairs%'")) .where("stairs") @@ -65,7 +65,7 @@ val simpleTransform = streaming.withColumn("stairs", expr("gt like '%stairs%'")) // COMMAND ---------- -// in Scala +// 스칼라 버전 val deviceModelStats = streaming.cube("gt", "model").avg() .drop("avg(Arrival_time)") .drop("avg(Creation_Time)") @@ -76,7 +76,7 @@ val deviceModelStats = streaming.cube("gt", "model").avg() // COMMAND ---------- -// in Scala +// 스칼라 버전 val historicalAgg = static.groupBy("gt", "model").avg() val deviceModelStats = streaming.drop("Arrival_Time", "Creation_Time", "Index") .cube("gt", "model").avg() @@ -87,7 +87,7 @@ val deviceModelStats = streaming.drop("Arrival_Time", "Creation_Time", "Index") // COMMAND ---------- -// in Scala +// 스칼라 버전 // Subscribe to 1 topic val ds1 = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") @@ -107,7 +107,7 @@ val ds3 = spark.readStream.format("kafka") // COMMAND ---------- -// in Scala +// 스칼라 버전 ds1.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)") .writeStream.format("kafka") .option("checkpointLocation", "/to/HDFS-compatible/dir") @@ -116,14 +116,17 @@ ds1.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)") ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .writeStream.format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") - .option("checkpointLocation", "/to/HDFS-compatible/dir")\ + .option("checkpointLocation", "/to/HDFS-compatible/dir") .option("topic", "topic1") .start() // COMMAND ---------- -//in Scala +// 스칼라 버전 + +import org.apache.spark.sql.ForeachWriter + datasetOfString.write.foreach(new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { // open a database connection @@ -139,25 +142,25 @@ datasetOfString.write.foreach(new ForeachWriter[String] { // COMMAND ---------- -// in Scala +// 스칼라 버전 val socketDF = spark.readStream.format("socket") .option("host", "localhost").option("port", 9999).load() // COMMAND ---------- -activityCounts.format("console").write() +activityCounts.writeStream.format("console").outputMode("complete").start() // COMMAND ---------- -// in Scala +// 스칼라 버전 activityCounts.writeStream.format("memory").queryName("my_device_table") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.streaming.Trigger activityCounts.writeStream.trigger(Trigger.ProcessingTime("100 seconds")) @@ -166,7 +169,7 @@ activityCounts.writeStream.trigger(Trigger.ProcessingTime("100 seconds")) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.streaming.Trigger activityCounts.writeStream.trigger(Trigger.Once()) @@ -175,7 +178,7 @@ activityCounts.writeStream.trigger(Trigger.Once()) // COMMAND ---------- -// in Scala +// 스칼라 버전 case class Flight(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count: BigInt) val dataSchema = spark.read diff --git a/code/Streaming-Chapter_21_Structured_Streaming_Basics.sql "b/code/21\354\236\245_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215_\355\231\234\354\232\251.sql" similarity index 100% rename from code/Streaming-Chapter_21_Structured_Streaming_Basics.sql rename to "code/21\354\236\245_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215_\355\231\234\354\232\251.sql" diff --git a/code/Streaming-Chapter_22_Event-Time_and_Stateful_Processing.py "b/code/22\354\236\245_\354\235\264\353\262\244\355\212\270_\354\213\234\352\260\204\352\263\274_\354\203\201\355\203\234_\352\270\260\353\260\230_\354\262\230\353\246\254.py" similarity index 93% rename from code/Streaming-Chapter_22_Event-Time_and_Stateful_Processing.py rename to "code/22\354\236\245_\354\235\264\353\262\244\355\212\270_\354\213\234\352\260\204\352\263\274_\354\203\201\355\203\234_\352\270\260\353\260\230_\354\262\230\353\246\254.py" index 96f2c52..889f5c3 100644 --- a/code/Streaming-Chapter_22_Event-Time_and_Stateful_Processing.py +++ "b/code/22\354\236\245_\354\235\264\353\262\244\355\212\270_\354\213\234\352\260\204\352\263\274_\354\203\201\355\203\234_\352\270\260\353\260\230_\354\262\230\353\246\254.py" @@ -9,7 +9,7 @@ # COMMAND ---------- -withEventTime = streaming\.selectExpr( +withEventTime = streaming.selectExpr( "*", "cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time") @@ -28,7 +28,7 @@ # COMMAND ---------- from pyspark.sql.functions import window, col -withEventTime.groupBy(window(col("event_time"), "10 minutes"), "User").count()\ +withEventTime.groupBy(window(col("event_time"), "10 minutes"), col("User")).count()\ .writeStream\ .queryName("pyevents_per_window")\ .format("memory")\ diff --git a/code/Streaming-Chapter_22_Event-Time_and_Stateful_Processing.scala "b/code/22\354\236\245_\354\235\264\353\262\244\355\212\270_\354\213\234\352\260\204\352\263\274_\354\203\201\355\203\234_\352\270\260\353\260\230_\354\262\230\353\246\254.scala" similarity index 97% rename from code/Streaming-Chapter_22_Event-Time_and_Stateful_Processing.scala rename to "code/22\354\236\245_\354\235\264\353\262\244\355\212\270_\354\213\234\352\260\204\352\263\274_\354\203\201\355\203\234_\352\270\260\353\260\230_\354\262\230\353\246\254.scala" index feff54b..78e1d53 100644 --- a/code/Streaming-Chapter_22_Event-Time_and_Stateful_Processing.scala +++ "b/code/22\354\236\245_\354\235\264\353\262\244\355\212\270_\354\213\234\352\260\204\352\263\274_\354\203\201\355\203\234_\352\270\260\353\260\230_\354\262\230\353\246\254.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 spark.conf.set("spark.sql.shuffle.partitions", 5) val static = spark.read.json("/data/activity-data") val streaming = spark @@ -15,7 +15,7 @@ streaming.printSchema() // COMMAND ---------- -// in Scala +// 스칼라 버전 val withEventTime = streaming.selectExpr( "*", "cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time") @@ -23,7 +23,7 @@ val withEventTime = streaming.selectExpr( // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{window, col} withEventTime.groupBy(window(col("event_time"), "10 minutes")).count() .writeStream @@ -40,9 +40,9 @@ spark.sql("SELECT * FROM events_per_window").printSchema() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{window, col} -withEventTime.groupBy(window(col("event_time"), "10 minutes"), "User").count() +withEventTime.groupBy(window(col("event_time"), "10 minutes"), col("User")).count() .writeStream .queryName("events_per_window") .format("memory") @@ -52,7 +52,7 @@ withEventTime.groupBy(window(col("event_time"), "10 minutes"), "User").count() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{window, col} withEventTime.groupBy(window(col("event_time"), "10 minutes", "5 minutes")) .count() @@ -65,7 +65,7 @@ withEventTime.groupBy(window(col("event_time"), "10 minutes", "5 minutes")) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.{window, col} withEventTime .withWatermark("event_time", "5 hours") @@ -80,7 +80,7 @@ withEventTime // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.expr withEventTime diff --git a/code/Streaming-Chapter_22_Event-Time_and_Stateful_Processing.sql "b/code/22\354\236\245_\354\235\264\353\262\244\355\212\270_\354\213\234\352\260\204\352\263\274_\354\203\201\355\203\234_\352\270\260\353\260\230_\354\262\230\353\246\254.sql" similarity index 100% rename from code/Streaming-Chapter_22_Event-Time_and_Stateful_Processing.sql rename to "code/22\354\236\245_\354\235\264\353\262\244\355\212\270_\354\213\234\352\260\204\352\263\274_\354\203\201\355\203\234_\352\270\260\353\260\230_\354\262\230\353\246\254.sql" diff --git a/code/Streaming-Chapter_23_Structured_Streaming_in_Production.py "b/code/23\354\236\245_\354\232\264\354\230\201_\355\231\230\352\262\275\354\227\220\354\204\234\354\235\230_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215.py" similarity index 100% rename from code/Streaming-Chapter_23_Structured_Streaming_in_Production.py rename to "code/23\354\236\245_\354\232\264\354\230\201_\355\231\230\352\262\275\354\227\220\354\204\234\354\235\230_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215.py" diff --git a/code/Streaming-Chapter_23_Structured_Streaming_in_Production.scala "b/code/23\354\236\245_\354\232\264\354\230\201_\355\231\230\352\262\275\354\227\220\354\204\234\354\235\230_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215.scala" similarity index 99% rename from code/Streaming-Chapter_23_Structured_Streaming_in_Production.scala rename to "code/23\354\236\245_\354\232\264\354\230\201_\355\231\230\352\262\275\354\227\220\354\204\234\354\235\230_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215.scala" index f8b7393..864f226 100644 --- a/code/Streaming-Chapter_23_Structured_Streaming_in_Production.scala +++ "b/code/23\354\236\245_\354\232\264\354\230\201_\355\231\230\352\262\275\354\227\220\354\204\234\354\235\230_\352\265\254\354\241\260\354\240\201_\354\212\244\355\212\270\353\246\254\353\260\215.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 val static = spark.read.json("/data/activity-data") val streaming = spark .readStream diff --git a/code/Advanced_Analytics_and_Machine_Learning_Chapter_24_Advanced_Analytics_and_Machine_Learning.java "b/code/24\354\236\245_\352\263\240\352\270\211_\353\266\204\354\204\235\352\263\274_\353\250\270\354\213\240\353\237\254\353\213\235_\352\260\234\354\232\224.java" similarity index 100% rename from code/Advanced_Analytics_and_Machine_Learning_Chapter_24_Advanced_Analytics_and_Machine_Learning.java rename to "code/24\354\236\245_\352\263\240\352\270\211_\353\266\204\354\204\235\352\263\274_\353\250\270\354\213\240\353\237\254\353\213\235_\352\260\234\354\232\224.java" diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_24_Advanced_Analytics_and_Machine_Learning.py "b/code/24\354\236\245_\352\263\240\352\270\211_\353\266\204\354\204\235\352\263\274_\353\250\270\354\213\240\353\237\254\353\213\235_\352\260\234\354\232\224.py" similarity index 100% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_24_Advanced_Analytics_and_Machine_Learning.py rename to "code/24\354\236\245_\352\263\240\352\270\211_\353\266\204\354\204\235\352\263\274_\353\250\270\354\213\240\353\237\254\353\213\235_\352\260\234\354\232\224.py" diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_24_Advanced_Analytics_and_Machine_Learning.scala "b/code/24\354\236\245_\352\263\240\352\270\211_\353\266\204\354\204\235\352\263\274_\353\250\270\354\213\240\353\237\254\353\213\235_\352\260\234\354\232\224.scala" similarity index 90% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_24_Advanced_Analytics_and_Machine_Learning.scala rename to "code/24\354\236\245_\352\263\240\352\270\211_\353\266\204\354\204\235\352\263\274_\353\250\270\354\213\240\353\237\254\353\213\235_\352\260\234\354\232\224.scala" index d13cb09..871a957 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_24_Advanced_Analytics_and_Machine_Learning.scala +++ "b/code/24\354\236\245_\352\263\240\352\270\211_\353\266\204\354\204\235\352\263\274_\353\250\270\354\213\240\353\237\254\353\213\235_\352\260\234\354\232\224.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 import org.apache.spark.ml.linalg.Vectors val denseVec = Vectors.dense(1.0, 2.0, 3.0) val size = 3 @@ -11,7 +11,7 @@ denseVec.toSparse // COMMAND ---------- -// in Scala +// 스칼라 버전 var df = spark.read.json("/data/simple-ml") df.orderBy("value2").show() @@ -24,7 +24,7 @@ spark.read.format("libsvm").load( // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.RFormula val supervised = new RFormula() .setFormula("lab ~ . + color:value1 + color:value2") @@ -32,7 +32,7 @@ val supervised = new RFormula() // COMMAND ---------- -// in Scala +// 스칼라 버전 val fittedRF = supervised.fit(df) val preparedDF = fittedRF.transform(df) preparedDF.show() @@ -40,26 +40,26 @@ preparedDF.show() // COMMAND ---------- -// in Scala +// 스칼라 버전 val Array(train, test) = preparedDF.randomSplit(Array(0.7, 0.3)) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.classification.LogisticRegression val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features") // COMMAND ---------- -// in Scala +// 스칼라 버전 println(lr.explainParams()) // COMMAND ---------- -// in Scala +// 스칼라 버전 val fittedLR = lr.fit(train) @@ -70,20 +70,20 @@ fittedLR.transform(train).select("label", "prediction").show() // COMMAND ---------- -// in Scala +// 스칼라 버전 val Array(train, test) = df.randomSplit(Array(0.7, 0.3)) // COMMAND ---------- -// in Scala +// 스칼라 버전 val rForm = new RFormula() val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.Pipeline val stages = Array(rForm, lr) val pipeline = new Pipeline().setStages(stages) @@ -91,7 +91,7 @@ val pipeline = new Pipeline().setStages(stages) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.tuning.ParamGridBuilder val params = new ParamGridBuilder() .addGrid(rForm.formula, Array( @@ -104,7 +104,7 @@ val params = new ParamGridBuilder() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator val evaluator = new BinaryClassificationEvaluator() .setMetricName("areaUnderROC") @@ -114,7 +114,7 @@ val evaluator = new BinaryClassificationEvaluator() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.tuning.TrainValidationSplit val tvs = new TrainValidationSplit() .setTrainRatio(0.75) // also the default. @@ -125,7 +125,7 @@ val tvs = new TrainValidationSplit() // COMMAND ---------- -// in Scala +// 스칼라 버전 val tvsFitted = tvs.fit(train) @@ -136,7 +136,7 @@ evaluator.evaluate(tvsFitted.transform(test)) // 0.9166666666666667 // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.PipelineModel import org.apache.spark.ml.classification.LogisticRegressionModel val trainedPipeline = tvsFitted.bestModel.asInstanceOf[PipelineModel] @@ -152,7 +152,7 @@ tvsFitted.write.overwrite().save("/tmp/modelLocation") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.tuning.TrainValidationSplitModel val model = TrainValidationSplitModel.load("/tmp/modelLocation") model.transform(test) diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_25_Preprocessing_and_Feature_Engineering.py "b/code/25\354\236\245_\353\215\260\354\235\264\355\204\260_\354\240\204\354\262\230\353\246\254_\353\260\217_\355\224\274\354\262\230_\354\227\224\354\247\200\353\213\210\354\226\264\353\247\201.py" similarity index 97% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_25_Preprocessing_and_Feature_Engineering.py rename to "code/25\354\236\245_\353\215\260\354\235\264\355\204\260_\354\240\204\354\262\230\353\246\254_\353\260\217_\355\224\274\354\262\230_\354\227\224\354\247\200\353\213\210\354\226\264\353\247\201.py" index d0bdab8..d316035 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_25_Preprocessing_and_Feature_Engineering.py +++ "b/code/25\354\236\245_\353\215\260\354\235\264\355\204\260_\354\240\204\354\262\230\353\246\254_\353\260\217_\355\224\274\354\262\230_\354\227\224\354\247\200\353\213\210\354\226\264\353\247\201.py" @@ -54,7 +54,7 @@ # COMMAND ---------- from pyspark.ml.feature import QuantileDiscretizer -bucketer = QuantileDiscretizer().setNumBuckets(5).setInputCol("id") +bucketer = QuantileDiscretizer().setNumBuckets(5).setInputCol("id").setOutputCol("result") fittedBucketer = bucketer.fit(contDF) fittedBucketer.transform(contDF).show() @@ -192,8 +192,8 @@ from pyspark.ml.feature import NGram unigram = NGram().setInputCol("DescOut").setN(1) bigram = NGram().setInputCol("DescOut").setN(2) -unigram.transform(tokenized.select("DescOut")).show(False) -bigram.transform(tokenized.select("DescOut")).show(False) +unigram.transform(tokenized.select("DescOut")).show(10, False) +bigram.transform(tokenized.select("DescOut")).show(10, False) # COMMAND ---------- @@ -206,7 +206,7 @@ .setMinTF(1)\ .setMinDF(2) fittedCV = cv.fit(tokenized) -fittedCV.transform(tokenized).show(False) +fittedCV.transform(tokenized).show(10, False) # COMMAND ---------- diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_25_Preprocessing_and_Feature_Engineering.scala "b/code/25\354\236\245_\353\215\260\354\235\264\355\204\260_\354\240\204\354\262\230\353\246\254_\353\260\217_\355\224\274\354\262\230_\354\227\224\354\247\200\353\213\210\354\226\264\353\247\201.scala" similarity index 93% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_25_Preprocessing_and_Feature_Engineering.scala rename to "code/25\354\236\245_\353\215\260\354\235\264\355\204\260_\354\240\204\354\262\230\353\246\254_\353\260\217_\355\224\274\354\262\230_\354\227\224\354\247\200\353\213\210\354\226\264\353\247\201.scala" index c341142..7aaa3a2 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_25_Preprocessing_and_Feature_Engineering.scala +++ "b/code/25\354\236\245_\353\215\260\354\235\264\355\204\260_\354\240\204\354\262\230\353\246\254_\353\260\217_\355\224\274\354\262\230_\354\227\224\354\247\200\353\213\210\354\226\264\353\247\201.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 val sales = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") @@ -18,7 +18,7 @@ sales.show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.Tokenizer val tkn = new Tokenizer().setInputCol("Description") tkn.transform(sales.select("Description")).show(false) @@ -26,7 +26,7 @@ tkn.transform(sales.select("Description")).show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.StandardScaler val ss = new StandardScaler().setInputCol("features") ss.fit(scaleDF).transform(scaleDF).show(false) @@ -34,7 +34,7 @@ ss.fit(scaleDF).transform(scaleDF).show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.RFormula val supervised = new RFormula() .setFormula("lab ~ . + color:value1 + color:value2") @@ -43,7 +43,7 @@ supervised.fit(simpleDF).transform(simpleDF).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.SQLTransformer val basicTransformation = new SQLTransformer() @@ -58,7 +58,7 @@ basicTransformation.transform(sales).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.VectorAssembler val va = new VectorAssembler().setInputCols(Array("int1", "int2", "int3")) va.transform(fakeIntDF).show() @@ -66,13 +66,13 @@ va.transform(fakeIntDF).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 val contDF = spark.range(20).selectExpr("cast(id as double)") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.Bucketizer val bucketBorders = Array(-1.0, 5.0, 10.0, 250.0, 600.0) val bucketer = new Bucketizer().setSplits(bucketBorders).setInputCol("id") @@ -81,16 +81,16 @@ bucketer.transform(contDF).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.QuantileDiscretizer -val bucketer = new QuantileDiscretizer().setNumBuckets(5).setInputCol("id") +val bucketer = new QuantileDiscretizer().setNumBuckets(5).setInputCol("id").setOutputCol("result") val fittedBucketer = bucketer.fit(contDF) fittedBucketer.transform(contDF).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.StandardScaler val sScaler = new StandardScaler().setInputCol("features") sScaler.fit(scaleDF).transform(scaleDF).show() @@ -98,7 +98,7 @@ sScaler.fit(scaleDF).transform(scaleDF).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.MinMaxScaler val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features") val fittedminMax = minMax.fit(scaleDF) @@ -107,7 +107,7 @@ fittedminMax.transform(scaleDF).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.MaxAbsScaler val maScaler = new MaxAbsScaler().setInputCol("features") val fittedmaScaler = maScaler.fit(scaleDF) @@ -116,7 +116,7 @@ fittedmaScaler.transform(scaleDF).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.ElementwiseProduct import org.apache.spark.ml.linalg.Vectors val scaleUpVec = Vectors.dense(10.0, 15.0, 20.0) @@ -128,7 +128,7 @@ scalingUp.transform(scaleDF).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.Normalizer val manhattanDistance = new Normalizer().setP(1).setInputCol("features") manhattanDistance.transform(scaleDF).show() @@ -136,7 +136,7 @@ manhattanDistance.transform(scaleDF).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.StringIndexer val lblIndxr = new StringIndexer().setInputCol("lab").setOutputCol("labelInd") val idxRes = lblIndxr.fit(simpleDF).transform(simpleDF) @@ -145,7 +145,7 @@ idxRes.show() // COMMAND ---------- -// in Scala +// 스칼라 버전 val valIndexer = new StringIndexer() .setInputCol("value1") .setOutputCol("valueInd") @@ -161,7 +161,7 @@ valIndexer.fit(simpleDF).setHandleInvalid("skip") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.IndexToString val labelReverse = new IndexToString().setInputCol("labelInd") labelReverse.transform(idxRes).show() @@ -169,7 +169,7 @@ labelReverse.transform(idxRes).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.VectorIndexer import org.apache.spark.ml.linalg.Vectors val idxIn = spark.createDataFrame(Seq( @@ -186,7 +186,7 @@ indxr.fit(idxIn).transform(idxIn).show // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder} val lblIndxr = new StringIndexer().setInputCol("color").setOutputCol("colorInd") val colorLab = lblIndxr.fit(simpleDF).transform(simpleDF.select("color")) @@ -196,7 +196,7 @@ ohe.transform(colorLab).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.Tokenizer val tkn = new Tokenizer().setInputCol("Description").setOutputCol("DescOut") val tokenized = tkn.transform(sales.select("Description")) @@ -205,7 +205,7 @@ tokenized.show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.RegexTokenizer val rt = new RegexTokenizer() .setInputCol("Description") @@ -217,7 +217,7 @@ rt.transform(sales.select("Description")).show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.RegexTokenizer val rt = new RegexTokenizer() .setInputCol("Description") @@ -230,7 +230,7 @@ rt.transform(sales.select("Description")).show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.StopWordsRemover val englishStopWords = StopWordsRemover.loadDefaultStopWords("english") val stops = new StopWordsRemover() @@ -241,7 +241,7 @@ stops.transform(tokenized).show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.NGram val unigram = new NGram().setInputCol("DescOut").setN(1) val bigram = new NGram().setInputCol("DescOut").setN(2) @@ -251,7 +251,7 @@ bigram.transform(tokenized.select("DescOut")).show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.CountVectorizer val cv = new CountVectorizer() .setInputCol("DescOut") @@ -265,7 +265,7 @@ fittedCV.transform(tokenized).show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 val tfIdfIn = tokenized .where("array_contains(DescOut, 'red')") .select("DescOut") @@ -275,7 +275,7 @@ tfIdfIn.show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.{HashingTF, IDF} val tf = new HashingTF() .setInputCol("DescOut") @@ -289,13 +289,13 @@ val idf = new IDF() // COMMAND ---------- -// in Scala +// 스칼라 버전 idf.fit(tf.transform(tfIdfIn)).transform(tf.transform(tfIdfIn)).show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.Word2Vec import org.apache.spark.ml.linalg.Vector import org.apache.spark.sql.Row @@ -320,7 +320,7 @@ result.collect().foreach { case Row(text: Seq[_], features: Vector) => // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.PCA val pca = new PCA().setInputCol("features").setK(2) pca.fit(scaleDF).transform(scaleDF).show(false) @@ -328,7 +328,7 @@ pca.fit(scaleDF).transform(scaleDF).show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.PolynomialExpansion val pe = new PolynomialExpansion().setInputCol("features").setDegree(2) pe.transform(scaleDF).show(false) @@ -336,7 +336,7 @@ pe.transform(scaleDF).show(false) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.{ChiSqSelector, Tokenizer} val tkn = new Tokenizer().setInputCol("Description").setOutputCol("DescOut") val tokenized = tkn @@ -353,14 +353,14 @@ chisq.fit(prechi).transform(prechi) // COMMAND ---------- -// in Scala +// 스칼라 버전 val fittedPCA = pca.fit(scaleDF) fittedPCA.write.overwrite().save("/tmp/fittedPCA") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.PCAModel val loadedPCA = PCAModel.load("/tmp/fittedPCA") loadedPCA.transform(scaleDF).show() diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_26_Classification.py "b/code/26\354\236\245_\353\266\204\353\245\230.py" similarity index 94% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_26_Classification.py rename to "code/26\354\236\245_\353\266\204\353\245\230.py" index 1e9dffc..e9b0d3f 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_26_Classification.py +++ "b/code/26\354\236\245_\353\266\204\353\245\230.py" @@ -64,7 +64,7 @@ # COMMAND ---------- from pyspark.mllib.evaluation import BinaryClassificationMetrics -out = model.transform(bInput)\ +out = trainedModel.transform(bInput)\ .select("prediction", "label")\ .rdd.map(lambda x: (float(x[0]), float(x[1]))) metrics = BinaryClassificationMetrics(out) @@ -74,9 +74,6 @@ print metrics.areaUnderPR print metrics.areaUnderROC -print "Receiver Operating Characteristic" -metrics.roc.toDF().show() - # COMMAND ---------- diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_26_Classification.scala "b/code/26\354\236\245_\353\266\204\353\245\230.scala" similarity index 88% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_26_Classification.scala rename to "code/26\354\236\245_\353\266\204\353\245\230.scala" index 5803082..4c1a335 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_26_Classification.scala +++ "b/code/26\354\236\245_\353\266\204\353\245\230.scala" @@ -1,11 +1,11 @@ -// in Scala +// 스칼라 버전 val bInput = spark.read.format("parquet").load("/data/binary-classification") .selectExpr("features", "cast(label as double) as label") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.classification.LogisticRegression val lr = new LogisticRegression() println(lr.explainParams()) // see all parameters @@ -14,14 +14,14 @@ val lrModel = lr.fit(bInput) // COMMAND ---------- -// in Scala +// 스칼라 버전 println(lrModel.coefficients) println(lrModel.intercept) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.classification.BinaryLogisticRegressionSummary val summary = lrModel.summary val bSummary = summary.asInstanceOf[BinaryLogisticRegressionSummary] @@ -32,7 +32,7 @@ bSummary.pr.show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.classification.DecisionTreeClassifier val dt = new DecisionTreeClassifier() println(dt.explainParams()) @@ -41,7 +41,7 @@ val dtModel = dt.fit(bInput) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.classification.RandomForestClassifier val rfClassifier = new RandomForestClassifier() println(rfClassifier.explainParams()) @@ -50,7 +50,7 @@ val trainedModel = rfClassifier.fit(bInput) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.classification.GBTClassifier val gbtClassifier = new GBTClassifier() println(gbtClassifier.explainParams()) @@ -59,7 +59,7 @@ val trainedModel = gbtClassifier.fit(bInput) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.classification.NaiveBayes val nb = new NaiveBayes() println(nb.explainParams()) @@ -68,9 +68,9 @@ val trainedModel = nb.fit(bInput.where("label != 0")) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics -val out = model.transform(bInput) +val out = trainedModel.transform(bInput) .select("prediction", "label") .rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])) val metrics = new BinaryClassificationMetrics(out) @@ -78,7 +78,7 @@ val metrics = new BinaryClassificationMetrics(out) // COMMAND ---------- -// in Scala +// 스칼라 버전 metrics.areaUnderPR metrics.areaUnderROC println("Receiver Operating Characteristic") diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_27_Regression.py "b/code/27\354\236\245_\355\232\214\352\267\200.py" similarity index 96% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_27_Regression.py rename to "code/27\354\236\245_\355\232\214\352\267\200.py" index 6a55a7c..b5627f6 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_27_Regression.py +++ "b/code/27\354\236\245_\355\232\214\352\267\200.py" @@ -60,7 +60,7 @@ from pyspark.ml.tuning import CrossValidator, ParamGridBuilder glr = GeneralizedLinearRegression().setFamily("gaussian").setLink("identity") pipeline = Pipeline().setStages([glr]) -params = ParamGridBuilder().addGrid(glr.regParam, [0, 0.5, 1]).build() +params = ParamGridBuilder().addGrid(glr.regParam, [0.0, 0.5, 1.0]).build() evaluator = RegressionEvaluator()\ .setMetricName("rmse")\ .setPredictionCol("prediction")\ diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_27_Regression.scala "b/code/27\354\236\245_\355\232\214\352\267\200.scala" similarity index 91% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_27_Regression.scala rename to "code/27\354\236\245_\355\232\214\352\267\200.scala" index 58cfc17..aa6ddb9 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_27_Regression.scala +++ "b/code/27\354\236\245_\355\232\214\352\267\200.scala" @@ -1,12 +1,12 @@ -// in Scala +// 스칼라 버전 val df = spark.read.load("/data/regression") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.regression.LinearRegression -val lr = new LinearRegression().setMaxIter(10).setRegParam(0.3)\ +val lr = new LinearRegression().setMaxIter(10).setRegParam(0.3) .setElasticNetParam(0.8) println(lr.explainParams()) val lrModel = lr.fit(df) @@ -14,7 +14,7 @@ val lrModel = lr.fit(df) // COMMAND ---------- -// in Scala +// 스칼라 버전 val summary = lrModel.summary summary.residuals.show() println(summary.objectiveHistory.toSeq.toDF.show()) @@ -24,7 +24,7 @@ println(summary.r2) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.regression.GeneralizedLinearRegression val glr = new GeneralizedLinearRegression() .setFamily("gaussian") @@ -38,7 +38,7 @@ val glrModel = glr.fit(df) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.regression.DecisionTreeRegressor val dtr = new DecisionTreeRegressor() println(dtr.explainParams()) @@ -47,7 +47,7 @@ val dtrModel = dtr.fit(df) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.regression.RandomForestRegressor import org.apache.spark.ml.regression.GBTRegressor val rf = new RandomForestRegressor() @@ -60,7 +60,7 @@ val gbtModel = gbt.fit(df) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.regression.GeneralizedLinearRegression import org.apache.spark.ml.Pipeline @@ -69,7 +69,7 @@ val glr = new GeneralizedLinearRegression() .setFamily("gaussian") .setLink("identity") val pipeline = new Pipeline().setStages(Array(glr)) -val params = new ParamGridBuilder().addGrid(glr.regParam, Array(0, 0.5, 1)) +val params = new ParamGridBuilder().addGrid(glr.regParam, Array(0.0, 0.5, 1.0)) .build() val evaluator = new RegressionEvaluator() .setMetricName("rmse") @@ -85,7 +85,7 @@ val model = cv.fit(df) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.mllib.evaluation.RegressionMetrics val out = model.transform(df) .select("prediction", "label") diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_28_Recommendation.py "b/code/28\354\236\245_\354\266\224\354\262\234.py" similarity index 100% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_28_Recommendation.py rename to "code/28\354\236\245_\354\266\224\354\262\234.py" diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_28_Recommendation.scala "b/code/28\354\236\245_\354\266\224\354\262\234.scala" similarity index 93% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_28_Recommendation.scala rename to "code/28\354\236\245_\354\266\224\354\262\234.scala" index 7658bdf..1f18f82 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_28_Recommendation.scala +++ "b/code/28\354\236\245_\354\266\224\354\262\234.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 import org.apache.spark.ml.recommendation.ALS val ratings = spark.read.textFile("/data/sample_movielens_ratings.txt") .selectExpr("split(value , '::') as col") @@ -21,7 +21,7 @@ val predictions = alsModel.transform(test) // COMMAND ---------- -// in Scala +// 스칼라 버전 alsModel.recommendForAllUsers(10) .selectExpr("userId", "explode(recommendations)").show() alsModel.recommendForAllItems(10) @@ -30,7 +30,7 @@ alsModel.recommendForAllItems(10) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.evaluation.RegressionEvaluator val evaluator = new RegressionEvaluator() .setMetricName("rmse") @@ -42,7 +42,7 @@ println(s"Root-mean-square error = $rmse") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.mllib.evaluation.{ RankingMetrics, RegressionMetrics} @@ -53,7 +53,7 @@ val metrics = new RegressionMetrics(regComparison) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.mllib.evaluation.{RankingMetrics, RegressionMetrics} import org.apache.spark.sql.functions.{col, expr} val perUserActual = predictions @@ -64,7 +64,7 @@ val perUserActual = predictions // COMMAND ---------- -// in Scala +// 스칼라 버전 val perUserPredictions = predictions .orderBy(col("userId"), col("prediction").desc) .groupBy("userId") @@ -73,7 +73,7 @@ val perUserPredictions = predictions // COMMAND ---------- -// in Scala +// 스칼라 버전 val perUserActualvPred = perUserActual.join(perUserPredictions, Seq("userId")) .map(row => ( row(1).asInstanceOf[Seq[Integer]].toArray, @@ -84,7 +84,7 @@ val ranks = new RankingMetrics(perUserActualvPred.rdd) // COMMAND ---------- -// in Scala +// 스칼라 버전 ranks.meanAveragePrecision ranks.precisionAt(5) diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_29_Unsupervised_Learning.py "b/code/29\354\236\245_\353\271\204\354\247\200\353\217\204_\355\225\231\354\212\265.py" similarity index 100% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_29_Unsupervised_Learning.py rename to "code/29\354\236\245_\353\271\204\354\247\200\353\217\204_\355\225\231\354\212\265.py" diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_29_Unsupervised_Learning.scala "b/code/29\354\236\245_\353\271\204\354\247\200\353\217\204_\355\225\231\354\212\265.scala" similarity index 91% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_29_Unsupervised_Learning.scala rename to "code/29\354\236\245_\353\271\204\354\247\200\353\217\204_\355\225\231\354\212\265.scala" index fd0eaf7..e59a558 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_29_Unsupervised_Learning.scala +++ "b/code/29\354\236\245_\353\271\204\354\247\200\353\217\204_\355\225\231\354\212\265.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.VectorAssembler val va = new VectorAssembler() @@ -18,7 +18,7 @@ sales.cache() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.clustering.KMeans val km = new KMeans().setK(5) println(km.explainParams()) @@ -27,7 +27,7 @@ val kmModel = km.fit(sales) // COMMAND ---------- -// in Scala +// 스칼라 버전 val summary = kmModel.summary summary.clusterSizes // number of points kmModel.computeCost(sales) @@ -37,7 +37,7 @@ kmModel.clusterCenters.foreach(println) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.clustering.BisectingKMeans val bkm = new BisectingKMeans().setK(5).setMaxIter(5) println(bkm.explainParams()) @@ -46,7 +46,7 @@ val bkmModel = bkm.fit(sales) // COMMAND ---------- -// in Scala +// 스칼라 버전 val summary = bkmModel.summary summary.clusterSizes // number of points kmModel.computeCost(sales) @@ -56,7 +56,7 @@ kmModel.clusterCenters.foreach(println) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.clustering.GaussianMixture val gmm = new GaussianMixture().setK(5) println(gmm.explainParams()) @@ -65,7 +65,7 @@ val model = gmm.fit(sales) // COMMAND ---------- -// in Scala +// 스칼라 버전 val summary = model.summary model.weights model.gaussiansDF.show() @@ -76,7 +76,7 @@ summary.probability.show() // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer} val tkn = new Tokenizer().setInputCol("Description").setOutputCol("DescOut") val tokenized = tkn.transform(sales.drop("features")) @@ -93,7 +93,7 @@ val prepped = cvFitted.transform(tokenized) // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.ml.clustering.LDA val lda = new LDA().setK(10).setMaxIter(5) println(lda.explainParams()) @@ -102,7 +102,7 @@ val model = lda.fit(prepped) // COMMAND ---------- -// in Scala +// 스칼라 버전 model.describeTopics(3).show() cvFitted.vocabulary diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_30_Graph_Analysis.py "b/code/30\354\236\245_\352\267\270\353\236\230\355\224\204_\353\266\204\354\204\235.py" similarity index 92% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_30_Graph_Analysis.py rename to "code/30\354\236\245_\352\267\270\353\236\230\355\224\204_\353\266\204\354\204\235.py" index dcb416d..8cffc6f 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_30_Graph_Analysis.py +++ "b/code/30\354\236\245_\352\267\270\353\236\230\355\224\204_\353\266\204\354\204\235.py" @@ -13,6 +13,9 @@ # COMMAND ---------- +# graphframes(https://spark-packages.org/package/graphframes/graphframes) 라이브러리가 필요합니다. +# http://graphframes.github.io/quick-start.html +# DataBricks Runtime: https://docs.databricks.com/user-guide/libraries.html#maven-libraries from graphframes import GraphFrame stationGraph = GraphFrame(stationVertices, tripEdges) diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_30_Graph_Analysis.scala "b/code/30\354\236\245_\352\267\270\353\236\230\355\224\204_\353\266\204\354\204\235.scala" similarity index 84% rename from code/Advanced_Analytics_and_Machine_Learning-Chapter_30_Graph_Analysis.scala rename to "code/30\354\236\245_\352\267\270\353\236\230\355\224\204_\353\266\204\354\204\235.scala" index 8286e6e..648be88 100644 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_30_Graph_Analysis.scala +++ "b/code/30\354\236\245_\352\267\270\353\236\230\355\224\204_\353\266\204\354\204\235.scala" @@ -1,4 +1,4 @@ -// in Scala +// 스칼라 버전 val bikeStations = spark.read.option("header","true") .csv("/data/bike-data/201508_station_data.csv") val tripData = spark.read.option("header","true") @@ -7,7 +7,7 @@ val tripData = spark.read.option("header","true") // COMMAND ---------- -// in Scala +// 스칼라 버전 val stationVertices = bikeStations.withColumnRenamed("name", "id").distinct() val tripEdges = tripData .withColumnRenamed("Start Station", "src") @@ -16,7 +16,12 @@ val tripEdges = tripData // COMMAND ---------- -// in Scala +// 스칼라 버전 + +// graphframes(https://spark-packages.org/package/graphframes/graphframes) 라이브러리가 필요합니다. +// http://graphframes.github.io/quick-start.html +// DataBricks Runtime: https://docs.databricks.com/user-guide/libraries.html#maven-libraries + import org.graphframes.GraphFrame val stationGraph = GraphFrame(stationVertices, tripEdges) stationGraph.cache() @@ -24,7 +29,7 @@ stationGraph.cache() // COMMAND ---------- -// in Scala +// 스칼라 버전 println(s"Total Number of Stations: ${stationGraph.vertices.count()}") println(s"Total Number of Trips in Graph: ${stationGraph.edges.count()}") println(s"Total Number of Trips in Original Data: ${tripData.count()}") @@ -32,14 +37,14 @@ println(s"Total Number of Trips in Original Data: ${tripData.count()}") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.desc stationGraph.edges.groupBy("src", "dst").count().orderBy(desc("count")).show(10) // COMMAND ---------- -// in Scala +// 스칼라 버전 stationGraph.edges .where("src = 'Townsend at 7th' OR dst = 'Townsend at 7th'") .groupBy("src", "dst").count() @@ -49,7 +54,7 @@ stationGraph.edges // COMMAND ---------- -// in Scala +// 스칼라 버전 val townAnd7thEdges = stationGraph.edges .where("src = 'Townsend at 7th' OR dst = 'Townsend at 7th'") val subgraph = GraphFrame(stationGraph.vertices, townAnd7thEdges) @@ -57,13 +62,13 @@ val subgraph = GraphFrame(stationGraph.vertices, townAnd7thEdges) // COMMAND ---------- -// in Scala +// 스칼라 버전 val motifs = stationGraph.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[ca]->(a)") // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.expr motifs.selectExpr("*", "to_timestamp(ab.`Start Date`, 'MM/dd/yyyy HH:mm') as abStart", @@ -79,7 +84,7 @@ motifs.selectExpr("*", // COMMAND ---------- -// in Scala +// 스칼라 버전 import org.apache.spark.sql.functions.desc val ranks = stationGraph.pageRank.resetProbability(0.15).maxIter(10).run() ranks.vertices.orderBy(desc("pagerank")).select("id", "pagerank").show(10) @@ -87,21 +92,21 @@ ranks.vertices.orderBy(desc("pagerank")).select("id", "pagerank").show(10) // COMMAND ---------- -// in Scala +// 스칼라 버전 val inDeg = stationGraph.inDegrees inDeg.orderBy(desc("inDegree")).show(5, false) // COMMAND ---------- -// in Scala +// 스칼라 버전 val outDeg = stationGraph.outDegrees outDeg.orderBy(desc("outDegree")).show(5, false) // COMMAND ---------- -// in Scala +// 스칼라 버전 val degreeRatio = inDeg.join(outDeg, Seq("id")) .selectExpr("id", "double(inDegree)/double(outDegree) as degreeRatio") degreeRatio.orderBy(desc("degreeRatio")).show(10, false) @@ -110,33 +115,33 @@ degreeRatio.orderBy("degreeRatio").show(10, false) // COMMAND ---------- -// in Scala +// 스칼라 버전 stationGraph.bfs.fromExpr("id = 'Townsend at 7th'") .toExpr("id = 'Spear at Folsom'").maxPathLength(2).run().show(10) // COMMAND ---------- -// in Scala +// 스칼라 버전 spark.sparkContext.setCheckpointDir("/tmp/checkpoints") // COMMAND ---------- -// in Scala +// 스칼라 버전 val minGraph = GraphFrame(stationVertices, tripEdges.sample(false, 0.1)) val cc = minGraph.connectedComponents.run() // COMMAND ---------- -// in Scala +// 스칼라 버전 cc.where("component != 0").show() // COMMAND ---------- -// in Scala +// 스칼라 버전 val scc = minGraph.stronglyConnectedComponents.maxIter(3).run() diff --git "a/code/31\354\236\245_\353\224\245\353\237\254\353\213\235.py" "b/code/31\354\236\245_\353\224\245\353\237\254\353\213\235.py" new file mode 100644 index 0000000..ea5bf5e --- /dev/null +++ "b/code/31\354\236\245_\353\224\245\353\237\254\353\213\235.py" @@ -0,0 +1,138 @@ +# databricks 런타임 환경에서 테스트 하기 위해서는 아래의 코드를 먼저 실행해야 합니다. +%sh +curl -O http://download.tensorflow.org/example_images/flower_photos.tgz +tar xzf flower_photos.tgz &>/dev/null + + +# COMMAND ---------- +dbutils.fs.ls('file:/databricks/driver/flower_photos') + + +# COMMAND ---------- + +img_dir = '/tmp/flower_photos' +dbutils.fs.mkdirs(img_dir) + +dbutils.fs.cp('file:/databricks/driver/flower_photos/tulips', img_dir + "/tulips", recurse=True) +dbutils.fs.cp('file:/databricks/driver/flower_photos/daisy', img_dir + "/daisy", recurse=True) +dbutils.fs.cp('file:/databricks/driver/flower_photos/LICENSE.txt', img_dir) + + +# COMMAND ---------- + +sample_img_dir = img_dir + "/sample" +dbutils.fs.rm(sample_img_dir, recurse=True) +dbutils.fs.mkdirs(sample_img_dir) +files = dbutils.fs.ls(img_dir + "/daisy")[0:10] + dbutils.fs.ls(img_dir + "/tulips")[0:1] +for f in files: + dbutils.fs.cp(f.path, sample_img_dir) + +dbutils.fs.ls(sample_img_dir) + + +# COMMAND ---------- +# 본문 내용은 여기서 부터 시작입니다. + +# Spark 2.3 버전으로 실행해야 합니다. +# 스파크 딥러닝 관련 내용은 다음의 링크를 참조하십시오. +# https://docs.databricks.com/applications/deep-learning/deep-learning-pipelines.html + +# spark-deep-learning 프로젝트는 다음의 링크를 참조하십시오. +# https://github.com/databricks/spark-deep-learning + +from pyspark.ml.image import ImageSchema + +# 이미지 파일이 많기 때문에 /tulips 디렉터리와 /daisy 디렉터리의 일부 파일을 /sample 디렉터리에 복제하여 사용합니다. +# 약 10개의 파일을 /sample 디렉터리에 복제합니다. +img_dir = '/data/deep-learning-images/' +sample_img_dir = img_dir + "/sample" + +image_df = ImageSchema.readImages(sample_img_dir) + +# COMMAND ---------- + +image_df.printSchema() + + +# COMMAND ---------- + +from pyspark.ml.image import ImageSchema +from pyspark.sql.functions import lit +from sparkdl.image import imageIO + +tulips_df = ImageSchema.readImages(img_dir + "/tulips").withColumn("label", lit(1)) +daisy_df = imageIO.readImagesWithCustomFn(img_dir + "/daisy", decode_f=imageIO.PIL_decode).withColumn("label", lit(0)) +tulips_train, tulips_test = tulips_df.randomSplit([0.6, 0.4]) +daisy_train, daisy_test = daisy_df.randomSplit([0.6, 0.4]) +train_df = tulips_train.unionAll(daisy_train) +test_df = tulips_test.unionAll(daisy_test) + +# 메모리 오버헤드를 줄이기 위해 파티션을 나눕니다. +train_df = train_df.repartition(100) +test_df = test_df.repartition(100) + + +# COMMAND ---------- + +from pyspark.ml.classification import LogisticRegression +from pyspark.ml import Pipeline +from sparkdl import DeepImageFeaturizer + +featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3") +lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label") +p = Pipeline(stages=[featurizer, lr]) + +p_model = p.fit(train_df) + + +# COMMAND ---------- + +from pyspark.ml.evaluation import MulticlassClassificationEvaluator + +tested_df = p_model.transform(test_df) +evaluator = MulticlassClassificationEvaluator(metricName="accuracy") +print("Test set accuracy = " + str(evaluator.evaluate(tested_df.select("prediction", "label")))) + + +# COMMAND ---------- + +from pyspark.sql.types import DoubleType +from pyspark.sql.functions import expr + +def _p1(v): + return float(v.array[1]) + +p1 = udf(_p1, DoubleType()) +df = tested_df.withColumn("p_1", p1(tested_df.probability)) +wrong_df = df.orderBy(expr("abs(p_1 - label)"), ascending=False) +wrong_df.select("image.origin", "p_1", "label").limit(10) + + +# COMMAND ---------- + +from pyspark.ml.image import ImageSchema +from sparkdl import DeepImagePredictor + +# 빠른 테스트를 위해 앞서 선언한 샘플 이미지 디렉터리를 사용합니다. +image_df = ImageSchema.readImages(sample_img_dir) + +predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3", decodePredictions=True, topK=10) +predictions_df = predictor.transform(image_df) + + +# COMMAND ---------- + +df = p_model.transform(image_df) +df.select("image.origin", (1-p1(df.probability)).alias("p_daisy")).show() + + +# COMMAND ---------- + +from keras.applications import InceptionV3 +from sparkdl.udf.keras_image_model import registerKerasImageUDF +from keras.applications import InceptionV3 + +registerKerasImageUDF("my_keras_inception_udf", InceptionV3(weights="imagenet")) + +# COMMAND ---------- + diff --git a/code/Ecosystem-Chapter_32_Language_Specifics.py "b/code/32\354\236\245_\354\226\270\354\226\264\353\263\204_\355\212\271\354\204\261.py" similarity index 100% rename from code/Ecosystem-Chapter_32_Language_Specifics.py rename to "code/32\354\236\245_\354\226\270\354\226\264\353\263\204_\355\212\271\354\204\261.py" diff --git a/code/Ecosystem-Chapter_32_Language_Specifics.r "b/code/32\354\236\245_\354\226\270\354\226\264\353\263\204_\355\212\271\354\204\261.r" similarity index 88% rename from code/Ecosystem-Chapter_32_Language_Specifics.r rename to "code/32\354\236\245_\354\226\270\354\226\264\353\263\204_\355\212\271\354\204\261.r" index ee9f825..e988a9b 100644 --- a/code/Ecosystem-Chapter_32_Language_Specifics.r +++ "b/code/32\354\236\245_\354\226\270\354\226\264\353\263\204_\355\212\271\354\204\261.r" @@ -35,7 +35,7 @@ collect(count(groupBy(retail.data, "country"))) # COMMAND ---------- -sample(mtcars) # fails +sample(mtcars) # 오류 발생 # COMMAND ---------- @@ -165,12 +165,14 @@ library(sparklyr) # COMMAND ---------- - +# 데이터브릭스 환경의 경우 다음의 링크를 참조하십시오. +# https://docs.databricks.com/spark/latest/sparkr/sparklyr.html sc <- spark_connect(master = "local") # COMMAND ---------- - +# 데이터브릭스 환경의 경우 다음의 링크를 참조하십시오. +# https://docs.databricks.com/spark/latest/sparkr/sparklyr.html spark_connect(master = "local", config = spark_config()) @@ -187,6 +189,10 @@ setShufflePartitions <- dbGetQuery(sc, "SET spark.sql.shuffle.partitions=10") # COMMAND ---------- +# https://spark.rstudio.com/reference/spark_write_csv/ +# https://spark.rstudio.com/reference/spark_write_json/ +# https://spark.rstudio.com/reference/spark_write_parquet/ + spark_write_csv(tbl_name, location) spark_write_json(tbl_name, location) spark_write_parquet(tbl_name, location) diff --git a/code/Ecosystem-Chapter_33_Ecosystem_and_Community.scala "b/code/33\354\236\245_\354\227\220\354\275\224_\354\213\234\354\212\244\355\205\234\352\263\274_\354\273\244\353\256\244\353\213\210\355\213\260.scala" similarity index 100% rename from code/Ecosystem-Chapter_33_Ecosystem_and_Community.scala rename to "code/33\354\236\245_\354\227\220\354\275\224_\354\213\234\354\212\244\355\205\234\352\263\274_\354\273\244\353\256\244\353\213\210\355\213\260.scala" diff --git a/code/Advanced_Analytics_and_Machine_Learning-Chapter_31_Deep_Learning.py b/code/Advanced_Analytics_and_Machine_Learning-Chapter_31_Deep_Learning.py deleted file mode 100644 index b00f242..0000000 --- a/code/Advanced_Analytics_and_Machine_Learning-Chapter_31_Deep_Learning.py +++ /dev/null @@ -1,85 +0,0 @@ -from sparkdl import readImages -img_dir = '/data/deep-learning-images/' -image_df = readImages(img_dir) - - -# COMMAND ---------- - -image_df.printSchema() - - -# COMMAND ---------- - -from sparkdl import readImages -from pyspark.sql.functions import lit -tulips_df = readImages(img_dir + "/tulips").withColumn("label", lit(1)) -daisy_df = readImages(img_dir + "/daisy").withColumn("label", lit(0)) -tulips_train, tulips_test = tulips_df.randomSplit([0.6, 0.4]) -daisy_train, daisy_test = daisy_df.randomSplit([0.6, 0.4]) -train_df = tulips_train.unionAll(daisy_train) -test_df = tulips_test.unionAll(daisy_test) - - -# COMMAND ---------- - -from pyspark.ml.classification import LogisticRegression -from pyspark.ml import Pipeline -from sparkdl import DeepImageFeaturizer -featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", - modelName="InceptionV3") -lr = LogisticRegression(maxIter=1, regParam=0.05, elasticNetParam=0.3, - labelCol="label") -p = Pipeline(stages=[featurizer, lr]) -p_model = p.fit(train_df) - - -# COMMAND ---------- - -from pyspark.ml.evaluation import MulticlassClassificationEvaluator -tested_df = p_model.transform(test_df) -evaluator = MulticlassClassificationEvaluator(metricName="accuracy") -print("Test set accuracy = " + str(evaluator.evaluate(tested_df.select( - "prediction", "label")))) - - -# COMMAND ---------- - -from pyspark.sql.types import DoubleType -from pyspark.sql.functions import expr -# a simple UDF to convert the value to a double -def _p1(v): - return float(v.array[1]) -p1 = udf(_p1, DoubleType()) -df = tested_df.withColumn("p_1", p1(tested_df.probability)) -wrong_df = df.orderBy(expr("abs(p_1 - label)"), ascending=False) -wrong_df.select("filePath", "p_1", "label").limit(10).show() - - -# COMMAND ---------- - -from sparkdl import readImages, DeepImagePredictor -image_df = readImages(img_dir) -predictor = DeepImagePredictor( - inputCol="image", - outputCol="predicted_labels", - modelName="InceptionV3", - decodePredictions=True, - topK=10) -predictions_df = predictor.transform(image_df) - - -# COMMAND ---------- - -df = p_model.transform(image_df) - - -# COMMAND ---------- - -from keras.applications import InceptionV3 -from sparkdl.udf.keras_image_model import registerKerasImageUDF -from keras.applications import InceptionV3 -registerKerasImageUDF("my_keras_inception_udf", InceptionV3(weights="imagenet")) - - -# COMMAND ---------- -