Skip to content

Cloud Dataproc - Cluster and Jobs

Step-01: Introduction

  • Create Dataproc Single Node Cluster
  • Create Job1: sort-words-job and verify
  • Create Job2: distinct-list-job and verify

Step-02: Create Dataproc Cluster

  • Go to Dataproc -> CREATE CLUSTER
  • Create Dataproc cluster: Cluster on Compute Engine

Setup Cluster

  • Cluster Name: mydataproc-cluster
  • Region: us-central1
  • Zone: any
  • Cluster Type: Single Node (1 master, 0 workers)
  • REST ALL LEAVE TO DEFAULTS

Configure Nodes

  • Primary Disk Size: 100GB
  • REST ALL LEAVE TO DEFAULTS

Customize Cluster

  • LEAVE TO DEFAULTS

Manage Security

  • LEAVE TO DEFAULTS
  • Click on CREATE

Step-03: Review Dataproc Cluster

  • Dataproc Cluster
  • MONITORING
  • JOBS
  • VM INSTANCES
  • CONFIGURATION
  • WEB INTERFACES
  • Go to Compute Instances and verify VM instance created for cluster

Step-04: Review Python files and Upload to Cloud Storage

  • These files will be used to create Jobs

Step-04-01: distinct-list.py

#! /usr/bin/python
import pyspark

# Create Number List
numbers = [1,2,3,1,2,3,4,4,2,3,6,6,7,2,2,1,3,4,5,8,1,2]

# Python SparkContext
sc = pyspark.SparkContext()

# Create RDD with parallelize method of SparkContext
rdd = sc.parallelize(numbers)

# Return distinct elements from RDD
distinct_numbers = rdd.distinct().collect()

# Print distinct numbers which we can verify in Cloud Dataproc Logs
print('Distinct Numbers:', distinct_numbers)

Step-04-02: sort-words.py

import pyspark

sc = pyspark.SparkContext()
rdd = sc.parallelize(["orange", "pear", "date", "grape", "banana", "kiwi", "cherry", "fig", "lemon", "mango", "apple"])
words = sorted(rdd.collect())
print(words)

Step-04-03: Upload files to Cloud Storage Bucket

  • Create / Use existing Cloud Storage Bucket
  • Cloud Storage Bucket: mybucket-1023
  • Upload files
  • sort-words.py
  • distinct-list.py

Step-05: Create Dataproc Job and Verify Job logs

Step-05-01: Create Dataproc Job

  • Go to Dataproc -> Jobs on Clusters -> Jobs -> SUBMIT JOB
  • Job ID: sort-words-job
  • Region: us-central1
  • Cluster: mydataproc-cluster
  • Job Type: PySpark
  • Main Python File: gs://mybucket-1023/sort-words.py
  • REST ALL LEAVE TO DEFAULTS
  • Click on CREATE

Step-05-02: Verify Output Job Logs

  • Go to Dataproc -> Jobs on Clusters -> Jobs -> sort-words-job
  • Verify output job logs
  • Observation: All the words in the list will be sorted in alphabetical order

Step-06: Create Dataproc Job and Verify Job logs

Step-06-01: Create Dataproc Job

  • Go to Dataproc -> Jobs on Clusters -> Jobs -> SUBMIT JOB
  • Job ID: distinct-list-job
  • Region: us-central1
  • Cluster: mydataproc-cluster
  • Job Type: PySpark
  • Main Python File: gs://mybucket-1023/distinct-list.py
  • REST ALL LEAVE TO DEFAULTS
  • Click on CREATE

Step-06-02: Verify Output Job Logs

  • Go to Dataproc -> Jobs on Clusters -> Jobs -> distinct-list-job
  • Verify output job logs
  • Observation: All the words in the list will be sorted in alphabetical order

Step-07: gcloud Commands: Cloud Dataproc: CleanUp

# Set Project
gcloud config set project PROJECT_ID
gcloud config set project mydatabases123

# Set Cloud Dataproc Region
gcloud config set dataproc/region VALUE
gcloud config set dataproc/region us-central1

# List Jobs
gcloud dataproc jobs list

# List Clusters
gcloud dataproc clusters list

# Delete Cluster
gcloud dataproc clusters delete mydataproc-cluster1 

# List Clusters
gcloud dataproc clusters list
🎉 New Course
Ultimate DevOps Real-World Project Implementation on AWS
$15.99 $84.99 81% OFF
DEVOPS2026FEB
Enroll Now on Udemy
🎉 Offer