Run PySpark Local Python Windows Notebook

Introduction PySpark is the Python API for Apache Spark, an open-source distributed computing system that enables fast, scalable data processing. PySpark allows Python developers to leverage the powerful capabilities of Spark for big data analytics, machine learning, and data engineering tasks without needing to delve into the complexities of Java or Scala. With PySpark, users can process large datasets across clusters, perform distributed data transformations, and run machine learning algorithms. It integrates seamlessly with popular data processing frameworks like Hadoop and supports multiple data formats, making it a versatile tool in data science and analytics. This introduction provides an overview of PySpark's capabilities, focusing on its core components like RDDs (Resilient Distributed Datasets), DataFrames, and SQL functionalities. Whether you're processing big data, building models, or performing ETL tasks, PySpark offers a robust platform for executing high-performance data workflows in a Python environment. Installation Install Python at : https://www.python.org/downloads/ Install Java First at all you need to download latest version of java at : https://jdk.java.net. I'm using java 23 for this post. Install PySpark First at all, you also need to download Apache Spark from : I'm using https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz to make a tutorial for this post. Configuration Python Java import os os.environ["JAVA_HOME"] = fr"D:\Soft\JAVA\jdk-23.0.1" os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin;" + os.environ["PATH"] PySpark import os os.environ["SPARK_HOME"] = fr"D:\Soft\pyspark\spark-3.5.4-bin-hadoop3" os.environ["PATH"] = os.environ["SPARK_HOME"] + "/bin;" + os.environ["PATH"] After done, you can try check Pyspark at command line : Try Example with Pyspark Notebook. import numpy as np import pandas as pd spark = SparkSession.builder \ .appName("Debugging Example") \ .master("local[*]") \ .config("spark.eventLog.enabled", "true") \ .config("spark.sql.shuffle.partitions", "1") \ .getOrCreate() spark.sparkContext.setLogLevel("DEBUG") # Enable Arrow-based columnar data transfers spark.conf.set("spark.sql.execution.arrow.enabled", "true") # Generate a pandas DataFrame pdf = pd.DataFrame(np.random.rand(100, 3)) # Create a Spark DataFrame from a pandas DataFrame using Arrow df = spark.createDataFrame(pdf) # rename columns df = df.toDF("a", "b", "c") df Use df.show(5) to see test output with pyspark. Reference https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html https://www.python.org/downloads/ https://stackoverflow.com/questions/77295900/display-spark-dataframe-in-visual-studio-code https://jdk.java.net/ https://pypi.org/project/pyspark/

Jan 21, 2025 - 09:22
 0
Run PySpark Local Python Windows Notebook

Introduction

PySpark is the Python API for Apache Spark, an open-source distributed computing system that enables fast, scalable data processing. PySpark allows Python developers to leverage the powerful capabilities of Spark for big data analytics, machine learning, and data engineering tasks without needing to delve into the complexities of Java or Scala.

With PySpark, users can process large datasets across clusters, perform distributed data transformations, and run machine learning algorithms. It integrates seamlessly with popular data processing frameworks like Hadoop and supports multiple data formats, making it a versatile tool in data science and analytics.

This introduction provides an overview of PySpark's capabilities, focusing on its core components like RDDs (Resilient Distributed Datasets), DataFrames, and SQL functionalities. Whether you're processing big data, building models, or performing ETL tasks, PySpark offers a robust platform for executing high-performance data workflows in a Python environment.

Installation

  1. Install Python at : https://www.python.org/downloads/
  2. Install Java First at all you need to download latest version of java at : https://jdk.java.net. I'm using java 23 for this post.
  3. Install PySpark

First at all, you also need to download Apache Spark from :

I'm using https://www.apache.org/dyn/closer.lua/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz to make a tutorial for this post.

Configuration Python

  1. Java
import os
os.environ["JAVA_HOME"] = fr"D:\Soft\JAVA\jdk-23.0.1"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin;" + os.environ["PATH"]
  1. PySpark
import os
os.environ["SPARK_HOME"] = fr"D:\Soft\pyspark\spark-3.5.4-bin-hadoop3"
os.environ["PATH"] = os.environ["SPARK_HOME"] + "/bin;" + os.environ["PATH"]

After done, you can try check Pyspark at command line :

Try Example with Pyspark Notebook.

import numpy as np
import pandas as pd
spark = SparkSession.builder \
    .appName("Debugging Example") \
    .master("local[*]") \
    .config("spark.eventLog.enabled", "true") \
    .config("spark.sql.shuffle.partitions", "1") \
    .getOrCreate()

spark.sparkContext.setLogLevel("DEBUG")
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Generate a pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))

# Create a Spark DataFrame from a pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
# rename columns
df = df.toDF("a", "b", "c")
df

Use df.show(5) to see test output with pyspark.
Image description

Reference

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow