Data Engineer: PySpark- End-to-End DataFrame With Apache Spark--2

We will see the basic exploration of Data using Pyspark Dataframe.

Scenario :

we need to remove the string from int datatype field . if its string, need to make as null else returns original values.

Steps :

read data from csv files
converted into dataframes
create the python functions
register the python functions in spark
call the function in spark dataframe

# import the required packages

from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

# initiated the spark sessions

spark = (SparkSession
 .builder
 .appName("Cleaning data")
 .getOrCreate())


# declare the schema


# read the dataframes
we will get the null values for date fields if we will not declare the dateformat in DataFrame reader.
.schema(firecallschema) -- datatypes declartions

cleaning = (spark.read.format("csv").options(header="true").
             load('/FileStore/tables/cleaning.csv'))

# to view the schema of the data frames.. similar to df.info() in pandas
firecalls.printSchema()






# declare the python functions

def cleaning_int(string):
   try:
       x =int(string)
   except ValueError:
       x = None
   return x

#register the function in Spark layer

 spark.udf.register("cleaning_data",cleaning_int,IntegerType())

# convert into dataframe into tempview
firecalls.createOrReplaceTempView("firecalls_db")
spark.sql("Select * from firecalls_db")

# What were all the different types of fire calls in 2002.

 cleaning.withColumn('Salary',expr("cleaning_data(Salary)")).show()

Data Engineer

Thursday, November 18, 2021

PySpark- End-to-End DataFrame With Apache Spark--2

No comments:

Post a Comment

PySpark- End-to-End DataFrame With Apache Spark--2

Search This Blog