We will see the basic exploration of Data using Pyspark Dataframe.
Scenario :
we need to remove the string from int datatype field . if its string, need to make as null else returns original values.
Steps :
- read data from csv files
- converted into dataframes
- create the python functions
- register the python functions in spark
- call the function in spark dataframe
# import the required packages
from pyspark.sql import * from pyspark.sql.types import * from pyspark.sql.functions import *
# initiated the spark sessions
spark = (SparkSession .builder .appName("Cleaning data") .getOrCreate())# declare the schema# read the dataframeswe will get the null values for date fields if we will not declare the dateformat in DataFrame reader..schema(firecallschema) -- datatypes declartionscleaning = (spark.read.format("csv").options(header="true"). load('/FileStore/tables/cleaning.csv'))
# to view the schema of the data frames.. similar to df.info() in pandasfirecalls.printSchema()
# declare the python functions
def cleaning_int(string): try: x =int(string) except ValueError: x = None return x
#register the function in Spark layer
spark.udf.register("cleaning_data",cleaning_int,IntegerType())
# convert into dataframe into tempview
firecalls.createOrReplaceTempView("firecalls_db") spark.sql("Select * from firecalls_db")# What were all the different types of fire calls in 2002.
cleaning.withColumn('Salary',expr("cleaning_data(Salary)")).show()