Donate. I desperately need donations to survive due to my health

Get paid by answering surveys Click here

Click here to donate

Remote/Work from Home jobs

Different results for pyspark Dataframe filter using raw query and dataframe API

I am using pyspark on my large cluster where i am trying to filter out rows that have at least one of the fields (email1/ phone1) matching the respective regex.

I have done it in two ways mentioned below:

  1. Raw query within pyspark df.filter()

    df_fl.filter('''( ( email1 rlike '{email_regex}') and ( email1 is not null) )\
            or ( ( phone1 rlike '{phone_regex}') and  ( phone1 is not null) )'''
           .format(email_regex=email_regex, phone_regex=phone_regex )).count()
    

---> 7,96,034

Below is syntax equivalent with different result:

2.Pyspark inbulit api's for query

df_dl.filter( (df_fl.email1.rlike(email_regex) & df_fl.email1.isNotNull())|\
              (df_fl1.phone1.rlike(phone_regex) & df_fl.phone1.isNotNull())
            ).count()

-----> 95614

I am unable to figure out why these two syntaxes yield different results even though both seems to perform same operation logically. Also which one has the correct result as there are total 1.6 million rows in the RDD.

Comments