I am using pyspark on my large cluster where i am trying to filter out rows that have at least one of the fields (email1/ phone1) matching the respective regex.
I have done it in two ways mentioned below:
Raw query within pyspark
df.filter()
df_fl.filter('''( ( email1 rlike '{email_regex}') and ( email1 is not null) )\ or ( ( phone1 rlike '{phone_regex}') and ( phone1 is not null) )''' .format(email_regex=email_regex, phone_regex=phone_regex )).count()
---> 7,96,034
Below is syntax equivalent with different result:
2.Pyspark inbulit api's for query
df_dl.filter( (df_fl.email1.rlike(email_regex) & df_fl.email1.isNotNull())|\
(df_fl1.phone1.rlike(phone_regex) & df_fl.phone1.isNotNull())
).count()
-----> 95614
I am unable to figure out why these two syntaxes yield different results even though both seems to perform same operation logically. Also which one has the correct result as there are total 1.6 million rows in the RDD.
Comments
Post a Comment