I had a problem when I tried to avoid data leakage from test data to the trained model.
I trained a model based on the trained data, then did transform on the test data (after running the transformations on the test data first)
saw Spark, ML, StringIndexer: handling unseen labels
but is the setHandleInvalid("keep") on the StringIndexer actually solve the issue?
I didn't do the trick for me.
I run the pipeline with StringIndexer + OneHotEncoderEstimator + VectorAssembler + StandardScaler.
It first runs on the training data , and after I fit it to create a model - I do model.transform(testData)
I get an error:
stage 936.0 failed 4 times, most recent failure: Lost task 0.3 in stage 936.0 (TID 159221, 172.28.5.87, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector)
....
Caused by: java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 51947, y.size = 178469
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104)
I think it happens because I don't do all the transformation on the new data as well in the beginning to create the model .
I tried using setHandleInvalie("keep") in the StringIndexer & OneHotEncoderEstimator but it didn't help.
I run on PySpark 2.3.1
Comments
Post a Comment