spark ML run a trained pipelineModel on new(test) data gives error

I had a problem when I tried to avoid data leakage from test data to the trained model. I trained a model based on the trained data, then did transform on the test data (after running the transformations on the test data first) saw Spark, ML, StringIndexer: handling unseen labels but is the setHandleInvalid("keep") on the StringIndexer actually solve the issue?

I didn't do the trick for me.

I run the pipeline with StringIndexer + OneHotEncoderEstimator + VectorAssembler + StandardScaler.

It first runs on the training data , and after I fit it to create a model - I do model.transform(testData)

I get an error:

stage 936.0 failed 4 times, most recent failure: Lost task 0.3 in stage 936.0 (TID 159221, 172.28.5.87, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector)
....    
Caused by: java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 51947, y.size = 178469
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104)

I think it happens because I don't do all the transformation on the new data as well in the beginning to create the model .

I tried using setHandleInvalie("keep") in the StringIndexer & OneHotEncoderEstimator but it didn't help.

I run on PySpark 2.3.1

All Questions Answered

Search This Blog

Donate. I desperately need donations to survive due to my health

Get paid by answering surveys Click here

Click here to donate

Remote/Work from Home jobs

spark ML run a trained pipelineModel on new(test) data gives error

Comments

Post a Comment