Donate. I desperately need donations to survive due to my health

Get paid by answering surveys Click here

Click here to donate

Remote/Work from Home jobs

Multi Classification on large dataset with over 600 labels

I'm trying to train a text data for multi label classification which comprises of 1 Million rows. After cleaning the data, I'm using a sparse matrix of Word2Vec features (Feature size is 300)

The data which I have is 1. ID 2. Dictionary 3. Label

Dictionary size varies from 10 keys to 900 keys

Steps I followed on Dictionary columns are:

  1. Converted Dictionary to String
  2. Getting only good tokens from the string
  3. Removing Stopwords
  4. Stemming of words
  5. Word2Vec Model training with feature size 300.
  6. Word2Vec feaure generation
  7. Label Encoding
  8. Converting Feature Vectors to Numpy Array
  9. Converting Numpy Array to Sparse Matrix of (1114220, 300)
  10. Tried OneVsRest model for training

onevsrest = OneVsRestClassifier(SVC(probability=True) , n_jobs=-1)

onevsrest.fit(sparse_matrix , df.labels)

I was running this model for nearly two days and it got killed automatically

I also tried Logistic Regression

lr = LogisticRegression(penalty ='l1' , C=1 ,dual=False , solver='saga' , n_jobs=-1)

lr.fit(sparse_matrix , df.labels)


Still I faced the same issue ( Model keeps training for 2 days and gets killed)

Am I doing something wrong? Or is there any better way to do this type of problem?

Comments