I'm trying to train a text data for multi label classification which comprises of 1 Million rows. After cleaning the data, I'm using a sparse matrix of Word2Vec features (Feature size is 300)
The data which I have is 1. ID 2. Dictionary 3. Label
Dictionary size varies from 10 keys to 900 keys
Steps I followed on Dictionary columns are:
- Converted Dictionary to String
- Getting only good tokens from the string
- Removing Stopwords
- Stemming of words
- Word2Vec Model training with feature size 300.
- Word2Vec feaure generation
- Label Encoding
- Converting Feature Vectors to Numpy Array
- Converting Numpy Array to Sparse Matrix of (1114220, 300)
- Tried OneVsRest model for training
onevsrest = OneVsRestClassifier(SVC(probability=True) , n_jobs=-1)
onevsrest.fit(sparse_matrix , df.labels)
I was running this model for nearly two days and it got killed automatically
I also tried Logistic Regression
lr = LogisticRegression(penalty ='l1' , C=1 ,dual=False , solver='saga' , n_jobs=-1)
lr.fit(sparse_matrix , df.labels)
Still I faced the same issue ( Model keeps training for 2 days and gets killed)
Am I doing something wrong? Or is there any better way to do this type of problem?
Comments
Post a Comment