I am trying to create a machine learning model with scikit which will show the category according to the product description given by the user. However the program works on certain input but fails on other input.
Here is my code:
import pandas as pd
import numpy as np
df=pd.read_excel('D:\\android\\testdata2.xlsx')
X=df['Product Description']
Y=df['Category']
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=.25,random_state=4)
from sklearn.feature_extraction.text import CountVectorizer
count_vector=CountVectorizer()
X_train_count=count_vector.fit_transform((X_train))
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer= TfidfTransformer()
X_train_tfidf=tfidf_transformer.fit_transform(X_train_count)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, Y_train)
rom sklearn.pipeline import Pipeline
from sklearn.externals import joblib
import pickle
text_clf=Pipeline([('vect',CountVectorizer()),('tfidf',TfidfTransformer()),('clf',MultinomialNB()),])
text_clf=text_clf.fit(X_train,Y_train)
joblib.dump(text_clf,'model.pkl')
X_test1=[ 'MULTIVAC']
predicted=text_clf.predict(X_test1)
proab=text_clf.predict_proba(X_test)
print (str(predicted))
print (max(proab[0]))
Here is the data that i am using. model-data
The output with some test cases changes, for example:
I/P:X_test1=[ 'IMPLANT MAMMAIRE ANATOMIQUE ']
O/P:['IMPLANT MAMMAIRE']
0.09326544037258762
But as i remove the data and only keep, 'IMPLANT' ,the output changes and this should not happen as it gives the wrong category.
I/P: X_test1=[ 'IMPLANT ']
O/P:['INSTRUMENT ELECTROCHIRURGIE']
0.09326544037258762
Another Example: The category should be:
I/P: X_test1=[ 'SET FISTULE GANT']
O/P:['SET BRANCHEMENT DEBRANCHEMENT HEMODIALYSE FISTULE']
0.09326544037258762
But the output comes as:
I/P: X_test1=[ 'SET FISTULE GANT']
O/P:['BEVACIZUMAB']
0.08333333333333333
Comments
Post a Comment