You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an issue trying to perform text classification using the 20_newsgroups dataset, loaded from sklearn. Only 6 different newsgroups were selected for use in this case (so, I have only 6 labels).
I got very low accuracy on the test dataset. And then I have noticed that Magpie predict the same label for all inputs. Only confidence scores differ. When I play with a number of epochs and vector dimensions, the model starts to predict 2-3 different labels. But the performance is still very low (around 15% accuracy). What can be wrong here? The model which predicts the same output for any input is senseless.
I have texts in a variable X and labels in variable y. Then I create a folder data_six where I placed every text and every label in the separate .txt and .lab files using this code:
counter = 1
for i in range(len(X)):
if y[i] in codes_to_leave:
name_text = "data_six/" + str(counter) + ".txt"
name_label = "data_six/" + str(counter) + ".lab"
with open(name_text, 'w') as f1:
f1.write(X[i])
with open(name_label, 'w') as f2:
f2.write(str(y[i]))
counter += 1
Then I use the following function for making predictions and measuring accuracy:
def predict_and_evaluate(data_folder):
filenames = os.listdir(data_folder)
count_true = 0
count_true_in_3 = 0
count_all = 0
for filename in filenames:
if filename[-3:] == 'txt':
count_all += 1
prediction_list = magpie.predict_from_file('data/' + filename)
first_prediction = max(prediction_list, key=lambda x:x[1])
prediction_name = first_prediction[0]
prediction_code = label_dict[prediction_name]
print(prediction_code)
top3_preds = [i[0] for i in prediction_list[:3]]
top3_codes = [label_dict[j] for j in top3_preds]
with open('data_six/' + filename[:-3] + 'lab', 'r') as f:
y_true = int(f.read())
if y_true == prediction_code:
count_true += 1
if y_true in top3_codes:
count_true_in_3 += 1
accuracy = float(count_true) / float(count_all)
accuracy_top_3 = float(count_true_in_3) / float(count_all)
return accuracy, accuracy_top_3
And in the result, I get all outputs "misc.forsale" or "rec.sport.hockey" (here I mean that these categories get the highest probabilities for any input). And when I change the number of epochs and/or vector dimensions, there might be predicted other categories like soc.religion.christian, but all the same: for any input - the same prediction.
Can somebody tell me, please, what may be the reason of such weird behavior?
The text was updated successfully, but these errors were encountered:
I have an issue trying to perform text classification using the 20_newsgroups dataset, loaded from sklearn. Only 6 different newsgroups were selected for use in this case (so, I have only 6 labels).
I got very low accuracy on the test dataset. And then I have noticed that Magpie predict the same label for all inputs. Only confidence scores differ. When I play with a number of epochs and vector dimensions, the model starts to predict 2-3 different labels. But the performance is still very low (around 15% accuracy). What can be wrong here? The model which predicts the same output for any input is senseless.
I have texts in a variable
X
and labels in variabley
. Then I create a folderdata_six
where I placed every text and every label in the separate .txt and .lab files using this code:Then I train word2vec and a model:
These are the outputs from the training process:
Then I use the following function for making predictions and measuring accuracy:
And in the result, I get all outputs "misc.forsale" or "rec.sport.hockey" (here I mean that these categories get the highest probabilities for any input). And when I change the number of epochs and/or vector dimensions, there might be predicted other categories like
soc.religion.christian
, but all the same: for any input - the same prediction.Can somebody tell me, please, what may be the reason of such weird behavior?
The text was updated successfully, but these errors were encountered: