-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about prediction accuracy improvement #161
Comments
hey @dyf180615 , good questions! The problem is complex and I can't fix it for you with a GitHub comment, but here are a few tips:
Hope that helps a bit! |
@jstypka thanks so much for your tips ,i saw your reply bofore i slept yesterday,i even dream about some tips similar to yours but i forget everything after waking up in th morning :)
|
@jstypka |
A very good tool!
I have used this tool to test and use an application scenario.
The scenario is as follows:
Chinese document for environmental penalties, 26 labels indicating the type of problem existing in the enterprise, and labeling each document with 1-5 labels. The length of the document is 5000 in the range of 0-49 bytes, 3300 in 50-100 bytes, 1100 in 100-149 bytes, and 500 in 150-700 bytes. The problem with tags is that there are 3-4 tags that are relatively large, up to 80% cumulative, and these tags often appear alone or in combination, the degree of association is complex, and the text describing the tags is similar.
The situation is this: I use 10,000 pieces of data for training, 8000 for training, 2000 for testing, cnn, word vector 128, batch-size64, epochs=10. The trained model is used directly to predict the 10,000 trained data.
The result is this: I set the filter with a confidence greater than 0.2 as the result, and the top 5 for more than 5 tags. About 5,000 are completely correct, accounting for about 66%, and 90% of them are single-label, but only about 5% is completely wrong. A complete error means that a correct label is not predicted. 15% are labels with multiple predictions, but there are also labels that hit the correct ones, and 5% are labels with less predictions, meaning that some of the labels are not correctly predicted, but there are also hits with the correct labels. I am very satisfied with the results of the single label. However, the problems in multi-labels are also very prominent. Multi-prediction is a very headache, and most of the multi-predictions and low-predictions are the 3-4 labels that I have mentioned, which are more difficult to distinguish.
So now you feel how I can improve the accuracy to what I want: 95% full pair of single tags, reducing the number of multi-predicted tags, so that the overall correct number of tags reaches 80-90%, may be difficult?
And there are some additional questions:
1,I try to add stop words that I think can be removed when the word segmentation, and some common words are mandatory as a dictionary. Does this help the result? The result I get is the ability to identify the difference between the parts of the label, but it also reduces the confidence of some labels that were previously highly trusted.
2, the training of the word vector, I use 10000 pieces of data for word vector training, will it be impossible to correctly establish the appropriate distance space for each word vector? If I use other people's 120G Chinese corpus to train good word vectors to use, is it better? Do you need to worry about the different dictionaries? (I used a custom dictionary extra)
3,Regarding the attention mechanism, is this mechanism applicable to my scenario, increasing the model's attention to some specific words in the text, and how to apply it to magpie?
4, Other methods that you feel are likely to increase the accuracy of prediction with greater precision. Thank you very much, I hope to get your help.
The text was updated successfully, but these errors were encountered: