-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WER for conformer update #124
Comments
@gandroz Wow cool, if you got the same result for |
And one more thing is that there's a very small difference between |
I'll try to continue training for several epochs, training seems not to have ended. I'll read the paper again to look for any clue on how to reduce WER even more. |
@gandroz You should check or generate the transcript file again, may be when creating |
I checked both files, my config file too and got the same results. So
weird. I'll try to debug to find any mistake
Le sam. 23 janv. 2021 13:03, Nguyễn Lê Huy <[email protected]> a
écrit :
… @gandroz <https://github.com/gandroz> You should check or generate the
transcript file again, may be when creating test-other transcript file,
you point to the test-clean directory.
If everything is right, then it's so weird haha 😆
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJCXOANR2CFFQ2EDBTOUSDDS3MFP3ANCNFSM4WOP6C2A>
.
|
I found why I always got the same test metrics.... I tested on the |
@gandroz Can you post your full config file you are using to generate the ~5% WER results? Thanks!!! |
@ncilfone sure !
I used a sentencepiece (unigram) model as vocab, currently trying with the BPE version |
Thanks @gandroz! Is that the vocab here: vocabularies/librispeech_train_4_1030.subwords Edit: Based on the config it seems like you might generate one before training? Also is this just single GPU training? |
no it's not that vocab. However, you can train yours with |
Yeah just realized that you generate it based on the config options. Thanks for letting me know! I'm assuming you are doing the featurization of the WAV files in TF as the stft etc. should be a bit faster on the GPU. DALI might be another place to look too although I've never used it... |
Final question I promise... It looks like you are using and tokens in SentencePiece but I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one? |
I think the best way to accelerate processing is to pre-process fbank just as it done on fairseq.
I'm not sure to understand well your question. Sentencepiece is an unsupervised text tokenizer and detokenizer so you have to train a model on the transcripts from LibriSpeech. Tokenized transcripts are padded to the biggest sentence during training for each batch. |
Ugh forgot that markdown will remove the notation I used... This is what I meant... It looks like you are using |
Oh I see. You are right, transcripts does not have those tokens and they are useless as far as I understand it. However, you can add them when encoding some text. You could find more details on the repo, and I've just realized that there is a tensorflow binding.... I think I'll try it instead of the python implementation I used. |
Hi @gandroz , |
@tund not yet, it took me a week to test on |
Thanks for your reply @gandroz . |
Indeed, I could just perform greedy search for this test. In a near future perhaps... |
@gandroz any chance you can post your loss curves? |
How you are able to achieve such good results with your models? I've trained conformed subword model, but it stops improving after ~20 epochs. I've updated Keras trainer to use EarlyStopping and stops the training process after 5 epochs without improvement to validation loss. What am I missing? Train data: 50hrs Audio lengths. Not sure :
The test results are complete rubbish:
config
|
@mjurkus Could you show the loss curves? |
@mjurkus my training was performed over the LibriSpeech data, 960h of data for training. ASR needs lots of data to converge, so maybe you need more. Furthermore, maybe LibriSpeech data is cleaner than yours ? I also have some proprietary data but they are way worse than LibriSpeech (not even the same sampling rate). But perhaps you could share the training curves ? |
Your model does not seem to learn anything.... Try to reduce your LR, explore some data augmentation as it could help. |
Using conformer with characters worked way better, than using subwords. Managed to get decent results (WER ~15%) do not have the graphs for those, though. Regarding augmentation - I figured, that this config enables augmentation.
|
I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors? |
@jinggaizi What vocabulary size did you use, 1k or 4k or english characters (around 28)? |
1k |
@jinggaizi no, I have no news from the author. I could try to email him again, he's smart. However, I am surprise by the WER you achieved with ESPNET. They say they had much better results (however I suspect it was not with the small model, but anyway). Have you use the RNNT or a transformer as a decoder ? When ESPNET announced they had same or better results than the paper, it was with a transformer as you can see in their sources. Maybe you could ask ESPNET how they manage to achieve such good results.... on which machine, which config etc. |
@usimarit thank for your reply, my result used RNNT as decoder, encoder is small size conformer, decoder is 1 lstm layer(dim=320) and dimension of join network is 640. espnet (https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1)have no RNNT result and i suspect that it's better because speed augmentation |
@gandroz hi, have you any news from the author, do you train the model on GPU or TPU? Have you ever tried a larger batch size, i assume google always use a larger batch size. i only worked on titan xp with small batch size, maybe larger batch size can improve the result of transducer |
@jinggaizi I've run it with a batch size of 2048 (which is what I think they used in the original paper taken from this ref here http://arxiv.org/abs/2011.06110) via batch accumulation on 8 GPUs (with a joint dim of 320) for days and I can barely get below 5.9% on dev-clean. |
It's seem like larger batch size doesn't work, i have no new idea发自我的华为手机-------- 原始邮件 --------发件人: Nicholas Cilfone <[email protected]>日期: 2021年2月23日周二 晚上10:02收件人: TensorSpeech/TensorFlowASR <[email protected]>抄送: jinggaizi <[email protected]>, Mention <[email protected]>主 题: Re: [TensorSpeech/TensorFlowASR] WER for conformer update (#124)
@jinggaizi I've run it with a batch size of 2048 (which is what I think they used in the original paper taken from this ref here http://arxiv.org/abs/2011.06110) via batch accumulation on 8 GPUs (with a joint dim of 320) for days and I can barely get below 5.9% on dev-clean.
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.
|
@ncilfone batch accumulation is just to mimic the large batch size, I believe they use actual large batch size, which is way more efficient. |
@ncilfone what' version of GPU with 2048 batch size. did you improve the RNNT training refer to https://arxiv.org/pdf/1909.12415.pdf |
Just a follow up with the author of the paper. I asked him some clues to try to find how we can achieve the same results. I asked a question about the dataset and whether the model was pre-trained or not, and asked for details on the hyperparameters not always mentionned in the paper.
So maybe a major difference comes from the batch size which is.....HUGE ! I really dont know how they manage to train the large (or even the small) model with so much data. Maybe an avenue could be to split the model over multiple GPU instead or replicating the model on multiple GPU. We could surely increase the batch size doing so. |
Thanks @gandroz, they have their HUGE TPUs, that's why they're able to get SOTA results. I'll try to implement gradient accumulation in keras builtin function and test on colab TPUs, hope it will get nearer to their result. |
Hi @usimarit , I see high bias issue - rnnt_loss in 240s and does not go down further in keras conformer trainer (both keras and non-keras version). I tried learning rate of - 0.5/ sqrt(dmodel), 0.05/sqrt(dmodel), 0.005/sqrt(dmodel) with 960 hours librispeech. There is not much difference in the loss curve. Please let me know if I need to modify anything in the config file to train a model that matches the WER performance of reference latest.h5 (WER of 6.5 in my testing). Thanks |
@MadhuAtBerkeley I trained with that config on google drive, except that I used |
@usimarit Thanks! I confirm that use_tf:False does help and now I see loss curve going below 100. |
Why set use_tf to False help the training as both tf version and numpy version perform similar method? |
Hi, could you please post your config in espnet? |
The only difference is the numpy version uses |
`batch-size: 6 criterion: loss etype: transformer transformer-lr: 10 transformer-enc-positional-encoding-type: rel_pos rnnt-mode: 'rnnt' # switch to 'rnnt-att' to use transducer with attention |
@gandroz hi, have any response from the author , running some experience with small batchsize. do you try to use other methods to improve the result |
@jinggaizi No, not any news from the author, I'll let you know as soon as I have. I cannot work on the project for the moment, so nothing news from me either |
Hey, thanks for the updated config! Any rough estimates of how long it took to train (I'm guessing a few days at least)? Also, any luck with pre-computing fbanks? |
Hello, is gradient accumulation not supported in the latest version (v1.0.0)? |
@changji-ustc I haven't supported it in keras training loop, I'm working on this. |
@usimarit Have you been able to get a better WER with conformer? I see a lot of changes in the word piece branch. With mixed precision and batch size 16 (effective batch size 96), the best Librispeech WER I've gotten is 6.4%. With a medium conformer model with mixed precision and batch size 12 (effective batch size 72), the best WER I've gotten is 4.6%. (Warmup steps 40k, with only Librispeech) |
Hi,
I've just ended a training of a conformer using the sentencepiece featurizer on LibriSpeech over 50 epochs.
Here are the results if you want to update your readme:
Test results:
G_WER = 5.22291565
G_CER = 1.9693377
B_WER = 5.19438553
B_CER = 1.95449066
BLM_WER = 100
BLM_CER = 100
The strange part is that I dot the same metrics on
test-other
dataset hmmm...The text was updated successfully, but these errors were encountered: