You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Upgrade pytorch lightning version in requirements
Signed-off-by: Abhishree <[email protected]>
* Initial fixes for PTL2.0
Signed-off-by: Abhishree <[email protected]>
* Add further fixes to support lightning 2.0
Signed-off-by: Abhishree <[email protected]>
* Add replacements for replace_sampler_ddp, resume_from_checkpoint_fit_path and few occurances of validation_epoch_end
Signed-off-by: Abhishree <[email protected]>
* Replace all occurances of validation_epoch_end to on_validation_epoch_end
Signed-off-by: Abhishree <[email protected]>
* Replace training_epoch_end, test_epoch_end with on_train_epoch_end and on_test_epoch_end respectively
Signed-off-by: Abhishree <[email protected]>
* Change logger=None to logger=False in Trainer object
Signed-off-by: Abhishree <[email protected]>
* Remove PTL2.0 deprecated Trainer args from TrainerConfig dataclass
Signed-off-by: Abhishree <[email protected]>
* Modify trainer.precision check and other small edits
Signed-off-by: Abhishree <[email protected]>
* Replace logger=None with logger=False in test_ptl_stateless_timer.py Trainer
Signed-off-by: Abhishree <[email protected]>
* Add default values for args to fix Attribute Error
Signed-off-by: Abhishree <[email protected]>
* Add the following modifications
1) Remove outputs arg from on_validation_epoch_end, on_test_epoch_end and make it an arg of the class
2) Replace resume_from_checkpoint with ckpt_path as needed
3) Explicitly add accelerator as 'CPU' in UTs being run on CPU
Signed-off-by: Abhishree <[email protected]>
* Remove outputs arg from on_validation_epoch_end, on_test_epoch_end
Signed-off-by: Abhishree <[email protected]>
* Remove outputs arg in on_validation_epoch_end in MultiBinaryAccuracy docstrings
Signed-off-by: Abhishree <[email protected]>
* Add val, test outputs as instance vars in PunctuationCapitalizationModel and TokenClassificationModel
Signed-off-by: Abhishree <[email protected]>
* Replace trainer.fit_loop.max_steps with trainer.fit_loop.epoch_loop.max_steps in test_optimizers_schedulers.py
Signed-off-by: Abhishree <[email protected]>
* Revert an extra space that was mistakenly added
Signed-off-by: Abhishree <[email protected]>
* Use self.validation_step_outputs and self.test_step_outputs in test_ema.py for uniformity
Signed-off-by: Abhishree <[email protected]>
* Use self.validation_step_outputs and self.test_step_outputs in test_ptl_stateless_timer.py and check_for_ranks.py for uniformity
Signed-off-by: Abhishree <[email protected]>
* Add self.validation_step_outputs.clear() and self.test_step_outputs.clear() wherever missing
Signed-off-by: Abhishree <[email protected]>
* Remove outputs arg from on_train_epoch_end
Signed-off-by: Abhishree <[email protected]>
* Remove outputs from on_validation_epoch_end in multi_binary_acc.py
Signed-off-by: Abhishree <[email protected]>
* Remove output args from on_validation_epoch_end in the docstrings of some ASR files
Signed-off-by: Abhishree <[email protected]>
* Remove output args from on_validation_epoch_end and clear memory from validation_step_outputs
Signed-off-by: Abhishree <[email protected]>
* Add on_validation_epoch_end and remove outputs args for nlp models
Signed-off-by: Abhishree <[email protected]>
* Append output of validation_step to validation_step_outputs in EncDecClassificationModel
Signed-off-by: Abhishree <[email protected]>
* Add the following changes
1) Index self.validation_step_outputs and self.test_step.outputs with dataloader_idx wherever needed
2) Initialize self.validation_step_outputs and self.test_step.outputs as empty lists and add support for multi dataloaders if they exist
3) Remove self.pre_configure_ddp from NLPDDPStrategy class as its removed in PTL 2.0
Signed-off-by: Abhishree <[email protected]>
* Add default value dataloader_idx=0 for on_validation_batch_end() in megatron_base_model.py
Signed-off-by: Abhishree <[email protected]>
* TypeCast precision to str in attention.py and utils_funcs.py to avoid TypeError
Signed-off-by: Abhishree <[email protected]>
* Add if condition check for multiple dataloaders when appending to validation outputs
Signed-off-by: Abhishree <[email protected]>
* Separate validation pass to be used with both validation_step and test_step
Signed-off-by: Abhishree <[email protected]>
* Add if condition check for multiple dataloader while appending to test_step_outputs in punctuation_capitalization_model.py
Signed-off-by: Abhishree <[email protected]>
* Add condition check for multiple dataloaders based on type of trainer.val/test_dataloaders or self._validation/test_dl instead of len
Signed-off-by: Abhishree <[email protected]>
* Comment Megatron T5 IA3 PP=2 in CI pipeline due to dataloader_iter issue with PTL 2.0
Signed-off-by: Abhishree <[email protected]>
* Modify precision checks to account for 16-mixed and bf16-mixed
Signed-off-by: Abhishree <[email protected]>
* Append output of validation/test_step to self.validation/test_step_outputs in CTCG2PModel
Signed-off-by: Abhishree <[email protected]>
* Modify find_unused_parameters=True in g2p_heteronym model
1) Add find_unused_parameters=True for DDP strategy in g2p_heteronym_classification_train_and_evaluate.py
2) Remove args output in validation/test_step and add instance variables instead for heteronym_classification.py
Signed-off-by: Abhishree <[email protected]>
* Remove outputs from on_test_epoch_end in DialogueGPTClassificationModel
Signed-off-by: Abhishree <[email protected]>
* Add validation/test outputs in sgdqa_model and modify dialogue_config.yaml
Signed-off-by: Abhishree <[email protected]>
* Add split arg self.test_step_outputs to TextClassificationModel
Signed-off-by: Abhishree <[email protected]>
* Add test_step_outputs to dialogue and text classification models
Signed-off-by: Abhishree <[email protected]>
* Change condition check for multiple dataloaders:
1) Replace ds_item as list in dialogue_config.yaml
2) Check for len of val/test_dataloaders or validation/test_dl along with type check of list in sgdqa_model.py while appending outputs of validation/test_step
3) Check for len of _validation/test_dl for creating self.validation/test_step_outputs in ModelPT and punctuation_cpitalization_model.py
Signed-off-by: Abhishree <[email protected]>
* Add additional condition for multi dataloaders
Check len(self.trainer.val/test_dataloaders) > 1 along with type(self.trainer.val/test_dataloaders) == list for multi dataloaders in validation/test_step
Signed-off-by: Abhishree <[email protected]>
* Add val step outputs and default val for dataloader_idx
1) Append validation_step outout to self.validation_step_outputs in MultiLabelIntentSlotClassificationMode
2) Add default val for dataloader_idx for on_test_batch_start/end in TimingCallback
3) Add self.validation/test_step_outputs in BERTQAModel and remove outputs arg
Signed-off-by: Abhishree <[email protected]>
* Add val/test_step_outputs to S2SQAModel and GPTQAModel
Signed-off-by: Abhishree <[email protected]>
* Edit JenkinsFile for bert_pretrainig.py
Edit Jenkinsfile for this test to disable validation as a workaround for trainer.val_dataloader None error
Signed-off-by: Abhishree <[email protected]>
* Modify precision to support 16-mixed, bf16-mixed in megatron_gpt_pretraining.py
Signed-off-by: Abhishree <[email protected]>
* Add ddp_find_unused_parameters_true and remove output args
1) Add ddp_find_unused_parameters_true fro trainer.strategy in self_alignment_pretraining.py as it has unused parameters
2) Remove output args and add self.validation/test_step_outputs to validation/test_step in mt_enc_dec_model.py
3) Comment tests in JenkinsFile that need to be fixed
Signed-off-by: Abhishree <[email protected]>
* Precision fix in megatron_nmt_training.py for 16-mixed, bf16-mixed
Signed-off-by: Abhishree <[email protected]>
* Precision fix for megatron_bert_pretraining.py and megatron_bert_model.py
Signed-off-by: Abhishree <[email protected]>
* Precision fix and validation/test_step_outputs
1) Add fix to account for 16-mixed and bf16-mixed in megatron_retro_mutransfer_pretrain.py, megatron_retro_pretraining.py
2) Reset ckpt_path for test in enc_dec_nmt.py
3) Remove outputs args and add validation/test_step_outputs in megatron_retrieval_model.py
4) Comment Megatron Bert Pretraining and Resume Training with Pipeline Paralleism and add back NMT Training Post-LN
Signed-off-by: Abhishree <[email protected]>
* Precision fix and skip few failing tests
Signed-off-by: Abhishree <[email protected]>
* Add missing comment lines in JenkinsFile
Signed-off-by: Abhishree <[email protected]>
* Comment jenkin tests and super().on_validation_epoch_end() in megatron_gpt_sft_model.py
Signed-off-by: Abhishree <[email protected]>
* Minor edit JenkinsFile
Signed-off-by: Abhishree <[email protected]>
* Minor edit in jenkins file
Signed-off-by: Abhishree <[email protected]>
* Edit in Jenkins file
Signed-off-by: Abhishree <[email protected]>
* Comment missed lines in Jenkins file
Signed-off-by: Abhishree <[email protected]>
* Fix precision and validation/test outputs
1) Add precision fix to account for 16-mixed and bf16-mixed in megatron_t5_pretraining.py
2) Remove outputs args and add append loss to self.validation/test_step_outputs in megatron_lm_encoder_decoder_model.py
3) Add back resume_from_checkpoint in the megatron_t5_config.yaml
4) Comment out certain tests in Jenkins file
Signed-off-by: Abhishree <[email protected]>
* Fix precision and validation/test/predict errors in megatron_t5_prompt_learning.py
Signed-off-by: Abhishree <[email protected]>
* Precision fix and edit precision typo in all files
1) Account for 16-mixed and bf16-mixed in megatron_bart_pretraining.py and megatron_t5_seq2seq_finetune.py
2) Fix precision typo in all files
Signed-off-by: Abhishree <[email protected]>
* Fix all CI TTS tests and comment few Jenkins tests
Signed-off-by: Abhishree <[email protected]>
* Combine xx_epoch_end and on_xx_epoch_end
Add on_inference_epoch_end to inference_epoch_end function and have a single on_validation/test_epoch_end in megatron_finetune_model.py and megatron_gpt_sft_model.py
Signed-off-by: Abhishree <[email protected]>
* Add a missing comment in JenkinsFile
Signed-off-by: Abhishree <[email protected]>
* Add try except StopIteration in validation_step for models with dataloader_iter
Signed-off-by: Abhishree <[email protected]>
* Remove pyyaml from requirements
Signed-off-by: Abhishree <[email protected]>
* Add try except for inference_step in megatron_finetune_model.py
Signed-off-by: Abhishree <[email protected]>
* Remove limit_val_batches for mockGPTDataset test
Signed-off-by: Abhishree <[email protected]>
* Add new self.validation_step_outputs for MegatronGPTSFTModel
Signed-off-by: Abhishree <[email protected]>
* Minor edit Jenkinsfile
Signed-off-by: Abhishree <[email protected]>
* Initialize self.validation/test_step_outputs in megatron_gpt_sft_model.py
Initialize self.validation/test_step_outputs in setup of MegatronGPTSFTModel to take care of cases when datalaoders are not setup in ModelPT for example while restoring the model.
Signed-off-by: Abhishree <[email protected]>
* Remove resume_from_checkpoint if trainer arg in conf yaml files
Signed-off-by: Abhishree <[email protected]>
* Remove resume_from_checkpoint as trainer arg in GPT, T5 configs
Signed-off-by: Abhishree <[email protected]>
* Remove resume_from_checkpoint in duplex_tn_config.yaml
Signed-off-by: Abhishree <[email protected]>
* Fix typos, unused imports and refactor code to remove redundant funcs
Signed-off-by: Abhishree <[email protected]>
* Remove commented code in megatron_nmt_model.py
Signed-off-by: Abhishree <[email protected]>
* Fix overriden functions to match parent class functions
Signed-off-by: Abhishree <[email protected]>
* Prefetch dataloader_iter to prevent hang for PP>1
Signed-off-by: Abhishree <[email protected]>
* Override setup() in NLPDDPStrategy to avoid hang during predict with PP>1
Signed-off-by: Abhishree <[email protected]>
* Uncomment tests in JenkinsFile
Signed-off-by: Abhishree <[email protected]>
* Add '16' to precision checks and other minor fixes
Signed-off-by: Abhishree <[email protected]>
* Clear validation/test_step_outputs with dataloader_idx for multi dataloaders
Signed-off-by: Abhishree <[email protected]>
* Minor edits
Signed-off-by: Abhishree <[email protected]>
* Modify precision checks to avoid indexing
Signed-off-by: Abhishree <[email protected]>
* Remove self.validation_step_outputs_sft and add dataloader_idx to clear outputs
Signed-off-by: Abhishree <[email protected]>
* Reference checkpoint with trainer.ckpt_path
Signed-off-by: Abhishree <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add _prefetch to NLPModel and minor fixes
Signed-off-by: Abhishree <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add limit_val_batches in JenkinsFile for NMT
1) Add trainer.limit_val_batches in Megatron NMT Training TP=2
2) Remove unused import in ModelPT
Signed-off-by: Abhishree <[email protected]>
---------
Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
3667
3669
}
3668
3670
}
3671
+
// @athitten Remove /home/TestData/nlp/megatron_sft/trec.jsonl for validation and test file until we have support for multiple dataloaders in lightning 2.0
Copy file name to clipboardExpand all lines: examples/asr/conf/asr_adapters/asr_adaptation.yaml
-1Lines changed: 0 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -187,7 +187,6 @@ trainer:
187
187
precision: 32# Should be set to 16 for O1 and O2 to enable the AMP.
188
188
log_every_n_steps: 10# Interval of logging.
189
189
enable_progress_bar: True
190
-
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
191
190
num_sanity_val_steps: 0# number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
192
191
check_val_every_n_epoch: 1# number of evaluations on validation every n epochs
Copy file name to clipboardExpand all lines: examples/asr/conf/conformer/conformer_ctc_bpe.yaml
-1Lines changed: 0 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -204,7 +204,6 @@ trainer:
204
204
precision: 32# 16, 32, or bf16
205
205
log_every_n_steps: 10# Interval of logging.
206
206
enable_progress_bar: True
207
-
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
208
207
num_sanity_val_steps: 0# number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
209
208
check_val_every_n_epoch: 1# number of evaluations on validation every n epochs
Copy file name to clipboardExpand all lines: examples/asr/conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc_bpe.yaml
-1Lines changed: 0 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -239,7 +239,6 @@ trainer:
239
239
precision: 32# 16, 32, or bf16
240
240
log_every_n_steps: 10# Interval of logging.
241
241
enable_progress_bar: True
242
-
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
243
242
num_sanity_val_steps: 0# number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
244
243
check_val_every_n_epoch: 1# number of evaluations on validation every n epochs
Copy file name to clipboardExpand all lines: examples/asr/conf/squeezeformer/squeezeformer_ctc_bpe.yaml
-1Lines changed: 0 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -179,7 +179,6 @@ trainer:
179
179
precision: 32# 16, 32, or bf16
180
180
log_every_n_steps: 10# Interval of logging.
181
181
enable_progress_bar: True
182
-
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
183
182
num_sanity_val_steps: 0# number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
184
183
check_val_every_n_epoch: 1# number of evaluations on validation every n epochs
Copy file name to clipboardExpand all lines: examples/asr/conf/ssl/wav2vec/wav2vec_ci.yaml
-1Lines changed: 0 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -138,7 +138,6 @@ trainer:
138
138
gradient_clip_val: 0.0
139
139
precision: 32# 16, 32, or bf16
140
140
log_every_n_steps: 100# Interval of logging.
141
-
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
142
141
num_sanity_val_steps: 0# number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
143
142
check_val_every_n_epoch: 1# number of evaluations on validation every n epochs
Copy file name to clipboardExpand all lines: examples/nlp/dialogue/conf/dialogue_config.yaml
-1Lines changed: 0 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,6 @@ trainer:
25
25
accelerator: gpu
26
26
log_every_n_steps: 5# Interval of logging.
27
27
val_check_interval: 1.0# Set to 0.25 to check 4 times per epoch, or an int for number of iterations
28
-
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
29
28
num_sanity_val_steps: 0# number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
30
29
enable_checkpointing: False # Provided by exp_manager
Copy file name to clipboardExpand all lines: examples/nlp/duplex_text_normalization/conf/duplex_tn_config.yaml
-1Lines changed: 0 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -71,7 +71,6 @@ decoder_trainer:
71
71
strategy: ddp
72
72
log_every_n_steps: 1# Interval of logging.
73
73
val_check_interval: 1.0# Set to 0.25 to check 4 times per epoch, or an int for number of iterations
74
-
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
0 commit comments