Hi, thanks for developing a very wonderful project. I found `torch.nn.functional.scaled_dot_product_attention` [throws an error when both attn_mask and is_causal are set ](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html). But, currently the language_model.py code uses both. https://github.com/huggingface/nanoVLM/blob/6ba9082e16f1fc8c21a1f8d0c54b26c9233c8771/models/language_model.py#L141 A simple fix is to create a causal mask by yourself, but if there's other ways, I want to know.