-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
feat(ml): better multilingual search with nllb models #13567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c29a7ad
to
8ea15a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome
@@ -144,6 +144,7 @@ | |||
page: nextPage, | |||
withExif: true, | |||
isVisible: true, | |||
language: $lang, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a new setting? I like to have my UI in English, but I prefer to use my native language for searching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we could have a Search Language user preference separate from the UI language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing is that the setting only applies to four non-default models, so I wonder if it'd be confusing to users. Maybe we can hide the setting if they're not using a relevant model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also have this be something the admin can override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean making it use a certain search language by default for all users? That could be nice, possibly doing the same for the UI language as well. But I think we should avoid making too many changes at once in this PR, so the admin stuff can be done separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few related FRs came up that I consolidated into #13651.
@mertalev I think this is good to go, but it needs rebasing, there's a bunch of conflicts. If you want it to go in still, can you work on that? 🙂 |
Sure! I got caught up trying to make it work in mobile, but the language codes for localization were different IIRC. The web portion should be good though after rebasing. |
7c718a9
to
02193bf
Compare
02193bf
to
23ae018
Compare
Tested on web and mobile with a variety of languages and confirmed the language is passed to the ML service, which uses it to pass the right language token to the NLLB model. |
Do the model benchmarks account for this token being passed? |
The results in the docs are when the language token is passed correctly. The models might not even make the list when no token is passed because the input is interpreted as English. |
Description
The NLLB models have the highest quality results for multilingual text in our model catalog. However, these models expect the input to explicitly indicate the language of the text. Since we don't provide this, the results are worse than other multilingual models.
This PR makes the machine learning service accept a
language
option that it maps to the corresponding FLORES200 token that NLLB expects. The server is updated to accept this parameter and forward it to the machine learning service, while web and mobile are updated to provide the language based on current user settings. Search still works without it, and the option only has an impact when using an NLLB model.How Has This Been Tested?
I tested on web and can confirm the results for Turkish are significantly better. Changing the language in the user settings has a marked effect on the ranking. I haven't tested mobile so I'm not sure if all the language codes there line up with the map in the machine learning service.