This repository contains the code and data to generate Chinese homophone words to probe machine translation evaluation systems for our paper Automatically Generating Chinese Homophone Words to Probe Machine Translation Estimation Systems accepted by the 10th Workshop on Noisy and User-generated Text (W-NUT) at NAACL 2025. Our method using self-information achieves much higher correlation with human preferences than an existing method proposed by Hiruncharoenvate et al. (2015). These generated homophone words can be used for censorship avoidance or as what we show, creating data perturbations to probe the robustness of MT or MT evaluation systems.
Please clone our repository and install the following dependencies.
pip install jieba pandas pypinyin numpy torch transformers
Pytorch and transormers are used to run language models to get self-information scores, which is the method (the last step) we propose to filter all possible homophone candidates. If you are restricted with GPUs or you just want to generate all homophones without filtering, this step is not necessary.
This step is to prepare the Weibo corpus
python preprocessing.py
Once the corpus is prepared, run the following code to generate all possible homophones and the top-k homophones filtered using the baseline method proposed by Hiruncharoenvate et al. (2015). The generator uses the file extracted_keywords.xlsx which contains translation error words discovered by Qian et al. (2023). You can replace this file with your own words that need homophone substitution.
python generator.py
Pick most errorneous words in translation, and optionally filter homophone candidates based on frequency for getting self-information scores. To create your own data, you can run the following code. To replicate our results, you can skip this step and use the data in freq_candidates folder to get self-information scores in the next step.
python prepare4self_info.py
Check for which word you want to get self-information scores and run the following code.
python self_info.py
Chaya Hiruncharoenvate, Zhiyuan Lin, and Eric Gilbert. 2015. Algorithmically Bypassing Censorship on Sina Weibo with Nondeterministic Homophone Substitutions. In Proceedings of the International AAAI Conference on Web and Social Media, 9(1):150–158.
Shenbin Qian, Constantin Orasan, Felix Do Carmo, Qiuliang Li, and Diptesh Kanojia. 2023. Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 125–135, Tampere, Finland. European Association for Machine Translation.