Comparison of Translation Quality between Large Language Models and Neural Machine Translation Systems: A Case Study of Chinese-English Language Pair


  • Xinchen Li School of Arts, English and Languages, Queen’s University Belfast, Belfast, United Kingdom


Large language model, Neural machine translation, Translation quality assessment, Chinese-English translation, Comparative study


A number of Neural Machine Translation (NMT) systems have already demonstrated their strength to undertake various translation tasks which are not too demanding. However, the incredible advancement of AI technology in recent years has endowed Large Language Models (LLMs) with great potential, so we may imagine that they may even do better than NMT working as translators. To figure out whether LLMs have better performance than NMT in translation, and how genres and translation directions may influence translation quality, this article chose two LLMs, namely, ChatGPT 3.5 and Wenxin Yiyan, or ERNIE Bot 3.5, and one NMT system, namely, DeepL, to test and compare their performance in Chinese-English translation, employing a quantitative method including BLEU scoring and SPSS analysis. The results show that there is no significant improvement in these LLMs’ translation quality compared with the NMT system, and all the chosen systems tend to perform better in non-literary translation than in literary translation and produce TTs of higher quality in Chinese-English translation than in English-Chinese translation


Chang, P.-C., Jurafsky, D. & Manning, C. D. (2009). Disambiguating “DE” for Chinese-English machine translation. Proceedings of the Fourth Workshop on Statistical Machine Translation, pp.215-223.

DeepL. DeepL Translate: The world’s most accurate translator. Available at:

Esperança-Rodier, E. & Frankowski, D. (2021). DeepL vs Google Translate: Who’s the best at translating MWEs from French into Polish? A multidisciplinary approach to corpora creation and quality translation of MWEs. Translating and the Computer 43. Available at

Forcada, M. L. (2017). Making sense of neural machine translation. Translation Spaces, 6 (2), 291-309.

Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., Kim, Y. J., Afify, M. & Awadalla, H. H. (2023). How good are GPT models at machine translation? A comprehensive evaluation. ArXiv. doi:10.48550/arXiv.2302.09210

Lin, C.-Y. & Och, F. J. (2004). ORANGE: A method for evaluating automatic evaluation metrics for machine translation. COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pp.501-507.

Liu, D. & Gildea, D. (2005). Syntactic features for evaluation of machine translation. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp.25-32.

Lyu, C., Xu, J. & Wang, L. (2023). A paradigm shift: The future of machine translation lies with Large Language Models. ArXiv. doi: 10.48550/arXiv.2305.01181

Matusov, E. (2019). The challenges of using neural machine translation for literature. Proceedings of the Qualities of Literary Machine Translation, pp.10-19.

Omar, A. & Gomaa, Y. (2020). The machine translation of literature: Implications for translation pedagogy, International Journal of Emerging Technologies in Learning, 15 (11), 228-235.

Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.311-318.

Sipayung, K. T., Sianturi, N. M., Arta, I. M. D., Rohayati, Y. & Indah, D. (2021). Comparison of translation techniques by Google Translate and U-dictionary: How differently does both machine translation tools perform in translating? Elsya: Journal of English Language Studies, 3 (3), 236-245.

Takakusagi, Y., Oike, T., Shirai, K., Sato, H., Kano, K., Shima, S., Tsuchida, K., Mizoguchi, N., Serizawa, I. & Yoshida, D. (2021). Validation of the reliability of machine translation for a medical article from Japanese to English using DeepL translator. Cureus, 13 (9). doi: 10.7759/cureus.17778

Toral, A., Oliver, A. & Ballestín, P. R. (2020). Machine translation of novels in the age of transformer. ArXiv. doi: 10.48550/arXiv.2011.14979

Voigt, R. & Jurafsky, D. (2012). Towards a literary machine translation: The role of referential cohesion. Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pp.18-25.

Wong, B. T. & Kit, C. (2012). Extending machine translation evaluation metrics with lexical cohesion to document level. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp.1060-1068.




How to Cite

Li, X. (2024). Comparison of Translation Quality between Large Language Models and Neural Machine Translation Systems: A Case Study of Chinese-English Language Pair. International Journal of Education and Humanities, 4(2), 121–128. Retrieved from