Add Natasha NLP Engine for Russian language support #1739

mklyazhev · 2025-10-02T00:03:18Z

mklyazhev
Oct 2, 2025

Hello Presidio team,

I would like to propose adding support for the Natasha NLP engine to Presidio.

Natasha (https://github.com/natasha/natasha) is a powerful library for Russian NLP optimized for CPU performance. I have started implementing this in my own fork for personal use because I'm thrilled with the functionality Presidio provides, but I found it was missing an engine like this for the Russian language.

Would the team be open to a pull request for this feature? I'm happy to contribute and can prepare a PR with tests and documentation.

omri374 · 2025-10-08T09:11:18Z

omri374
Oct 8, 2025
Maintainer

Hi, thanks for your suggestion! As long as it uses permissive licenses (for its entire supply chain), a contribution would be great. The simplest approach imho is to create a custom EntityRecognizer rather than an NlpEngine, but both options are possible. In any case, it should be an optional dependency.

0 replies

mklyazhev · 2025-10-09T13:05:43Z

mklyazhev
Oct 9, 2025
Author

Thanks for the helpful feedback!

I'll ensure Natasha and its dependencies use permissive licenses and will start working on the PR.

Regarding the EntityRecognizer vs. NlpEngine approach: I'll take a closer look at Natasha's other capabilities (like tokenization, lemmatization). If they align well with the broader functionality of Presidio's NlpArtifacts, it might be beneficial to implement the full NlpEngine. If not, I'll stick with the simpler, self-contained EntityRecognizer as you suggested.

I'll keep you updated on the approach I choose.

0 replies

mklyazhev · 2025-10-10T13:58:59Z

mklyazhev
Oct 10, 2025
Author

@omri374
Hi, I've explored both implementation options: EntityRecognizer and NlpEngine.

Natasha is a full-featured NLP pipeline, like SpaCy or Stanza. It provides not only NER but also tokenization, lemmatization, and morphological analysis.

Implementing natasha as a self-contained EntityRecognizer would require running a separate NLP engine (defaulting to SpaCy) alongside it, leading to redundant processing (e.g., tokenizing the same text twice). Alternatively, it would have to rely on the preprocessing results from the main NlpEngine, which may not guarantee the best outcome compared to using Natasha's own native preprocessing.

Therefore, I believe the NlpEngine approach is better suited here. This would:

Avoid redundant NLP processing.
Allow future Russian-language recognizers to reuse the NlpArtifacts produced by NatashaNlpEngine.
Ensure that Natasha's NER model uses the native text processing tools it was trained and optimized with.

I will proceed with the NlpEngine + Recognizer implementation, ensuring that natasha is handled as an optional dependency, as you recommended. Please let me know if you have any objections.

5 replies

omri374 Oct 10, 2025
Maintainer

Sounds good! Have you validated the licensing? Per Microsoft's OSS standards, we have to make sure there are no license conflicts.

mklyazhev Oct 12, 2025
Author

I've found two potential issues related to pymorphy2 (a core dependency for natasha):

pymorphy2 is declared under the MIT license in its setup.py and on PyPI. However, the GitHub repository itself is missing a LICENSE file. While the author's intent is clearly permissive, I wanted to bring this to your attention.
pymorphy2 uses the deprecated pkg_resources module, which triggers a UserWarning. More critically, setuptools plans to remove this module around November 2025 (in v81.0.0). If I understand correctly, this will cause an ImportError and will break pymorphy2. There is an open issue for this in the natasha repository (see issue: Pymorphy2 использует pkg_resources, но не имеет в зависимостях Setuptools natasha/natasha#138).

There has been an attempt to transfer the project to community maintenance to fix ongoing issues, but so far without success (see issue: pymorphy2/pymorphy2#174). My assumption is that these two points might be blockers for accepting the PR. What are your thoughts on this?

omri374 Oct 13, 2025
Maintainer

Thanks for the analysis. It's best that we waited for NatashaNLP to adopt pymorphy3 or some other alternative. In the meantime, would you be interested in creating a sample instead?

mklyazhev Oct 13, 2025
Author

Do you mean the samples in the docs/samples/python directory? If so, I agree that this is a good solution until the dependency issues in natasha are resolved.

omri374 Oct 14, 2025
Maintainer

Yes exactly

Add Natasha NLP Engine for Russian language support #1739

Uh oh!

mklyazhev Oct 2, 2025

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

omri374 Oct 8, 2025 Maintainer

Uh oh!

mklyazhev Oct 9, 2025 Author

Uh oh!

Uh oh!

mklyazhev Oct 10, 2025 Author

Uh oh!

omri374 Oct 10, 2025 Maintainer

Uh oh!

Uh oh!

mklyazhev Oct 12, 2025 Author

Uh oh!

omri374 Oct 13, 2025 Maintainer

Uh oh!

mklyazhev Oct 13, 2025 Author

Uh oh!

omri374 Oct 14, 2025 Maintainer

mklyazhev
Oct 2, 2025

Replies: 3 comments 5 replies

omri374
Oct 8, 2025
Maintainer

mklyazhev
Oct 9, 2025
Author

mklyazhev
Oct 10, 2025
Author

omri374 Oct 10, 2025
Maintainer

mklyazhev Oct 12, 2025
Author

omri374 Oct 13, 2025
Maintainer

mklyazhev Oct 13, 2025
Author

omri374 Oct 14, 2025
Maintainer