16 June 2015

dotNET Internationalization (i18n) using po files and GNU tools



Originally posted on http://issues.umbraco.org/issue/U4-6698

Localization based on GNU Gettext strategy instead of developer driven xml files

http://www.gnu.org/software/gettext/
http://en.wikipedia.org/wiki/Gettext

http://www.gnu.org/software/gettext/manual/gettext.html#Why
Since this is a long rant, I’ll help you a bit along, the resume is -> http://poedit.net/wordpress <- so Umbraco should adopt this too to get in on this translation eco system…

Branch: https://github.com/janhebnes/Umbraco-CMS/blob/dev-v7-localization-using-gettext/
The interesting part is build/Translation.cmd and the resulting pot file at
https://github.com/janhebnes/Umbraco-CMS/blob/dev-v7-localization-using-gettext/src/Umbraco.Web.UI/Umbraco/config/lang/messages.pot

----

This was an open space topic presented at Codegarden 15 - I would like the Umbraco core project to refactor the internal text translation eco system and adopt GNU Gettext and po+pot files translation strategy. The purpose is to remove the developer pain that exists in working with translating strings to internal gui and open up for a larger set of tools a communities maybe on a package level as well as on umbraco core.

----

Translation, i18n or localization in asp.net using resource files is broke, many fine blog posts argument to why, and is unusable in an open source space.
http://www.expatsoftware.com/articles/2010/03/why-internationalization-is-hopelessly.html
http://manas.com.ar/blog/2009/10/01/using-gnu-gettext-for-i18n-in-c-and-asp-net.html

The current implementation in Umbraco uses a simple and clean custom xml file based system with id references to text arrays. A developer must create the string as id and the initial translation en e.g. english. The system is easily expandable to package creators and requires “specialized” translation handling. The missing “id” could be injected at runtime, but if this is an edge case error message it might never find its way to the main translation file xml. The biggest problem right now i believe is that the package eco system is running there own translation subsystems on each package leaving the Umbraco Translations fragmented and an tedious task to handle for developers.

The GNU Project has another strategy that has been in use since 1995.
https://en.wikipedia.org/wiki/Gettext/
https://www.gnu.org/software/gettext/
https://www.gnu.org/software/gettext/manual/html_node/index.html
https://docs.python.org/2/library/gettext.html#gettext its big in python and
https://developer.mozilla.org/en-US/docs/gettext and big in php

The Gettext strategy is to scan the source files and generating a pot (po template) file containing the message “id”, default keywords for localized strings are _(“”) or gettext(“”). And merge/update the po-template into the localized po files updating them so e.g. outdated translation messages are commented out.

The source files can be any code files, C# and JavaScript are supported. Comments can be entered above your code piece to get context help into the notes of the po files, and the po files also contains reference comments to where in the source the translation is in use.
https://www.gnu.org/software/gettext/manual/html_node/xgettext-Invocation.html#xgettext-Invocation

Notice the translation mechanics in base gettext are looking at plural forms also
https://github.com/neris/NGettext/blob/master/src/NGettext.Tests/BaseCatalogTest.cs

There is a large echo system build around the translation process of these file formats and tools can be used to cooperate in a global scale. http://poedit.net/ is one of the main tools and difference implementations can be found in linux and mac systems.

One of the main arguments to why Umbraco should change the strategy is this: http://poedit.net/wordpress

How could we work with this.

I have tried this out on my pet project https://github.com/janhebnes/startlist.club and I am using https://github.com/vslavik/gettext-tools-windows for the main “ngettext” scanner and running a batch for updated the translation files https://github.com/janhebnes/startlist.club/blob/master/Translation.cmd

For working with the _ and gettext in asp.net i based my implementation on http://www.fairtutor.com/fairlylocal and tweaked it a bit to introduce LocalizedDisplayNames on models and have _ on the PageView. Everything available for review here: https://github.com/janhebnes/startlist.club/tree/master/FlightJournal.Web/Translations

I have looked at other options on the po parsing side:
https://github.com/neris/NGettext or https://github.com/fsateler/gettext-cs-utils (using a tt file in the project instead of a batch)

http://www.fairtutor.com/fairlylocal uses a project build step, i have currently chosen a separate batch.

----

Disclaimer: I have not analysed the core Umbraco before so this is based on a short review of the core. (this is where your feedback is much required)
----

How do we go about and handle the change.


C# sources

Most methods are based in \src\umbraco.businesslogic\ui.cs, methods like GetText and Text and a TextService.Localize. So maybe someone has allready had thought towards GNU when designing this. (nobody came out about it at codegarden 15’) All reference the TextService.Localize. The main issue is to get the “messageid” to be detectable by code scanning up front. Changing TextService.Localize to use po sources is the easy part.

Umbraco.ui contains:
GetText (used 65 times)
Text (used 53+14+378+20+134+345+10+2120+36) 3110 times
Any use of e.g. non message ids a unusable to the scanner e.g. ui.Text(action.Alias, u) \src\umbraco.cms\businesslogic\workflow\Notification.cs and should be made detectable in some other way to get into the pot file.

Javascript sources

The translatable strings aka “messages” in the backoffice AngularJS javascript must be made scannable with the default gettext utility. Scanning Umbraco.Web.UI.Client and the js. 136 of the “messages” are in a pattern of <localize key="or">, this should be changed to either _ or gettext (see also 5.1.6 for further inspiration https://www.gnu.org/software/gettext/manual/html_node/xgettext-Invocation.html#xgettext-Invocation)
Localize is used 136 times

Once this is tweeked a new javascript method actually handling _ or gettext must be written and last the source of the translation must be rewritten to use a po or mo file as source replacing \src\Umbraco.Web\umbraco.presentation\umbraco\js\language.aspx.cs

Changing TextService.Localize to use po or mo sources

Using e.g. https://github.com/neris/NGettext or http://www.fairtutor.com/fairlylocal in the form of e.g. https://github.com/janhebnes/startlist.club/tree/master/FlightJournal.Web/Translations could be one way.


Branch Created for demonstration

https://github.com/janhebnes/Umbraco-CMS/blob/dev-v7-localization-using-gettext/
Batch file added https://github.com/janhebnes/Umbraco-CMS/blob/dev-v7-localization-using-gettext/build/Translation.cmd
Resulting pot based on current source https://github.com/janhebnes/Umbraco-CMS/blob/dev-v7-localization-using-gettext/src/Umbraco.Web.UI/Umbraco/config/lang/messages.pot

messageid as english text

For removing a step for the developers and making translation process simplere the messageid or key must be changed to the actual english translation instead of a system key. But this can be a later step.

Potential roadmap

The core umbraco translation can be maaintained by teams on crowdin.
The package system can hook up on the same format and network, an when installed drop package po files in the lang folders, using the same system for handling strings in javascript and cs, and the same setup with pot templates per package. Depending on strategy the translating of packages could be a central service in the core. So a package manager just handles creating the pot and the eco system around translation updates and maintain po package files based on the package pot.

Feedback

I would like to get some community feedback on this feature.

1 comment:

Unknown said...

As an alternative solution to translate gettext .po files, I would suggest the online localization tool https://poeditor.com/ which would make the l10n process more automated, allowing collaborative translation work also.