Langcodes and plural forms

We all know Drupal and Symfony do have multilingual support (i18n, l10n, pluralization). But do they match? Or do they equally well? How does one test this? This is what I did. I wanted to add tests for Symfonfy's PluralizationRules as there were none and I wanted to have support for PluralForms loading Drupals *.po files clean and simple. How hard can it be?!? These tests should cover:
  1. Langcodes. Are all known language codes available (or easy to add)?
  2. Parsing Gettext PluralForm rules like
    nplurals=5; plural=n==1 ? 0 : n==2 ? 1 : n<7 ? 2 : n<11 ? 3 : 4 
  3. Pluralization rules. Do they work as defined somewhere on the internet?

Langcodes

How many langcodes are there? I looked at several sources
  • 187 on http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
  • 96 in standard_language_list() and 104 on http://localize.drupal.org/translate/languages
  • 132 on http://translate.sourceforge.net/wiki/l10n/pluralforms (as they have plural forms defined).
  • 96 grepped from Symfony PluralizationRules
I did not use wikipedia as that would increase all difference too much (and I forgot to include before writing the tests (doh)). Running phpunit --verbose src/Symfony/Component/Translation/Tests/PluralizationRulesTest.php gives an incomplete test reporting missing languages by resource..
1) Symfony\Component\Translation\Tests\PluralizationRulesTest::testLangcodesKnownByOthers
Symfony (95) : misses (47) ach, ak, an, arn, ast, ay, br, cgg, csb, en-gb
    , es_AR, gd, gsw-berne, ht, hy, ia, jbo, kk, kw, ky, lo, mai, mfe, mi
    , mnk, my, nap, oc, pms, pt-br, pt-pt, rm, sah, sco, se, si, son
    , su, ta-lk, tg, tt, ug, uz, wo, yo, zh-hans, zh-hant
Drupal (96) : misses (46) ach, ak, an, arn, ay, bh, br, cgg, csb, es_AR
    , fur, fy, gun, ha, ia, jbo, km, kw, lb, ln, mai, mnk, ms, nah, nap
    , no, nso, om, pap, pms, ps, pt_BR, rm, sah, so, son, su, tg, tk, tt, uz, wa, wo, yo, zh, zu
Sourceforge (130) : misses (12) bh, en-gb, gsw-berne, ht, my, om, pt-br
    , pt-pt, ta-lk, zh-hans, zh-hant, zu
Total: 142
I'm not sure what the 'real' standard is but afaik ie en-gb is wrong. This should be en_GB. I hope some readers could clarify this. Symfony is a little rude on getting the langcode as it breaks on '_' except pt_BR which was curious to me at first (The pluralization rules for Brasil are different the Portugal). Why do both Drupal and Symfony lack the longer list wikipedia has or even sourceforge? Do we have enough time to fix this in Symfony (and thus Drupal) or 'Do it with Drupal'?

Parsing PluralForms

Drupal is capable of parsing these string by using eval. We have locale.module uses eval where chx described a nice algorithm to cache the eval results by generating a lookup table. Symfony on the other hand uses hard-coded rules with the possibility to overwrite or add a langcode rule. Thanks to Sourceforce we have the rules as strings which we can use for our tests. (I could have downloaded all Drupal *.po files and extract the rules but hope others join in or donate some money) Using algorithm outlined by chx we can now query for each known Symfony langcodes (95-47) and see whether the generated tables do match, have a similar pattern of are complete wrong compared to Sourceforge rules.

Just nplural

First let's check for nplural. The result does not mean Symfony is wrong but there us a mismatch between Symfony and Sourceforge. Please help update Sourceforge wiki (or provide for a pointer to a better location).
2) Symfony\Component\Translation\Tests\PluralizationRulesTest::testPluralFormMatchesSourceforge
PluralForm don't match between hardcoded Symfony and Sourceforge definition.
...
Mismatch by nplural (number of plural forms).
1	fa	Symfony(2) <-> nplurals=1; plural=0

5	ga	Symfony(3) <-> nplurals=5; plural=n==1 ? 0 : n==2 ? 1 : n<7 ? 2 : n<11 ? 3 : 4

2	jv	Symfony(1) <-> nplurals=2; plural=n!=0
2	kn	Symfony(1) <-> nplurals=2; plural=(n!=1)
2	tr	Symfony(1) <-> nplurals=2; plural=(n>1)
2	zh	Symfony(1) <-> nplurals=2; plural=(n > 1)

Patterns

Having a PluralForm of nplurals=6; plural= n==0 ? 0 : n==1 ? 1 : n==2 ? 2 : n%100>=3 && n%100<=10 ? 3 : n%100>=11 ? 4 : 5; taken from Sourceforge (Note there is a note to stating Mozilla uses a different rule namely nplurals=6; plural=n==0 ? 5 : n==1 ? 0 : n==2 ? 1 : n%100>=3 && n%100<=10 ? 2 : n%100>=11 ? 3 : 4; which leads to a different lookup table.) Using this PluralForm gives us
0-99
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
5444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444

100-199
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
5554444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
that is we have calculated the Plural form index to choose when translating ie "12 lines" we need to choose form 4. In writing this down I notice I made an error eval-ing the string as it's result should have all indexes 0-5 in it.

Appendix

Just counting

Running jQuery('.level2 tbody > tr').size() on http://translate.sourceforge.net/wiki/l10n/pluralforms gives 132 language codes. (You need to jQuerify the pages)

Generating PHP code from it.

Getting the php code for those is quite simple.
 # just colorcoding Javascrips :(
{
  var $ = jQuery;
  lines = [];
  $('.level2 tbody > tr').each(function() {
    var langcode = $('td:first', this).text().trim();
    var nplural = $('td:last', this).text().trim();
    lines.push("'" + langcode + "' => '" + nplural+ "',");
  });
  var php = "array(\n  "
    + lines.join("\n  ")
    + "\n);\n";
  php;
}

Other resources

Way more language codes: https://translations.launchpad.net/+languages msgctxt: http://translate.sourceforge.net/wiki/toolkit/duplicates_duplicatestyle?s[]=msgctxt