Changelog:
Release Overview
The features for this release include support of CLDR 28 and Unicode 8.0.
For more details, including migration issues, see below.
Common Changes
CLDR 28: For details of the many changes in CLDR, see CLDR 28.
Unicode data updated to Unicode 8.0: 41 new emoji characters, 5,771 new ideographs for Chinese/Japanese/Korean, 6 new scripts, improved character properties data, etc.
ICU data size reduced by about 7.2% (1.8MB) via sharing string values across resource bundles. [#11537]
DateIntervalFormat now handles intervals with seconds, and sets FieldPosition more consistently. [#11706, #11726]
DateFormat::createInstanceForSkeleton() caches DateFormat patterns rather than DateTimePatternGenerator instances, for better performance (for cache hits) and lower heap memory consumption. [#11780]
StringSearch (based on collation) defaults to matches on normalization boundaries rather than grapheme cluster boundaries, which yields more matches on Indic text. [#11750]
RuleBasedNumberFormat (spelled-out numbers) now handles rounding (Java only), infinity, NaN. [#11653, #11760, #8223]
Most of the old Normalizer/unorm.h had been replaced by (and reimplemented via) Normalizer2, and is now deprecated. [#7303]
COLON has been withdrawn as a date pattern character corresponding to the date field [UDAT_]TIME_SEPARATOR_FIELD; there is currently no pattern character corresponding to that field. [#11773]
Support for locale key "cf" to specify currency format style, and interaction with NumberFormat values for UNumberFormatStyle: [#11787]
For NumberFormat style UNUM_CURRENCY / CURRENCYSTYLE, the default is "standard" currency style (typically using minus sign for negative numbers), but the new locale key "cf" may be used with values "standard" or "account" to specify currency format style ("account" indicates accounting style, often using parentheses for negative numbers).
For other NumberFormat styles, the locale key "cf" is ignored (they override the locale preference):
UNUM_CURRENCY_ISO / ISOCURRENCYSTYLE
UNUM_CURRENCY_PLURAL / PLURALCURRENCYSTYLE
UNUM_CURRENCY_ACCOUNTING / ACCOUNTINGCURRENCYSTYLE
UNUM_CASH_CURRENCY / CASHCURRENCYSTYLE
A new NumberFormat style is availble to explicitly specify standard style, ignoring the the locale key "cf"
UNUM_CURRENCY_STANDARD / STANDARDCURRENCYSTYLE
ICU4C Specific Changes
C API support for CompactDecimalFormat via UNumberFormatStyle additions: UNUM_DECIMAL_COMPACT_SHORT, UNUM_DECIMAL_COMPACT_LONG [#11693]
Larger UnicodeString object stores more characters inside the object without heap allocation; the UnicodeString object size is now build-time-configurable. [#11551]
On 64-bit machines, increase from object size 40 bytes with 15 internal UChars to a new default of 64 bytes with 27 UChars.
Some C++ classes now have swap() and moveFrom() methods, and support C++11 move semantics on compilers that support them. [#10086]
UnicodeString, LocalPointer, LocalArray
DecimalFormat code refactored to fix bugs, improve maintainability, and improve performance. [#10458]
New FilteredBreakIterator suppresses certain segment boundaries. For example, it can suppress the sentence boundary in the middle of "Mr. Smith". [#11248]
The internal, shared cache has been changed from unbounded to bounded. [#11767]
For [U]BreakIterator with type UBRK_SENTENCE, the locale key "ss" can now be used with value "standard" to specify that standard sentence break suppression data should be used, or with value "none" to indicate that no break suppression data should be used (the default). [#11770]
Collator: first-time startup time improved 20% due to precalculated unsafe-backward table [#11886]
A number of memory leaks and buffer overruns have been fixed based on static code analysis, mostly in data build tools
The features for this release include support of CLDR 27 (with a major cleanup of region locales, among many other improvements), formatting for scientific notation ("1.2 × 10³"), an update to Unicode 7.0 data for spoof-checking, narrow AM/PM markers ("7:45p"), and various performance enhancements. For C/C++, there are new methods for flexible dates ("Nov 10", or "Sept 2015"), named capture groups for regular expressions, formatting of compound units ("3.5 meters per second"), new C wrappers, and independent timezone resource loading. ICU4J has been improved and tested for using ICU4C data and for running on Android.
Data from the CLDR 25 release: Many bug fixes
Time zone data: 2014b, including post CLDR 25 time zone data update to CLDR.
U+20BD Ruble Sign added (from Unicode 7.0, otherwise ICU 53 still uses Unicode 6.3)
MeasureFormat API for new units in CLDR 24
Hoisted setContext/getContext from SimpleDateFormat to DateFormat, implement context-sensitive capitalization of relative dates
Added setContext/getContext methods to NumberFormat (and unum_setContext/unum_getContext for UNumberFormat), implement context-sensitive number formatting (for RBNF spellout)
Improved lenient date parsing consistency between ICU4C and ICU4J, add finer-grained control of date parsing leniency
Fixed numeric rounding in TimeUnitFormat
Fixes to Unicode 6.3 bidirectional algorithm implementations to behave exactly like reference implementations
Improved UTF-16 charset detection
Collation code re-implemented
Many bugs fixed, some enhancements implemented (link for ticket query)
Passes full UCA conformance tests now
Updated to UCA 6.3/CLDR 24 root collation
Performance: C++ UTF-8 and Java string comparisons significantly faster (very small reduction for C++ UTF-16)
Collation data size (uncompressed) reduced from 4.48MB (ICU 52) to 2.62MB
New data format, removed empty files, fixed genrb bug
More APIs function when collation rule strings have been omitted from the data files (e.g., getTailoredSet())
Java Collator.compare(Object, Object) now works with CharSequence, not just String
Java Collator base class (does not apply to RuleBasedCollator instances): getters for strength, decomposition mode, and locales return hardcoded default values; their setters do nothing
Rule syntax and semantics tightened and improved, matching LDML 25 Collation Rule Syntax
In particular, rule chains now must start with a reset.
Setting of variableTop deprecated, and not supported in rule syntax any more
Replaced by the new maxVariable setting; see LDML 25 Collation Settings
Accounting format supported in NumberFormat
RelativeDateTimeFormatter class for formatting relative times such as "3 weeks ago" or "next Tuesday."
Updated Spoof Checker for Unicode Security Standard version 6.3.
Unicode 6.3: New bidi control codes, new Bidi_Class property values, two new bidi "bracket" properties; for other property value changes see the UAX 44 summary.
The bidi algorithm implementation has also been updated to support the new properties and to match the updated algorithm in the Unicode 6.3 version of UAX 9.
Note: ICU 52 still uses collation root data based on Unicode Collation Algorithm 6.2 (UCA 6.2). (However, ICU 52 does use CLDR 24 collation tailoring data.)
CLDR 24: Improved coverage for top 70+ languages, fractional plural rules and forms, many new measurement units, major simplification of collation rule syntax, preliminary version of European Ordering Rules, new relative fields such as “last Sunday” and “now”, and much more.
Time zone data: 2013g.
Support new variants of Islamic calendar:
"islamic-umalqura": Umm al-Qura.
"islamic-tbla": Tabular (fixed intercalary years), with astronomical epoch.
Made Calendar getDayOfWeekType behave as documented.
New API for converting between Windows time zone ID and IANA tz database ID.
Technology Preview: New API for more granular control of DateFormat parse leniency.
DateTimePatternGenerator:
Support recently-added time zone pattern characters O, X, x and updated support for V, Z.
Support newly-defined skeleton character ‘J’ to generate preferred hour cycle without any day period indicator (such as AM/PM for h).
Implement support for plurals that depend on displayed fractional values.
MessageFormat and currency formatting etc. select appropriate plural forms for values with decimal digits (after the decimal point).
Segmentation:
Add dictionary-based word & line break for Lao.
Bug fixes:
* fix for enumset.h not being installed on Windows
* zOS pkgdata fix
* Test fixes
* Region enumeration fix
* make stable sort faster
* host failures for DateFormatTest
* LayoutEngine security patches (see above)
* ubrk fix for word_POSIX infinite loop
* fix memory leak/crash in LayoutEngine
* fix header guard typo in layout/TibetanReordering.h
Common Changes
==============
CLDR 23: Collation tailorings put native script first; non-Gregorian calendar formats are more consistent; much improved data for Armenian (hy), Georgian (ka), Mongolian (mn), and Welsh (cy); …
Time zone data: 2013b
Date format/parse now supports CLDR short weekday names ("EEEEEE", "cccccc").
Support DisplayContext for date formatting, locale display names.
DateTimePatternGenerator behavior is now much more consistent between C and J.
Support new timezone pattern characters in LDML spec: X+, x+, O, OOOO, V, VV, VVV.
Updated SpoofChecker for v5 of UTS39.
AlphabeticIndex enhancements:
New thread-safe ImmutableIndex sub-API
Build an index for a custom Collator.
Make data-driven for Chinese collations.
New API for CLDR script metadata.
ICU4C Specific Changes
======================
Support for “dangi” Korean luni-solar calendar (already in ICU4J).
Add CompactDecimalFormat (already in ICU4J).
Add TerritoryContainment APIs (already in ICU4J).
UnicodeString default constructor and destructor now inline.
Layout engine now supports 'morx' tables.
Fixed some ICU 50 regressions:
Affixes set with e.g. DecimalFormat::setPositivePrefix were ignored for parse.
UNUM_PARSE_INT_ONLY no longer handled grouping separator.
Add ucal_getTimeZoneID.
The C++ AlphabeticIndex implementation is now on par with Java, including full support for all Chinese collation tailorings.
U8_NEXT() and similar low-level macros now support NUL-terminated UTF-8 strings.
New macros like U8_NEXT_OR_FFFD() return U+FFFD for an ill-formed sequence.
Conversion: New "good one-way" mapping type, for example for Variation Selector sequences.
* 9306 Layout Engine changes for harfbuzz integration
* 9677 Affixes set with e.g. DecimalFormat::setPositivePrefix now ignored for parse
* 9714 OS/400 test failures
* 9728 Fail building icu4c with mingw-w64
* 9737 Locale::GetDefault() in locid.cpp is not thread-safe
* 9771 Updated Currency from/to data (CLDR 5470)
* 9748 Visual Studio 2010/2012 issues
* 9780 UNUM_PARSE_INT_ONLY no longer handles grouping sep
* 9783 New Turkish Lira symbol
* 9789 Date format parsing problem with new CLDR data
* 9793 Currency data integration issue with CLDR 5470 changes
* 9801 UCONFIG_NO_CONVERSION test failure
* 9802 No data test failure
* Unicode 6.2: Turkish Lira Sign, improved word & line segmentation (BreakIterator) for symbols
* CLDR 22.1: Data coverage & quality improved across all major languages; new short width type for weekday names; new zhuyin (Bopomofo) collation for Chinese; improved data for CompactDecimalFormat & RBNF
* Time zone data: 2012h
* Ordinal-number support in MessageFormat & PluralRules
* Deprecate setLocale(locale) in PluralFormat
* Dictionary-based break iterators (word segmentation):
* Support Chinese & Japanese, use more compact dictionary format, port all but Khmer support to Java
* Update Khmer dictionary
* Change Java util.ListFormat to text.ListFormatter and other updates, use CLDR data, port to C++
* Add updated IBM-eucJP and IBM-5233 converter
* Improve number formatting performance
* C++ GenderInfo: Effective combined gender of a list of people's genders (ported from Java)
* Thread safety support cannot be removed (see the Readme)
* Default compilers: Clang is now used if available (see the Readme)
* C++ Collator API cleanup, subclassing-API-breaking changes (see the Readme)
* Add option to genrb tool for writing java resource bundle files
* Time zone format APIs
* 9242 ICU4C fails to parse pattern containing EEE properly whilst ICU4J parses it successfully
* 9258 Number format performance
* 9283 uregex_open fails for look-behind assertion + case-insensitive
* 9284 Date format roundtrip test failure
* 9295 HPPA endianness detection
* 9313 Problem building ICU4C with Cygwin/MSVC
* 9332 Linux s390 endianness detection
* 9336 Problem building ICU4C 49.1.1 on zOS
* Unicode 6.1: New scripts & blocks; changes to grapheme break & line break
property values; some characters change from symbol to Po or No; etc.
* CLDR 21.0.1: Changes in segmentation data to match Unicode 6.1; new structures
for support of Chinese calendar, for context-dependent capitalization, for
gender of lists of people, for ordinal categories, and for multiple number
systems per locale; deprecation of "commonlyUsed" element in timezone names;
removal of "whole-locale" aliases; major cleanups of timezone names,
delimiter data, abbreviated number data.
* Normalizer2 API additions
* Easier-to-use getInstance() variants; e.g., getNFDInstance()
* Getter for the combining-class value for a code point
* Getter for the raw Decomposition_Mapping
* Pairwise composition
* TimeZone class: (C++) Getter for unknown time zone, (Java) fields for GMT &
unknown zone
* Support for deprecation of the "commonlyUsed" element for CLDR metazones
* DateTimePatternGenerator can now use separate patterns for skeletons that
differ only in MMM vs MMMM or EEE vs EEEE, etc.
* Support for custom DecimalFormatSymbols in RuleBasedNumberFormat
* Format and parse Chinese calendar dates including support for intercalary
months
* Context Transforms for context-dependent capitalization behavior
* APIs for TimeZoneNames and TimeZoneFormat
* Support for new date format pattern "ZZZZZ" for ISO 8601 zone format
* Options for ambiguous local time resolution in Calendar
* Support for ISO 4217 numeric currency code
* CLDR 2.0: The CLDR 2.0 release contains numerous improvements and bug fixes
approved by the CLDR committee, including much additional data for many
languages.
* Explicit parent locale support in data imported from CLDR.
* MessageFormat and related classes (choice/plural/select) have been
reimplemented, with several improvements and some incompatible changes.
* Extended PluralFormat pattern syntax supports explicit-value forms and
offsets.
* Utility APIs in PluralRules (get some/all/unique keyword values)
* Time zone API to return a list of available canonical system time zone IDs.
* Time zone API to return a region.
* Collation: Full implementation & public API for script reordering
* Dictionary-type trie
* GB18030-2005 update
* Common Locale Data Repository (CLDR) 1.9.1
* Update timezone data support to Olson 2011c
* 8271 UCOL_RUNTIME_VERSION should be updated for 4.6
* 8277 Collation Reordering Use Of USCRIPT_UNKNOWN
* 8290 Can't find Hangul with search coll (usearch doesn't handle CE iter
behavior)
* 8303 ULocale#toLanguageTag() should not supply "und" as language when the
locale has only private use
* 8341 USpoof uses NFKD, should be NFD
major changes:
Locale Data: ICU uses and supports data from Common Locale Data Repository
(CLDR) 1.7 , which includes data for 146 languages, 159 territories,
468 locales- 21% more locale data than the previous release.
Number system support and the number keyword.
Number system override in DateFormat
Numerics used by Hebrew Calendar date in Hebrew locale
BCP47 (language tag) / Locale transformation
BCP47 mapping of LDML keywords
Encoding selector: Return a list of charsets that can handle the input text
Simple duration: Implementation of CLDR duration format
Available/Preferred keywords for a locale (Calendar, Collation, and Currency)
StringPrep standard profiles: RFC3491 NAMEPREP, RFC3530 NFS4, RFC3722 iSCSI,
RFC3920 NodePrep/ResourcePrep, RFC4011 MIB, RFC4013 SASLprep, RFC4505 trace
and RFC4518 LDAPprep
Miscellaneous Arabic shaping enhancements
UTF-8 friendly internal data structure for Unicode data lookup
API to get CLDR version used by ICU
ISCII charset converter updates (added Gurumukhi, other updates)
Performance improvements in Time Zone Name format/parse, and in
DateIntervalFormat construction
Pkgsrc changes:
o New MASTER_SITE
o Adjust PLIST
o Remove no-longer-needed patches, since corresponding changes
have been adopted upstream
o BUILDLINK_ABI_DEPENDS bumped to >=4.0, since a new shared library
version is installed
o Fixes security vulnerability, ref. below.
Dependent pkgsrc packages will have their revisions bumped shortly
due to the (possibly/probably) changed ABI.
Upstream changes:
4.0.1:
ICU4C 4.0.1 is a maintenance release of ICU4J 4.0. The primary
changes of this release were:
* Updated time zone data to 2008i
* Technical preview of string search implementation using
Boyer-Moore algorithm (#6286). For detail information, please
see the tech note here.
* #5691 Conversion: consistent illegal sequences
* #6435 Bad @stable ICU4.0 tags
* #6597 TestDisplayNamesMeta failure
* #6670 Test failure in format/TimeZoneTest/TestShortZoneIDs
4.0:
Major changes in ICU 4.0 include the following:
* Common Changes
o Unicode 5.1 (#5696)
o Locale Data: ICU uses and supports data from Common
Locale Data Repository (CLDR) 1.6 , which includes many
improvements in quality and quantity of data.
o add/removeLikelySubtags (#6124)
o Charset converter file size improvement (#5987)
o Date Interval Formatting (#6157) Note: Calendar type
supported by this feature is Gregorian only in this
release.
o Improved Plural support
* ICU4C Specific Changes
Additional Calendars
+ Chinese (#4081)
+ Coptic/Ethiopic (#4571)
* ICU4J Specific Changes
o Charset
+ Graduated from Technology Preview status
+ ICU2022 Converter (#5791)
+ HZ Converter (#6128)
+ SCSU/BOCU-1 Converter (#2147)
+ Charset Converter Callback (#6144)
o Thai Dictionary break iterator (#5385)
o JDK TimeZone support (#5975)
o Locale Service Provider (#5976)
o More convenient formatting of year+month, day+month,
and other combinations (#6304)
o Simple Duration Formatting (#6303)
* ICU4C Security Fixes
ICU4C 4.0 resolves the vulnerabilities CVE-2007-4770 and
CVE-2007-4771 which were found in earlier versions of ICU.
The standard ICU tests verify that these have been corrected,
however, the updated versions of the previous tests may be
run by applying the following patch to ICU 4.0: r24324. As
well, ICU4C and ICU4J 4.0 resolve the issue underlying
CVE-2008-1036.
Major changes in ICU 3.6 include the following:
- Unicode: ICU uses and supports Unicode 5.0, which is the latest major release of Unicode. Unicode 5.0 will be used in many operating systems and applications, and this version of ICU is important maintain interoperability with these new operating systems and applications. More information about Unicode 5.0 can be found in the Unicode press release.
- Locale Data: ICU uses and supports data from Common Locale Data Repository (CLDR) 1.4, which includes many improvements in quality and quantity of data. There is 25% more CLDR locale data in 245 locales in ICU.
- ICU4C Specific Changes
- Charset Detection: A charset detection framework was added, which provides heuristics for detecting the charset for unlabeled sequences of bytes.
- Layout: The font layout engine has support added for Tibetan, Sinhala and Old Hangul.
- BiDi: The BiDi algorithm was enhanced to be more flexible and efficient
- ICU Data Management: The new icupkg tool provides an easier way to manage ICU's data library. This tool allows you to add, update or remove data from ICU's data archive.
- Time Zones The time zone data is modularized to allow easier building and updating of the data.
- Word Boundaries: The Thai word break iteration was improved to be more accurate. Also dictionary based detection of Thai word boundaries is now active for all locales.
- UText
- The BreakIterator uses UText for abstract text processing.
- 64-bit indexing is now used to allow access to larger chunks of text.
- API for read-only locking for security and robustness was added.
- Performance
- The u_sprintf/u_sscanf performance from the icuio library has been improved for number formatting/parsing.
- Constructing a DateFormat is significantly faster for many locales.
- Opening and closing a charset converter is significantly faster.
- The UTF-8 transformation functions and macros are faster.
- The UText API was improved for performance.
- The collation open and close functions have a small performance improvement.