String Deduplication – A new feature in Java 8 Update 20

7 Comments

Strings consume a lot of memory in any application. Especially the char[] containing the individual UTF-16 characters is contributing to most of the memory consumption of a JVM by each character eating up two bytes.
It is not uncommon to find 30% of the memory consumed by Strings, because not only are Strings the best format to interact with humans, but also popular HTTP APIs use lots of Strings. With Java 8 Update 20 we now have access to a new feature called String Deduplication, which requires the G1 Garbage Collector and is turned off by default.
String Deduplication takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.

Various strategies for String Duplication have been considered, but the one implemented now follows the following approach:
Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char.
If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.

This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.

So how does this work in practice? First you need the Java 8 Update 20 which was just recently released.

Then you can run the following code with: -Xmx256m -XX:+UseG1GC

public class LotsOfStrings {
 
  private static final LinkedList<String> LOTS_OF_STRINGS = new LinkedList<>();
 
  public static void main(String[] args) throws Exception {
    int iteration = 0;
    while (true) {
      for (int i = 0; i < 100; i++) {
        for (int j = 0; j < 1000; j++) {
          LOTS_OF_STRINGS.add(new String("String " + j));
        }
      }
      iteration++;
      System.out.println("Survived Iteration: " + iteration);
      Thread.sleep(100);
    }
  }
}

This code will run and terminate after 30 iterations with an OutOfMemoryError.

Now run it with String Deduplication enabled:
-Xmx256m -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics

It will now run significantly longer and terminate after 50 iterations.

The JVM now also prints what it does, so lets take a look:

[GC concurrent-string-deduplication, 4658.2K->0.0B(4658.2K), avg 99.6%, 0.0165023 secs]
   [Last Exec: 0.0165023 secs, Idle: 0.0953764 secs, Blocked: 0/0.0000000 secs]
      [Inspected:          119538]
         [Skipped:              0(  0.0%)]
         [Hashed:          119538(100.0%)]
         [Known:                0(  0.0%)]
         [New:             119538(100.0%)   4658.2K]
      [Deduplicated:       119538(100.0%)   4658.2K(100.0%)]
         [Young:              372(  0.3%)     14.5K(  0.3%)]
         [Old:             119166( 99.7%)   4643.8K( 99.7%)]
   [Total Exec: 4/0.0802259 secs, Idle: 4/0.6491928 secs, Blocked: 0/0.0000000 secs]
      [Inspected:          557503]
         [Skipped:              0(  0.0%)]
         [Hashed:          556191( 99.8%)]
         [Known:              903(  0.2%)]
         [New:             556600( 99.8%)     21.2M]
      [Deduplicated:       554727( 99.7%)     21.1M( 99.6%)]
         [Young:             1101(  0.2%)     43.0K(  0.2%)]
         [Old:             553626( 99.8%)     21.1M( 99.8%)]
   [Table]
      [Memory Usage: 81.1K]
      [Size: 2048, Min: 1024, Max: 16777216]
      [Entries: 2776, Load: 135.5%, Cached: 0, Added: 2776, Removed: 0]
      [Resize Count: 1, Shrink Threshold: 1365(66.7%), Grow Threshold: 4096(200.0%)]
      [Rehash Count: 0, Rehash Threshold: 120, Hash Seed: 0x0]
      [Age Threshold: 3]
   [Queue]
      [Dropped: 0]

For our convenience we do not need to add up all data ourselves but can use the handy totals calculation.
The above snippet is the forth execution of String Deduplication, it took 16ms and looked at about 120k Strings.
All of them are new, meaning not yet looked at. These numbers look different in real apps, where strings are passed multiple times, thus some might be skipped or have a hashcode already (as you might know the hash code of a String is computed lazy).
In above case all strings could be deduplicated, removing 4.5MB of data from memory.
The Table section gives statistics about the internal tracking table, and the Queue one lists how many requests for deduplication have been dropped due to load, which is one part of the overhead reduction mechanism.

So how does this compare to String Interning? I blogged about how great String Interning is for memory efficiency. In fact the String Deduplication is almost like interning with the exception that interning reuses the whole String instance, not just the char array.

The argument the creators of JDK Enhancement Proposal 192 make is that often developers do not know where the right place to intern strings would be, or that this place is hidden behind frameworks. As I wrote, you need some knowledge where you typically encounter duplicates (like country names).
String Deduplication also benefits duplicate Strings across applications inside the same JVM and thus also includes stuff like XML Schemas, urls, jar names etc which one normally would assume not appear multiple times.

It also adds no runtime overhead as it is performed asynchronously and concurrent during garbage collection, while String Interning happens in the application thread. This now also explains the reason we find that Thread.sleep() in above code. Without the sleep there would be too much pressure on GC, so String Deduplication would not run at all. But this is a problem only for such sample code. Real applications usually find a few ms spare time to run String Deduplication.

Kommentare

  • 29. August 2014 von BEMPEL

    minor comment: char array contains UTF-16 characters:

    http://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1

    “The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.”

    • Thanks for the clarification, I have updated the text. I think the main aspect to point out that a char takes up two bytes.

      • 1. September 2014 von Christoffer Hammarström

        Actually, to be even more specific, each char is a UTF-16 code *unit*, not a character.

        So it would be more correct to say “Especially the char[] containing the individual UTF-16 code units”.

          • 8. January 2015 von Michael Böckling

            This is actually important when you want to correctly split a String. Using charAt() might corrupt your data, as some characters (CJK languages) need 2 code units. What you want in that case is offsetByCodePoint(), but you give up O(1) string indexing in return.

            It is a tricky topic, and confusing the terminology is easy, I just wanted to point out that there is some value in being pedantic about it.

  • It’s about 1.2% performance promotion on my solr 4.9 search cluster

    • 2. September 2014 von codeslubber

      In other words, meaningless. Wonder if this was in the hopper for 4 years like Java 8, which hardly anyone seems to be using.

Comment

Your email address will not be published. Required fields are marked *