Save Memory by Using String Intern in Java

During Attila Szegedis talk about “lessons learned about the JVM” at QCon London, I was surprised that he emphasized the importance of knowing the amount of data you store in memory. It is not common to be concerned about object size in enterprise Java programming, but he gave a good example of what they had to do Twitter.

Recap: Memory Footprint of Data

Question: How much memory does the String “Hello World” consume?
Answer: 62/86 Bytes (32/64 bit Java)!
This breaks down into 8/16 (Object Header for String) + 11 * 2 (characters) + [8/16 (Object Header char Array) + 4 (array length) padded to 16/24] + 4 (Offset) + 4 (Count) + 4 (HashCode) + 4/8 (Reference to char Array). [On 64Bit the size of String Object is padded to 40].

The Problem

Imagine you have a lot of Locations attached to tweets in your data store. The implementation of the location as a Java class could look like this

class Location {
	String city;
	String region;
	String countryCode;
	double long;
	double lat;
}

So if you load all the locations of tweets ever made, it is quite obvious that you load a lot of String objects, and at the scale Twitter has, there are for sure a lot of duplicate Strings. Attila said that this data did not fit into a 32 GB heap. So the question is: can we reduce memory consumption, so that all Locations fit into memory?

Let us have a look at two possible solutions, which can even augment each other.

Attilas Solution

There is a, more or less, hidden dependency between the data stored in the Location class, which, once realized, will solve the problem in an elegant way, non-technical way. We can just apply normalization to the Object and split it into two:

class SharedLocation {
	String city;
	String region;
	String countryCode;
}
class Location {
	SharedLocation sharedLocation;
	double long;
	double lat;
}

This is actually a neat solution, because cities rarely change the region and country they reside in. The combination of those Strings is unique. And it is flexible, so that even violations of those uniqueness can be handled. This makes always sense for data which could be user input. By doing so, multiple Tweets from “Solingen, NRW, DE” just consume one SharedLocation.
But still “Ratingen, NRW, DE” would store 3 additional String in memory, rather than just the new one “Ratingen”. By refactoring the data model like that the amount of data for that Twitter research project dropped to about 20GB.

String interning

But what to do when you do not want to, or simply cannot refactor the data model? Or the researcher at twitter would not have had a 20GB Heap?
The answer is String interning, which keeps every String only once in memory. But there is huge confusion about String interning. Many people ask if it speeds up equals checks, because equal strings are actually identical Strings when using interning. Yes, it might do that (like all objects should)

// java.lang.String
public boolean equals(Object anObject) {
  if (this == anObject) {
    return true;
  }
  //...
}

But equals performance is not the reason you should do interning. String interning is intended to reuse String objects to save memory.

Only use String.intern() on Strings you know are occurring multiple times, and only do it to save memory

How effective interning is, is determined by the ratio of duplicate/unique strings. And it depends on whether it is easy to change code at string generating places.

So how does it work?

String interning takes a String instance (so it already exists in the Heap) and checks if an identical copy exists already in a StringTable.
That StringTable is basically a HashSet that stores the String in the Permanent Generation. The only purpose of that Table is to keep a single instance of the String alive. If it is in there, the instance is returned. If its not, its added to the String Table:

// OpenJDK 6 code
JVM_ENTRY(jstring, JVM_InternString(JNIEnv *env, jstring str))
  JVMWrapper("JVM_InternString");
  JvmtiVMObjectAllocEventCollector oam;
  if (str == NULL) return NULL;
  oop string = JNIHandles::resolve_non_null(str);
  oop result = StringTable::intern(string, CHECK_NULL);
  return (jstring) JNIHandles::make_local(env, result);
JVM_END
 
oop StringTable::intern(Handle string_or_null, jchar* name,
                        int len, TRAPS) {
  unsigned int hashValue = hash_string(name, len);
  int index = the_table()->hash_to_index(hashValue);
  oop string = the_table()->lookup(index, name, len, hashValue);
 
  // Found
  if (string != NULL) return string;
 
  // Otherwise, add to symbol to table
  return the_table()->basic_add(index, string_or_null, name, len,
                                hashValue, CHECK_NULL);
}

As a result, this specific instance of a String only exists once.

How to use interning

The right place to use String interning is where you read from data store and add the Objects/String to a larger scope. Note that ALL Strings which are hardcoded (as constant or anywhere in code) are automatically interned by the compiler.
An example would be:

String city = resultSet.getString(1);
String region = resultSet.getString(2);
String countryCode = resultSet.getString(3);
double city = resultSet.getDouble(4);
double city = resultSet.getDouble(5);
 
Location location = new Location(city.intern(), region.intern(), countryCode.intern(), long, lat);
allLocations.add(location);

All newly created location objects will use the interned string. The temporary strings read from the database will be garbage collected.

How to find out how effective would be string interning

You are best off taking a heap dump of a quite full heap. It can even be collected at an OutOfMemoryError.
Open it in MAT and select the java.lang.String from the histogram. On that one use “Java Basics” and “Group By Value”

Depending on the heap size, this may take a long time. In the end the result will be like this. Sorting either along retained heap or number of objects will reveal interesting things:

From this screenshot we can see that empty Strings take a lot of memory! 2 million empty Strings take a total of 130MB. Then we see some amount of JavaScript that is loaded, a few more technical strings like the keys, which are used for localization. And we can see some Strings related to business logic.
These business logic strings are probably the easiest to intern, because we might know where they are loaded into memory.
For the other ones we need to use “Merge shortest Path to GC Root” to identify where they are stored, which may or may not reveal to use where it is created.

Tradeoffs

So why not always do String interning? Because it slows your code down!
Here a small example:

private static final int MAX = 40000000;
public static void main(String[] args) throws Exception {
	long t = System.currentTimeMillis();
	String[] arr = new String[MAX];
	for (int i = 0; i < MAX; i++) {
		arr[i] = new String(DB_DATA[i % 10]);
		// and: arr[i] = new String(DB_DATA[i % 10]).intern();
	}
	System.out.println((System.currentTimeMillis() - t) + "ms");
	System.gc();
	System.out.println(arr[0]);
}

The code uses a String array to keep a strong reference to the String objects. and we print the first element in the end to avoid removal of the structure due to optimization. Then we load 10 different Strings from the database. I used new String() here to illustrate the temp String allocation you always do when you read from storage. At the end I do a GC, so that the results are correct and no leftovers are included.
This was run on a 64bit Windows, JDK 1.6.0_27, i5-2520M with 8GB Ram. Run with -XX:+PrintGCDetails -Xmx6G -Xmn3G to log all GC activity. Here is the output:
Without intern()

1519ms
[GC [PSYoungGen: 2359296K->393210K(2752512K)] 2359296K->2348002K(4707456K), 5.4071058 secs] [Times: user=8.84 sys=1.00, real=5.40 secs] 
[Full GC (System) [PSYoungGen: 393210K->392902K(2752512K)] [PSOldGen: 1954792K->1954823K(1954944K)] 2348002K->2347726K(4707456K) [PSPermGen: 2707K->2707K(21248K)], 5.3242785 secs] [Times: user=3.71 sys=0.20, real=5.32 secs] 
DE
Heap
 PSYoungGen      total 2752512K, used 440088K [0x0000000740000000, 0x0000000800000000, 0x0000000800000000)
  eden space 2359296K, 18% used [0x0000000740000000,0x000000075adc6360,0x00000007d0000000)
  from space 393216K, 0% used [0x00000007d0000000,0x00000007d0000000,0x00000007e8000000)
  to   space 393216K, 0% used [0x00000007e8000000,0x00000007e8000000,0x0000000800000000)
 PSOldGen        total 1954944K, used 1954823K [0x0000000680000000, 0x00000006f7520000, 0x0000000740000000)
  object space 1954944K, 99% used [0x0000000680000000,0x00000006f7501fd8,0x00000006f7520000)
 PSPermGen       total 21248K, used 2724K [0x000000067ae00000, 0x000000067c2c0000, 0x0000000680000000)
  object space 21248K, 12% used [0x000000067ae00000,0x000000067b0a93e0,0x000000067c2c0000)

With intern()

4838ms
[GC [PSYoungGen: 2359296K->156506K(2752512K)] 2359296K->156506K(2757888K), 0.1962062 secs] [Times: user=0.69 sys=0.01, real=0.20 secs] 
[Full GC (System) [PSYoungGen: 156506K->156357K(2752512K)] [PSOldGen: 0K->18K(5376K)] 156506K->156376K(2757888K) [PSPermGen: 2708K->2708K(21248K)], 0.2576126 secs] [Times: user=0.25 sys=0.00, real=0.26 secs] 
DE
Heap
 PSYoungGen      total 2752512K, used 250729K [0x0000000740000000, 0x0000000800000000, 0x0000000800000000)
  eden space 2359296K, 10% used [0x0000000740000000,0x000000074f4da6f8,0x00000007d0000000)
  from space 393216K, 0% used [0x00000007d0000000,0x00000007d0000000,0x00000007e8000000)
  to   space 393216K, 0% used [0x00000007e8000000,0x00000007e8000000,0x0000000800000000)
 PSOldGen        total 5376K, used 18K [0x0000000680000000, 0x0000000680540000, 0x0000000740000000)
  object space 5376K, 0% used [0x0000000680000000,0x0000000680004b30,0x0000000680540000)
 PSPermGen       total 21248K, used 2725K [0x000000067ae00000, 0x000000067c2c0000, 0x0000000680000000)
  object space 21248K, 12% used [0x000000067ae00000,0x000000067b0a95d0,0x000000067c2c0000)

We can see that the difference is significant. It took 3.3 seconds longer to run with interning. But the memory saved was enormous. When the code finished, the program using interning used 253472K(250M) of Memory. The other one used 2397635K (2.4G). That is quite a difference and illustrates nicely the tradeoffs when using String interning.

  • Facebook
  • Delicious
  • Digg
  • StumbleUpon
  • Reddit
  • Blogger
  • LinkedIn
Fabian Lange

9 Responses to Save Memory by Using String Intern in Java

  1. Der Code wird nicht nur langsamer, sondern “interne Strings” werden ja auch nicht mehr gelöscht. Wenn man da falsch optimiert, läuft man Gefahr es viel schlimmer zu machen.

    Die eigentlich interessante Frage hier: wieso werden so viele leere Stringinstanzen erzeugt und verbleiben offensichtlich auch referenziert im Speicher?

  2. Fabian Lange says:

    Hallo Michael,
    je nach JVM werden intere Strings auch Garbage Collected. Dafür müssen Sie allerdings tatsächlich Garbage sein (also nicht mehr stark/strong referenziert).
    In dem Screenshot sind es deshalb so viele leere Strings, weil hier eine Kundendatenbank im Speicher ist. Diese wurde über Hibernate aus der Datenbank gelesen und sehr viele String Felder haben keinen Wert. Deshalb gibt es halt so viele Leerstrings.

    Fabian

  3. Andrej Golovnin says:

    Hallo Michael, Hallo Fabian,

    seit Java 7 (Oracle JVM) werden interne String in dem normalen Java Heap Space angelegt (s. Bug #6962931). Zusätzlich wurde eine neue XX-Option (s. Bug #6962930) eingeführt: -XX:StringTableSize=hashTableSize. Über diese Option kann man jetzt die Größe des Hashtables für interne Strings festlegen. Der Default-Wert ist auf 1009 festgelegt. Für bestimmte Anwendungen kann es durchaus sinnvoll sein, die Größe des Hashtables zu verändern. S. Bug #6988220 für mehr Details.

    Gruß,
    Andrej

  4. di says:

    nice article. may be one of these days you can write something about Memory Footprint.there is so much inaccurate information out there , it is difficult to find out what is going on really.

  5. Simon says:

    Are interned strings eligible for collection at a full GC? Would it be wise to Intern uuid strings that are pk identities of entities with a long lived process?

  6. Jamie says:

    There’s also a non-permgen intern implementation available in the Guava libraries which provides weak/strong reference implementations.

    It also supports other immutable objects as well.

    Interface:
    http://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/collect/Interner.java

    and

    Implementations:
    http://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/collect/Interners.java

  7. B says:

    what is the difference ?
    String s1=new String(“abcde”);
    sop(s1);

    sop(s1.intern());

    am confused of string .,.

    • Fabian Lange says:

      if you use intern, your method sop might get a reference to s1, which would be the same, or if that “abcde” was already interned before, it would get a reference to that previously interned string instead and s1 would be garbage collected.

  8. This was a really nice article!
    I’ve been used String interning one project to lead with a search hit 1000 names per second with Red-Black Tree – O(log n) strategy on server side, with multiple threads managed with ExecutorService on thread pool .

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>