Recently I run into a problem where hashing some data had a big impact on the application performance. There are few factors that influenced this but it required 2.5x more instances, imagine when you have 15 instances of you application in production and it requires 38 to serve the same traffic.
In order to understand the problem and to find a what to fix I had to dig deeper. It this post will talk about what and how I found the problem.
The setup was:
- hashing algorithm:
SHA256
- you find in some
Map<String, List<String>>
data that should be hashed - data that should be hashed matches some regex patterns
- when your pattern matches, the content is replaced with the hashed one.
- Guava Hashing class was used
Because there were some iterations over data and string operations, the initial thought was that hashing algorithm is slow and should be replaced. In order to prove the theory and compare their performance, JMH (Java Microbenchmarking Harness) was used.
Setup JMHproject:
mvn archetype:generate -DinteractiveMode=false -DarchetypeGroupId=org.openjdk.jmh -DarchetypeArtifactId=jmh-java-benchmark-archetype -DgroupId=com.iftodi -DartifactId=jmh.hashing -Dversion=1.0
Add Guava and Apache Commons Lang dependencies for Hashing and RandomStringUtils classes.
com.google.guava guava 28.0-jre org.apache.commons commons-lang3 3.9
Benchmark:
@State(Scope.Thread) public class HashingBenchmark { private static final int STRING_LENGTH = 26; @Benchmark @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.MILLISECONDS) public void testMethod(Blackhole blackhole) { HashCode hashedString = Hashing.sha256().hashString(RandomStringUtils.random(STRING_LENGTH), Charset.defaultCharset()); blackhole.consume(hashedString); } }
SHA256:
Throughput:
Result "com.iftodi.HashingBenchmark.testMethod": 232.410 ±(99.9%) 5.036 ops/ms [Average] (min, avg, max) = (214.934, 232.410, 238.504), stdev = 6.723 CI (99.9%): [227.374, 237.446] (assumes normal distribution) Benchmark Mode Cnt Score Error Units HashingBenchmark.testMethod thrpt 25 232.410 ± 5.036 ops/ms
Average time:
Result "com.iftodi.HashingBenchmark.testMethod": 0.004 ±(99.9%) 0.001 ms/op [Average] (min, avg, max) = (0.004, 0.004, 0.004), stdev = 0.001 CI (99.9%): [0.004, 0.004] (assumes normal distribution) Benchmark Mode Cnt Score Error Units HashingBenchmark.testMethod avgt 25 0.004 ± 0.001 ms/op
SHA1:
Throughput:
Result "com.iftodi.HashingBenchmark.testMethod": 245.732 ±(99.9%) 2.870 ops/ms [Average] (min, avg, max) = (233.948, 245.732, 251.042), stdev = 3.831 CI (99.9%): [242.862, 248.602] (assumes normal distribution) Benchmark Mode Cnt Score Error Units HashingBenchmark.testMethod thrpt 25 245.732 ± 2.870 ops/ms
Average time:
Result "com.iftodi.HashingBenchmark.testMethod": 0.004 ±(99.9%) 0.001 ms/op [Average] (min, avg, max) = (0.004, 0.004, 0.004), stdev = 0.001 CI (99.9%): [0.004, 0.004] (assumes normal distribution) Benchmark Mode Cnt Score Error Units HashingBenchmark.testMethod avgt 25 0.004 ± 0.001 ms/op
From the results there is no big difference that should have a big impact to the performance.
It was necessary to dig deeper and few interesting things were found:
- looping map using
Map.entrySet().foreach(entry -> {})
is faster than looping by key, also getting the value inside the loop will hurt, even just looping usingentrySet()
is faster than looping onkeySet()
- Regex patterns are using backtracking and when they are not optimized you can run in a stack overflow exception or at least there will be performance penalties (this bug exists starting with java 1.4 and will not be fixed: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=5050507) – it is important to write regexes that are not doing a lot of work to match thing that is required. More about optimizing regex expressions: https://www.javaworld.com/article/2077757/optimizing-regular-expressions-in-java.html
I think this also may affect performance but I have not benchmarked this, it’s a theory:
- when using streams, use them correctly, when I was debugging I found code that instead of doing
collect(Collectors.toList())
at the end, they were doing aforEach
and adding data to a list.