Off-heap Hash Map in Java, part 2

I spent some time trying to re-produce issue we had in production which justified usage of Off-heap hash map. And it was a total fail! In theory I knew that it is related to cases when app needs huge maps consuming almost all heap memory and those maps are not static, they are constantly changed triggering Full GC cycles. Anyway I got some interesting results just comparing time and memory usage of map population.

So I had this simple code to create a map. OHM uses same interface as HashMap so it is pretty simply to test them both.

public class HashMapTest {
    public static void main(String[] args) {
        //final Map<String, String> map = new HashMap<>(15_000_000, 0.9f);
        final Map<String, String> map = new OHMap<>(15_000_000);
 
        for (int i = 0; i < 10; i++) {
            map.clear();
 
            System.out.print("Loading map...");
            long start = System.currentTimeMillis();
            populateMap(map);
            long end = System.currentTimeMillis();
            System.out.println("Done in " + (end - start) + "ms.");
        }
    }
 
    private static void populateMap(Map<String, String> map) {
        for (int i = 0; i < 10 * 1000_000; i++) {
            map.put(String.valueOf(i), UUID.randomUUID().toString());
        }
    }
}

I started with OpenJDK 8 and as expected OHM was slower than HashMap: 33ms vs 23ms. But memory consumption is quit opposite! I had to pump -Xmx to 3Gb to make HashMap test work and total memory used by Java process was 3181Mb. OHM worked even with -Xmx1G though total memory consumption was also close to 3Gb.

Now the most interesting results (prompted me to post this) I got from using OpenJDK 11! The performance difference between HashMap and OHM was shocking: 17ms vs. 34ms!!! And memory consumption for HasMap test with -Xmx3G was lower than 3Gb!

Undoubtedly Java engineers did a good job with core JDK11. With such results I may no need of OHM in production when we switch to Java 11. But I still wasn’t able to re-produce the state of continuous GC cycles with huge hash maps. My next try will be adding multi-threading to get closer to production use case.

Off-heap Hash Map in Java, part 1

A month ago I had to deal with an interesting case: huge hash maps in Java! When input started to grow we switched to larger AWS instance. But that didn’t help because input kept growing. And I observed huge heap consumption and very long GC pauses. Essentially application was partly doing its job and partly doing GC. But when it started to hit max heap I had to do something. And I started investigating off-heap solutions.

My “googling” quickly led me to two Java solutions: Java Large Off Heap Cache and Binary Off Heap Hash Map. Both solutions are treating keys and values as blob of bytes. I chose BinaryOffheapHashMap because it is a small code which I can understand. Even that code was experimental it solved my task: creating a hash map outside of GC world. You can read more about that project here. OHC looks more “professional” and is something I will try next time.

So my experiments allowed me to look on Java from a different angle: “greedy” memory consumption and really nasty GC cycles. I will publish my test results in my next post.

QConSF 2017

About

QCon has been hosted by InfoQ in 11 years in a row. It attracts a lot of engineers from all other the world. All famous brands like IBM, Oracle, Google, Netflix, LinkedIn etc. tend to have at least few talks there. QCon describes itself as a conference for senior engineers with emphasize on practical approaches.  As a InfoQ reader I decided to give a try for its San Francisco version.

This year QCon hosted 175 speaker across 18 tracks! I’ve visited three days of presentations and three workshops. To be honest I wasn’t very comfortable with how crowded this conference was – more then 1600 people registered! I’ve been only on one US conference before – AWS Community Summit 2017 – so I don’t qualify to give grades, but in my opinion the organization and IT infrastructure was much above my understanding of  “standard”. Even breakfasts/lunches were thought trough so people don’t need to stay a lot in the lines.

The material quality and its diversity was pretty good and some times surprising. I was also impressed by the number of “big” companies playing as sponsors and speakers. I attended sessions hosted by IBM, LinkedIn, Oracle, Reddit, Docker, AWS. There were two exhibition rooms were you could talk to engineers/managers from Microsoft, Pivotal, MySQL, Vaadin, AppDynamics, Azul, RedisLab, MongoDB! There were pretty long  breaks between sessions (25 minutes) where you could share your thoughts with either presenters or your college engineers. I personally met with engineers from Canada, Norway, Poland, Netherlands, US Cincinnati, US Texas, Russia and Ukraine!

Hypes and Buzzwords

Some say that QCon is an indicator for new  “big things” and that QCon have spread the “Microservises” hype first. Based on 2017 track titles and there popularity (by votes) Microservices is still a #1 buzzword: I counted more then 20 sessions mentioning “microservice” or “service” word! The next big thing this time was Chaos Engineering: Chaos Architecture got The Best Attended mark! And I would give #3 for Serverless and Containers because those themes were very connected to Microservices.

IMHO

I have never attended large conferences before and it was essential to me to try one of the best and understand the importance and role conferences play in day-to-day life of any engineer. And… I am confused. I am not disappointed, no. I just proved to myself low usefulness and minimal feedback from a conference if you treat them wrong. I will try to elaborate on that statement. Basically in the Internet era everything can be found and learn from online resources. Literary every thing! Services like Coursera or Udemy will even force you to learn stuff b/c you payed your hard own money for that 🙂 So if you were looking for “new” material and “secret” knowledge you’d be very disappointed. The true meaning is to share knowledge and give or receive feed backs! So the real pearls of any conference are Open Spaces or “Ask Me Anything” sessions where you can get “secret” or even “sacred” knowledge from authors, maintainers or early adopters! Though it doesn’t mean conference useless if you do only presentations and workshops. It still can be very useful for anyone who can’t or don’t want to track all changes in IT world via Internet. Or if you want to hear/try something completely different, something out of your daily duties. I think there is also another benefit: just to make sure that you (and your company) are not insane and you are doing right thing e.g. using right frameworks, tools, databases etc.

CMake and MinGW

I used to build my C/C++ toy projects in Code::Blocks, but now I have moved to CLion. And that IDE uses CMake under the hood. It automates all processes (same as Code::Blocks btw), but I was interested how to build my project w/o IDE. CMake has a nice and short tutorial, but it was missing the main point: how to start the build!!! I had to surf Internet for other tutorials for that. One of them gave me some clues. But if you follow it then you may have some interesting troubles. First of all, never run cmake in the source folder! Create a separate folder like “build” and run cmake there. Secondly, if you have MS Visual C++ compiler then cmake will detect it and use it. That wasn’t my goal. So I had to read another tutorial which gave more insight. And then I realized I should have just read cmake –help more carefully 🙂

Anyway here is a short note how to run cmake with MinGW:

mkdir build
cd build
cmake -G "MinGW Makefiles" ..
mingw-make

Explanation:

First of all don’t forget to install CMake from official website (or use choco). Secondly add it to user’s PATH variable. Then you can open command line and go to your project source. CMake generates tons of files and that’s why we better run it in another folder.

Note that running cmake for Linux/MacOS is similar: just use -G “Unix Makefiles” and then make!

Jupyter Notebook on Amazon Linux

Jupyter Notebook is an app for data analysis. The idea is to combine documentation and the code! My wife uses it for her data science courses from Coursera. Once she complained that some tasks took whole night to complete on her laptop. Her Sony Vaio is pretty powerful, but definitely not a mainframe. When I noticed that Notebook is actually a web application I immediately suggested to run it in Amazon AWS! This is a short instruction how to setup Jupyter Notebook there.

First you have to provision EC2 instance with Amazon Linux. I recommend so called “compute-optimized” instance types (cX) as they provide max CPU power. Amazon Linux already comes with Python 2.7.12 which is enough for Jupyter. Installing Jupyter is pretty simple:

sudo pip install jupyter

Then you need to start it. Here is what I do:

ssh -i <rsa-key> ec2-user@<ec2-machine-public-dns>
screen
jupyter notebook --no-browser

First I login to the EC2 instance. Then I start screen session so I can easily logout/disconnect and let jupyter run in background. Third line is launching Jupyter Notebook. Note “no-browser” that’s because by default Notebook would try launching browser and we don’t want that. Jupyter will print out login URL similar to http://localhost:8888/?token=a917d6207a4726774e2fd4d6053d12e24b0326628e2d7350. Copy it to you clipboard.

Next step is to create an SSH tunnel to access our Jupyter instance:

ssh -i <rsa-key> -fNL 8888:localhost:8888 ec2-user@<ec2-machine-public-dns>

Now you can open you browser and pasted saved URL:

The last thing you can do (if you want to try data science staff) is installing popular Python packages. But before that you need to install GCC and its prerequisites. In Amazon Linux (and Red Hat) it’s super easy:

sudo yum groupinstall "Development Tools"

Then you can install actual packages using pip:

sudo pip install numpy
sudo pip install pandas
sudo pip install xgboost
sudo pip install sklearn

And so on…