JanusGraph: A small glimpse into the big world of graphs

11/14/2019
G DATA Blog

Read below to learn from Florian Hockmann of G Data how JanusGraph compares with Neo4j, why you should be keeping an eye out for TinkerPop 4, and get expert tips on graph data modeling.

JanusGraph is scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. The project was forked from Titan and brought under open governance in the Linux Foundation in 2017.

Tell us a little about yourself and what you are working on today?

Florian Hockmann

My name is Florian Hockmann, and I’m working as a R&D engineer at G DATA, a German antivirus vendor. The team I’m part of is responsible for analysing the hundreds of thousands of malware samples we receive each day. We use a graph database to store information about these malware samples to be able to find connections between similar malware samples.

Florian Hockmann

How did you get involved with JanusGraph?

We originally used Titan, which was the predecessor of JanusGraph. Titan was a natural fit for us as we were looking for a database that could scale horizontally and that enabled us to find connections between malware samples, which is a typical use case for a graph database. After the company that developed Titan was acquired and shortly after it stopped all work on Titan, we were left with a database system that wasn’t maintained anymore. So, we were of course quite happy when IBM and others forked Titan to found JanusGraph, and we wanted to contribute to this new project to play our part in ensuring that JanusGraph succeeds as a scalable open source graph database.

I had already been involved in Apache TinkerPop—where I mostly develop the Gremlin .NET variant Gremlin.Net—and, therefore, it was a natural fit to contribute an extension library to that for JanusGraph. But I also made small contributions to other parts of the project and helped new users on the mailing list or on StackOverflow as well. This was a good way for me to get to know the various parts of the project to get more involved in it.

What should people know when deciding between Neo4j and JanusGraph?

I see mainly two differentiating factors between these two graph databases. Firstly, Neo4j is mostly a project that is kind of self-contained. What I mean by this is that it implements its own storage engine, indices, server component, network protocol, and query language.

JanusGraph, on the other hand, relies on third-party projects for most of these aspects. The reasoning behind this is that there are already existing solutions for these problems that are good at their specific job. By using them, JanusGraph can really concentrate on the graph aspect instead of having to also solve these problems again.

JanusGraph can, for example, use Elasticsearch or Apache Solr for advanced index capabilities like full-text search and scalable databases like Apache Cassandra or HBase to store the data. Because of that, it’s probably easier to get started with Neo4j as fewer moving parts are involved, but JanusGraph offers more flexibility as users can choose, for example, between different storage and index backends based on their specific needs. Users can decide for themselves which approach they prefer.

The other key differentiating factors I see are the user-facing interfaces of these two graph databases with the query language as the central aspect of that. JanusGraph implements TinkerPop for this (which can be considered as the de-facto standard for graph databases right now as most graph databases currently implement it), which offers users mostly the same experience across different graph databases, similar to the role SQL plays for relational databases.

While it’s also possible to use TinkerPop with its query language Gremlin together with Neo4j, Neo4j mostly promotes their own query language—Cipher. So, most Neo4j users probably end up using that language.

Users, of course, have to decide for themselves again which query language they prefer, Gremlin or Cipher, and how important it is for them to be able to easily switch to another graph database at some point in the future.

Apart from these technical aspects, I, of course, also want to point out that JanusGraph is an open source project that is completely community-driven. Users who want to see a certain feature implemented can therefore simply implement it themselves.

What advice would you give people to want to deploy JanusGraph in production?

I already mentioned that JanusGraph uses a few different components to create a graph database which offers rich functionality, like index and storage engines. While this approach gives users great flexibility and a rich feature set, it can also be a bit overwhelming for new users.

I would, however, like to point out that one doesn’t need a deep knowledge of all components to get started with JanusGraph. When I started with Titan—and it’s basically still the same for JanusGraph—I didn’t know really anything about Cassandra or Elasticsearch, but I was still able to setup and deploy Titan with these backends quickly.

Over the years, we switched from Cassandra to Scylla, added Apache Spark for machine learning, and made our deployment easier to scale by moving JanusGraph into Docker containers hosted on Docker Swarm.

So, my advice is to start with a small and simple deployment and then increase the size of the deployment and its complexity as needed. JanusGraph’s docs also contain a chapter, “Deployment Scenarios,” that describes a relatively simple getting-started scenario and how it can be evolved into a more advanced scenario.

Another project that is very important for JanusGraph is TinkerPop, which I already mentioned a few times. So, I would advise new users to get familiar with TinkerPop and, most importantly, its graph query language Gremlin. There are really good resources to get started like TinkerPop’s tutorials or the free e-book Practical Gremlin.

What are you looking forward to in JanusGraph and TinkerPop in the next few years?

Especially for JanusGraph, it’s hard to predict the future development as the project is completely community-driven and many contributions come from developers who are basically interested users who want to improve JanusGraph based on their own experiences and needs.

Apart from many small performance improvements, JanusGraph will most likely soon have an in-memory backend with significantly improved performance that is also ready for production usage, as opposed to the current in-memory backend which is only intended for testing purposes. This improved backend is a good example for a contribution made by users of JanusGraph, in this case developers at Goldman Sachs.

Backends are in general an area where I expect substantial improvements in the next few years for JanusGraph. We, of course, simply benefit from improvements in new releases of the backends themselves, but completely new backends can also provide big improvements or completely new functionality for JanusGraph.

FoundationDB looks, for example, very promising as it concentrates completely on achieving a scalable storage engine that offers transactions with ACID properties, and additional layers can add features like rich data models or advanced index capabilities. This approach seems to be a good fit for JanusGraph’s modular architecture and has the potential to solve some frequent problems with JanusGraph, like storing supernodes or performing upserts.

But, it’s good that you also asked about TinkerPop, as many improvements for JanusGraph will actually come from TinkerPop, especially when the next major version, TinkerPop 4, gets released.

The development of TinkerPop 4 is still in a very early state, but some major improvements can already be identified. What I’m personally, especially looking forward to are a wider range of execution engines for Gremlin traversals. Right now, one can choose between executing a traversal with a single thread—which is a good fit for real time use cases—or on a computing cluster with Spark (e.g., for machine learning or graph analytics).

At G DATA, we often have use cases that are in the middle of these two options, as they should be answered in a matter of a few seconds—which isn’t quite possible with Spark since it has some overhead—but they involve traversing over a significant number of edges, which also isn’t a good fit for single-threaded execution. An additional execution engine that is able to use more computing resources but that doesn’t need to load the whole graph first could be the perfect fit for those use cases.

A lot of effort is also currently spent in creating a more abstract data model for TinkerPop that is not specific to graphs. This has the potential to open up TinkerPop also for non-graph databases and computing engines. So, it can really increase the ecosystem of TinkerPop-enabled databases.

Do you have any tips or tricks for performant graph modeling?

This may sound obvious, but I think many users still aren’t doing it—namely evaluating a new schema or major changes to a schema before taking it into production.

This should be done with real data if possible, and the evaluation should include queries that model actual use cases. There is really no other way to ensure that your schema is actually a good fit for your use cases, and changing the schema later in production is a lot more time consuming than doing an initial evaluation.

A topic that is very important for probably all graph databases are supernodes, as they can be really painful and lead to very high query execution times. So, it’s best to check early whether supernodes can occur in your data model and then to work around them, for example, by changing the schema accordingly.

Another general thing to consider for a graph model is whether something should be a property on a vertex or a different vertex on its own connecting to the other vertex with an edge. My usual approach is to decide whether I want to be able to search for other vertices who have the same value for that property, in which case, I model it as its own vertex with edges connecting it to all vertices with that value. Otherwise, it can usually just be a vertex property.

How can someone get involved with JanusGraph?

It depends on whether you want to contribute code, improve the documentation, or help in some other way, like helping other users on the mailing who encounter a problem you have also already encountered and know how to solve.

For code or documentation changes, you can just look through our open issues on GitHub to find one that interests you or create a new issue to describe the suggested improvement and then just submit a pull request for it.

This is not different than for other open source projects. One advantage of JanusGraph for new contributors is probably that it consists of so many different modules that there is also a wide range of topics to contribute to, from something specific to a certain backend like Cassandra or Elasticsearch over core areas like how a query is executed to utility aspects around JanusGraph like schema management or client libraries for a certain programming language. So, you can choose an area to contribute to where you already have some knowledge in or that you are interested in.

If someone is interested in contributing to JanusGraph but needs some guidance to get started, then it’s of course always possible to ask me or any other active contributor and we are more than happy to help.

This interview is a shortened version of a conversation with Florian. You can find the whole interview here in the IBM blog

from Stefan Karpenstein
Public Relations Manager