A while ago I shared this tip to improve insert performances with Spring Data MongoDB in reactive applications:
This article gives proper explanations for that performance tip.
Spring Boot, MongoDB and Reactor are a good match to write scalable applications, but there is a gotcha when inserting data. In a Spring Boot reactive application using MongoDB, a typical repository component is:
public interface ItemRepository extends ReactiveMongoRepository<Item, String> {
// Other methods, as needed
}
To insert a List
of business items in the database, the insert(Iterable<S> entities)
repository method works as
intended, a single network call is used for all inserted data. But to insert a Flux
of items, it's a bit more
complicated since it is a reactive type that represents an asynchronous succession of elements. At first, the
insert(org.reactivestreams.Publisher<S> entities)
repository method seems to be a good match:
Flux<Item> itemFlux = ...
itemRepository.insert(itemFlux);
But if we monitor the database calls, we see that the items are inserted one by one. We enable MongoDB request logging
by changing the right log category, for example in the application.properties
:
logging.level.org.springframework.data.mongodb.core.ReactiveMongoTemplate=debug
When the code is executed to insert a Flux
of (for example) 6 items, the logs show:
19:22:18.890 DEBUG [ctor-http-nio-3] o.s.d.m.core.ReactiveMongoTemplate : Inserting Document containing fields: [date, _class] in collection: item
19:22:18.892 DEBUG [ctor-http-nio-3] o.s.d.m.core.ReactiveMongoTemplate : Inserting Document containing fields: [date, _class] in collection: item
19:22:18.893 DEBUG [ctor-http-nio-3] o.s.d.m.core.ReactiveMongoTemplate : Inserting Document containing fields: [date, _class] in collection: item
19:22:18.893 DEBUG [ctor-http-nio-3] o.s.d.m.core.ReactiveMongoTemplate : Inserting Document containing fields: [date, _class] in collection: item
19:22:18.893 DEBUG [ctor-http-nio-3] o.s.d.m.core.ReactiveMongoTemplate : Inserting Document containing fields: [date, _class] in collection: item
19:22:18.894 DEBUG [ctor-http-nio-3] o.s.d.m.core.ReactiveMongoTemplate : Inserting Document containing fields: [date, _class] in collection: item
This behavior can also be seen in the sources of the repository implementation class SimpleReactiveMongoRepository
from Spring Data. The insert(T objectToSave, String collectionName)
method is executed for each item of the Flux
:
@Override
public <S extends T> Flux<S> insert(Publisher<S> entities) {
Assert.notNull(entities, "The given Publisher of entities must not be null!");
return Flux.from(entities).flatMap(entity -> mongoOperations.insert(entity, entityInformation.getCollectionName()));
}
Note that some other methods from SimpleReactiveMongoRepository
with a Publisher
parameter also use one database
call per persistent item. The optimization described in the next section can be applied accordingly to these methods.
In some cases inserting the items one by one is not an issue, for example if the items in the Flux
are
emitted with some delay between each element.
But sometimes we prefer grouping the database calls. To do so, we can simply combine the buffer(int)
operator of Reactor with the insert(Iterable<S> entities)
repository method:
Flux<Item> itemFlux = ...
itemFlux.buffer(3).flatMap(itemRepository::insert);
With a batch size of 3, the logs are now:
19:37:24.445 DEBUG [ctor-http-nio-3] o.s.d.m.core.ReactiveMongoTemplate : Inserting list of Documents containing 3 items
19:37:24.446 DEBUG [ctor-http-nio-3] o.s.d.m.core.ReactiveMongoTemplate : Inserting list of Documents containing 3 items
The insertions are now correctly grouped. For small volumes of data, the performance difference may not be noticeable. But for larger volumes it can make a difference.
Let's measure the performances of these strategies. I used a remote MongoDB (thanks to the free plan of Clever Cloud MongoDB hosting) to emphasize the performance difference due to the network latency. But in my case the ping between the application and the database was only 16 ms. The database used a replica set of 2 instances. The MongoDB client connection pool min and max were set to 100. The ReactiveMongoTemplate logs were temporarily disabled.
Strategy | 100 items | 1000 items | 10000 items |
---|---|---|---|
Default | 30 ms | 200 ms | 1900 ms |
Optimized with batch size of 1 | 30 ms | 200 ms | 1900 ms |
Optimized with batch size of 10 | 24 ms | 34 ms | 210 ms |
Optimized with batch size of 100 | 24 ms | 30 ms | 80 ms |
As expected, the optimized insertion with a batch size of 1 has the same duration than the default insertion, since they execute similar database calls.
With larger batch sizes we see performance benefits. The benefits are more important for higher number of items. We also see some diminishing returns: for 10000 items, batch size increase from 1 to 10 gives a x9 speed increase, but 10 to 100 gives only a x2.6 speed increase.
If you are not familiar with reactive programming, you may be surprised how fast the default insertion is. Only 200 ms for 1000 consecutive insertions: one insertion takes an average duration of 0.2 ms, way bellow the ping (16 ms). How? Simply because insertions are executed asynchronously, no need to wait for an insertion to complete before executing the next one. This maximizes the connection pool usage and reduces the total duration.
We saw that some persistence methods of Spring Data MongoDB reactive repositories may not bring ideal performances. We described a solution to improve this, and measured the performance differences.
A sample project is available in GitHub, see spring-data-mongodb-reactive-insertion. The project README describes how to configure and run the application.
© 2007-2024 Florian Beaufumé