OpenSearch is tightly bound to the Lucene core APIs that facilitate the following functionalities:
- Encoding
- Transactions
- Merges
- Search
- And more…
In this presentation I will discuss how the OpenSearch storage encoding can be extended to popular formats (e.g. Parquet, Avro) that are readable by public big data systems such as Apache spark. This provides a strategic long term benefit for the project as it allows it to more easily integrate with big data systems without the need for reindexing and transforming the data. In addition for integrations it allows for OpenSearch to easily enjoy new encoding developments that are happening outside of Lucene, such as new compression algos etc.. Moreover, I will discuss the various way of solving this problem and how in my case I choose to extend it via a new extension mechanism that involves an external writer. The approach is quite generic and allow to extend many other aspects of the Lucene codec with native implementations such as Rust/Python etc..