Skip to content

Reading and indexing data in Elasticsearch using ASP.NET Core and NEST 5.x (Part 2/4)

In part 1 we provided our .NET application with a way of communication with our Elastic cluster. In this part we’ll be indexing the actual data – 170000 recipes from the OpenRecipes dump!

Implementation

For the sake of simplicity, we aren’t going to use an actual database, but just a simple JSON file.

You can find out how to keep Elastic in sync with your current database here.

These are the steps we’re going to follow:

  1. Create the Recipe class, which will be the model for the recipes in our index.
  2. Implement a DataIndexer class that is going to do the actual reading from file and indexing.
  3. Register the DataIndexer class so it can be injected wherever needed.
  4. Expose an API endpoint responsible for processing index requests and passing them into the DataIndexer service.
  5. Download our JSON data file.
  6. Index the actual data by sending a request to our API.
Step 1. Create the model for our Recipes

Create a folder called Models in your project’s root path. Add a class to it called Recipe with the following implementation:

using Nest;
using System;
// Specify the type name
[ElasticsearchType(Name = "recipe")]
public class Recipe
{
    public string Id { get; set; }
    // Mark the Name property as a Completion field
    // In part 4, when we implement the Autocomplete method, you'll find out why this is needed
    [Completion]
    public string Name { get; set; }
    // Specify the Ingredients as a text field to enable full-text search
    [Text]
    public string Ingredients { get; set; }

    public string Url { get; set; }

    public string Image { get; set; }

    public string CookTime { get; set; }

    public string RecipeYield { get; set; }

    public DateTime? DatePublished { get; set; }

    public string PrepTime { get; set; }
    // Specify the Description as a text field to enable full-text search
    [Text]
    public string Description { get; set; }
}
Step 2. Implement the DataIndexer service

In your Elastic folder, create a new class called DataIndexer with the following implementation.

public class DataIndexer
{
   public DataIndexer(ElasticClientProvider clientProvider, IHostingEnvironment env, IOptions<ElasticConnectionSettings> settings)
   {
       this.client = clientProvider.Client; // Get the ElasticClient
       this.contentRootPath = Path.Combine(env.ContentRootPath, "data"); // Where we'll be looking for the file to read and index
       this.defaultIndex = settings.Value.DefaultIndex; // The default index
   }

   private readonly ElasticClient client;
   private readonly string contentRootPath;
   private readonly string defaultIndex;
}

The contentRootPath field has a hardcoded Data directory appended to it. This is the directory where we’ll store our json database.

Now we’ll have to implement the actual method that’s going to index our data. Add a method called IndexDataFromFile in your DataIndexer class. For simplicity, it’s just going to return a boolean value showing whether the indexing was successful or not.

public async Task<bool> IndexRecipesFromFile(string fileName, bool deleteIndexIfExists, string index = null){}

Again, for the sake of simplicity, we’ll be assuming that the fileName passed is present in the project’s Data (that we hardcoded earlier).

Before showing you the C# code, I’ll write down the method in pseudocode, so you can better undestand what we’re trying to do. The IndexDataFromFile method will function in the following way:

// Read the json file
// Convert the file content to an array of C# objects
// Pass the converted array of objects into ElasticClient's IndexMany method

Okay, let’s implement that!

Handling if an index wasn’t passed in:

if (index == null)
{
    index = this.defaultIndex;
}
else
{
    index = index.ToLower();
}

If the user passed an index, it is going to be converted to lowercase since Elastic indices must always be lowercase.

Reading the file:

using (FileStream fs = new FileStream(Path.Combine(contentRootPath, fileName), FileMode.Open))
{
   using (StreamReader reader = new StreamReader(fs))
   {
       // Won't be efficient with large files, but better for brevity
       string rawJsonCollection = await reader.ReadToEndAsync();
   }
{

As you can see, we’re simply opening a StreamReader to read our file. The using blocks guarantee that the streams will be closed after use.

Now let’s parse the contents:

Recipe[] mappedCollection = JsonConvert.DeserializeObject<Recipe[]>(rawJsonCollection, new JsonSerializerSettings
                    {
                        Error = HandleDeserializationError
                    });

// https://stackoverflow.com/questions/26107656/ignore-parsing-errors-during-json-net-data-parsing
private void HandleDeserializationError(object sender, Newtonsoft.Json.Serialization.ErrorEventArgs errorArgs)
{
    var currentError = errorArgs.ErrorContext.Error.Message;
    errorArgs.ErrorContext.Handled = true;
}

We’re making use of the JsonConvert class, which is present in the Newtonsoft.Json namespace.

The HandleDeserializationError method is required because the data set is huge and the probability of errors occurring is really high. We don’t want to throw exceptions just because a recipe’s name is broken.

I have added a boolean parameter called deleteIndexIfExists, which if true, should drop the index and create a new one prior to indexing the actual documents. Let’s handle that.

// If the user specified to drop the index prior to indexing the documents. Useful when you want to "hard reset" things
if (this.client.IndexExists(index).Exists && deleteIndexIfExists)
{
    await this.client.DeleteIndexAsync(index);
}

If the index we specified is not present in Elastic, we’ll need to create it. The CreateIndexDescriptor you see below is required in order to create the mapping for our Recipe. If we don’t create it manually, Elastic will automatically create a mapping for us based on its own assumptions, but we want to call AutoMap explicitly so we can make use of the attributes we assigned to the Recipe class ealier. This is called AttributeMapping.

if (!this.client.IndexExists(index).Exists)
{
 // Automap means that it will be creating the mapping according to the model's attributes
    var indexDescriptor = new CreateIndexDescriptor(index)
                    .Mappings(mappings => mappings
                        .Map<Recipe>(m => m.AutoMap()));

    await this.client.CreateIndexAsync(index, i => indexDescriptor);
}

In each Elastic index, there is a field called max_result_window, which is the maximum value of from + size for searches to this index. It defaults to 10000, which means that we can’t have pagination for more than a 100 pages. Since we have much more than that, we’ll set it to int.MaxValue.

// Max out the result window so you can have pagination for >100 pages
this.client.UpdateIndexSettings(index, ixs => ixs
     .IndexSettings(s => s
         .Setting("max_result_window", int.MaxValue)));

Now let’s do the actual indexing:

// Then index the documents
int batchSize = 10000; // magic :O
int totalBatches = (int)Math.Ceiling((double)mappedCollection.Length / batchSize);

for (int i = 0; i < totalBatches; i++)
{
    var response = await this.client.IndexManyAsync(mappedCollection.Skip(i * batchSize).Take(batchSize));
    if (!response.IsValid)
    {
        return false;
    }
}

return true;

Simply returning some obscure boolean values is clumsy, but fine in the scope of this tutorial.

Step 3. Register the DataIndexer service

We did this quite a few times in part 1.

Simply open your Startup.cs file and in the ConfigureServices method, add the following line.

services.AddTransient(typeof(DataIndexer));

This will make our DataIndexer class available for injection.

Step 4. Expose the API endpoint

In your Controllers folder, create a class called IndexController with the following implementation.

using Elastic;
using Microsoft.AspNetCore.Mvc;
using System.Threading.Tasks;

[Route("/api/[controller]")]
public class IndexController : Controller
{
    public IndexController(DataIndexer indexer)
    {
        indexer = indexer;
    }

    private readonly DataIndexer indexer;

    [HttpGet("file")]
    public async Task<IActionResult> IndexDataFromFile([FromQuery]string fileName, string index, bool deleteIndexIfExists)
    {
        var response = await indexer.IndexRecipesFromFile(fileName, deleteIndexIfExists, index);
        return Ok(response);
    }
}

As you can see, we’re simply injecting the DataIndexer service we just implemented and calling its IndexRecipesFromFile function. This action is accessible by making a GET request to http://yourappurl/api/index/file.

Step 5. Download our JSON file

For our recipes search engine, we’ll be using a slightly edited version of the OpenRecipes dump that you can download here.

Now, after you’ve downloaded the archive, in your project’s root path, create a folder called Data and extract openrecipes-big.json into it.

Extracted data

Now we’re ready to do some indexing!

Step 6. Send a request to index the data

Download Postman (or just use your favourite tool) and start it up. Basically it’s a tool that enables us to make requests to our API in a friendly way. It’s extremely powerful and you can read about all of its capabilities in the docs. It should look something like this:

Postman

Now, fire up your application and send a GET request to:

http://yourappurl/api/index/file?fileName=openrecipes-big.json

This is done by simply typing in the url in Postman and hitting Send.

Beware that it is going to take some time (takes 3-5 mins on my machine). After a while you should see a response similar to this one:

Postman Indexing Respose

We can now verify that the recipes are indexed correctly by sending a GET request to

http://yourelasticurl/indexName/typeName/_count

The indexName variable is the one you specified in your appsettings.json and the typeName is the one that you’ve specified in the ElasticsearchType attribute (on the Recipe class).

You should see a response like this one:

Postman Count Response

We can see that 173278 recipes are present in our Elastic index.

Congratulations!

We’ve completed the second part of the tutorial!

Now that we have the recipes indexed, we need to be able to search them.

Don’t miss part 3, where we’ll be implementing the Search functionality!

And of course, if you have any questions, don’t hesitate to contact me!

5 Comments

  1. Mohammed Mohammed

    Hi,
    I am getting the below error, once I executing the
    var response = await this.client.IndexManyAsync(mappedCollection.Skip(i * batchSize).Take(batchSize), index, type); line

    “response = {Invalid NEST response built from a successful low level call on POST: /%22recipes%22/recipe/_bulk?pretty=true&error_trace=true}”

    That line is in the below mentioned function

    “private async Task IndexDocuments(Recipe[] mappedCollection, string index)”

  2. Dipak Patel Dipak Patel

    Hi,

    I just finished this second part, response is success but there is no index. it shows 0 results when i looked for count.

    Do you know where can i check further?

  3. Mahidul Islam Mahidul Islam

    I am getting error NullReferenceException: Object reference not set to an instance of an object. pleaseeee help me out

    • dnikolovv dnikolovv

      Where exactly do you get NullReference? Can you provide the stack trace?

Leave a Reply

Your email address will not be published. Required fields are marked *