Skip to content

Latest commit

 

History

History

NuGet.Jobs.Db2AzureSearch

Overview

Subsystem: Search 🔎

This tool creates the resources needed to run the NuGet search service. These resources can be updated using the Catalog2AzureSearch and Auxiliary2AzureSearch jobs.

Specifically, this tool creates:

Running the job

You can run this job using:

NuGet.Jobs.Db2AzureSearch.exe -Configuration path\to\your\settings.json

Using DEV resources

The easiest way to run the tool if you are on the nuget.org team is to use the DEV environment resources:

  1. Install the certificate used to authenticate as our client AAD app registration into your CurrentUser certificate store.
  2. Clone our internal NuGetDeployment repository.
  3. Update your cloned copy of the DEV Db2AzureSearch appsettings.json file to authenticate using the certificate you installed:
{
    ...
    "KeyVault_VaultName": "PLACEHOLDER",
    "KeyVault_ClientId": "PLACEHOLDER",
    "KeyVault_CertificateThumbprint": "PLACEHOLDER",
    "KeyVault_ValidateCertificate": true,
    "KeyVault_StoreName": "My",
    "KeyVault_StoreLocation": "CurrentUser"
    ...
}
  1. Update the -Configuration CLI option to point to the DEV Azure Search settings: NuGetDeployment/src/Jobs/NuGet.Jobs.Cloud/Jobs/Db2AzureSearch/DEV/northcentralus/appsettings.json

Using personal Azure resources

As an alternative to using nuget.org's DEV resources, you can also run this tool using your personal Azure resources.

Prerequisites

  • Gallery DB. This can be initialized locally using the NuGetGallery.
  • Azure Search. You can create your own Azure Search resource using the Azure Portal.
  • Azure Blob Storage. You can create your own Azure Blob Storage using the Azure Portal.

In your Azure Blob Storage account, you will need to create a container named ng-search-data and upload the following files:

  1. downloads.v1.json with content []
  2. ExcludedPackages.v1.json with content []

You will also need to create a second container (if it does not already exist) named content and upload the following file:

  1. flags.json with content {}

If you are on the nuget.org team, you can copy these files from the PROD auxiliary files container.

Settings

Once you've created your Azure resources, you can create your settings.json file. There's a few PLACEHOLDER values you will need to fill in yourself:

  • The GalleryDb:ConnectionString setting is the connection string to your Gallery DB.
  • The SearchServiceName setting is the name of your Azure Search resource. For example, use the name foo-bar for the Azure Search service with URL https://foo-bar.search.windows.net.
  • The SearchServiceApiKey setting is an admin key that has write permissions to the Azure Search resource. Make sure the Azure Search resource you're connecting to has API keys enabled (either in parallel with managed identities "RBAC" access or with managed identities authentication disabled).
  • The StorageConnectionString and AuxiliaryDataStorageConnectionString settings are both the connection string to your Azure Blob Storage account.
  • The DownloadsV1JsonUrl setting is the URL to downloads.v1.json file above. Make sure it works without authentication.
  • The FeatureFlags:ConnectionString setting is the connection string to your Azure Blob storage account.
{
  "GalleryDb": {
    "ConnectionString": "PLACEHOLDER"
  },

  "Db2AzureSearch": {
    "AzureSearchBatchSize": 1000,
    "MaxConcurrentBatches": 4,
    "MaxConcurrentVersionListWriters": 8,
    "SearchServiceName": "PLACEHOLDER",
    "SearchServiceApiKey": "PLACEHOLDER",
    "SearchIndexName": "search-000",
    "HijackIndexName": "hijack-000",
    "StorageConnectionString": "PLACEHOLDER",
    "StorageContainer": "v3-azuresearch-000",
    "StoragePath": "",
    "GalleryBaseUrl": "https://www.nuget.org/",
    "AuxiliaryDataStorageConnectionString": "PLACEHOLDER",
    "AuxiliaryDataStorageContainer": "ng-search-data",
    "AuxiliaryDataStorageExcludedPackagesPath": "ExcludedPackages.v1.json",
    "DownloadsV1JsonUrl": "PLACEHOLDER",
    "FlatContainerBaseUrl": "https://api.nuget.org/",
    "FlatContainerContainerName": "v3-flatcontainer",
    "AllIconsInFlatContainer": false,
    "DatabaseBatchSize": 10000,
    "CatalogIndexUrl": "https://api.nuget.org/v3/catalog0/index.json",
    "EnablePopularityTransfers": true,
    "Scoring": {
      "FieldWeights": {
        "PackageId": 9,
        "TokenizedPackageId": 9,
        "Tags": 5
      },
      "DownloadScoreBoost": 30000,
      "PopularityTransfer": 0.99
    }
  },

  "FeatureFlags": {
    "ConnectionString": "PLACEHOLDER"
  }
}

Building from NuGet.Insights Kusto tables

For local development and fast iteration, you can build the job with the NuGet.Insights Kusto tables.

You can use the following configuration as a starting point:

{
  "Db2AzureSearch": {
    "AzureSearchBatchSize": 1000,
    "MaxConcurrentBatches": 4,
    "MaxConcurrentVersionListWriters": 8,
    "SearchServiceName": "<AZURE AI SEARCH RESOURCE NAME>",
    "SearchServiceUseDefaultCredential": true,
    "SearchIndexName": "search-001",
    "HijackIndexName": "hijack-001",
    "StorageConnectionString": "<AZURE STORAGE CONNECTION STRING>",
    "StorageContainer": "v3-azuresearch-001",
    "StoragePath": "",
    "GalleryBaseUrl": "https://www.nuget.org/",
    "FlatContainerBaseUrl": "https://api.nuget.org/",
    "FlatContainerContainerName": "v3-flatcontainer",
    "AllIconsInFlatContainer": false,
    "EnablePopularityTransfers": true,
    "Scoring": {
      "FieldWeights": {
        "PackageId": 9,
        "TokenizedPackageId": 9,
        "Tags": 5
      },
      "DownloadScoreBoost": 30000,
      "PopularityTransfer": 0.99
    },
    "Development": {
      "ReplaceContainersAndIndexes": true,
      "DisableVersionListWriters": false,
      "KustoConnectionString": "https://<KUSTO CLUSTER NAME>.kusto.windows.net",
      "KustoDatabaseName": "<KUSTO DATABASE NAME>",
      "KustoTableNameFormat": "Ni{0}",
      "KustoTopPackageCount": 100000,
      "KustoOnlyLatestPackages": true
    }
  },

  "FeatureFlags": {
    "ConnectionString": "<FEATURE FLAGS AZURE STORAGE CONNECTION STRING>"
  },

  "KeyVault_VaultName": "<KEY VAULT NAME, IF NEEDED>",
  "KeyVault_UseManagedIdentity": true
}

Algorithm

At a high-level, here's how Db2AzureSearch works:

  1. Create the Azure Search indexes
  2. Create the Azure Blob storage container for the search auxiliary files
  3. Capture the catalog's cursor
  4. Load initial data from Gallery DB and statistics auxiliary files
  5. Process package metadata in batches
    1. Load a chunk of packages from Gallery DB
    2. Generate and upload documents to the Azure Search indexes
    3. Update the search version list resource
  6. Write the search auxiliary files to search storage
  7. Write the catalog's cursor to search storage