databricks-sql.yaml

id: databricks-sql
namespace: company.team
description: |
  This flow demonstrates how to run a SQL query on a Databricks lakehouse.

tasks:
  - id: sql_query
    type: io.kestra.plugin.databricks.sql.Query
    accessToken: "{{ secret('DATABRICKS_TOKEN') }}"
    host: "{{ secret('DATABRICKS_HOST') }}"
    httpPath: sql/protocolv1/o/3497567377305430/0713-130754-9oa8zceh
    sql: SELECT * FROM samples.nyctaxi.trips LIMIT 100;

  - id: csv
    type: io.kestra.plugin.serdes.csv.IonToCsv
    from: "{{ outputs.sql_query.uri }}"

  - id: pandas
    type: io.kestra.plugin.scripts.python.Script
    beforeCommands:
      - pip install kestra pandas > /dev/null
    script: |
      import pandas as pd

      df = pd.read_csv("{{ outputs.csv.uri }}")
      df.head()

extend:
  title: Run a SQL query on Databricks, output the result to a CSV file and read
    that CSV file in a Python script with Pandas
  description: >+
    This flow demonstrates how to run a SQL query on a Databricks lakehouse. The
    results are stored in a CSV file. Then, the flow uses Pandas to read the CSV
    file and further process the data.

    The Databricks SQL Connector is designed to perform SQL queries on a
    Databricks lakehouse and handle the results. In simple terms, it allows you
    to query data stored in Databricks and manipulate or analyze this data. This
    process involves three main steps:

    1. **SQL Query Execution**: The connector executes a SQL query on the
    Databricks lakehouse.

    2. **CSV Conversion**: The results of the SQL query are converted into a CSV
    format.

    3. **Data Processing with Pandas**: The CSV data is then processed using the
    Pandas library in Python for further analysis or manipulation.

    This flow is particularly useful for extracting and transforming data stored
    in Databricks for various data analysis and data science applications.

  tags:
    - Databricks
    - Python
    - pip
  ee: false
  demo: false
  meta_description: This flow run a SQL query on a Databricks lakehouse. The
    results are stored in a CSV file. Then, the flow uses Pandas to read the CSV
    file and further process the data.