-
Notifications
You must be signed in to change notification settings - Fork 1
/
databricks-sql.yaml
61 lines (50 loc) · 2.12 KB
/
databricks-sql.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
id: databricks-sql
namespace: company.team
description: |
This flow demonstrates how to run a SQL query on a Databricks lakehouse.
tasks:
- id: sql_query
type: io.kestra.plugin.databricks.sql.Query
accessToken: "{{ secret('DATABRICKS_TOKEN') }}"
host: "{{ secret('DATABRICKS_HOST') }}"
httpPath: sql/protocolv1/o/3497567377305430/0713-130754-9oa8zceh
sql: SELECT * FROM samples.nyctaxi.trips LIMIT 100;
- id: csv
type: io.kestra.plugin.serdes.csv.IonToCsv
from: "{{ outputs.sql_query.uri }}"
- id: pandas
type: io.kestra.plugin.scripts.python.Script
beforeCommands:
- pip install kestra pandas > /dev/null
script: |
import pandas as pd
df = pd.read_csv("{{ outputs.csv.uri }}")
df.head()
extend:
title: Run a SQL query on Databricks, output the result to a CSV file and read
that CSV file in a Python script with Pandas
description: >+
This flow demonstrates how to run a SQL query on a Databricks lakehouse. The
results are stored in a CSV file. Then, the flow uses Pandas to read the CSV
file and further process the data.
The Databricks SQL Connector is designed to perform SQL queries on a
Databricks lakehouse and handle the results. In simple terms, it allows you
to query data stored in Databricks and manipulate or analyze this data. This
process involves three main steps:
1. **SQL Query Execution**: The connector executes a SQL query on the
Databricks lakehouse.
2. **CSV Conversion**: The results of the SQL query are converted into a CSV
format.
3. **Data Processing with Pandas**: The CSV data is then processed using the
Pandas library in Python for further analysis or manipulation.
This flow is particularly useful for extracting and transforming data stored
in Databricks for various data analysis and data science applications.
tags:
- Databricks
- Python
- pip
ee: false
demo: false
meta_description: This flow run a SQL query on a Databricks lakehouse. The
results are stored in a CSV file. Then, the flow uses Pandas to read the CSV
file and further process the data.