Gradle Build Analysis Project

The purpose of this project is to provide data for developer productivity by collecting and reporting:

What are the costliest tasks (time x frequency) of local developers, by project?
- What is the breakdown of these tasks in terms of cache effectiveness, network, something else?
- Given a task, tell me the cache rate, mean/stddev/histogram, inputs, network, etc. How has this changed over time?
What are the costliest errors (again, time x frequency) and build failures, segmented by environment and failure type?
- What is the impact of flaky test failures on local builds?
What versions of Gradle BT are in use at Gradle, and with what frequency? Similarly, what versions of guava are in use at Gradle across among active projects?

These applications collect and index Gradle Enterprise and GitHub events data into Google Cloud Storage and BigQuery.

Data Pipeline

Build Event Collector app

This application is responsible for streaming build event data from a configured Gradle Enterprise export API endpoint into a specified Google Cloud Storage bucket. It pulls all builds down and all configured event types, avoiding as much data interaction (parsing, filtering) as possible.

Reference Materials:

Running Build Event Collector

gcloud compute instances create build-event-collectorator1 \
   --preemptible \
   --image-family debian-9 \
   --image-project debian-cloud \
   --machine-type n1-highmem-8 \
   --scopes "userinfo-email,cloud-platform" \
   --metadata startup-script='#!/bin/sh
APP_NAME="build-event-collectorator"
APP_VERSION="0.5.0"
export GRADLE_ENTERPRISE_HOSTNAME="gradle-enterprise.mycompany.com"
export GRADLE_ENTERPRISE_USERNAME="my-username"
export GRADLE_ENTERPRISE_PASSWORD="my-password"
export GCS_RAW_BUCKET_NAME="build-events-raw"
gsutil cp "gs://gradle-build-analysis-apps/maven2/org/gradle/buildeng/analysis/${APP_NAME}/${APP_VERSION}/${APP_NAME}-${APP_VERSION}.zip" .
apt-get update && apt-get -y --force-yes install openjdk-8-jdk unzip
update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
echo "Running ${APP_NAME}-${APP_VERSION}..."
unzip "${APP_NAME}-${APP_VERSION}.zip"
sh "${APP_NAME}-${APP_VERSION}/bin/${APP_NAME}"
echo "Application exited"'

By default the collector processes only builds from the moment the app is started, but you can collect past builds by setting export BACKFILL_DAYS=<number> in the startup script.

Similarly, you can specify a export LAST_BUILD_ID="jh4qknspatp2y" to start streaming from the build immediately after the given build ID.

Getting logs for a given instance

gcloud compute instances get-serial-port-output build-event-collectorator1

Build Producerator app

This app is not necessary right now. It streamed build events to Google Cloud PubSub to allow fanout with many build event collectors.

Currently the limiting factor is network outbound from the Gradle Enterprise server and multiple downloaders do not make processing faster.

Build Event Indexer apps

These applications are responsible for transforming raw data pulled from Google Cloud Storage, tranforming and combining those events using Apache Beam. Each application consists of a EventModel that represents the schema of the BigQuery table to be generated, an EventJsonTransformer which filters and transforms data, and a BuildIndexer which writes to BigQuery.

You can develop your own indexer by creating a Model, EventsJsonTransformer, and an Indexer. You will find several indexers for inspiration under build-event-indexerator/src/main/kotlin.

Note	We are using `FileIO` in Apache Beam here to read whole files and filter lines rather than reading using `TextIO` because doing so encounters an Exception: "Total size of the BoundedSource objects generated by split() operation is larger than the allowable limit." See more information.

You can run any indexer locally using a task rule and the DirectRunner:

./gradlew :build-event-indexerator:indexBuildEvents --args="--runner=DirectRunner --project=build-analysis --input=gs://build-events-raw/2019/01/01/22*.txt --output=build-analysis:gradle_builds.builds"

Once you’re happy with your Apache Beam setup, create a Google Dataflow job to run over a larger input.

./gradlew :build-event-indexerator:indexTestEvents --args="--runner=DataflowRunner --project=build-analysis --input=gs://build-events-raw/2019/01/** --output=build-analysis:gradle_builds.test_executions --region=us-central1 --tempLocation=gs://gradle-dataflow-tmp/$(openssl rand -hex 8)"

Let’s break this down a bit:

--runner=DataflowRunner tells Apache Beam that you want to use Google Dataflow which uses Google Compute Engine under the hood.
--project=build-analysis configures the Google Cloud project name. We use build-analysis.
--input=gs://build-events-raw/2019/01/** will consume all files from the given Google Cloud Storage bucket in the month of January 2019. These build files are keyed by build start time.
--output=build-analysis:gradle_builds.my_builds_table is the BigQuery table that will be created (if necessary) or appended to.
--region=us-central1 Google Compute region to use for workers. Quotas are set by region, and we have requested increased capacity so that Dataflow can process TBs of build data before Eric dies of old age. This is optional and us-central1 is the default.
--tempLocation=gs://gradle-dataflow-tmp/$(openssl rand -hex 8) an existing GCS bucket and a random key that Dataflow jobs can use to store temporary files.

Creating Dashboards Using Data Studio

Reporting-specific tables must be created in BigQuery in order to keep reasonable Data Studio costs and performance.

You need to use the BigQuery CLI or API to create new partitioned tables. Using time partitioned tables allows us Data Studio to avoid querying all of the data when querying a subset of the time range.

You can generate tables for various dashboards using the following queries:

Builds Dashboard

bq query --location="US" --destination_table="build-analysis:reports.builds_dashboard" --time_partitioning_field="date" --use_legacy_sql="false" --replace --batch '
SELECT
  DATE(buildTimestamp) AS date,
  rootProjectName AS project,
  buildId,
  STARTS_WITH(buildAgentId, "tcagent") AS ci,
  SUM(wallClockDuration) AS total_build_time
FROM
  `gradle_builds.builds`
WHERE
  rootProjectName IN ("gradle",
    "dotcom",
    "dotcom-docs",
    "gradle-kotlin-dsl",
    "ci-health",
    "build-analysis",
    "gradle-profiler",
    "gradle-site-plugin",
    "gradlehub")
  AND buildTimestamp > "2019-01-01"
GROUP BY
  1,
  2,
  3,
  4;'

Failures Dashboard

bq query --location="US" --destination_table="build-analysis:reports.failures_dashboard" --time_partitioning_field="timestamp" --use_legacy_sql="false" --replace --batch '
SELECT
  buildId,
  rootProjectName AS project,
  buildTimestamp AS timestamp,
  wallClockDuration AS build_duration,
  STARTS_WITH(buildAgentId, "tcagent") AS ci,
  failureData.category AS failure_category,
  failed_task,
  JSON_EXTRACT(env.value,
    "$.name") AS os
FROM
  `gradle_builds.builds` builds,
  UNNEST(failureData.taskPaths) AS failed_task
CROSS JOIN
  UNNEST(environmentParameters) AS env
WHERE
  rootProjectName IN ("gradle",
    "dotcom",
    "dotcom-docs",
    "gradle-kotlin-dsl",
    "ci-health",
    "build-analysis",
    "gradle-profiler",
    "gradle-site-plugin",
    "gradlehub")
  AND buildTimestamp > "2019-01-01"
  AND BYTE_LENGTH(failureId) > 0
  AND env.key = "Os"'

Tasks and Build Cache Dashboard

bq query --location="US" --destination_table="build-analysis:reports.tasks_dashboard" --time_partitioning_field="date" --use_legacy_sql="false" --replace --batch '
SELECT
  DATE(buildTimestamp) AS date,
  rootProjectName AS project,
  CONCAT(tasks.buildPath, " > ", tasks.path) AS absolute_task_path,
  tasks.className AS task_type,
  tasks.outcome,
  tasks.cacheable,
  CASE
    WHEN tasks.cacheable IS FALSE THEN "NOT_CACHEABLE"
    WHEN tasks.cacheable IS TRUE
  AND tasks.outcome IN ("from_cache") THEN "CACHE_HIT"
    WHEN tasks.cacheable IS TRUE AND tasks.outcome IN ("success", "failed") THEN "CACHE_MISS"
    WHEN tasks.cacheable IS TRUE
  AND tasks.outcome IN ("up_to_date",
    "skipped",
    "no_source") THEN "UP_TO_DATE"
    ELSE "UNKNOWN"
  END AS cache_use,
  STARTS_WITH(buildAgentId, "tcagent") AS ci,
  SUM(tasks.wallClockDuration) AS total_time_ms,
  AVG(tasks.wallClockDuration) AS avg_duration,
  STDDEV(tasks.wallClockDuration) AS stddev_duration
FROM
  `gradle_builds.task_executions`,
  UNNEST(tasks) AS tasks
WHERE
  rootProjectName IN ("gradle",
    "dotcom",
    "dotcom-docs",
    "gradle-kotlin-dsl",
    "ci-health",
    "build-analysis",
    "gradle-profiler",
    "gradle-site-plugin",
    "gradlehub")
  AND buildTimestamp > "2019-01-01"
GROUP BY
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8;'

Tests Dashboard

bq query --location="US" --destination_table="build-analysis:reports.tests_dashboard" --time_partitioning_field="date" --use_legacy_sql="false" --replace --batch '
SELECT
  DATE(buildTimestamp) AS date,
  rootProjectName as project,
  CONCAT(t.className, ".", t.name) AS test_name,
  t.taskId as task_path,
  exec.failed AS failed,
  STARTS_WITH(buildAgentId, "tcagent") AS ci,
  SUM(exec.wallClockDuration) AS total_time_ms,
  AVG(exec.wallClockDuration) AS avg_duration,
  STDDEV(exec.wallClockDuration) stddev_duration
FROM
  `gradle_builds.test_executions`,
  UNNEST(tests) AS t,
  UNNEST(t.executions) AS exec
WHERE
  rootProjectName IN ("gradle",
    "dotcom",
    "dotcom-docs",
    "gradle-kotlin-dsl",
    "ci-health",
    "build-analysis",
    "gradle-profiler",
    "gradle-site-plugin",
    "gradlehub")
  AND buildTimestamp > "2019-01-01"
  AND t.suite = FALSE
GROUP BY
  1,
  2,
  3,
  4,
  5,
  6;'

Dependency Search

bq query --location="US" --destination_table="build-analysis:reports.dependencies_dashboard" --use_legacy_sql="false" --replace --batch '
SELECT
  DISTINCT(CONCAT(md.group, ":", md.module)) AS group_and_module,
  rootProjectName AS project_name,
  md.version,
  COUNT(buildId) build_count
FROM
  `gradle_builds.dependencies` AS d,
  UNNEST(moduleDependencies) AS md
WHERE
  rootProjectName IN ("gradle",
    "dotcom",
    "dotcom-docs",
    "gradle-kotlin-dsl",
    "ci-health",
    "build-analysis",
    "gradle-profiler",
    "gradle-site-plugin",
    "gradlehub")
  AND buildTimestamp > "2019-01-01"
GROUP BY
  1,
  2,
  3;'

Ad-hoc Queries

You can query build data using:

Google Cloud Project: your-google-cloud-project
BigQuery Dataset: gradle_builds

Here are many of the BigQuery tables generated. All of them that have a timestamp field are partitioned by that field:

builds
build_cache_interactions
build_failures
dependency_resolutions
exceptions
network_activity
task_executions
test_executions

Schemas are generated from data classes under build-event-indexerator/src/main/kotlin/org/gradle/buildeng/analysis/model/ using BigQueryTableSchemaGenerator.

Some fields are JSON. See BigQuery JSON functions for reference.

Updating data

You can use Google Cloud Scheduler or plain old cron to schedule ~daily data updates. See Cloud Scheduler Docs

Ad-hoc Query Examples

Number of build failures

SELECT
  FORMAT_TIMESTAMP('%Y-%m-%d', buildTimestamp) AS day,
  STARTS_WITH(buildAgentId, 'tcagent') AS isCI,
  COUNT(buildId) AS count
FROM
  `gradle_builds.builds`
WHERE
  buildTimestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND BYTE_LENGTH(failureId) > 0
GROUP BY 1, 2
ORDER BY 1, 2;

What versions of Gradle are in use recently?

SELECT
  buildToolVersion,
  COUNT(buildId) as count
FROM
  `gradle_builds.builds`
WHERE
  rootProjectName = 'gradle'
  and buildTimestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY
  1
ORDER BY
  2 DESC;

Is any local build still using Java 7? Using Windows? How much memory/CPUs?

SELECT
  JSON_EXTRACT(env.value,
    '$.version') as jdk_version,
  COUNT(env.value) as count
FROM
  `gradle_builds.builds`,
  UNNEST(environmentParameters) AS env
WHERE
  buildAgentId NOT LIKE 'tcagent%'
  AND rootProjectName = 'gradle'
  AND env.key LIKE 'Jvm'
  AND buildTimestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY
  1
ORDER BY
  2 DESC;

Which Gradle features are everyone using? Is everyone using the Daemon?

SELECT
  buildAgentId,
  JSON_EXTRACT(env.value,
    '$.daemon') AS daemon,
  JSON_EXTRACT(env.value,
    '$.taskOutputCache') AS build_cache,
  COUNT(env.value) AS count
FROM
  `gradle_builds.builds`,
  UNNEST(environmentParameters) AS env
WHERE
  buildAgentId NOT LIKE 'tcagent%'
  AND env.key LIKE 'BuildModes'
  and (JSON_EXTRACT(env.value,
    '$.daemon') = 'false' OR JSON_EXTRACT(env.value,
    '$.taskOutputCache') = 'false')
  AND buildTimestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2, 3
ORDER BY 4 DESC;

Given a task, tell me the cache rate, mean/stddev/histogram, etc. How has this changed over time?
Given a test, tell me the outcome history, duration, flakiness, etc.
What are the costliest tests? Are there Test tasks that never fail? Could we run them less frequently?
What are the costliest errors (again, time x frequency) and build failures, segmented by environment and failure type?

Development

Prerequisites

Gradle Enterprise Export API access
Google Cloud project access
JDK 8 installed

Google Cloud initial setup

gcloud config set compute/region us-central1
gcloud config set compute/zone us-central1-f

Publishing to Google Cloud

NOTE: Make sure you’re using JDK8 and logged into Google Cloud first.

./gradlew publish

This will publish distZips for all apps to a Maven repository at gcs://gradle-build-analysis-apps/maven2

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
analysis-common		analysis-common
build-event-collectorator		build-event-collectorator
build-event-indexerator		build-event-indexerator
build-producerator		build-producerator
docs		docs
gradle/wrapper		gradle/wrapper
.gitignore		.gitignore
LICENSE		LICENSE
build.gradle.kts		build.gradle.kts
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gradle Build Analysis Project

Data Pipeline

Build Event Collector app

Running Build Event Collector

Getting logs for a given instance

Build Producerator app

Build Event Indexer apps

Creating Dashboards Using Data Studio

Builds Dashboard

Failures Dashboard

Tasks and Build Cache Dashboard

Tests Dashboard

Dependency Search

Ad-hoc Queries

Updating data

Ad-hoc Query Examples

Development

Prerequisites

Google Cloud initial setup

Publishing to Google Cloud

About

Releases

Packages

Languages

License

payneChen/build-analysis-demo

Folders and files

Latest commit

History

Repository files navigation

Gradle Build Analysis Project

Data Pipeline

Build Event Collector app

Running Build Event Collector

Getting logs for a given instance

Build Producerator app

Build Event Indexer apps

Creating Dashboards Using Data Studio

Builds Dashboard

Failures Dashboard

Tasks and Build Cache Dashboard

Tests Dashboard

Dependency Search

Ad-hoc Queries

Updating data

Ad-hoc Query Examples

Development

Prerequisites

Google Cloud initial setup

Publishing to Google Cloud

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages