Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated spark with scala and python #36997

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

utieyin
Copy link
Contributor

@utieyin utieyin commented Dec 13, 2024

There was an escalation from a customer regarding spark https://github.com/chainguard-dev/customer-issues/issues/1926
The previous spark-3.5 package has now also been renamed to spark-3.5-scala-2.12 to reflect the earlier change to scala-2.13

Related: https://github.com/chainguard-dev/customer-issues/issues/1926

Pre-review Checklist

For new package PRs only

  • This PR is marked as fixing a pre-existing package request bug

@kranurag7 kranurag7 added the approved-to-run A repo member has approved this external contribution label Dec 14, 2024
Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

• Detected Error: The build seems to halt after attempting to download Scala and Maven without an explicit error message. The pipeline doesn't proceed past the curl commands.

• Error Category: Build Configuration/Environment

• Failure Point: During the make-distribution.sh script execution when trying to download build dependencies

• Root Cause Analysis: The script is attempting to download Scala and Maven but likely failing silently. The environment appears to be missing required Maven configuration.

• Suggested Fix:

  1. Add maven-wrapper to environment dependencies
  2. Modify the pipeline to pre-download required dependencies
  3. Update the environment section:
environment:
  contents:
    packages:
      - maven-wrapper
      - scala
      # existing packages...
  environment:
    LANG: en_US.UTF-8
    M2_HOME: /usr/share/maven
    MAVEN_OPTS: "-Xmx2048m -XX:ReservedCodeCacheSize=512m"

• Explanation: The build system is trying to download build tools at runtime, which can be unreliable. By providing Maven and Scala through the package manager and configuring proper Maven environment variables, we ensure the build tools are available and properly configured.

• Additional Notes:

  • The current approach relies on runtime downloads which can be flaky
  • Using system-provided Maven and Scala is more reliable
  • Maven memory settings help prevent OOM issues during large builds

• References:

Consider this a critical fix since build reliability is essential for CI/CD pipelines.

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

• Detected Error:

curl: (28) Failed to connect to archive.apache.org port 443 after 128725 ms: Could not connect to server
shasum: /home/build/build/apache-maven-3.9.6-bin.tar.gz.sha512: no properly formatted SHA checksum lines found

• Error Category: Dependency/Network

• Failure Point: Maven download and verification during the build process

• Root Cause Analysis:
The build is failing because it cannot download Maven from the Apache archive server and subsequently cannot verify the checksum. This appears to be a network connectivity or timeout issue when trying to reach archive.apache.org.

• Suggested Fix:

  1. Modify the Maven download URL to use a more reliable mirror or the main Apache distribution site:
environment:
  contents:
    packages:
      - maven  # Use the system-provided Maven package instead

Remove the explicit Maven download since Wolfi already provides Maven as a package.

• Explanation:
Instead of downloading Maven during the build process, using the system-provided Maven package from Wolfi's repositories will be more reliable and faster. The built-in package management system (apk) is more resilient to network issues and already handles verification.

• Additional Notes:

  • The system-provided Maven package is maintained and verified by Wolfi
  • This approach reduces build time by eliminating the need to download and verify Maven
  • This is more consistent with Wolfi's package management philosophy
  • The Maven version provided by Wolfi should be sufficient for building Spark

• References:

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

Based on the error output, here's my analysis:

• Detected Error:

curl: (28) Failed to connect to archive.apache.org port 443 after 128901 ms: Could not connect to server
shasum: /home/build/build/apache-maven-3.9.6-bin.tar.gz.sha512: no properly formatted SHA checksum lines found

• Error Category: Dependency/Network

• Failure Point: Maven download and verification during the make-distribution.sh script execution

• Root Cause Analysis: The build is failing because:

  1. Connection timeout trying to download Maven from archive.apache.org
  2. Subsequent SHA512 checksum verification fails because the Maven archive wasn't downloaded

• Suggested Fix:

  1. Add explicit Maven configuration in the environment section:
environment:
  contents:
    packages:
      # Existing packages...
      - maven-3.9
  environment:
    LANG: en_US.UTF-8
    M2_HOME: /usr/share/java/maven-3.9
    MAVEN_HOME: /usr/share/java/maven-3.9
    PATH: /usr/share/java/maven-3.9/bin:$PATH
  1. Modify the make-distribution.sh patch to use system Maven instead of downloading it:
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -166,7 +166,7 @@
-MVN="$SPARK_HOME/build/mvn"
+MVN="mvn"

• Explanation:

  • The build is trying to download Maven but failing due to network issues
  • Wolfi OS already provides maven-3.9 package
  • Using the system-provided Maven will avoid the download/verification steps
  • This approach is more reliable as it uses pre-verified packages

• Additional Notes:

  • This issue is common in CI environments with network restrictions
  • Using system packages is preferred over downloading during build
  • The maven-3.9 package in Wolfi is maintained and security-patched

• References:

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

• Detected Error:

curl: (28) Failed to connect to archive.apache.org port 443 after 131779 ms: Could not connect to server
shasum: /home/build/build/apache-maven-3.9.6-bin.tar.gz.sha512: no properly formatted SHA checksum lines found

• Error Category: Dependency/Network

• Failure Point: Maven download and verification during the build process

• Root Cause Analysis: The build is failing because it cannot download Maven 3.9.6 from Apache's archive server and consequently cannot verify its checksum. This appears to be either a network connectivity issue or a temporary outage of the Apache archive server.

• Suggested Fix:

  1. Use the Maven package already installed in the build environment instead of downloading it. Modify the environment section to add:
environment:
  contents:
    packages:
      # existing packages...
  environment:
    # existing environment vars...
    PATH: /usr/share/maven/bin:$PATH
    M2_HOME: /usr/share/maven
  1. Remove or comment out the Maven download portion from the build script (it should use the system Maven instead)

• Explanation: The build environment already includes Maven 3.9 from Wolfi's package repository (maven-3.9=3.9.9-r0). By ensuring the Maven binaries are in the PATH and M2_HOME is set correctly, the build will use the system-provided Maven instead of trying to download it, avoiding the network connectivity issues.

• Additional Notes:

  • This approach is more reliable as it doesn't depend on external download availability
  • The system Maven version (3.9.9) is actually newer than the one the build is trying to download (3.9.6)
  • Using system packages is preferred in Wolfi as they are maintained and security-patched

• References:

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

• Detected Error: The build is failing while trying to download Maven and Scala dependencies during the build process.

• Error Category: Dependency/Download

• Failure Point: In the make-distribution.sh script when trying to download Maven and Scala dependencies

• Root Cause Analysis: The build is attempting to download Maven and Scala from external sources, but appears to be failing. This is likely because we already have Maven installed in the build environment and should use that instead.

• Suggested Fix:

  1. Modify the make-distribution.patch file to skip downloading Maven and use the system Maven:
diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -162,6 +162,9 @@
 VERSION=$VERSION
 
 if [ -z "$JAVA_HOME" ]; then
   echo "Error: JAVA_HOME is not set, cannot proceed."
   exit -1
 fi
+
+# Use system Maven instead of downloading
+MVN="mvn"

• Explanation:

  • The build environment already has Maven 3.9 installed via the package manager
  • Modifying the script to use the system Maven will prevent unnecessary downloads
  • This approach is more reliable as it uses vetted packages from the Wolfi repository

• Additional Notes:

  • Make sure the maven-3.9 package is in the environment section of the melange.yaml
  • The system Maven is already configured with the correct settings and repositories
  • This approach aligns with Wolfi's principle of using system packages where possible

• References:

Follow up action: Let me know if you need additional build error details after applying this fix.

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

• Detected Error:

curl: (28) Failed to connect to archive.apache.org port 443 after 129871 ms: Could not connect to server
shasum: /home/build/build/apache-maven-3.9.6-bin.tar.gz.sha512: no properly formatted SHA checksum lines found

• Error Category: Dependency/Network

• Failure Point: Maven download and verification step during make-distribution.sh execution

• Root Cause Analysis: The build is failing because it cannot download Maven 3.9.6 from Apache's archive server, likely due to network connectivity issues or server unavailability. The subsequent checksum verification fails because the download was incomplete.

• Suggested Fix:

  1. Remove the Maven download from make-distribution.sh since Wolfi already provides Maven 3.9:
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
- # Download Maven if it hasn't been installed
- if [ ! -f "${MAVEN_PATH}/bin/mvn" ]; then
-   # Download Maven
-   MAVEN_VERSION=3.9.6
-   wget "https://archive.apache.org/dist/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz"
-   tar xzf "apache-maven-${MAVEN_VERSION}-bin.tar.gz"
-   rm -rf "${MAVEN_PATH}"
-   mv "apache-maven-${MAVEN_VERSION}" "${MAVEN_PATH}"
- fi

• Explanation: Since Wolfi OS provides Maven 3.9 as a package and it's already included in the build environment packages list, we can remove the Maven download step from the make-distribution.sh script. This eliminates the network dependency and uses the system-provided Maven instead.

• Additional Notes:

  • The build environment already has maven-3.9=3.9.9-r0 installed
  • Using system-provided Maven is more reliable than downloading during build
  • This approach is more consistent with Wolfi's package management philosophy

• References:

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

Based on the build error log, I can help diagnose and fix the issue.

• Detected Error:

exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.18/scala-2.12.18.tgz
exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz?action=download
exec: curl --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz.sha512

• Error Category: Build/Configuration

• Failure Point: make-distribution.sh script trying to download Scala and Maven dependencies

• Root Cause Analysis: The build is attempting to download Scala and Maven instead of using the system-provided Maven from the environment packages. This is happening because the build script isn't properly detecting or using the installed Maven package.

• Suggested Fix:

  1. Add environment variables to explicitly set Maven and Scala locations:
environment:
  environment:
    LANG: en_US.UTF-8
    MAVEN_HOME: /usr/share/java/maven-3.9
    M2_HOME: /usr/share/java/maven-3.9
    PATH: /usr/share/java/maven-3.9/bin:$PATH
    SCALA_HOME: /usr/share/java/scala
  1. Modify the make-distribution.sh patch to skip downloading Maven and Scala if system versions are present

• Explanation: By explicitly setting the Maven and Scala environment variables, we ensure the build system uses the packaged versions instead of trying to download them. This addresses the underlying issue while maintaining the build integrity.

• Additional Notes:

  • The system already has Maven 3.9 installed as a package
  • The build script's behavior of downloading dependencies can be overridden
  • This approach aligns with Wolfi's principle of using system packages

• References:

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

• Detected Error:

curl: (28) Failed to connect to archive.apache.org port 443 after 132404 ms: Could not connect to server
shasum: /home/build/build/apache-maven-3.9.6-bin.tar.gz.sha512: no properly formatted SHA checksum lines found

• Error Category: Dependency/Network

• Failure Point: Maven download and verification step in make-distribution.sh

• Root Cause Analysis: The build is failing because it cannot download Maven 3.9.6 from Apache's archive server and verify its checksum. This appears to be either a network connectivity issue or potentially a temporary outage of the Apache archive server.

• Suggested Fix:

  1. Modify the make-distribution.patch to use Maven that's already installed in the build environment instead of downloading it:
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -89,8 +89,8 @@
-MVN="build/apache-maven-${MVN_VERSION}/bin/mvn"
+MVN="/usr/share/java/maven-3.9/bin/mvn"

• Explanation:

  • The build environment already has Maven 3.9 installed via the package manager (maven-3.9=3.9.9-r0)
  • By using the system-installed Maven instead of downloading it, we avoid network dependencies
  • This aligns with Wolfi's principle of using package manager-provided tools where possible

• Additional Notes:

  • The current error occurs during the build preparation phase before the actual Spark compilation
  • Using system Maven is more reliable as it's already verified and installed
  • This approach is more efficient as it eliminates an unnecessary download

• References:

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

Based on the build error output, I can help diagnose and fix the issue.

• Detected Error: The build is failing during the curl commands to download Scala and Maven dependencies

• Error Category: Dependency

• Failure Point: During the make-distribution.sh script execution when trying to download external dependencies

• Root Cause Analysis: The build script is trying to download Scala and Maven binaries but appears to be failing silently. This could be due to:

  1. Network connectivity issues in the build environment
  2. Missing curl configurations
  3. Invalid or outdated download URLs

• Suggested Fix:

  1. Add explicit curl download verification by modifying the make-distribution.patch:
diff --git a/dev/make-distribution.sh b/dev/make-distribution.sh
index a123456..b123456 100755
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -166,7 +166,10 @@
   local url=$1
   local output=$2
   
-  curl --silent --show-error -L "$url" > "$output"
+  curl --silent --show-error -L --retry 3 --retry-delay 2 \
+       --max-time 60 --fail "$url" > "$output" || {
+    echo "Failed to download $url" >&2
+    return 1
+  }

• Explanation:

  • Adds retry logic to handle temporary network issues
  • Includes proper error handling and output
  • Sets reasonable timeouts
  • Uses --fail to catch HTTP errors
  • Provides visibility into download failures

• Additional Notes:

  1. Verify these URLs are still valid:
  2. Consider caching these dependencies in the build environment
  3. May want to add SHA verification for downloaded artifacts

• References:

  1. Apache Spark Build Documentation:
    https://spark.apache.org/docs/latest/building-spark.html
  2. Maven Distribution Mirror System:
    https://www.apache.org/dyn/closer.lua
  3. Scala Download Archive:
    https://downloads.lightbend.com/scala/

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

• Detected Error: "patching file dev/make-distribution.sh
Hunk #1 FAILED at 36.
1 out of 1 hunk FAILED -- saving rejects to file dev/make-distribution.sh.rej
patch unexpectedly ends in middle of line"

• Error Category: Configuration

• Failure Point: Patch application step in the pipeline

• Root Cause Analysis: The patch file make-distribution.patch cannot be applied cleanly to the source code, likely because the target file content has changed in version 3.5.3 compared to what the patch was created against.

• Suggested Fix:

  1. Regenerate the patch file against the current version:
git clone https://github.com/apache/spark
cd spark
git checkout v3.5.3
# Make your changes to dev/make-distribution.sh
git diff dev/make-distribution.sh > make-distribution.patch
  1. Or review and manually update the patch file to match the current file structure:
  • Check dev/make-distribution.sh at line 36 in v3.5.3
  • Compare with your patch content
  • Adjust the patch context and line numbers

• Explanation: Patch failures typically occur when the target file has changed and no longer matches the patch context. Regenerating the patch against the current version ensures the line numbers and context match exactly.

• Additional Notes:

  • The error indicates the patch is malformed or contains incorrect line endings
  • Consider using git diff -U3 for more context in the patch
  • Verify patch file has proper line endings (LF not CRLF)
  • Consider using patch --verbose for debugging

• References:

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

Based on the build error output, I can help identify and fix the issue:

• Detected Error: exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.18/scala-2.12.18.tgz
(Build fails attempting to download Scala)

• Error Category: Build Configuration/Dependencies

• Failure Point: During make-distribution.sh script execution when trying to download Scala dependencies

• Root Cause Analysis: The build script is attempting to download Scala and Maven dependencies directly, but appears to be failing silently. This is likely due to either network connectivity issues or missing curl configurations.

• Suggested Fix:
Add the following to the pipeline section before the make-distribution.sh execution:

pipeline:
  - runs: |
      # Pre-download required dependencies
      mkdir -p build
      curl -L https://downloads.lightbend.com/scala/2.12.18/scala-2.12.18.tgz -o build/scala.tgz
      curl -L https://archive.apache.org/dist/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz -o build/maven.tar.gz
      
      # Extract dependencies
      cd build
      tar xf scala.tgz
      tar xf maven.tar.gz
      cd ..
      
      # Now run make-distribution
      ./dev/make-distribution.sh ...

• Explanation:

  • The build script needs Scala and Maven binaries to compile Spark
  • Pre-downloading and extracting these dependencies ensures they're available
  • Using explicit download locations prevents redirect issues
  • Creating a build directory maintains clean workspace organization

• Additional Notes:

  • Consider adding SHA512 verification for downloaded artifacts
  • May want to add retry logic for downloads
  • Could cache these dependencies in future builds

• References:

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

• Detected Error: "patching file dev/make-distribution.sh
Hunk #1 FAILED at 36.
1 out of 1 hunk FAILED -- saving rejects to file dev/make-distribution.sh.rej
patch unexpectedly ends in middle of line"

• Error Category: Configuration/Patch

• Failure Point: Patch application step in the pipeline failing to apply make-distribution.patch

• Root Cause Analysis: The patch file is either malformed or doesn't match the target file content at version 3.5.3, likely due to upstream changes in the make-distribution.sh script

• Suggested Fix:

  1. Regenerate the patch against the current version:
git clone https://github.com/apache/spark
cd spark
git checkout v3.5.3
# Make your changes to dev/make-distribution.sh
git diff dev/make-distribution.sh > make-distribution.patch
  1. Alternatively, directly modify the make-distribution.sh script in the pipeline instead of using a patch:
  - runs: |
      # Add your modifications here directly
      sed -i 's/original/replacement/' dev/make-distribution.sh

• Explanation: The patch is failing because the context lines in the patch file don't match the target file at version 3.5.3. Regenerating the patch against the exact version ensures the line numbers and context match perfectly.

• Additional Notes:

  • Patches are version-sensitive and need to be maintained when upgrading packages
  • Consider using sed/awk commands for simple modifications instead of patches
  • Verify the patch format is correct (no DOS line endings, proper unified diff format)

• References:

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

• Detected Error: "Bad checksum from https://archive.apache.org/dist/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz.sha512"

• Error Category: Build Configuration

• Failure Point: Maven installation during build setup

• Root Cause Analysis: The build is attempting to download Maven 3.9.6 but failing due to checksum verification. This is likely because the build script is trying to use a newer Maven version than what's available in the environment.

• Suggested Fix:

  1. Remove the Maven download attempt by modifying the pipeline to use the system-provided Maven:
environment:
  contents:
    packages:
      - maven-3.9    # Use system Maven instead of downloading
  environment:
    M2_HOME: /usr/share/java/maven-3.9
    PATH: /usr/share/java/maven-3.9/bin:$PATH
  1. Add a patch to make-distribution.sh to skip Maven download:
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -166,6 +166,7 @@
   MVN="$SPARK_HOME/build/mvn"
 fi

+MVN="mvn"  # Force use of system Maven

• Explanation: The build is failing because it's trying to download and verify Maven 3.9.6, but we already have Maven 3.9.x installed in the system. By forcing the use of system Maven and properly setting up the Maven environment variables, we can bypass this download step and use the pre-installed version.

• Additional Notes:

  • The system already has Maven 3.9 installed via the package manager
  • Using system packages is preferred for reproducibility
  • This approach aligns with Wolfi's principle of using system-provided tools where possible

• References:

Copy link
Contributor

octo-sts bot commented Dec 14, 2024

Gen AI suggestions to solve the build error:

• Detected Error: "patch: **** malformed patch at line 13: @@ -166,7 +166,7 @@ export MAVEN_OPTS="${MAVEN_OPTS:--Xss128m -Xmx4g -XX:ReservedCodeCacheSize=128m}"

• Error Category: Build Configuration

• Failure Point: The "patch" step in the pipeline attempting to apply make-distribution.patch

• Root Cause Analysis: The patch file appears to be malformed or corrupted, specifically at line 13. This is typically caused by incorrect patch formatting, line endings, or copy/paste errors.

• Suggested Fix:

  1. Verify the patch file format is correct using:
dos2unix make-distribution.patch  # Convert line endings
  1. Ensure the patch follows unified diff format:
--- a/dev/make-distribution.sh
+++ b/dev/make-distribution.sh
@@ -166,7 +166,7 @@
  1. Check for proper whitespace and no extra blank lines

• Explanation: Patch files must follow strict formatting rules. The error indicates the diff header line is malformed, which is a common issue when patches are created or edited on different platforms or through copy/paste.

• Additional Notes:

  • Use diff -u to generate proper unified diff format
  • Verify no hidden characters were introduced
  • Consider regenerating the patch if needed:
git diff --no-prefix original_file modified_file > make-distribution.patch

• References:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved-to-run A repo member has approved this external contribution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants