Integration of personalized dependencies at Amazon Sagemaker Canvas Flows of Work

When implementing automatic learning workflows (ML) on the amazon Sagemaker canvas, organizations may consider the external dependencies necessary for their specific use cases. Although the Sagemaker canvas provides powerful capabilities without code and low code for rapid experimentation, some projects may require specialized dependencies and libraries that are not included by default on the SageMaker canvas. This publication provides an example of how to incorporate a code that is based on external dependencies in its SageMaker canvas workflows.

amazon Sagemaker Canvas is a ML platform without low code code (LCNC) that guides users through each stage of the ML trip, from the preparation of initial data to the implementation of the final model. Without writing a single line of code, users can explore data sets, transform data, create models and generate predictions.

SageMaker Canvas offers comprehensive data disputes that help you prepare your data, including:

More than 300 transformation steps incorporated
Feature engineering capabilities
Data standardization and cleaning functions
A custom code editor that supports Python, Pyspark and Sparksql

In this publication, we demonstrate how to incorporate dependencies stored in amazon Simo Storage Service (amazon S3) within a Data Wrangler Data Flow by amazon Sagemaker. Using this approach, you can execute custom scripts that depend on modules that are not inherently compatible with SageMaker Canvas.

General solution of the solution

To show the integration of customs and personalized dependencies of amazon S3 in Sagemaker Canvas, we explore the following example workflow.

The solution follows three main steps:

Load Scripts and Custom Units at amazon S3
Use SageMaker Data Wrangler in SageMaker Canvas to transform your data using the loaded code
Train and export the model

The following diagram is the architecture of the solution.

In this example, we work with two complementary data sets available on the SageMaker canvas that contain shipping information for computer screen delivery. By joining these data sets, we create an integral data set that captures several delivery metrics and delivery results. Our goal is to build a predictive model that can determine if future shipments will arrive on time according to historical patterns and features.

Previous requirements

As a previous requirement, you need access to amazon S3 and amazon Sagemaker ai. If you still do not have a sagemaker domain configured in your account, you also need permits to create a domain ai SageMaker.

Create data flow

To create data flow, follow these steps:

In the ai of amazon Sagemaker, on the navigation panel, underneath Applications and IDESselect Canvasas shown in the next screen capture. It is possible that you should create a Sagemaker domain if you have not yet done so.
After your domain is created, choose Open canvas.

On canvas, select the Data sets tab and select Canvas-Sample-Shipping-logs.csv, as shown in the following screen capture. After the previous view appears, choose + Create a data flow.

The initial data flow will open with a source and a type of data.

In the upper right of the screen and select Add data → tabular. Choose Canvas data sets as the source and select the canvas-sample-products-descriptions.csv.
Choose Next As shown in the next screen capture. Then choose Matter.

After both data sets have been added, select the more sign. In the drop -down menu, choose select Combine data. From the following drop -down menu, choose Join.

To carry out an internal union in the producing column, in the right -hand menu, underneath Type of unionchoose Interior Union. Low Join the keyschoose Produceas shown in the next screen capture.

After the data sets have joined, select the more sign. In the drop -down menu, select + Add transformation. A preview of the data set will open.

The data set contains XSHIPPINGDISTANCE (LONG) and YSHIPPINGDISTANCE (LONG) columns. For our purposes, we want to use a personalized function that finds the total distance using the x and Y coordinates and then release the individual coordinate columns. For this example, we find the total distance using a function that is based on the MPMath library.

To call the personalized function, select + Add transformation. In the drop -down menu, select Personalized transformation. Change the editor Python (Pandas) and try to execute the following function from the Python editor:

from mpmath import sqrt  # Import sqrt from mpmath

def calculate_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):

    # Use mpmath's sqrt to calculate the total distance for each row
    df(new_col) = df.apply(lambda row: float(sqrt(row(x_col)**2 + row(y_col)**2)), axis=1)
    
    # Drop the original x and y columns
    df = df.drop(columns=(x_col, y_col))
    
    return df

df = calculate_total_distance(df)

Executing the function produces the following error: Modulenotfountorr: The 'MPMath' module was not called, as shown in the following screenshot.

This error occurs because MPMath is not an inherently compatible module with SageMaker Canvas. To use a function that is based on this module, we need to address the use of a personalized function differently.

Zip the script and the dependencies

To use a function that is based on a module that is not admitted native to the canvas, the personalized script must be ignited with the modules on which it is based. For this example, we use our local integrated development environment (IDE) to create a script.py that is based on the MPMath Library.

The script.py file contains two functions: a function that is compatible with the execution time of Python (Pandas) (function calculate_total_distance), And one that is compatible with the execution time of Python (Pyspark) (function udf_total_distance).

def calculate_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):
    from npmath import sqrt  # Import sqrt from npmath

    # Use npmath's sqrt to calculate the total distance for each row
    df(new_col) = df.apply(lambda row: float(sqrt(row(x_col)**2 + row(y_col)**2)), axis=1)

    # Drop the original x and y columns
    df = df.drop(columns=(x_col, y_col))

    return df

def udf_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import udf
    from pyspark.sql.types import FloatType

    spark = SparkSession.builder \
        .master("local") \
        .appName("DistanceCalculation") \
        .getOrCreate()

    def calculate_distance(x, y):
        import sys

        # Add the path to npmath
        mpmath_path = "/tmp/maths"
        if mpmath_path not in sys.path:
            sys.path.insert(0, mpmath_path)

        from mpmath import sqrt
        return float(sqrt(x**2 + y**2))

    # Register and apply UDF
    distance_udf = udf(calculate_distance, FloatType())
    df = df.withColumn(new_col, distance_udf(df(x_col), df(y_col)))
    df = df.drop(x_col, y_col)

    return df

To ensure that the script can be executed, install MPMath in the same directory as script.py executing pip install mpmath.

Run zip -r my_project.zip To create a .ZIP file that contains the function and installation of MPMath. The current directory now contains a .zip file, our Python script and the installation on which our script depends, as shown in the following screenshot.

amazon S3

After creating the .zip file, put it to a amazon s3 cube.

After the ZIP file has climbed to amazon S3, it can be accessed on the SageMaker canvas.

Execute custom script

Return to data flow on the SageMaker canvas and replace the previous custom function code with the following code and choose Update.

import zipfile
import boto3
import sys
from pathlib import Path
import shutil
import importlib.util


def load_script_and_dependencies(bucket_name, zip_key, extract_to):
    """
    Downloads a zip file from S3, unzips it, and ensures dependencies are available.

    Args:
        bucket_name (str): Name of the S3 bucket.
        zip_key (str): Key for the .zip file in the bucket.
        extract_to (str): Directory to extract files to.

    Returns:
        str: Path to the extracted folder containing the script and dependencies.
    """
    
    s3_client = boto3.client("s3")
    
    # Local path for the zip file
    zip_local_path="/tmp/dependencies.zip"
    
    # Download the .zip file from S3
    s3_client.download_file(bucket_name, zip_key, zip_local_path)
    print(f"Downloaded zip file from S3: {zip_key}")

    # Unzip the file
    try:
        with zipfile.ZipFile(zip_local_path, 'r') as zip_ref:
            zip_ref.extractall(extract_to)
            print(f"Extracted files to {extract_to}")
    except Exception as e:
        raise RuntimeError(f"Failed to extract zip file: {e}")

    # Add the extracted folder to Python path
    if extract_to not in sys.path:
      sys.path.insert(0, extract_to)
          
    return extract_to
    


def call_function_from_script(script_path, function_name, df):
    """
    Dynamically loads a function from a Python script using importlib.
    """
    try:
        # Get the script name from the path
        module_name = script_path.split('/')(-1).replace('.py', '')
        
        # Load the module specification
        spec = importlib.util.spec_from_file_location(module_name, script_path)
        if spec is None:
            raise ImportError(f"Could not load specification for module {module_name}")
            
        # Create the module
        module = importlib.util.module_from_spec(spec)
        sys.modules(module_name) = module
        
        # Execute the module
        spec.loader.exec_module(module)
        
        # Get the function from the module
        if not hasattr(module, function_name):
            raise AttributeError(f"Function '{function_name}' not found in the script.")
            
        loaded_function = getattr(module, function_name)

        # Clean up: remove module from sys.modules after execution
        del sys.modules(module_name)
        
        # Call the function
        return loaded_function(df)
        
    except Exception as e:
        raise RuntimeError(f"Error loading or executing function: {e}")


bucket_name="canvasdatabuckett"  # S3 bucket name
zip_key = 'functions/my_project.zip'  # S3 path to the zip file with our custom dependancy
script_name="script.py"  # Name of the script in the zip file
function_name="udf" # Name of function to call from our script
extract_to = '/tmp/maths' # Local path to our custom script and dependancies

# Step 1: Load the script and dependencies
extracted_path = load_script_and_dependencies(bucket_name, zip_key, extract_to)

# Step 2: Call the function from the script
script_path = f"{extracted_path}/{script_name}"
df = call_function_from_script(script_path, function_name, df)

This example code decompresses the .zip file and adds the units required to the local route so that they are available for the function at the time of execution. Because MpMath added to the local route, he can now call a function that is based on this external library.

The previous code is executed using the execution time of Python (Pandas) and the calculation_total_distance function. To use the execution time of Python (Pyspark), update the function function_name to call the UDF_Total_distance function instead.

Complete the data flow

As a last step, eliminate irrelevant columns before training the model. Follow these steps:

In the SageMaker canvas console, select + Add transformation. In the drop -down menu, select Manage columns
Low Transformchoose Fall column. Low Columns to releaseAdd Productid_0, Produce_1, and Orderid, as shown in the following screenshot.

The final data set must contain 13 columns. The complete data flow is shown in the following image.

Train the model

To train the model, follow these steps:

In the upper right of the page, select Create model and name your data set and model.
Select Predictive analysis as the type of problem and Ontimedelivery As the target column, as shown in screen capture below.

When building the model, you can choose to run a quick compilation or standard compilation. A quick construction prioritizes speed over precision and produces a model trained in less than 20 minutes. A standard construction prioritizes precision over latency, but the model takes longer to train.

Results

After completing the construction of the model, you can see the precision of the model, together with metrics such as F1, precision and retirement. In the case of a standard construction, the model reached a 94.5%precision.

After completing model training, there are four ways in which you can use your model:

Implement the model directly from the Sagemaker canvas to an end point
Add the model to the SageMaker model record
Export your model to a Jupyter notebook
Send your amazon Quicksight model to use on board views

Clean

To manage costs and avoid additional work space charges, choose Log out To sign the Sagemaker canvas when you have finished using the application, as shown in the following screen capture. You can also configure the Sagemaker canvas to automatically turn off when it is inactive.

If you created a s3 cube specifically for this example, you may also want to empty and eliminate your cube.

Summary

In this publication, we demonstrate how Custom S3 dependencies can load and integrate them into Canvas SageMaker's workflows. When walking through a practical example of implementing a personalized distance calculation function with the MPMath Library, we show how:

Personalized code package and dependencies in a .zip file
Store and access these units from amazon S3
Implement personalized data transformations in SageMaker Data Wrangler
Train a predictive model using transformed data

This approach means that scientists and data analysts can extend SageMaker canvas capabilities beyond the more than 300 functions included.

To prove the personalized transformations, see the documentation of the amazon Sagemaker canvas and log in to SageMaker Canvas today. To obtain additional information on how you can optimize your SageMaker canvas implementation, we recommend exploring these related publications:

About the author

Nadhya Polanco He is an architect of associated solutions in AWS based in Brussels, Belgium. In this role, it supports organizations that seek to incorporate ai and automatic learning in their workloads. In his free time, Nadhya likes to enjoy his passion for coffee and explore new destinations.

Integration of personalized dependencies at Amazon Sagemaker Canvas Flows of Work

Technical Terrence Team

Chipotle responds to bankruptcy, store closure reports

Leave a Reply Cancel reply

Recommended.

Exploring the use of phone cases in schools

What is an expense receipt?

Schools Are Using Voice Technology to Teach Reading. Is It Helping?

Bitcoin Ordinals Community Discussions Fixed After Enrollment Validation Error

OnePlus will launch its new flagship, the OnePlus 13 series, on January 7

Categories

Important Links

Integration of personalized dependencies at Amazon Sagemaker Canvas Flows of Work

General solution of the solution

Previous requirements

Create data flow

Zip the script and the dependencies

amazon S3

Execute custom script

Complete the data flow

Train the model

Results

Clean

Summary

About the author

Related

Technical Terrence Team

Chipotle responds to bankruptcy, store closure reports

Leave a Reply Cancel reply

Recommended.

Exploring the use of phone cases in schools

What is an expense receipt?

Schools Are Using Voice Technology to Teach Reading. Is It Helping?

Bitcoin Ordinals Community Discussions Fixed After Enrollment Validation Error

OnePlus will launch its new flagship, the OnePlus 13 series, on January 7

Categories

Important Links

Get daily news updates to your inbox!