When implementing automatic learning workflows (ML) on the amazon Sagemaker canvas, organizations may consider the external dependencies necessary for their specific use cases. Although the Sagemaker canvas provides powerful capabilities without code and low code for rapid experimentation, some projects may require specialized dependencies and libraries that are not included by default on the SageMaker canvas. This publication provides an example of how to incorporate a code that is based on external dependencies in its SageMaker canvas workflows.
amazon Sagemaker Canvas is a ML platform without low code code (LCNC) that guides users through each stage of the ML trip, from the preparation of initial data to the implementation of the final model. Without writing a single line of code, users can explore data sets, transform data, create models and generate predictions.
SageMaker Canvas offers comprehensive data disputes that help you prepare your data, including:
- More than 300 transformation steps incorporated
- Feature engineering capabilities
- Data standardization and cleaning functions
- A custom code editor that supports Python, Pyspark and Sparksql
In this publication, we demonstrate how to incorporate dependencies stored in amazon Simo Storage Service (amazon S3) within a Data Wrangler Data Flow by amazon Sagemaker. Using this approach, you can execute custom scripts that depend on modules that are not inherently compatible with SageMaker Canvas.
General solution of the solution
To show the integration of customs and personalized dependencies of amazon S3 in Sagemaker Canvas, we explore the following example workflow.
The solution follows three main steps:
- Load Scripts and Custom Units at amazon S3
- Use SageMaker Data Wrangler in SageMaker Canvas to transform your data using the loaded code
- Train and export the model
The following diagram is the architecture of the solution.
In this example, we work with two complementary data sets available on the SageMaker canvas that contain shipping information for computer screen delivery. By joining these data sets, we create an integral data set that captures several delivery metrics and delivery results. Our goal is to build a predictive model that can determine if future shipments will arrive on time according to historical patterns and features.
Previous requirements
As a previous requirement, you need access to amazon S3 and amazon Sagemaker ai. If you still do not have a sagemaker domain configured in your account, you also need permits to create a domain ai SageMaker.
Create data flow
To create data flow, follow these steps:
- In the ai of amazon Sagemaker, on the navigation panel, underneath Applications and IDESselect Canvasas shown in the next screen capture. It is possible that you should create a Sagemaker domain if you have not yet done so.
- After your domain is created, choose Open canvas.
- On canvas, select the Data sets tab and select Canvas-Sample-Shipping-logs.csv, as shown in the following screen capture. After the previous view appears, choose + Create a data flow.
The initial data flow will open with a source and a type of data.
- In the upper right of the screen and select Add data → tabular. Choose Canvas data sets as the source and select the canvas-sample-products-descriptions.csv.
- Choose Next As shown in the next screen capture. Then choose Matter.
- After both data sets have been added, select the more sign. In the drop -down menu, choose select Combine data. From the following drop -down menu, choose Join.
- To carry out an internal union in the producing column, in the right -hand menu, underneath Type of unionchoose Interior Union. Low Join the keyschoose Produceas shown in the next screen capture.
- After the data sets have joined, select the more sign. In the drop -down menu, select + Add transformation. A preview of the data set will open.
The data set contains XSHIPPINGDISTANCE (LONG) and YSHIPPINGDISTANCE (LONG) columns. For our purposes, we want to use a personalized function that finds the total distance using the x and Y coordinates and then release the individual coordinate columns. For this example, we find the total distance using a function that is based on the MPMath library.
- To call the personalized function, select + Add transformation. In the drop -down menu, select Personalized transformation. Change the editor Python (Pandas) and try to execute the following function from the Python editor:
Executing the function produces the following error: Modulenotfountorr: The 'MPMath' module was not called, as shown in the following screenshot.
This error occurs because MPMath is not an inherently compatible module with SageMaker Canvas. To use a function that is based on this module, we need to address the use of a personalized function differently.
Zip the script and the dependencies
To use a function that is based on a module that is not admitted native to the canvas, the personalized script must be ignited with the modules on which it is based. For this example, we use our local integrated development environment (IDE) to create a script.py that is based on the MPMath Library.
The script.py file contains two functions: a function that is compatible with the execution time of Python (Pandas) (function calculate_total_distance
), And one that is compatible with the execution time of Python (Pyspark) (function udf_total_distance
).
To ensure that the script can be executed, install MPMath in the same directory as script.py executing pip install mpmath
.
Run zip -r my_project.zip
To create a .ZIP file that contains the function and installation of MPMath. The current directory now contains a .zip file, our Python script and the installation on which our script depends, as shown in the following screenshot.
amazon S3
After creating the .zip file, put it to a amazon s3 cube.
After the ZIP file has climbed to amazon S3, it can be accessed on the SageMaker canvas.
Execute custom script
Return to data flow on the SageMaker canvas and replace the previous custom function code with the following code and choose Update.
This example code decompresses the .zip file and adds the units required to the local route so that they are available for the function at the time of execution. Because MpMath added to the local route, he can now call a function that is based on this external library.
The previous code is executed using the execution time of Python (Pandas) and the calculation_total_distance function. To use the execution time of Python (Pyspark), update the function function_name to call the UDF_Total_distance function instead.
Complete the data flow
As a last step, eliminate irrelevant columns before training the model. Follow these steps:
- In the SageMaker canvas console, select + Add transformation. In the drop -down menu, select Manage columns
- Low Transformchoose Fall column. Low Columns to releaseAdd Productid_0, Produce_1, and Orderid, as shown in the following screenshot.
The final data set must contain 13 columns. The complete data flow is shown in the following image.
Train the model
To train the model, follow these steps:
- In the upper right of the page, select Create model and name your data set and model.
- Select Predictive analysis as the type of problem and Ontimedelivery As the target column, as shown in screen capture below.
When building the model, you can choose to run a quick compilation or standard compilation. A quick construction prioritizes speed over precision and produces a model trained in less than 20 minutes. A standard construction prioritizes precision over latency, but the model takes longer to train.
Results
After completing the construction of the model, you can see the precision of the model, together with metrics such as F1, precision and retirement. In the case of a standard construction, the model reached a 94.5%precision.
After completing model training, there are four ways in which you can use your model:
- Implement the model directly from the Sagemaker canvas to an end point
- Add the model to the SageMaker model record
- Export your model to a Jupyter notebook
- Send your amazon Quicksight model to use on board views
Clean
To manage costs and avoid additional work space charges, choose Log out To sign the Sagemaker canvas when you have finished using the application, as shown in the following screen capture. You can also configure the Sagemaker canvas to automatically turn off when it is inactive.
If you created a s3 cube specifically for this example, you may also want to empty and eliminate your cube.
Summary
In this publication, we demonstrate how Custom S3 dependencies can load and integrate them into Canvas SageMaker's workflows. When walking through a practical example of implementing a personalized distance calculation function with the MPMath Library, we show how:
- Personalized code package and dependencies in a .zip file
- Store and access these units from amazon S3
- Implement personalized data transformations in SageMaker Data Wrangler
- Train a predictive model using transformed data
This approach means that scientists and data analysts can extend SageMaker canvas capabilities beyond the more than 300 functions included.
To prove the personalized transformations, see the documentation of the amazon Sagemaker canvas and log in to SageMaker Canvas today. To obtain additional information on how you can optimize your SageMaker canvas implementation, we recommend exploring these related publications:
About the author
Nadhya Polanco He is an architect of associated solutions in AWS based in Brussels, Belgium. In this role, it supports organizations that seek to incorporate ai and automatic learning in their workloads. In his free time, Nadhya likes to enjoy his passion for coffee and explore new destinations.