Write readable tests for your machine learning models with Behave | by Khuyen Tran | March 2023

Use natural language to test the behavior of your ML models

Imagine that you create an ML model to predict customer sentiment based on reviews. When implementing it, he realizes that the model incorrectly labels certain positive reviews as negative when they are rephrased with negative words.

This is just one example of how an extremely accurate machine learning model can fail without proper testing. Therefore, testing the accuracy and reliability of your model is crucial before implementation.

But how do you test your ML model? A straightforward approach is to use the unit test:

from textblob import TextBlobdef test_sentiment_the_same_after_paraphrasing():
sent = "The hotel room was great! It was spacious, clean and had a nice view of the city."
sent_paraphrased = "The hotel room wasn't bad. It wasn't cramped, dirty, and had a decent view of the city."
sentiment_original = TextBlob(sent).sentiment.polarity
sentiment_paraphrased = TextBlob(sent_paraphrased).sentiment.polarity
both_positive = (sentiment_original > 0) and (sentiment_paraphrased > 0)
both_negative = (sentiment_original < 0) and (sentiment_paraphrased < 0)
assert both_positive or both_negative

This approach works, but it can be difficult for non-technical or business participants to understand. Wouldn’t it be nice if you could incorporate project goals and objectives in their tests, expressed in natural language?

That’s when behaving is useful.

Feel free to play around and fork the source code for this article here:

Behavior is a Python framework for behavior-based development (BDD). BDD is a software development methodology that:

Emphasizes collaboration between stakeholders (such as business analysts, developers, and testers)
Allows users to define requirements and specifications for a software application

Because the behavior provides a common language and format for expressing requirements and specifications, it can be ideal for defining and validating the behavior of machine learning models.

To install the behavior, type:

pip install behave

Let’s use the behavior to perform various tests on machine learning models.

Invariance tests check whether an ML model produces consistent results under different conditions.

An example of testing for invariance involves checking whether a model is paraphrase invariant. If a model is a variant of paraphrasing, it may misclassify a positive review as negative when the review is rephrased with negative words.

feature file

To use the behavior for invariance tests, create a directory named features. In that directory, create a file called invariant_test_sentiment.feature.

└──  features/ 
└───  invariant_test_sentiment.feature

Within invariant_test_sentiment.feature file, we will specify the project requirements:

The “Given”, “When” and “Then” parts of this file present the actual steps that will be executed during the test.

Python Step Implementation

To implement the steps used in the scenarios with Python, start by creating the features/steps directory and a file named invariant_test_sentiment.py inside of her:

└──  features/ 
├────  invariant_test_sentiment.feature  
└────  steps/ 
└────  invariant_test_sentiment.py

He invariant_test_sentiment.py The file contains the following code, which tests whether the sentiment produced by the TextBlob The model is consistent between the original text and its paraphrased version.

from behave import given, then, when
from textblob import TextBlob@given("a text")
def step_given_positive_sentiment(context):
context.sent = "The hotel room was great! It was spacious, clean and had a nice view of the city."
@when("the text is paraphrased")
def step_when_paraphrased(context):
context.sent_paraphrased = "The hotel room wasn't bad. It wasn't cramped, dirty, and had a decent view of the city."
@then("both text should have the same sentiment")
def step_then_sentiment_analysis(context):
# Get sentiment of each sentence
sentiment_original = TextBlob(context.sent).sentiment.polarity
sentiment_paraphrased = TextBlob(context.sent_paraphrased).sentiment.polarity
# Print sentiment
print(f"Sentiment of the original text: {sentiment_original:.2f}")
print(f"Sentiment of the paraphrased sentence: {sentiment_paraphrased:.2f}")
# Assert that both sentences have the same sentiment
both_positive = (sentiment_original > 0) and (sentiment_paraphrased > 0)
both_negative = (sentiment_original < 0) and (sentiment_paraphrased < 0)
assert both_positive or both_negative

Explanation of the above code:

Steps are identified by decorators that match the entity predicate: given, whenand then.
The decorator accepts a string containing the rest of the sentence in the matching scenario step.
He context variable allows you to share values between steps.

run the test

to run the invariant_test_sentiment.feature test, type the following command:

behave features/invariant_test_sentiment.feature

Production:

Feature: Sentiment Analysis # features/invariant_test_sentiment.feature:1
As a data scientist
I want to ensure that my model is invariant to paraphrasing
So that my model can produce consistent results in real-world scenarios.
Scenario: Paraphrased text                                                       
Given a text                                                                   
When the text is paraphrased                                                   
Then both text should have the same sentiment
Traceback (most recent call last):
assert both_positive or both_negative
AssertionErrorCaptured stdout:
Sentiment of the original text: 0.66
Sentiment of the paraphrased sentence: -0.38
Failing scenarios:
features/invariant_test_sentiment.feature:6  Paraphrased text
0 features passed, 1 failed, 0 skipped
0 scenarios passed, 1 failed, 0 skipped
2 steps passed, 1 failed, 0 skipped, 0 undefined

The output shows that the first two steps passed and the last one failed, indicating that the model is affected by paraphrasing.

The directional test is a statistical method used to assess whether the impact of an independent variable on a dependent variable has a particular direction, either positive or negative.

An example of a directional test is testing whether the presence of a specific word has a positive or negative effect on the sentiment score of a given text.

To use the behavior for directional tests, we will create two files directional_test_sentiment.feature and directional_test_sentiment.py .

└──  features/ 
├────  directional_test_sentiment.feature  
└────  steps/ 
└────  directional_test_sentiment.py

feature file

The code in directional_test_sentiment.feature specifies the project requirements as follows:

Notice that “Y” is added to the prose. Since the previous step starts with “Dado”, the behavior will change the name of “Y” to “Dado”.

Python Step Implementation

The code indirectional_test_sentiment.py implements a test scenario, which checks if the presence of the word “awesome” positively affects the sentiment score generated by the TextBlob model.

from behave import given, then, when
from textblob import TextBlob@given("a sentence")
def step_given_positive_word(context):
context.sent = "I love this product" 
@given("the same sentence with the addition of the word '{word}'")
def step_given_a_positive_word(context, word):
context.new_sent = f"I love this {word} product"
@when("I input the new sentence into the model")
def step_when_use_model(context):
context.sentiment_score = TextBlob(context.sent).sentiment.polarity
context.adjusted_score = TextBlob(context.new_sent).sentiment.polarity
@then("the sentiment score should increase")
def step_then_positive(context):
assert context.adjusted_score > context.sentiment_score

The second step uses the parameter syntax. {word}. when the .feature file is executed, the value specified for {word} on the stage you automatically go to the corresponding step function.

This means that if the scenario states that the same sentence must include the word “awesome”, the behavior will automatically replace {word} with “awesome”.

This conversion is useful when you want to use different values for the {word} parameter without changing the .feature file and the .py archive.

run the test

behave features/directional_test_sentiment.feature

Production:

Feature: Sentiment Analysis with Specific Word 
As a data scientist
I want to ensure that the presence of a specific word has a positive or negative effect on the sentiment score of a text
Scenario: Sentiment analysis with specific word                 
Given a sentence                                              
And the same sentence with the addition of the word 'awesome' 
When I input the new sentence into the model                  
Then the sentiment score should increase                      1 feature passed, 0 failed, 0 skipped
1 scenario passed, 0 failed, 0 skipped
4 steps passed, 0 failed, 0 skipped, 0 undefined

Since all the steps passed, we can infer that the sentiment score increases due to the presence of the new word.

Minimum functionality testing is a type of testing that verifies whether the system or product meets the minimum requirements and is functional for its intended use.

An example of minimal functionality testing is checking if the model can handle different types of inputs, such as numeric, categorical, or textual data.

To use minimal functionality tests for input validation, create two files minimum_func_test_input.feature and minimum_func_test_input.py .

└──  features/ 
├────  minimum_func_test_input.feature  
└────  steps/ 
└────  minimum_func_test_input.py

feature file

The code in minimum_func_test_input.feature specifies the project requirements as follows:

Python Step Implementation

The code in minimum_func_test_input.py implements the requirements, checking whether the output generated by predict for a specific ticket type meets expectations.

from behave import given, then, whenimport numpy as np
from sklearn.linear_model import LinearRegression
from typing import Union
def predict(input_data: Union[int, float, str, list]):
"""Create a model to predict input data"""
# Reshape the input data
if isinstance(input_data, (int, float, list)):
input_array = np.array(input_data).reshape(-1, 1)
else:
raise ValueError("Input type not supported")
# Create a linear regression model
model = LinearRegression()
# Train the model on a sample dataset
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
model.fit(X, y)
# Predict the output using the input array
return model.predict(input_array)
@given("I have an integer input of {input_value}")
def step_given_integer_input(context, input_value):
context.input_value = int(input_value)
@given("I have a float input of {input_value}")
def step_given_float_input(context, input_value):
context.input_value = float(input_value)
@given("I have a list input of {input_value}")
def step_given_list_input(context, input_value):
context.input_value = eval(input_value)
@when("I run the model")
def step_when_run_model(context):
context.output = predict(context.input_value)
@then("the output should be an array of one number")
def step_then_check_output(context):
assert isinstance(context.output, np.ndarray)
assert all(isinstance(x, (int, float)) for x in context.output)
assert len(context.output) == 1
@then("the output should be an array of three numbers")
def step_then_check_output(context):
assert isinstance(context.output, np.ndarray)
assert all(isinstance(x, (int, float)) for x in context.output)
assert len(context.output) == 3

run the test

behave features/minimum_func_test_input.feature

Production:

Feature: Test my_ml_model Scenario: Test integer input                       
Given I have an integer input of 42              
When I run the model                             
Then the output should be an array of one number 
Scenario: Test float input                         
Given I have a float input of 3.14               
When I run the model                            
Then the output should be an array of one number 
Scenario: Test list input                             
Given I have a list input of [1, 2, 3]              
When I run the model                                
Then the output should be an array of three numbers
1 feature passed, 0 failed, 0 skipped
3 scenarios passed, 0 failed, 0 skipped
9 steps passed, 0 failed, 0 skipped, 0 undefined

Since all the steps were approved, we can conclude that the model results match our expectations.

This section will describe some drawbacks of using Behavior compared to Pytest and explain why the tool may still be worth considering.

Learning curve

Using Behavior Driven Development (BDD) on behavior can result in a steeper learning curve than the more traditional testing approach used by pytest.

Counter argument: The focus on collaboration in BDD can lead to better alignment between business requirements and software development, resulting in a more efficient development process overall.

slower performance

Behavioral tests can be slower than pytest tests because the behavior must parse function files and map them to step definitions before running the tests.

Counter argument: Behavior’s approach to well-defined steps can lead to tests that are easier to understand and modify, reducing the overall effort required for test maintenance.

less flexibility

behaves is more rigid in its syntax, while pytest allows more flexibility in defining tests and accessories.

Counter argument: Behavior’s rigid structure can help ensure consistency and readability across tests, making them easier to understand and maintain over time.

Summary

Although the behavior has some drawbacks compared to pytest, its focus on collaboration, well-defined steps, and structured approach can still make it a valuable tool for development teams.

Congratulations! You have just learned how to use Behavior to test machine learning models. I hope this knowledge will help you create more understandable tests.

Write readable tests for your machine learning models with Behave | by Khuyen Tran | March 2023

Technical Terrence Team

US and EU agree to start talks on critical minerals used in electric vehicles

Leave a Reply Cancel reply

Recommended.

AWS re:Invent 2024 Highlights: Top takeaways from Swami Sivasubramanian to help customers manage generative AI at scale

Ethereum whale distributes its wealth in Mpeppe, a staking token worth 300x

Celebrating the first American women in district leadership

10 cheap cars to buy now, depending on the consumption of reports

Complete MLOPS cycle for a computer vision project | by Yağmur Çiğdem Aktaş | November 2024

Categories

Important Links

Write readable tests for your machine learning models with Behave | by Khuyen Tran | March 2023

Use natural language to test the behavior of your ML models

feature file

Python Step Implementation

run the test

feature file

Python Step Implementation

run the test

feature file

Python Step Implementation

run the test

Learning curve

slower performance

less flexibility

Summary

Related

Technical Terrence Team

US and EU agree to start talks on critical minerals used in electric vehicles

Leave a Reply Cancel reply

Recommended.

AWS re:Invent 2024 Highlights: Top takeaways from Swami Sivasubramanian to help customers manage generative AI at scale

Ethereum whale distributes its wealth in Mpeppe, a staking token worth 300x

Celebrating the first American women in district leadership

10 cheap cars to buy now, depending on the consumption of reports

Complete MLOPS cycle for a computer vision project | by Yağmur Çiğdem Aktaş | November 2024

Categories

Important Links

Get daily news updates to your inbox!