Easily integrate LLM into your Scikit-learn workflow with Scikit-LLM

Image generated by DALL-E 2

Text analysis tasks have been around for some time as the needs are always there. Research has come a long way, from simple statistical descriptions to text classification and advanced text generation. With the addition of the large language model to our arsenal, our work tasks become even more accessible.

Scikit-LLM is a Python package developed for text analysis activities with the power of LLM. This package stood out because we were able to integrate the standard Scikit-Learn process with Scikit-LLM.

So what is this package all about and how does it work? Let's get into it.

Scikit-LLM is a Python package to enhance text data analysis tasks through LLM. It was developed by ai/” rel=”noopener” target=”_blank”>beatsbyte to help bring together the Scikit-Learn standard library and the power of the language model. Scikit-LLM created their API to be similar to the SKlearn library, so we don't have too many problems using it.

Facility

To use the package, we need to install them. To do that you can use the following code.

At the time of writing, Scikit-LLM only supports some of the OpenAI and GPT4ALL models. That's why we would only work with the OpenAI model. However, you can use the GPT4ALL model by installing the component initially.

pip install scikit-llm(gpt4all)

After installation, you must configure the OpenAI key to access LLM models.

from skllm.config import SKLLMConfig

SKLLMConfig.set_openai_key("")
SKLLMConfig.set_openai_org("")

Testing Scikit-LLM

Let's test some capabilities of Scikit-LLM with the configured environment. One capability that LLMs have is to perform text classification without the need for retraining, which we call Zero-Shot. However, we would initially try a few-shot text classification with the sample data.

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset


#label: Positive, Neutral, Negative
X, y = get_classification_dataset()


#Initiate the model with GPT-3.5
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)

You just need to provide the text data inside the variable X and the label y in the data set. In this case, the label consists of the sentiment, which is Positive, Neutral or Negative.

As you can see, the process is similar to using the fit method in the Scikit-Learn package. However, we already know that Zero-Shot did not necessarily require a data set for training. That's why we can provide the labels without the training data.

X, _ = get_classification_dataset()

clf = ZeroShotGPTClassifier()
clf.fit(None, ("positive", "negative", "neutral"))
labels = clf.predict(X)

This could also be extended in cases of multi-label classification, which you can see in the code below.

from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset
X, _ = get_multilabel_classification_dataset()
candidate_labels = (
    "Quality",
    "Price",
    "Delivery",
    "Service",
    "Product Variety",
    "Customer Support",
    "Packaging",,
)
clf = MultiLabelZeroShotGPTClassifier(max_labels=4)
clf.fit(None, (candidate_labels))
labels = clf.predict(X)

The amazing thing about Scikit-LLM is that it allows the user to extend the power of LLM to the typical Scikit-Learn process.

Scikit-LLM in the machine learning process

In the following example, I will show how we can start Scikit-LLM as a vectorizer and use XGBoost as a model classifier. We would also include the steps in the model process.

First, we would load the data and start the tag encoder to transform the tag data into a numeric value.

from sklearn.preprocessing import LabelEncoder

X, y = get_classification_dataset()

le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)

Next, we would define a pipeline to perform vectorization and model fitting. We can do it with the following code.

from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from skllm.preprocessing import GPTVectorizer

steps = (("GPT", GPTVectorizer()), ("Clf", XGBClassifier()))
clf = Pipeline(steps)

#Fitting the dataset
clf.fit(X_train, y_train_enc)

Finally, we can make predictions with the following code.

pred_enc = clf.predict(X_test)
preds = le.inverse_transform(pred_enc)

As we can see, we can use Scikit-LLM and XGBoost in the Scikit-Learn process. Combining all the necessary packages would make our prediction even stronger.

There are still several tasks you can perform with Scikit-LLM, including model tuning, which I suggest you check out the documentation for more information. You can also use the open source model of GPT4ALL if necessary.

Scikit-LLM is a Python package that powers Scikit-Learn's text data analysis tasks with LLM. In this article, we discuss how we use Scikit-LLM for text classification and combine them in the machine learning process.

Cornellius Yudha Wijaya He is an assistant data science manager and data writer. While working full-time at Allianz Indonesia, she loves sharing Python tips and data through social media and print media.

Easily integrate LLM into your Scikit-learn workflow with Scikit-LLM

Technical Terrence Team

Crypto Stocks Marathon, Coinbase, and Riot Dominate the Week's Financial Engines (NYSE:MTB)

Leave a Reply Cancel reply

Recommended.

Researchers from Imperial College and GSK AI present RAmBLA: a machine learning framework to assess the reliability of LLMs as assistants in the biomedical domain

MathWorks and Discovery Museum Renew Longstanding Partnership to Bring STEM Practice to Preschool-8th Grade Classrooms

Popular beverage retailer files for Chapter 11 bankruptcy

New DeepMind AI Research Proposes Two-Directional Structure-Aware Positional Encodings for Directed Graphs

Bitcoin falls after ETF approval; Investors flock to Chainlink and NuggetRush

Categories

Important Links

Easily integrate LLM into your Scikit-learn workflow with Scikit-LLM

Facility

Testing Scikit-LLM

Scikit-LLM in the machine learning process

Related

Technical Terrence Team

Crypto Stocks Marathon, Coinbase, and Riot Dominate the Week's Financial Engines (NYSE:MTB)

Leave a Reply Cancel reply

Recommended.

Researchers from Imperial College and GSK AI present RAmBLA: a machine learning framework to assess the reliability of LLMs as assistants in the biomedical domain

MathWorks and Discovery Museum Renew Longstanding Partnership to Bring STEM Practice to Preschool-8th Grade Classrooms

Popular beverage retailer files for Chapter 11 bankruptcy

New DeepMind AI Research Proposes Two-Directional Structure-Aware Positional Encodings for Directed Graphs

Bitcoin falls after ETF approval; Investors flock to Chainlink and NuggetRush

Categories

Important Links

Get daily news updates to your inbox!