Image generated by DALL-E 2
Text analysis tasks have been around for some time as the needs are always there. Research has come a long way, from simple statistical descriptions to text classification and advanced text generation. With the addition of the large language model to our arsenal, our work tasks become even more accessible.
Scikit-LLM is a Python package developed for text analysis activities with the power of LLM. This package stood out because we were able to integrate the standard Scikit-Learn process with Scikit-LLM.
So what is this package all about and how does it work? Let's get into it.
Scikit-LLM is a Python package to enhance text data analysis tasks through LLM. It was developed by ai/” rel=”noopener” target=”_blank”>beatsbyte to help bring together the Scikit-Learn standard library and the power of the language model. Scikit-LLM created their API to be similar to the SKlearn library, so we don't have too many problems using it.
Facility
To use the package, we need to install them. To do that you can use the following code.
At the time of writing, Scikit-LLM only supports some of the OpenAI and GPT4ALL models. That's why we would only work with the OpenAI model. However, you can use the GPT4ALL model by installing the component initially.
pip install scikit-llm(gpt4all)
After installation, you must configure the OpenAI key to access LLM models.
from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key("")
SKLLMConfig.set_openai_org("")
Testing Scikit-LLM
Let's test some capabilities of Scikit-LLM with the configured environment. One capability that LLMs have is to perform text classification without the need for retraining, which we call Zero-Shot. However, we would initially try a few-shot text classification with the sample data.
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset
#label: Positive, Neutral, Negative
X, y = get_classification_dataset()
#Initiate the model with GPT-3.5
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)
You just need to provide the text data inside the variable X and the label y in the data set. In this case, the label consists of the sentiment, which is Positive, Neutral or Negative.
As you can see, the process is similar to using the fit method in the Scikit-Learn package. However, we already know that Zero-Shot did not necessarily require a data set for training. That's why we can provide the labels without the training data.
X, _ = get_classification_dataset()
clf = ZeroShotGPTClassifier()
clf.fit(None, ("positive", "negative", "neutral"))
labels = clf.predict(X)
This could also be extended in cases of multi-label classification, which you can see in the code below.
from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset
X, _ = get_multilabel_classification_dataset()
candidate_labels = (
"Quality",
"Price",
"Delivery",
"Service",
"Product Variety",
"Customer Support",
"Packaging",,
)
clf = MultiLabelZeroShotGPTClassifier(max_labels=4)
clf.fit(None, (candidate_labels))
labels = clf.predict(X)
The amazing thing about Scikit-LLM is that it allows the user to extend the power of LLM to the typical Scikit-Learn process.
Scikit-LLM in the machine learning process
In the following example, I will show how we can start Scikit-LLM as a vectorizer and use XGBoost as a model classifier. We would also include the steps in the model process.
First, we would load the data and start the tag encoder to transform the tag data into a numeric value.
from sklearn.preprocessing import LabelEncoder
X, y = get_classification_dataset()
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)
Next, we would define a pipeline to perform vectorization and model fitting. We can do it with the following code.
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from skllm.preprocessing import GPTVectorizer
steps = (("GPT", GPTVectorizer()), ("Clf", XGBClassifier()))
clf = Pipeline(steps)
#Fitting the dataset
clf.fit(X_train, y_train_enc)
Finally, we can make predictions with the following code.
pred_enc = clf.predict(X_test)
preds = le.inverse_transform(pred_enc)
As we can see, we can use Scikit-LLM and XGBoost in the Scikit-Learn process. Combining all the necessary packages would make our prediction even stronger.
There are still several tasks you can perform with Scikit-LLM, including model tuning, which I suggest you check out the documentation for more information. You can also use the open source model of GPT4ALL if necessary.
Scikit-LLM is a Python package that powers Scikit-Learn's text data analysis tasks with LLM. In this article, we discuss how we use Scikit-LLM for text classification and combine them in the machine learning process.
Cornellius Yudha Wijaya He is an assistant data science manager and data writer. While working full-time at Allianz Indonesia, she loves sharing Python tips and data through social media and print media.