In this tutorial, we will build an efficient legal ia chat using open source tools. Provides a step -by -step guide to create a chatbot using Bigscience/T0PP LLMHugging face transformers and Pytorch. We will guide it through the model configuration, optimizing performance using Pytorch and ensuring an efficient and accessible legal assistant with ai.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "bigscience/T0pp" # Open-source and available
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
First, we load Bigscience/T0PP, a llm of open source, using facial transformers for hugs. Initializes a tokenizer for text preprocessing and loading the automodefforseq2seqlm, allowing the model to perform text generation tasks, such as responding legal consultations.
import spacy
import re
nlp = spacy.load("en_core_web_sm")
def preprocess_legal_text(text):
text = text.lower()
text = re.sub(r'\s+', ' ', text) # Remove extra spaces
text = re.sub(r'(^a-zA-Z0-9\s)', '', text) # Remove special characters
doc = nlp(text)
tokens = (token.lemma_ for token in doc if not token.is_stop) # Lemmatization
return " ".join(tokens)
sample_text = "The contract is valid for 5 years, terminating on December 31, 2025."
print(preprocess_legal_text(sample_text))
Then, we prepay legal text using expressions of spaces and regular to guarantee a cleaner and more structured entry for NLP tasks. First converts the text into lowercase, eliminates additional spaces and special characters using regx, and then tokeniza and limits the text using the SPACY NLP pipe. In addition, it filters the words of stopping to retain only significant terms, which makes it ideal for legal text processing in ai applications. The cleaned text is more efficient for automatic learning and language models such as Bigscience/T0PP, improving precision in the legal responses of Chatbot.
def extract_legal_entities(text):
doc = nlp(text)
entities = ((ent.text, ent.label_) for ent in doc.ents)
return entities
sample_text = "Apple Inc. signed a contract with Microsoft on June 15, 2023."
print(extract_legal_entities(sample_text))
Here, we extract legal text entities using the entity recognition capabilities (NER) of Spacy. The function processes the entry text with the SPACY NLP model, identifying and extracting key entities as organizations, dates and legal terms. Returns a list of tuples, each containing the recognized entity and its category (for example, organization, date or term related to the law).
import faiss
import numpy as np
import torch
from transformers import AutoModel, AutoTokenizer
embedding_model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
embedding_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def embed_text(text):
inputs = embedding_tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
output = embedding_model(**inputs)
embedding = output.last_hidden_state.mean(dim=1).squeeze().cpu().numpy() # Ensure 1D vector
return embedding
legal_docs = (
"A contract is legally binding if signed by both parties.",
"An NDA prevents disclosure of confidential information.",
"A non-compete agreement prohibits working for a competitor."
)
doc_embeddings = np.array((embed_text(doc) for doc in legal_docs))
print("Embeddings Shape:", doc_embeddings.shape) # Should be (num_samples, embedding_dim)
index = faiss.IndexFlatL2(doc_embeddings.shape(1)) # Dimension should match embedding size
index.add(doc_embeddings)
query = "What happens if I break an NDA?"
query_embedding = embed_text(query).reshape(1, -1) # Reshape for FAISS
_, retrieved_indices = index.search(query_embedding, 1)
print(f"Best matching legal text: {legal_docs(retrieved_indices(0)(0))}")
With the previous code, we create a legal document recovery system using FAISS for efficient semantic search. First load the minilm embedding model of the hugged face to generate numerical text representations. The insced_text function processes legal documents and consultations by calculating contextual inlays using Minilm. These integrities are stored in a FAISS vector index, allowing rapid searches for similarity.
def legal_chatbot(query):
inputs = tokenizer(query, return_tensors="pt", padding=True, truncation=True)
output = model.generate(**inputs, max_length=100)
return tokenizer.decode(output(0), skip_special_tokens=True)
query = "What happens if I break an NDA?"
print(legal_chatbot(query))
Finally, we define a legal chatbot such as generating responses to legal consultations using a previously trained language model. The legal_chatbot function takes a user consultation, processing it using the tokenizer and generates an answer with the model. The answer is decoded in readable text, eliminating any special tokens. When a consultation like “What happens if I break a nda?” It is entrance, the chatbot provides a legal response generated by the relevant ai.
In conclusion, when integrating Bigscience/T0PP LLM, hug the facial transformers and Pytorch, we have demonstrated how to build a powerful powerful and scalable chat using open source resources. This project is a solid base to create reliable legal tools with ai, which makes legal assistance more accessible and automated.
Here is the Colab notebook For the previous project. Besides, don't forget to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LINKEDIN GRsplash. Do not forget to join our 80k+ ml subject.
Recommended Reading Reading IA Research Liberations: An advanced system that integrates the ai system and data compliance standards to address legal concerns in IA data sets
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, Asif undertakes to take advantage of the potential of artificial intelligence for the social good. Its most recent effort is the launch of an artificial intelligence media platform, Marktechpost, which stands out for its deep coverage of automatic learning and deep learning news that is technically solid and easily understandable by a broad audience. The platform has more than 2 million monthly views, illustrating its popularity among the public.