Image by the author
In Python, strings are immutable sequences of characters that are human-readable and are typically encoded in a specific character encoding, such as UTF-8. While bytes represent raw binary data. A byte object is immutable and consists of an array of bytes (8-bit values). In Python 3, string literals are Unicode by default, while byte literals are prefixed with a b
.
Converting bytes to strings is a common task in Python, particularly when working with data from network operations, file I/O, or responses from certain APIs. This is a tutorial on how to convert bytes to strings in Python.
1. Convert bytes to strings using the decode() method
The easiest way to convert bytes to a string is by using the decode()
Method on the byte object (or byte string). This method requires specifying the character encoding used.
Note:Strings do not have an associated binary encoding and bytes do not have an associated text encoding. To convert bytes to strings, you can use the
decode()
method on the bytes object. And to convert a string to bytes, you can use theencode()
method on the string. In either case, specify the encoding to use.
Example 1: UTF-8 encoding
Here we convert byte_data
to a UTF-8 encoded string using the decode()
method:
# Sample byte object
byte_data = b'Hello, World!'
# Converting bytes to string
string_data = byte_data.decode('utf-8')
print(string_data)
You should get the following result:
You can check the data types before and after conversion as follows:
print(type(bytes_data))
print(type(string_data))
Data types should be as expected:
Example 2: Handling other encodings
Sometimes the byte stream may contain encodings other than UTF-8. You can work around this problem by specifying the corresponding encoding scheme used when calling the function. decode()
method on the bytes object.
Here's how you can decode a UTF-16 encoded byte string:
# Sample byte object
byte_data_utf16 = b'\xff\xfeH\x00e\x00l\x00l\x00o\x00,\x00 \x00W\x00o\x00r\x00l\x00d\x00!\x00'
# Converting bytes to string
string_data_utf16 = byte_data_utf16.decode('utf-16')
print(string_data_utf16)
And here is the result:
Using Chardet to detect encoding
In practice, you may not always know which encoding scheme is being used. And mismatched encodings can lead to errors or unreadable text. How can you fix this?
You can use the chardet library (install chardet using pip: pip install chardet
) to detect the encoding and then use it in the call to the `decode()` method. Here is an example:
import chardet
# Sample byte object with unknown encoding
byte_data_unknown = b'\xe4\xbd\xa0\xe5\xa5\xbd'
# Detecting the encoding
detected_encoding = chardet.detect(byte_data_unknown)
encoding = detected_encoding('encoding')
print(encoding)
# Converting bytes to string using detected encoding
string_data_unknown = byte_data_unknown.decode(encoding)
print(string_data_unknown)
You should get a similar result:
Handling errors in decoding
He bytes
The object you are working with may not always be valid; sometimes it may contain invalid sequences for the specified encoding. This will cause errors.
Here, byte_data_invalid
contains invalid sequence \xff:
# Sample byte object with invalid sequence for UTF-8
byte_data_invalid = b'Hello, World!\xff'
# try converting bytes to string
string_data = byte_data_invalid.decode('utf-8')
print(string_data)
When you try to decode it, you will get the following error:
Traceback (most recent call last):
File "/home/balapriya/bytes2str/main.py", line 5, in
string_data = byte_data_invalid.decode('utf-8')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 13: invalid start byte
But there are a couple of ways to handle these errors. You can ignore them when decoding, or you can replace the invalid sequences with a placeholder.
Ignore errors
To ignore invalid sequences when decoding, you can configure errors that you can configure errors
to ignore
in it decode()
method call:
# Sample byte object with invalid sequence for UTF-8
byte_data_invalid = b'Hello, World!\xff'
# Converting bytes to string while ignoring errors
string_data = byte_data_invalid.decode('utf-8', errors="ignore")
print(string_data)
Now you will get the following result without any error:
Replace errors
You can also replace invalid sequences with the placeholder. To do this, you can configure errors
to replace
as shown:
# Sample byte object with invalid sequence for UTF-8
byte_data_invalid = b'Hello, World!\xff'
# Converting bytes to string while replacing errors with a placeholder
string_data_replace = byte_data_invalid.decode('utf-8', errors="replace")
print(string_data_replace)
Now the invalid sequence (at the end) is replaced by a placeholder:
Output >>>
Hello, World!�
2. Convert bytes to strings using the str() constructor
He decode()
The method is the most common way to convert bytes to strings. But you can also use the method str()
constructor to get a string from a bytes object. You can pass the encoding scheme to str()
like:
# Sample byte object
byte_data = b'Hello, World!'
# Converting bytes to string
string_data = str(byte_data,'utf-8')
print(string_data)
This produces:
3. Convert bytes to strings using the Codecs module
Another method to convert bytes to strings in Python is to use the decode()
built-in function codecs Module. This module provides convenient functions for encoding and decoding.
You can call decode()
function with bytes object and encoding scheme as shown:
import codecs
# Sample byte object
byte_data = b'Hello, World!'
# Converting bytes to string
string_data = codecs.decode(byte_data,'utf-8')
print(string_data)
As expected, this also produces the following result:
Summary
In this tutorial, we learned how to convert bytes to strings in Python, while also handling different encodings and possible errors in an elegant manner. Specifically, we learned how to:
- Use the
decode()
method to convert bytes to a string, specifying the correct encoding. - Handle possible decoding errors using the
errors
parameter with options likeignore
eitherreplace
. - Use the
str()
constructor to convert a valid bytes object to a string. - Use the
decode()
function of thecodecs
module that is built into the Python standard library to convert a valid bytes object to a string.
Happy coding!
twitter.com/balawc27″ rel=”noopener”>Bala Priya C. Bala is a technical developer and writer from India. She enjoys working at the intersection of mathematics, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, programming, and drinking coffee! Currently, she is working on learning and sharing her knowledge with the developer community by creating tutorials, how-to guides, opinion pieces, and more. Bala also creates interesting resource overviews and coding tutorials.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>