Text Data Types

The object Base Class and Collections Abstract Base Class

Recall from the previous tutorial covering the data model that the object class is the base class of all classes. dir can be used to view a list of it's identifiers:

In [1]: dir(object)
Out[1]: [
            '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', 
            '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', 
            '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', 
            '__sizeof__', '__str__', '__subclasshook__'
           ]

The identifiers can also be viewed if object is input followed by a dot .:

In [2]: object.
# -------------------------------
# Available Identifiers for `object`:
# -------------------------------------
#   🔧 Functions:
#     - __init__(self, /, *args, **kwargs)          : Initializes the object.
#     - __new__(*args, **kwargs)                    : Creates a new instance of the class.
#     - __delattr__(self, name, /)                  : Defines behavior for when an attribute is deleted.
#     - __dir__(self, /)                            : Default dir() implementation.
#     - __sizeof__(self, /)                         : Returns the size of the object in memory, in bytes.
#     - __eq__(self, value, /)                      : Checks for equality with another object.
#     - __ne__(self, value, /)                      : Checks for inequality with another object.
#     - __lt__(self, value, /)                      : Checks if the object is less than another.
#     - __le__(self, value, /)                      : Checks if the object is less than or equal to another.
#     - __gt__(self, value, /)                      : Checks if the object is greater than another.
#     - __ge__(self, value, /)                      : Checks if the object is greater than or equal to another.
#     - __repr__(self, /)                           : Returns a string representation of the object.
#     - __str__(self, /)                            : Returns a string for display purposes.
#     - __format__(self, format_spec, /)            : Returns a formatted string representation of the object.
#     - __hash__(self, /)                           : Returns a hash of the object.
#     - __getattribute__(self, name, /)             : Gets an attribute from the object.
#     - __setattr__(self, name, value, /)           : Sets an attribute on the object.
#     - __delattr__(self, name, /)                  : Deletes an attribute from the object.
#     - __reduce__(self, /)                         : Prepares the object for pickling.
#     - __reduce_ex__(self, protocol, /)            : Similar to __reduce__, with a protocol argument.
#     - __init_subclass__(...)                      : Called when a class is subclassed; default 
#                                                     implementation does nothing.
#     - __subclasshook__(...)                       : Customize issubclass() for abstract classes.
#
#    🔍 Attributes:
#     - __class__                                    : The class of the object.
#     - __doc__                                      : The docstring of the object.
# -------------------------------------

If the str class is now examined, notice that it has many more identifiers:

In [2]: dir(str)
Out[2]: [
          '__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__',
          '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__',
          '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__',
          '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__repr__',
          '__radd__', '__rmatmul__', '__rmul__', '__setattr__', '__sizeof__', '__str__',
          '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode',
          'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum',
          'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric',
          'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',
          'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust',
          'rpartition', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase',
          'title', 'upper', 'zfill'
       ]

Because, the object is the base class, it is present in the str classes method resolution order:

In [3]: str.mro()
Out[3]: ['str', 'object']

Recall the str class inherits everything from the object class. Some identifiers are redefined in the str class for additional functionality and additional identifiers are supplemented. The method resolution order essentially means preferentially use the method if it is redefined in the str class over the equivalent method in the object class.

The str class follows the design pattern of the abstract base class immutable Collection and therefore has the behaviour of an immutable Collection. When str is input, followed by a dot . the identifiers are typically listed alphabetically. However it is easier to understand the identifiers in the str class when the identifiers are grouped by design pattern and purpose:

In [4]: str.
# -------------------------------
# Available Identifiers for `str`:
# -------------------------------------

# 🔧 Functions from `object` (inherited by `str`):
#     - __init__(self, /, *args, **kwargs)          : Initializes the object.
#     - __new__(*args, **kwargs)                    : Creates a new instance of the class.
#     - __delattr__(self, name, /)                  : Defines behavior for when an attribute is deleted.
#     - __dir__(self, /)                            : Default dir() implementation.
#     - __sizeof__(self, /)                         : Returns the size of the object in memory, in bytes.
#     - __eq__(self, value, /)                      : Checks for equality with another object.
#     - __ne__(self, value, /)                      : Checks for inequality with another object.
#     - __lt__(self, value, /)                      : Checks if the object is less than another.
#     - __le__(self, value, /)                      : Checks if the object is less than or equal to another.
#     - __gt__(self, value, /)                      : Checks if the object is greater than another.
#     - __ge__(self, value, /)                      : Checks if the object is greater than or equal to another.
#     - __repr__(self, /)                           : Returns a string representation of the object.
#     - __str__(self, /)                            : Returns a string for display purposes.
#     - __format__(self, format_spec, /)            : Returns a formatted string representation of the object.
#     - __hash__(self, /)                           : Returns a hash of the object.
#     - __getattribute__(self, name, /)             : Gets an attribute from the object.
#     - __setattr__(self, name, value, /)           : Sets an attribute on the object.
#     - __delattr__(self, name, /)                  : Deletes an attribute from the object.
#     - __reduce__(self, /)                         : Prepares the object for pickling.
#     - __reduce_ex__(self, protocol, /)            : Similar to __reduce__, with a protocol argument.

# 🔍 Attributes from `object`:
#     - __class__                                   : The class of the string.
#     - __doc__                                     : The docstring of the string class.

# 🔧 Collection-Based Methods (from `str` and the Collection ABC):
#     - __contains__(self, key, /)                  : Checks if a substring is in the string (`in`).
#     - __iter__(self, /)                           : Returns an iterator over the string.
#     - __len__(self, /)                            : Returns the length of the string.
#     - __getitem__(self, key, /)                   : Retrieves a character by index (`[]`).
#     - count(self, sub, start=0,                   : Counts the occurrences of a substring.
#             end=9223372036854775807, /) 
#     - index(self, sub, start=0,                   : Returns the index of the first occurrence of a substring.
#             end=9223372036854775807, /) 

# 🔧 Supplementary Collection-Based Methods:
#     - rindex(self, sub, start=0,                  : Returns the highest index of a substring.
#              end=9223372036854775807, /) 
#     - find(self, sub, start=0,                    : Finds the first index of a substring.
#            end=9223372036854775807, /) 
#     - rfind(self, sub, start=0,                   : Finds the highest index of a substring.
#             end=9223372036854775807, /) 
#     - replace(self, old, new, count=-1, /)        : Replaces occurrences of a substring.

# 🔧 Collection-Like Operators:
#     - __add__(self, value, /)                     : Implements string concatenation (`+`).
#     - __mul__(self, value, /)                     : Implements string repetition (`*`).
#     - __rmul__(self, value, /)                    : Implements reflected multiplication (`*`).

# 🔧 Encoding-Related Methods:
#     - encode(self, encoding='utf-8', )            : Encodes the string using a specified encoding.
#              errors='strict', /
#

# 🔧 String-Specific Dunder Methods (from `str`):
#     - __bytes__(self, /)                          : Converts the bytes object to a bytes object.

# 🔧 Additional String-Specific Methods (Grouped by Similarity):

# 🔧 Formatting Methods:
#     - format(self, /, *args, **kwargs)            : Formats the string using a format string.
#     - format_map(self, mapping, /)                : Formats the string using a dictionary.
#     - translate(self, table, /)                   : Maps characters using a translation table.
#     - __mod__(self, value, /)                     : Implements C style string formatting using `%`.
#     - __rmod__(self, value, /)                    : Implements reverse C style string formatting using `%`.

# 🅰️ Case-Specific Methods:
#     - lower(self, /)                              : Converts all characters to lowercase.
#     - casefold(self, /)                           : Returns a casefolded version for caseless matching.
#     - upper(self, /)                              : Converts all characters to uppercase.
#     - capitalize(self, /)                         : Capitalizes the first character of the string.
#     - title(self, /)                              : Returns a title-cased version of the string.
#     - swapcase(self, /)                           : Swaps the case of all characters.

# 🔠 Boolean Methods (Grouped by Type):

# Character Classification:
#     - isascii(self, /)                            : Checks if all characters are ASCII.
#     - isalpha(self, /)                            : Checks if the string contains only alphabetic characters.

# Numeric Classification:
#     - isdecimal(self, /)                          : Checks if the string contains only decimal characters.
#     - isdigit(self, /)                            : Checks if the string contains only digits.
#     - isnumeric(self, /)                          : Checks if the string contains only numeric characters.

# Whitespace and Titlecase:
#     - islower(self, /)                            : Checks if all characters are lowercase.
#     - isupper(self, /)                            : Checks if all characters are uppercase.
#     - isspace(self, /)                            : Checks if the string contains only whitespace.
#     - istitle(self, /)                            : Checks if the string is title-cased.
#     - isprintable(self, /)                        : Checks if all characters are printable.
#     - isidentifier(self, /)                       : Checks if the string is a valid Python identifier.

# Starts or Ends With:
#     - startswith(self, prefix, start=0,           : Checks if the string starts with a prefix.
#                  end=9223372036854775807, /) 
#     - endswith(self, suffix, start=0,             : Checks if the string ends with a suffix.
#                end=9223372036854775807, /)   

# 🔄 Manipulation Methods (Grouping Similar Functions):
#     - ljust(self, width, fillchar=' ', /)         : Left-justifies the string in a field of a given width.
#     - rjust(self, width, fillchar=' ', /)         : Right-justifies the string in a field of a given width.
#     - center(self, width, fillchar=' ', /)        : Centers the string in a field of a given width.
#     - zfill(self, width, /)                       : Pads the string with zeros on the left.
#     - expandtabs(self, tabsize=8, /)              : Expands tabs in the string into spaces.

# 🔄 Stripping Methods:
#     - lstrip(self, chars=None, /)                 : Strips leading characters from the string.
#     - rstrip(self, chars=None, /)                 : Strips trailing characters from the string.
#     - strip(self, chars=None, /)                  : Strips leading and trailing characters from the string.
#     - removeprefix(self, prefix, /)               : Removes the specified prefix from the string.
#     - removesuffix(self, suffix, /)               : Removes the specified suffix from the string.

# 🧩 Splitting and Joining:
#     - split(self, sep=None, maxsplit=-1, /)       : Splits the string at occurrences of a separator.
#     - rsplit(self, sep=None, maxsplit=-1, /)      : Splits the string at occurrences of a separator, from the #                                                     right.
#     - splitlines(self, keepends=False, /)         : Splits the string at line breaks.
#     - join(self, iterable, /)                     : Joins an iterable with the string as a separator.
#     - partition(self, sep, /)                     : Splits the string into a 3-tuple around a separator.
#     - rpartition(self, sep, /)                    : Splits the string into a 3-tuple around a separator, from #                                                     the right.

The str is a Collection where each element (fundamental unit) in the Collection is a Unicode Character. The str class always uses the Unicode Transformation Format-8 (UTF-8) to encode an Unicode character and this greatly simplifies text related operations as the user does not need to handle encoding and decoding using various other translation tables.

Another text datatype is the bytes class. The bytes class is also a Collection where each element in the Collection is a byte. The byte is a logical unit in a computers memory. It is helpful to conceptualise it as the combination of 8 binary switches:

Each combination in the 8 switches above corresponds to an int between 0 and 256 so the bytes class also has some numeric behaviour. An encoding standard is used to designate a single byte or multiple bytes to a Unicode character. However unlike the Unicode str, there are a variety of encoding tables and the numeric bytes Collection must be encoded and decoded using the same encoding table for the text to make sense. Notice that the identifiers in the bytes class are largely constent with identifiers in the str class but may behave slightly different as they use a difference unit in the Collection:

In [4]: dir(bytes)
Out[4]: ['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', 
          '__format__', '__ge__', '__getitem__', '__getattribute__', '__gt__', 
          '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', 
          '__ne__', '__repr__', '__radd__', '__rmod__', '__sizeof__', '__str__', 
          '__bytes__', 'capitalize', 'casefold', 'count', 'decode', 'endswith', 
          'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdecimal', 
          'isdigit', 'islower', 'isupper', 'join', 'ljust', 'lower', 'replace', 
          'rfind', 'rindex', 'rjust', 'split', 'splitlines', 'startswith', 
          'title', 'upper', 'zfill']

In [5]: bytes.
# -------------------------------
# Available Identifiers for `bytes`:
# -------------------------------------

# 🔧 Functions from `object` (inherited by `bytes`):
#     - __init__(self, /, *args, **kwargs)             : Initializes the object.
#     - __new__(*args, **kwargs)                       : Creates a new instance of the class.
#     - __delattr__(self, name, /)                     : Defines behavior for when an attribute is deleted.
#     - __dir__(self, /)                               : Default dir() implementation.
#     - __sizeof__(self, /)                            : Returns the size of the object in memory, in bytes.
#     - __eq__(self, value, /)                         : Checks for equality with another object.
#     - __ne__(self, value, /)                         : Checks for inequality with another object.
#     - __lt__(self, value, /)                         : Checks if the object is less than another.
#     - __le__(self, value, /)                         : Checks if the object is less than or equal to another.
#     - __gt__(self, value, /)                         : Checks if the object is greater than another.
#     - __ge__(self, value, /)                         : Checks if the object is greater than or equal to another.
#     - __repr__(self, /)                              : Returns a string representation of the object.
#     - __str__(self, /)                               : Returns a string for display purposes.
#     - __format__(self, format_spec, /)               : Returns a formatted string representation of the object.
#     - __hash__(self, /)                              : Returns a hash of the object.
#     - __getattribute__(self, name, /)                : Gets an attribute from the object.
#     - __setattr__(self, name, value, /)              : Sets an attribute on the object.
#     - __delattr__(self, name, /)                     : Deletes an attribute from the object.
#     - __reduce__(self, /)                            : Prepares the object for pickling.
#     - __reduce_ex__(self, protocol, /)               : Similar to __reduce__, with a protocol argument.

# 🔍 Attributes from `object`:
#     - __class__                                      : The class of the bytes object.
#     - __doc__                                        : The docstring of the bytes class.

# 🔧 Collection-Based Methods (from `bytes` and the Collection ABC):
#     - __contains__(self, key, /)                     : Checks if a byte value is in the bytes (`in`).
#     - __iter__(self, /)                              : Returns an iterator over the bytes.
#     - __len__(self, /)                               : Returns the length of the bytes.
#     - __getitem__(self, key, /)                      : Retrieves a byte by index (`[]`).
#     - count(self, sub, start=0,                      : Counts the occurrences of a sub-byte sequence.
#             end=9223372036854775807, /) 
#     - index(self, sub, start=0,                      : Returns the index of the first occurrence of a sub-byte.
#             end=9223372036854775807, /) 

# 🔧 Supplementary Collection-Based Methods:
#     - rindex(self, sub, start=0,                      : Returns the highest index of the first occurrence of a sub-byte.
#     - find(self, sub, start=0,                       : Finds the index of a sub-byte sequence.
#            end=9223372036854775807, /)  
#     - rfind(self, sub, start=0,                      : Finds the highest index of a sub-byte sequence.
#            end=9223372036854775807, /)  
#     - replace(self, old, new, count=-1, /)           : Replaces occurrences of a sub-byte sequence.

# 🔧 Collection-Like Operators:
#     - __add__(self, value, /)                        : Implements bytes concatenation (`+`).
#     - __mul__(self, value, /)                        : Implements bytes repetition (`*`).
#     - __rmul__(self, value, /)                       : Implements reflected multiplication (`*`).

# 🔧 Encoding-Related Methods:
#     - decode(self, encoding='utf-8',                 : Decodes the bytes using a specified encoding.
#             errors='strict', /)

# 🔧 Bytes-Specific Dunder Methods (from `bytes`):
#     - __bytes__(self, /)                             : Returns a copy of the bytes object.
#     - __iter__(self, /)                              : Returns an iterator over the bytes.

# 🔧 Additional Bytes-Specific Methods (Grouped by Similarity):

# 🔧 Formatting and Representation:
#     - hex(self, /)                                   : Returns a string of hexadecimal values.
#     - fromhex(string, /)                             : Creates a `bytes` object from a hexadecimal string.
#     - __mod__(self, value, /)                        : Implements C-style formatting using `%`.
#     - __rmod__(self, value, /)                       : Implements reverse C-style formatting using `%`.

# 🅰️ Case-Specific Methods (For Mutable Equivalent `bytearray`):
#     - **N/A for `bytes`, as they are immutable.** (Mutable `bytearray` provides `lower`, `upper`, etc.)

# 🔠 Boolean Methods (Data Validation):
#     - isalnum(self, /)                               : Checks if all bytes are alphanumeric.
#     - isalpha(self, /)                               : Checks if all bytes are alphabetic.
#     - isascii(self, /)                               : Checks if all bytes are ASCII.
#     - isdigit(self, /)                               : Checks if all bytes are digits.
#     - islower(self, /)                               : Checks if all bytes are lowercase alphabetic.
#     - isupper(self, /)                               : Checks if all bytes are uppercase alphabetic.
#     - isspace(self, /)                               : Checks if all bytes are whitespace.
#     - startswith(self, prefix, start=0,              : Checks if starts with a prefix.
#                 end=9223372036854775807, /) 
#     - endswith(self, suffix, start=0,                : Checks if ends with a suffix.
#               end=9223372036854775807, /)   

# 🔄 Manipulation Methods (Grouping Similar Functions):
#     - ljust(self, width, fillchar=b' ', /)           : Left-justifies in a field of a given width.
#     - rjust(self, width, fillchar=b' ', /)           : Right-justifies in a field of a given width.
#     - center(self, width, fillchar=b' ', /)          : Centers in a field of a given width.
#     - zfill(self, width, /)                          : Pads with zeros on the left.
#     - expandtabs(self, tabsize=8, /)                 : Expands tabs into spaces.

# 🔄 Stripping Methods:
#     - lstrip(self, bytes=None, /)                    : Strips leading bytes from the bytes object.
#     - rstrip(self, bytes=None, /)                    : Strips trailing bytes from the bytes object.
#     - strip(self, bytes=None, /)                     : Strips leading and trailing bytes from the bytes object.

# 🧩 Splitting and Joining:
#     - split(self, sep=None, maxsplit=-1, /)          : Splits at occurrences of a separator.
#     - rsplit(self, sep=None, maxsplit=-1, /)         : Splits at occurrences of a separator, from the right.
#     - splitlines(self, keepends=False, /)            : Splits at line breaks.
#     - join(self, iterable_of_bytes, /)               : Joins an iterable with bytes as a separator.
#     - partition(self, sep, /)                        : Splits into a 3-tuple around a separator.
#     - rpartition(self, sep, /)                       : Splits into a 3-tuple around a separator, from the right.

The str and bytes classes are immutable which essentially means all methods with exception to the constructor return a new instance (of the same class or a different class). The bytes class has a mutable counterpart the bytearray, which has additional methods which mutate the bytearray in place:

In [5]: bytearray.
# -------------------------------
# Available Identifiers for `bytearray`:
# -------------------------------------

# 🔧 Functions from `object` (inherited by `bytearray`):
#     - __init__(self, /, *args, **kwargs)             : Initializes the object.
#     - __new__(*args, **kwargs)                       : Creates a new instance of the class.
#     - __delattr__(self, name, /)                     : Defines behavior for when an attribute is deleted.
#     - __dir__(self, /)                               : Default dir() implementation.
#     - __sizeof__(self, /)                            : Returns the size of the object in memory, in bytes.
#     - __eq__(self, value, /)                         : Checks for equality with another object.
#     - __ne__(self, value, /)                         : Checks for inequality with another object.
#     - __lt__(self, value, /)                         : Checks if the object is less than another.
#     - __le__(self, value, /)                         : Checks if the object is less than or equal to another.
#     - __gt__(self, value, /)                         : Checks if the object is greater than another.
#     - __ge__(self, value, /)                         : Checks if the object is greater than or equal to another.
#     - __repr__(self, /)                              : Returns a string representation of the object.
#     - __str__(self, /)                               : Returns a string for display purposes.
#     - __format__(self, format_spec, /)               : Returns a formatted string representation of the object.
#     - __hash__(self, /)                              : Returns a hash of the object.
#     - __getattribute__(self, name, /)                : Gets an attribute from the object.
#     - __setattr__(self, name, value, /)              : Sets an attribute on the object.
#     - __delattr__(self, name, /)                     : Deletes an attribute from the object.
#     - __reduce__(self, /)                            : Prepares the object for pickling.
#     - __reduce_ex__(self, protocol, /)               : Similar to __reduce__, with a protocol argument.

# 🔍 Attributes from `object`:
#     - __class__                                      : The class of the bytearray object.
#     - __doc__                                        : The docstring of the bytearray class.

# 🔧 Collection-Based Methods (from `bytearray` and the Collection ABC):
#     - __contains__(self, key, /)                     : Checks if a byte value is in the bytearray (`in`).
#     - __iter__(self, /)                              : Returns an iterator over the bytearray.
#     - __len__(self, /)                               : Returns the length of the bytearray.
#     - __getitem__(self, key, /)                      : Retrieves a byte by index (`[]`).
#     - count(self, sub, start=0,                      : Counts the occurrences of a sub-byte sequence.
#             end=9223372036854775807, /) 
#     - index(self, sub, start=0,                      : Returns the index of the first occurrence of a sub-byte.
#             end=9223372036854775807, /) 

# 🔧 Supplementary Collection-Based Methods:
#     - rindex(self, sub, start=0,                     : Returns the highest index of a sub-byte sequence.
#              end=9223372036854775807, /) 
#     - find(self, sub, start=0,                       : Finds the lowest index of a sub-byte sequence.
#            end=9223372036854775807, /)  
#     - rfind(self, sub, start=0,                      : Finds the highest index of a sub-byte sequence.
#            end=9223372036854775807, /)  
#     - replace(self, old, new, count=-1, /)           : Replaces occurrences of a sub-byte sequence.

# 🔧 Mutable Collection-Specific Methods:
#     - __setitem__(self, key, value, /)               : Assigns a value to an item (`[] =`).
#     - __delitem__(self, key, /)                      : Deletes an item from the bytearray.
#     - append(self, item, /)                          : Appends a byte to the end of the bytearray.
#     - extend(self, iterable_of_bytes, /)             : Appends multiple bytes to the bytearray.
#     - insert(self, index, item, /)                   : Inserts a byte at a specific position.
#     - pop(self, index=-1, /)                         : Removes and returns a byte at a given index.
#     - remove(self, value, /)                         : Removes the first occurrence of a value.
#     - clear(self, /)                                 : Removes all bytes from the bytearray.
#     - reverse(self, /)                               : Reverses the order of bytes in place.

# 🔧 Collection-Like Operators:
#     - __add__(self, value, /)                        : Implements bytearray concatenation (`+`).
#     - __mul__(self, value, /)                        : Implements bytearray repetition (`*`).
#     - __rmul__(self, value, /)                       : Implements reflected multiplication (`*`).

# 🔧 Encoding-Related Methods:
#     - decode(self, encoding='utf-8',                 : Decodes the bytearray using a specified encoding.
#             errors='strict', /)

# 🔧 Bytes-Specific Dunder Methods (from `bytearray`):
#     - __bytes__(self, /)                             : Returns a bytes object copy of the bytearray.
#     - __iter__(self, /)                              : Returns an iterator over the bytearray.

# 🔧 Additional Bytearray-Specific Methods (Grouped by Similarity):

# 🔧 Formatting and Representation:
#     - hex(self, /)                                   : Returns a string of hexadecimal values.
#     - fromhex(string, /)                             : Creates a `bytearray` object from a hexadecimal string.
#     - __mod__(self, value, /)                        : Implements C-style formatting using `%`.
#     - __rmod__(self, value, /)                       : Implements reverse C-style formatting using `%`.

# 🅰️ Case-Specific Methods:
#     - lower(self, /)                                 : Converts to lowercase.
#     - upper(self, /)                                 : Converts to uppercase.
#     - capitalize(self, /)                            : Capitalizes the first byte.
#     - title(self, /)                                 : Converts to title case.
#     - swapcase(self, /)                              : Swaps case.
#     - casefold(self, /)                              : Converts for case-insensitive comparisons.

# 🔠 Boolean Methods (Data Validation):
#     - isalnum(self, /)                               : Checks if all bytes are alphanumeric.
#     - isalpha(self, /)                               : Checks if all bytes are alphabetic.
#     - isascii(self, /)                               : Checks if all bytes are ASCII.
#     - isdigit(self, /)                               : Checks if all bytes are digits.
#     - islower(self, /)                               : Checks if all bytes are lowercase alphabetic.
#     - isupper(self, /)                               : Checks if all bytes are uppercase alphabetic.
#     - isspace(self, /)                               : Checks if all bytes are whitespace.
#     - startswith(self, prefix, start=0,              : Checks if starts with a prefix.
#                 end=9223372036854775807, /) 
#     - endswith(self, suffix, start=0,                : Checks if ends with a suffix.
#               end=9223372036854775807, /)   

# 🔄 Manipulation Methods (Grouping Similar Functions):
#     - ljust(self, width, fillchar=b' ', /)           : Left-justifies in a field of a given width.
#     - rjust(self, width, fillchar=b' ', /)           : Right-justifies in a field of a given width.
#     - center(self, width, fillchar=b' ', /)          : Centers in a field of a given width.
#     - zfill(self, width, /)                          : Pads with zeros on the left.
#     - expandtabs(self, tabsize=8, /)                 : Expands tabs into spaces.

# 🔄 Stripping Methods:
#     - lstrip(self, bytes=None, /)                    : Strips leading bytes from the bytearray.
#     - rstrip(self, bytes=None, /)                    : Strips trailing bytes from the bytearray.
#     - strip(self, bytes=None, /)                     : Strips leading and trailing bytes from the bytearray.

# 🧩 Splitting and Joining:
#     - split(self, sep=None, maxsplit=-1, /)          : Splits at occurrences of a separator.
#     - rsplit(self, sep=None, maxsplit=-1, /)         : Splits at occurrences of a separator, from the right.
#     - splitlines(self, keepends=False, /)            : Splits at line breaks.
#     - join(self, iterable_of_bytes, /)               : Joins an iterable with bytearray as a separator.
#     - partition(self, sep, /)                        : Splits into a 3-tuple around a separator.
#     - rpartition(self, sep, /)                       : Splits into a 3-tuple around a separator, from the right.

Instantiation, Encoding and Collection Properties

A str instance can be explictly instantiated using:

In [5]: exit
In [6]: str('Hello World!')
Out[6]: 'Hello World!'

The return value shows the printed formal representation, which recall is the preferred way to initialise a str. Since the str class is the fundamental builtins text class, the preferred way str instance is without explictly using the str class. The Unicode str can use any Unicode Character. In this example Greek letters will be used:

In [7]: `Γεια σου Κοσμο!`
Out[7]: `Γεια σου Κοσμο!`

Greek Alphabet

Greek Alphabet	Uppercase	Lower Case
Alpha	Α	α
Beta	Β	β
Gamma	Γ	γ
Delta	Δ	δ
Epsilon	Ε	ε or ϵ
Zeta	Ζ	ζ
Eta	Η	η
Theta	Θ	θ
Iota	Ι	ι
Kappa	Κ	κ
Lambda	Λ	λ
Mu	Μ	μ
Nu	Ν	ν
Xi	Ξ	ξ
Omicron	Ο	ο
Pi	Π	π
Rho	Ρ	ρ
Sigma	Σ	σ or ς
Tau	Τ	τ
Upsilon	Υ	υ
Phi	Φ	φ
Chi	Χ	χ
Psi	Ψ	ψ
Omega	Ω	ω

The str instances can be assigned to object names:

In [8]: ascii_text = 'Hello World!'
        text = 'Γεια σου Κοσμο!'

And will display in the Variable Explorer:

Variable Explorer
Name ▲	Type	Size	Value
ascii_text	str	12	Hello World!
text	str	15	Γεια σου Κοσμο!

Notice the Variable Explorer displays the type and the length and the length is the number of Unicode Characters in each str.

Another text datatype is the byte class. Recall the bytes class is a Collection where each element in the Collection is a byte and a byte can be concepualised as a combination of 8 switches:

The byte class requires an encoding table. The encoding table maps a command to a memory configuration in bytes. One of the first widespread encoding tables was the American Standard for Information Interchange (ASCII). A very early generation computer is based on the typewritter. The type writter has a limited number of commands that control the device. Many of these commands are printable key presses, however there are commands that aren't printable such as the carriage return and form feed which need to be used in order to print text out onto a piece of paper:

Notice the limited number of characters in in ASCII are essentially restricted to the English Language. Select ASCII Encoding to view all the ASCII Characters:

ASCII Encoding

Binary to Character Mapping
Binary	Character
0b00000000	NUL (null character)
0b00000001	SOH (start of header)
0b00000010	STX (start of text)
0b00000011	ETX (end of text)
0b00000100	EOT (end of transmission)
0b00000101	ENQ (enquiry)
0b00000110	ACK (acknowledge)
0b00000111	BEL (bell)
0b00001000	BS (backspace)
0b00001001	TAB (horizontal tab)
0b00001010	LF (line feed)
0b00001011	VT (vertical tab)
0b00001100	FF (form feed)
0b00001101	CR (carriage return)
0b00001110	SO (shift out)
0b00001111	SI (shift in)
0b00010000	DLE (data link escape)
0b00010001	DC1 (device control 1)
0b00010010	DC2 (device control 2)
0b00010011	DC3 (device control 3)
0b00010100	DC4 (device control 4)
0b00010101	NAK (negative acknowledge)
0b00010110	SYN (synchronous idle)
0b00010111	ETB (end of transmission block)
0b00011000	CAN (cancel)
0b00011001	EM (end of medium)
0b00011010	SUB (substitute)
0b00011011	ESC (escape)
0b00011100	FS (file separator)
0b00011101	GS (group separator)
0b00011110	RS (record separator)
0b00011111	US (unit separator)
0b00100000
0b00100001	! (exclamation mark)
0b00100010	" (double quote)
0b00100011	# (number sign)
0b00100100	$ (dollar sign)
0b00100101	% (percent)
0b00100110	& (ampersand)
0b00100111	' (single quote)
0b00101000	( left parenthesis
0b00101001	) right parenthesis
0b00101010	* (asterisk)
0b00101011	+ (plus)
0b00101100	, (comma)
0b00101101	- (hyphen)
0b00101110	. (period)
0b00101111	/ (slash)
0b00110000	0 (digit zero)
0b00110001	1 (digit one)
0b00110010	2 (digit two)
0b00110011	3 (digit three)
0b00110100	4 (digit four)
0b00110101	5 (digit five)
0b00110110	6 (digit six)
0b00110111	7 (digit seven)
0b00111000	8 (digit eight)
0b00111001	9 (digit nine)
0b00111010	: (colon)
0b00111011	; (semicolon)
0b00111100	< (less than)
0b00111101	= (equal sign)
0b00111110	> (greater than)
0b00111111	? (question mark)
0b01000000	@ (commercial at)
0b01000001	A (uppercase A)
0b01000010	B (uppercase B)
0b01000011	C (uppercase C)
0b01000100	D (uppercase D)
0b01000101	E (uppercase E)
0b01000110	F (uppercase F)
0b01000111	G (uppercase G)
0b01001000	H (uppercase H)
0b01001001	I (uppercase I)
0b01001010	J (uppercase J)
0b01001011	K (uppercase K)
0b01001100	L (uppercase L)
0b01001101	M (uppercase M)
0b01001110	N (uppercase N)
0b01001111	O (uppercase O)
0b01010000	P (uppercase P)
0b01010001	Q (uppercase Q)
0b01010010	R (uppercase R)
0b01010011	S (uppercase S)
0b01010100	T (uppercase T)
0b01010101	U (uppercase U)
0b01010110	V (uppercase V)
0b01010111	W (uppercase W)
0b01011000	X (uppercase X)
0b01011001	Y (uppercase Y)
0b01011010	Z (uppercase Z)
0b01011011	[ (left square bracket)
0b01011100	\ (backslash)
0b01011101	] (right square bracket)
0b01011110	^ (caret)
0b01011111	_ (underscore)
0b01100000	` (grave accent)
0b01100001	a (lowercase a)
0b01100010	b (lowercase b)
0b01100011	c (lowercase c)
0b01100100	d (lowercase d)
0b01100101	e (lowercase e)
0b01100110	f (lowercase f)
0b01100111	g (lowercase g)
0b01101000	h (lowercase h)
0b01101001	i (lowercase i)
0b01101010	j (lowercase j)
0b01101011	k (lowercase k)
0b01101100	l (lowercase l)
0b01101101	m (lowercase m)
0b01101110	n (lowercase n)
0b01101111	o (lowercase o)
0b01110000	p (lowercase p)
0b01110001	q (lowercase q)
0b01110010	r (lowercase r)
0b01110011	s (lowercase s)
0b01110100	t (lowercase t)
0b01110101	u (lowercase u)
0b01110110	v (lowercase v)
0b01110111	w (lowercase w)
0b01111000	x (lowercase x)
0b01111001	y (lowercase y)
0b01111010	z (lowercase z)
0b01111011	{ (left curly brace)
0b01111100	\| (vertical bar)
0b01111101	} (right curly brace)
0b01111110	~ (tilde)

The bytes class can be used to cast a str instance to a bytes instance:

In [9]: bytes(ascii_text)
bytes(ascii_text)
Traceback (most recent call last):

  Cell In[9], line 1
    bytes(ascii_text)

TypeError: string argument without an encoding

Notice an encoding table needs to be specified:

In [10]: bytes(ascii_text)
Out[10]: b'Hello World!'

The print out of the formal representation shows the preferential way of constructing a bytes instance which consists of only ASCII characters. Notice that the prefix b is used to distinguish a bytes object from a str object:

In [11]: bytes(ascii_text, encoding='ascii')
Out[11]: b'Hello World!'

\ is a special character in a string (str object or bytes object) that can be used to insert an escape character. For example \t is a tab and \n is a new line (the new line is actually two commands the line feed and carriage return):

In [12]: ascii_text = 'Hello\tWorld!'
Out[12]: b_ascii_text = b'Hello\tWorld!'

Notice that the Variable Explorer will display the printed format with the escape character processed:

Variable Explorer
Name ▲	Type	Size	Value
ascii_text	str	12	Hello World!
b_ascii_text	bytes	12	Hello World!
text	str	15	Γεια σου Κοσμο!

Binary is machine readible but humans have problems transcribing a long line of zeros and ones. Therefore it is common to split the 8 bit byte into two 4 bit halves. Each half is represented by use of a hexadecimal character:

binary	hexadecimal	decimal
0b0000	0x0	0
0b0001	0x1	1
0b0010	0x2	2
0b0011	0x3	3
0b0100	0x4	4
0b0101	0x5	5
0b0110	0x6	6
0b0111	0x7	7
0b1000	0x8	8
0b1001	0x9	9
0b1010	0xa	10
0b1011	0xb	11
0b1100	0xc	12
0b1101	0xd	13
0b1110	0xe	14
0b1111	0xf	15

All the ASCII characters can be reviewed using the three numbering systems binary (base 2 denoted with the prefix 0b), hexadecimal (base 16 denoted with the prefix 0x) and decimal (base 10 standard representation, therefore no prefix). Select ASCII Encoding to view all the ASCII Characters:

ASCII Encoding

Binary	Hexadecimal	Decimal	Character Name
0b00000000	0x00	0	NUL (null)
0b00000001	0x01	1	SOH (start of heading)
0b00000010	0x02	2	STX (start of text)
0b00000011	0x03	3	ETX (end of text)
0b00000100	0x04	4	EOT (end of transmission)
0b00000101	0x05	5	ENQ (enquiry)
0b00000110	0x06	6	ACK (acknowledge)
0b00000111	0x07	7	BEL (bell)
0b00001000	0x08	8	BS (backspace)
0b00001001	0x09	9	HT (horizontal tab)
0b00001010	0x0a	10	LF (line feed)
0b00001011	0x0b	11	VT (vertical tab)
0b00001100	0x0c	12	FF (form feed)
0b00001101	0x0d	13	CR (carriage return)
0b00001110	0x0e	14	SO (shift out)
0b00001111	0x0f	15	SI (shift in)
0b00010000	0x10	16	DLE (data link escape)
0b00010001	0x11	17	DC1 (device control 1)
0b00010010	0x12	18	DC2 (device control 2)
0b00010011	0x13	19	DC3 (device control 3)
0b00010100	0x14	20	DC4 (device control 4)
0b00010101	0x15	21	NAK (negative acknowledgment)
0b00010110	0x16	22	SYN (synchronous idle)
0b00010111	0x17	23	ETB (end of transmission block)
0b00011000	0x18	24	CAN (cancel)
0b00011001	0x19	25	EM (end of medium)
0b00011010	0x1a	26	SUB (substitute)
0b00011011	0x1b	27	ESC (escape)
0b00011100	0x1c	28	FS (file separator)
0b00011101	0x1d	29	GS (group separator)
0b00011110	0x1e	30	RS (record separator)
0b00011111	0x1f	31	US (unit separator)
0b00010000	0x20	32
0b00010001	0x21	33	! (exclamation mark)
0b00010010	0x22	34	" (double quote)
0b00010011	0x23	35	# (number sign)
0b00010100	0x24	36	$ (dollar sign)
0b00010101	0x25	37	% (percent)
0b00010110	0x26	38	& (ampersand)
0b00010111	0x27	39	' (apostrophe)
0b00011000	0x28	40	( (left parenthesis)
0b00011001	0x29	41	) (right parenthesis)
0b00101010	0x2a	42	* (asterisk)
0b00101011	0x2b	43	+ (plus sign)
0b00101100	0x2c	44	, (comma)
0b00101101	0x2d	45	- (minus sign)
0b00101110	0x2e	46	. (period)
0b00101111	0x2f	47	/ (slash)
0b00101010	0x2a	42	(asterisk)
0b00101011	0x2b	43	(plus sign)
0b00101100	0x2c	44	(comma)
0b00101101	0x2d	45	(minus sign)
0b00101110	0x2e	46	(period)
0b00101111	0x2f	47	(slash)
0b00110000	0x30	48	0 (zero)
0b00110001	0x31	49	1 (one)
0b00110010	0x32	50	2 (two)
0b00110011	0x33	51	3 (three)
0b00110100	0x34	52	4 (four)
0b00110101	0x35	53	5 (five)
0b00110110	0x36	54	6 (six)
0b00110111	0x37	55	7 (seven)
0b00111000	0x38	56	8 (eight)
0b00111001	0x39	57	9 (nine)
0b00111010	0x3a	58	: (colon)
0b00111011	0x3b	59	; (semicolon)
0b00111100	0x3c	60	< (less than)
0b00111101	0x3d	61	= (equal sign)
0b00111110	0x3e	62	> (greater than)
0b00111111	0x3f	63	? (question mark)
0b01000000	0x40	64	@ (at sign)
0b01000001	0x41	65	A (capital A)
0b01000010	0x42	66	B (capital B)
0b01000011	0x43	67	C (capital C)
0b01000100	0x44	68	D (capital D)
0b01000101	0x45	69	E (capital E)
0b01000110	0x46	70	F (capital F)
0b01000111	0x47	71	G (capital G)
0b01001000	0x48	72	H (capital H)
0b01001001	0x49	73	I (capital I)
0b01001010	0x4a	74	J (capital J)
0b01001011	0x4b	75	K (capital K)
0b01001100	0x4c	76	L (capital L)
0b01001101	0x4d	77	M (capital M)
0b01001110	0x4e	78	N (capital N)
0b01001111	0x4f	79	O (capital O)
0b01010000	0x50	80	P (capital P)
0b01010001	0x51	81	Q (capital Q)
0b01010010	0x52	82	R (capital R)
0b01010011	0x53	83	S (capital S)
0b01010100	0x54	84	T (capital T)
0b01010101	0x55	85	U (capital U)
0b01010110	0x56	86	V (capital V)
0b01010111	0x57	87	W (capital W)
0b01011000	0x58	88	X (capital X)
0b01011001	0x59	89	Y (capital Y)
0b01011010	0x5a	90	Z (capital Z)
0b01011011	0x5b	91	[ (opening bracket)
0b01011100	0x5c	92	\ (backslash)
0b01011101	0x5d	93	] (closing bracket)
0b01011110	0x5e	94	^ (caret)
0b01011111	0x5f	95	_ (underscore)
0b01100000	0x60	96	` (grave accent)
0b01100001	0x61	97	a (lowercase a)
0b01100010	0x62	98	b (lowercase b)
0b01100011	0x63	99	c (lowercase c)
0b01100100	0x64	100	d (lowercase d)
0b01100101	0x65	101	e (lowercase e)
0b01100110	0x66	102	f (lowercase f)
0b01100111	0x67	103	g (lowercase g)
0b01101000	0x68	104	h (lowercase h)
0b01101001	0x69	105	i (lowercase i)
0b01101010	0x6a	106	j (lowercase j)
0b01101011	0x6b	107	k (lowercase k)
0b01101100	0x6c	108	l (lowercase l)
0b01101101	0x6d	109	m (lowercase m)
0b01101110	0x6e	110	n (lowercase n)
0b01101111	0x6f	111	o (lowercase o)
0b01110000	0x70	112	p (lowercase p)
0b01110001	0x71	113	q (lowercase q)
0b01110010	0x72	114	r (lowercase r)
0b01110011	0x73	115	s (lowercase s)
0b01110100	0x74	116	t (lowercase t)
0b01110101	0x75	117	u (lowercase u)
0b01110110	0x76	118	v (lowercase v)
0b01110111	0x77	119	w (lowercase w)
0b01111000	0x78	120	x (lowercase x)
0b01111001	0x79	121	y (lowercase y)
0b01111010	0x7a	122	z (lowercase z)
0b01111011	0x7b	123	{ (left brace)
0b01111100	0x7c	124	\| (vertical bar)
0b01111101	0x7d	125	} (right brace)
0b01111110	0x7e	126	~ (tilde)
0b01111111	0x7f	127	(delete)

A bytes str can be represented as a hexadecimal string:

In [13]: ascii_text_b.hex()
Out[13]: '48656c6c6f09576f726c6421'

\x is used to insert a hexadecimal characters and expects 2 hexadecimal digits:

In [14]: b'\x48\x65\x6c\x6c\x6f\x09\x57\x6f\x72\x6c\x64\x21'
Out[14]: b'Hello\tWorld!'

Notice the formal representation prefers using the printable ASCII character where present over the hexadecimal escape character. If a character is included outwith the ASCII printable character range for example the NUL character at 0x00:

In [15]: b'\x00\x48\x65\x6c\x6c\x6f\x09\x57\x6f\x72\x6c\x64\x21'
Out[15]: b'\x00Hello\tWorld!'

Then it has no printable alternative and this byte therefore remains represented as a hexadecimal escape character.

Notice that ASCII covers only occupies half the possible values that span over a byte. The remaining values were used regionally in extended ASCII tables:

Extended ASCII Tables

binary	hexadecimal	decimal	latin1	latin2	latin3	latin4	cyrillic	arabic	greek	hebrew	turkish	nordic	thai
0b10000000	0x80	128
0b10000001	0x81	129
0b10000010	0x82	130
0b10000011	0x83	131
0b10000100	0x84	132
0b10000101	0x85	133
0b10000110	0x86	134
0b10000111	0x87	135
0b10001000	0x88	136
0b10001001	0x89	137
0b10001010	0x8a	138
0b10001011	0x8b	139
0b10001100	0x8c	140
0b10001101	0x8d	141
0b10001110	0x8e	142
0b10001111	0x8f	143
0b10010000	0x90	144
0b10010001	0x91	145
0b10010010	0x92	146
0b10010011	0x93	147
0b10010100	0x94	148
0b10010101	0x95	149
0b10010110	0x96	150
0b10010111	0x97	151
0b10011000	0x98	152
0b10011001	0x99	153
0b10011010	0x9a	154
0b10011011	0x9b	155
0b10011100	0x9c	156
0b10011101	0x9d	157
0b10011110	0x9e	158
0b10011111	0x9f	159
0b10100000	0xa0	160	NBSP	NBSP	NBSP	NBSP	NBSP	NBSP	NBSP	NBSP	NBSP	NBSP	NBSP
0b10100001	0xa1	161	¡	Ą	Ħ	Ą	Ё		‘		¡	Ą	ก
0b10100010	0xa2	162	¢	˘	˘	ĸ	Ђ		’	¢	¢	Ē	ข
0b10100011	0xa3	163	£	Ł	£	Ŗ	Ѓ		£	£	£	Ģ	ฃ
0b10100100	0xa4	164	¤	¤	¤	¤	Є	¤	€	¤	¤	Ī	ค
0b10100101	0xa5	165	¥	Ľ		Ĩ	Ѕ		₯	¥	¥	Ĩ	ฅ
0b10100110	0xa6	166	¦	Ś	Ĥ	Ļ	І		¦	¦	¦	Ķ	ฆ
0b10100111	0xa7	167	§	§	§	§	Ї		§	§	§	§	ง
0b10101000	0xa8	168	¨	¨	¨	¨	Ј		¨	¨	¨	Ļ	จ
0b10101001	0xa9	169	©	Š	İ	Š	Љ		©	©	©	Đ	ฉ
0b10101010	0xaa	170	ª	Ş	Ş	Ē	Њ		ͺ	×	ª	Š	ช
0b10101011	0xab	171	«	Ť	Ğ	Ģ	Ћ		«	«	«	Ŧ	ซ
0b10101100	0xac	172	¬	Ź	Ĵ	Ŧ	Ќ	،	¬	¬	¬	Ž	ฌ
0b10101101	0xad	173	SHY	SHY	SHY	SHY	SHY	SHY	SHY	SHY	SHY	SHY	ญ
0b10101110	0xae	174	®	Ž		Ž	Ў			®	®	Ū	ฎ
0b10101111	0xaf	175	¯	Ż	Ż	¯	Џ		―	¯	¯	Ŋ	ฏ
0b10110000	0xb0	176	°	°	°	°	А		°	°	°	°	ฐ
0b10110001	0xb1	177	±	ą	ħ	ą	Б		±	±	±	ą	ฑ
0b10110010	0xb2	178	²	˛	²	˛	В		²	²	²	ē	ฒ
0b10110011	0xb3	179	³	ł	³	ŗ	Г		³	³	³	ģ	ณ
0b10110100	0xb4	180	´	´	´	´	Д		΄	´	´	ī	ด
0b10110101	0xb5	181	µ	ľ	µ	ĩ	Е		΅	µ	µ	ĩ	ต
0b10110110	0xb6	182	¶	ś	ĥ	ļ	Ж		Ά	¶	¶	ķ	ถ
0b10110111	0xb7	183	·	ˇ	·	ˇ	З		·	·	·	·	ท
0b10111000	0xb8	184	¸	¸	¸	¸	И		Έ	¸	¸	ļ	ธ
0b10111001	0xb9	185	¹	š	ı	š	Й		Ή	¹	¹	đ	น
0b10111010	0xba	186	º	ş	ş	ē	К		Ί	÷	º	š	บ
0b10111011	0xbb	187	»	ť	ğ	ģ	Л	؛	»	»	»	ŧ	ป
0b10111100	0xbc	188	¼	ź	ĵ	ŧ	М		Ό	¼	¼	ž	ผ
0b10111101	0xbd	189	½	˝	½	Ŋ	Н		½	½	½	―	ฝ
0b10111110	0xbe	190	¾	ž		ž	О		Ύ	¾	¾	ū	พ
0b10111111	0xbf	191	¿	ż	ż	ŋ	П	؟	Ώ		¿	ŋ	ฟ
0b11000000	0xc0	192	À	Ŕ	À	Ā	Р		ΐ		À	Ā	ภ
0b11000001	0xc1	193	Á	Á	Á	Á	С	ء	Α		Á	Á	ม
0b11000010	0xc2	194	Â	Â	Â	Â	Т	آ	Β		Â	Â	ย
0b11000011	0xc3	195	Ã	Ă		Ã	У	أ	Γ		Ã	Ã	ร
0b11000100	0xc4	196	Ä	Ä	Ä	Ä	Ф	ؤ	Δ		Ä	Ä	ฤ
0b11000101	0xc5	197	Å	Ĺ	Ċ	Å	Х	إ	Ε		Å	Å	ล
0b11000110	0xc6	198	Æ	Ć	Ĉ	Æ	Ц	ئ	Ζ		Æ	Æ	ฦ
0b11000111	0xc7	199	Ç	Ç	Ç	Į	Ч	ا	Η		Ç	Į	ว
0b11001000	0xc8	200	È	Č	È	Č	Ш	ب	Θ		È	Č	ศ
0b11001001	0xc9	201	É	É	É	É	Щ	ة	Ι		É	É	ษ
0b11001010	0xca	202	Ê	Ę	Ê	Ę	Ъ	ت	Κ		Ê	Ę	ส
0b11001011	0xcb	203	Ë	Ë	Ë	Ë	Ы	ث	Λ		Ë	Ë	ห
0b11001100	0xcc	204	Ì	Ě	Ì	Ė	Ь	ج	Μ		Ì	Ė	ฬ
0b11001101	0xcd	205	Í	Í	Í	Í	Э	ح	Ν		Í	Í	อ
0b11001110	0xce	206	Î	Î	Î	Î	Ю	خ	Ξ		Î	Î	ฮ
0b11001111	0xcf	207	Ï	Ď	Ï	Ī	Я	د	Ο		Ï	Ï	ฯ
0b11010000	0xd0	208	Ð	Đ		Đ	а	ذ	Π		Ğ	Ð	ะ
0b11010001	0xd1	209	Ñ	Ń	Ñ	Ņ	б	ر	Ρ		Ñ	Ņ	ั
0b11010010	0xd2	210	Ò	Ň	Ò	Ō	в	ز			Ò	Ō	า
0b11010011	0xd3	211	Ó	Ó	Ó	Ķ	г	س	Σ		Ó	Ó	ำ
0b11010100	0xd4	212	Ô	Ô	Ô	Ô	д	ش	Τ		Ô	Ô	ิ
0b11010101	0xd5	213	Õ	Ő	Ġ	Õ	е	ص	Υ		Õ	Õ	ี
0b11010110	0xd6	214	Ö	Ö	Ö	Ö	ж	ض	Φ		Ö	Ö	ึ
0b11010111	0xd7	215	×	×	×	×	з	ط	Χ		×	Ũ	ื
0b11011000	0xd8	216	Ø	Ř	Ĝ	Ø	и	ظ	Ψ		Ø	Ø	ุ
0b11011001	0xd9	217	Ù	Ů	Ù	Ų	й	ع	Ω		Ù	Ų	ู
0b11011010	0xda	218	Ú	Ú	Ú	Ú	к	غ	Ϊ		Ú	Ú	ฺ
0b11011011	0xdb	219	Û	Ű	Û	Û	л		Ϋ		Û	Û
0b11011100	0xdc	220	Ü	Ü	Ü	Ü	м		ά		Ü	Ü
0b11011101	0xdd	221	Ý	Ý	Ŭ	Ũ	н		έ		İ	Ý
0b11011110	0xde	222	Þ	Ţ	Ŝ	Ū	о		ή		Ş	Þ
0b11011111	0xdf	223	ß	ß	ß	ß	п		ί	‗	ß	ß	฿
0b11100000	0xe0	224	à	ŕ	à	ā	р	ـ	ΰ	א	à	ā	เ
0b11100001	0xe1	225	á	á	á	á	с	ف	α	ב	á	á	แ
0b11100010	0xe2	226	â	â	â	â	т	ق	β	ג	â	â	โ
0b11100011	0xe3	227	ã	ă		ã	у	ك	γ	ד	ã	ã	ใ
0b11100100	0xe4	228	ä	ä	ä	ä	ф	ل	δ	ה	ä	ä	ไ
0b11100101	0xe5	229	å	ĺ	ċ	å	х	م	ε	ו	å	å	ๅ
0b11100110	0xe6	230	æ	ć	ĉ	æ	ц	ن	ζ	ז	æ	æ	ๆ
0b11100111	0xe7	231	ç	ç	ç	į	ч	ه	η	ח	ç	į	็
0b11101000	0xe8	232	è	č	è	č	ш	و	θ	ט	è	č	่
0b11101001	0xe9	233	é	é	é	é	щ	ى	ι	י	é	é	้
0b11101010	0xea	234	ê	ę	ê	ę	ъ	ي	κ	ך	ê	ę	๊
0b11101011	0xeb	235	ë	ë	ë	ë	ы	ً	λ	כ	ë	ë	๋
0b11101100	0xec	236	ì	ě	ì	ė	ь	ٌ	μ	ל	ì	ė	์
0b11101101	0xed	237	í	í	í	í	э	ٍ	ν	ם	í	í	ํ
0b11101110	0xee	238	î	î	î	î	ю	َ	ξ	מ	î	î	๎
0b11101111	0xef	239	ï	ď	ï	ī	я	ُ	ο	ן	ï	ï	๏
0b11110000	0xf0	240	ð	đ		đ	№	ِ	π	נ	ğ	ð	0
0b11110001	0xf1	241	ñ	ń	ñ	ņ	ё	ّ	ρ	ס	ñ	ņ	1
0b11110010	0xf2	242	ò	ň	ò	ō	ђ	ْ	ς	ע	ò	ō	2
0b11110011	0xf3	243	ó	ó	ó	ķ	ѓ		σ	ף	ó	ó	3
0b11110100	0xf4	244	ô	ô	ô	ô	є		τ	פ	ô	ô	4
0b11110101	0xf5	245	õ	ő	ġ	õ	ѕ		υ	ץ	õ	õ	5
0b11110110	0xf6	246	ö	ö	ö	ö	і		φ	צ	ö	ö	6
0b11110111	0xf7	247	÷	÷	÷	÷	ї		χ	ק	÷	ũ	7
0b11111000	0xf8	248	ø	ř	ĝ	ø	ј		ψ	ר	ø	ø	8
0b11111001	0xf9	249	ù	ů	ù	ų	љ		ω	ש	ù	ų	9
0b11111010	0xfa	250	ú	ú	ú	ú	њ		ϊ	ת	ú	ú	๚
0b11111011	0xfb	251	û	ű	û	û	ћ		ϋ		û	û	๛
0b11111100	0xfc	252	ü	ü	ü	ü	ќ		ό		ü	ü
0b11111101	0xfd	253	ý	ý	ŭ	ũ	§		ύ	‎LRM	ı	ý
0b11111110	0xfe	254	þ	ţ	ŝ	ū	ў		ώ	‏RLM	ş	þ
0b11111111	0xff	255	ÿ	˙	˙	˙	џ				ÿ	ĸ

If 0xe5 is examined for example, then notice that it maps to a different character in most of the ASCII tables:

binary	0b11100101
hexadecimal	0xe5
decimal	229
latin1	å
latin2	ĺ
latin3	ċ
latin4	å
cyrillic	х
arabic	م
greek	ε
hebrew	ו
turkish	å
nordic	å
thai	ๅ

If inserted as a hexadecimal escape character in a string, notice that it is automatically encoded using latin1, the most common ASCII table:

In [16]: '\xe5'
Out[16]: 'å'

If it is inserted as a hexadecimal character into a bytes object, notice that it remains a hexadecimal escape sequence:

In [17]: b'\xe5'
Out[17]: b'\xe5'

The bytes object can be decoded to a string when the correct encoding is applied:

In [18]: b'\xe5'.decode('latin1')
Out[18]: 'å'
In [19]: b'\xe5'.decode('latin2')
Out[19]: 'ĺ'
In [20]: b'\xe5'.decode('latin3')
Out[20]: 'ċ'
In [21]: b'\xe5'.decode('latin4')
Out[21]: 'å'
In [22]: b'\xe5'.decode('cyrillic')
Out[22]: 'х'
In [23]: b'\xe5'.decode('arabic')
Out[23]: 'م'
In [24]: b'\xe5'.decode('greek')
Out[24]: 'ε'
In [25]: b'\xe5'.decode('iso8859-9') # turkish
Out[25]: 'å'
In [26]: b'\xe5'.decode('hebrew')
Out[26]: 'ו'
In [27]: b'\xe5'.decode('iso8859-10') # nordic
Out[27]: 'å'
In [28]: b'\xe5'.decode('thai')
Out[28]: 'ๅ'

Returning to the str instance text, it can be encoded into a bytes object using the greek encoding table:

In [29]: text
Out[30]: 'Γεια σου Κοσμο!'

In [31]: text.encode(encoding='greek')
Out[31]: b'\xc3\xe5\xe9\xe1 \xf3\xef\xf5 \xca\xef\xf3\xec\xef!'

Notice in the bytes object that each of the printable ASCII characters is displayed using it's ASCII character and the non-ASCII characters are displayed using a hexadecimal escape character.

1 byte character encoding was suitable for offline regional computing however the advent of the internet resulted in a number of issues. Essentially a computer in Greece would produce content using the greek encoding table and then be read using a computer in the UK with the latin1 encoding table and the following character substitution would take place:

In [32]: text.encode(encoding='greek').decode(encoding='latin1')
Out[32]: 'Ãåéá óïõ Êïóìï!'

1 byte (8 bit) encoding allows:

In [33]: 2 ** 8
Out[33]: 65536

commands. 2 bytes (16 bits) encoding allows:

In [34]: 2 ** 16
Out[34]: 256

65536 commands.

The utf-16 standard was produced which includes all the characters seen in the extended ASCII tables:

In [35]: text.encode(encoding='utf-16-be')
Out[35]: b'\x03\x93\x03\xb5\x03\xb9\x03\xb1\x00 \x03\xc3\x03\xbf\x03\xc5\x00 \x03\x9a\x03\xbf\x03\xc3\x03\xbc\x03\xbf\x00!'

The bytes instance can be displayed as a hexadecimal string:

In [36]: text.encode(encoding='utf-16-be').hex()
Out[36]: '039303b503b903b1002003c303bf03c50020039a03bf03c303bc03bf0021'

Let's examine an ASCII character. In utf-16 encoding, the byte corresponding to the ASCII character when ascii encoding is used, is paired with the NULL byte 00 and the 2 bytes are used to encode the character.

utf-16-be is a variant of utf-16 that is Big Endian. Big Endian is typically the way, humans count where the Big (most significant byte 00) is placed before the Little (least significant byte 61):

In [37]: 'a'.encode(encoding='utf-16-be').hex()
Out[37]: '0061'

Intel processors typically use Little Endian where the Little (least significant byte 61) is placed before the Big (most significant byte 00):

In [38]: 'a'.encode(encoding='utf-16-le').hex()
Out[38]: '6100'

There was initially some confusion because of this and therefore a standard was produced that was Little Endian that includes a Byte Order Marker (BOM) as a prefix:

In [39]: 'a'.encode(encoding='utf-16').hex()
Out[39]: 'fffe6100'

The BOM can be seen by examinination of an empty str:

In [40]: ''.encode(encoding='utf-16').hex()
Out[40]: 'fffe'

A Greek character can also be examined using the utf-16 encoding variants:

In [39]: 'α'.encode(encoding='utf-16-be').hex()
Out[39]: '03b1'
In [40]: 'α'.encode(encoding='utf-16-le').hex()
Out[40]: 'b103'
In [41]: 'α'.encode(encoding='utf-16').hex()
Out[41]: 'fffeb103'

Some languages such as Chinese use more than 50000 characters and therefore 65536 commands is insufficient to incorporate all Latin and Asian characters. Therefore utf-16 was quickly phased out by utf-32. utf-32 uses 4 bytes (32 bits) encoding which allows:

In [42]: 2 ** 32
Out[42]: 4294967296

commands which is sufficient to cover all characters used in all languages. utf-32 has byte ordering variants:

In [43]: 'a'.encode(encoding='utf-32-be').hex()
Out[43]: '00000061'
In [44]: 'a'.encode(encoding='utf-32-le').hex()
Out[44]: '61000000'
In [45]: 'a'.encode(encoding='utf-32').hex()
Out[45]: 'fffe000061000000'


In [46]: 'α'.encode(encoding='utf-32-be').hex()
Out[46]: '000003b1'
In [47]: 'α'.encode(encoding='utf-16-le').hex()
Out[47]: 'b1030000'
In [48]: 'α'.encode(encoding='utf-16').hex()
Out[48]: 'fffe0000b1030000'

In [49]: '我'.encode(encoding='utf-32-be').hex()
Out[49]: '00006211'
In [50]: '我'.encode(encoding='utf-16-le').hex()
Out[50]: '11620000'
In [51]: '我'.encode(encoding='utf-16').hex()
Out[51]: 'fffe000011620000'

A Unicode character can be inserted into a string using the hexadecimal escape character \U this expects 8 hexadecimal values in the format shown by utf-32-be:

In [52]: '\U00000061'
Out[52]: 'a'
In [53]: '\U000003b1'
Out[53]: 'α'
In [54]: '\U00006211'
Out[54]: '我'

In a Python string the \ is an instruction to inser an escape character and \U expects 8 hexadecimal characters. On Windows \ is used as the default directory seperator. Therefore \\ has to be used within a file path, where the first \ is an instruction to an insert an escape character and the second \ is the escape character to be inserted. The prefix R can be used for a Raw String which has no escape characters. Note upper case R is preferentially used for a raw string and syntax highlighting won't be applied. Lower case r is instead prefentially used for a regular expression and syntax highlighting for a regular expression may be applied:

In [55]: 'C:\\Users\\Philip'
Out[55]: 'C:\\Users\\Philip'

In [56]: R'C:\Users\Philip'
Out[56]: 'C:\\Users\\Philip'

In [57]: r'C:\Users\Philip'
Out[57]: 'C:\\Users\\Philip'

On Windows if a string is used instead of a raw string, the following error message is common:

In [58]: 'C:\Users\Philip'
  Cell In[58], line 1
    'C:\Users\Philip'
    ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

The number of trailing zeros for an ASCII character and confusion due to byte order marker resulted in a new standard with a variable byte length per character:

In [59]: 'a'.encode(encoding='utf-8').hex()
Out[59]: '61'
In [60]: 'α'.encode(encoding='utf-8').hex()
Out[60]: 'ceb1'
In [61]: '我'.encode(encoding='utf-8').hex()
Out[61]: 'e68891'
In [62]: '🐱'.encode(encoding='utf-8').hex()
Out[62]: 'f09f90b1'

It is called utf-8 because the ASCII characters only occupy 1 byte (8 bits). Greek characters occupy 2 bytes (16 bits), Asian characters occupy 3 bytes (24 bits) and emojis cover 4 bytes (32 bits).

There is no byte order marker and under the hood the binary sequence is used which outlines the expected number of bytes per character:

number of bytes	binary sequence
1	0b 0aaaaaaa
2	0b 110aaaaa 10aaaaaa
3	0b 1110aaaa 10aaaaaa 10aaaaaa
4	0b 11110aaa 10aaaaaa 10aaaaaa 10aaaaaa

These underlying patterns can be seen when the binary sequence for each of the characters above is examined:

In [63]: '0b'+bin(int('a'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(8)
Out[63]: '0b01100001'
In [64]: '0b'+bin(int('α'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(16)
Out[64]: '0b1100111010110001'
In [65]: '0b'+bin(int('我'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(24)
Out[64]: '0b111001101000100010010001'
In [65]: '0b'+bin(int('🐱'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(32)
Out[65]: '0b11110000100111111001000010110001'

Although utf-8 was designed to not require a BOM. Microsoft produced a version utf-8-sig which has the BOM:

In [66]: ''.encode(encoding='utf-8').hex()
Out[66]: ''

In [67]: ''.encode(encoding='utf-8-sig').hex()
Out[67]: 'efbbbf'

In [68]: 'a'.encode(encoding='utf-8-sig').hex()
Out[68]: 'efbbbf61'
In [69]: 'α'.encode(encoding='utf-8-sig').hex()
Out[69]: 'efbbbfceb1'
In [70]: '我'.encode(encoding='utf-8-sig').hex()
Out[70]: 'efbbbfe68891'
In [71]: '🐱'.encode(encoding='utf-8-sig').hex()
Out[71]: 'efbbbff09f90b1'

The bytes class has the alternative constructor fromhex which can be used to construct a bytes instance from a hexadecimal string:

In [72]: bytes.fromhex('61')
Out[72]: b'a'
In [73]: bytes.fromhex('ceb1')
Out[73]: b'\xce\xb1'
In [74]: bytes.fromhex('e68891')
Out[74]: b'\xe6\x88\x91'
In [75]: bytes.fromhex('f09f90b1')
Out[75]: b'\xf0\x9f\x90\xb1'
In [76]: exit

From now on utf-8 will be used as the default encoding table. The following str instances can eb isntantiated and encoded to bytes instances:

In [1]: ascii_text = 'Hello World!'
In [2]: text = 'Γεια σου Κοσμο!'
In [3]: ascii_text.encode(encoding='utf-8')
Out[3]: b'Hello World!'
In [4]: text.encode(encoding='utf-8')
Out[4]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'

As ascii_text consists only of printable ASCII characters the bytes instance returned, which shows the preferred formal representation displays each byte as its printable ASCII character.

As text contains a mixture of pritnable ASCII characters and non-ASCII characters, the formal representation displays each byte as its printable ASCII character where applicable and a hexadecimal escape sequence otherwise. If these are assigned to variables:

In [5]: ascii_text_b = b'Hello World!'
In [6]: text_b = b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'

If these are shown in the Variable Explorer:

Variable Explorer
Name ▲	Type	Size	Value
ascii_text	str	12	Hello World!
ascii_text_b	bytes	12	Hello World!
text	str	15	Γεια σου Κοσμο!
text_b	bytes	27	Γεια σου Κοσμο!

The Variable Explorer in Spyder assumes 'utf-8' encoding for a bytes instance and attempts to display any printable character.

Notice the length of text and text_b are different because the element in each class is different. In text_b some of the characters are encoded to multiple bytes:

This can be seen by casting each Collection explictly to a tuple:

In [7]: text_as_tuple = tuple(text)
In [8]: text_b_as_tuple = tuple(text_b)

Variable Explorer
Name ▲	Type	Size	Value
ascii_text	str	12	Hello World!
ascii_text_b	bytes	12	Hello World!
text	str	15	Γεια σου Κοσμο!
text_as_tuple	tuple	15	('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …)
text_b	bytes	27	Hello World!
text_b_as_tuple	tuple	27	(206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …)

If text_as_tuple is expanded, the value at each index can be seen to be a Unicode character because a Unicode character is an element of a str:

text_as_tuple - tuple (15 elements)
Index	Value
0	'Γ'
1	'ε'
2	'ι'
3	'α'
4	' '
5	'σ'
6	'ο'
7	'υ'
8	' '
9	'Κ'
10	'ο'
11	'σ'
12	'μ'
13	'ο'
14	'!'

If text_b_as_tuple is expanded, the value at each index can be seen to be an int between 0:256:

text_b_as_tuple - tuple (27 elements)
Index	Value
0	206
1	147
2	206
3	181
4	32
5	185
6	206
7	177
8	32
9	207
10	132
11	206
12	181
13	206
14	183
15	207
16	131
17	32
18	207
19	140
20	207
21	132
22	32
23	206
24	177
25	207
26	132

Recall a byte is a numeric value between 0:256:

The binary bin and hexadecimal hex functions can be used to display this int as a binary string or hexadecimal string:

In [9]: text_b_as_tuple_bin = tuple([bin(byte) for byte in text_b_as_tuple])
In [10]:

text_b_as_tuple_bin and text_b_as_tuple_hex display in the Variable Explorer:

Variable Explorer
Name ▲	Type	Size	Value
ascii_text	str	12	Hello World!
ascii_text_b	bytes	12	Hello World!
text	str	15	Γεια σου Κοσμο!
text_as_tuple	tuple	15	('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …)
text_b	tuple	27	Hello World!
text_b_as_tuple	tuple	27	(206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …)
text_b_as_tuple_bin	tuple	27	('0b11001110', '0b10010011', '0b11001110', '0b10110101', '0b11001110', …)
text_b_as_tuple_hex	tuple	27	('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …)

text_b_as_tuple_bin can be expanded to view each byte in binary:

text_b_as_tuple_bin - tuple (27 elements)
Index	Value
0	0b11001110
1	0b10010011
2	0b11001110
3	0b10110101
4	0b11001110
5	0b10111001
6	0b11001110
7	0b10110001
8	0b00100000
9	0b11001111
10	0b10000100
11	0b11001110
12	0b10110101
13	0b11001110
14	0b10111011
15	0b11001111
16	0b10000101
17	0b00100000
18	0b11001111
19	0b10001100
20	0b11001111
21	0b10000100
22	0b00100000
23	0b11001110
24	0b10110001
25	0b11001111
26	0b10000100

text_b_as_tuple_hex can be expanded to view each byte in hexadecimal:

text_b_as_tuple_hex - tuple (27 elements)
Index	Value
0	0xce
1	0x93
2	0xce
3	0xb5
4	0xce
5	0xb9
6	0xce
7	0xb1
8	0x20
9	0xcf
10	0x84
11	0xce
12	0xb5
13	0xce
14	0xbb
15	0xcf
16	0x85
17	0x20
18	0xcf
19	0x8c
20	0xcf
21	0x84
22	0x20
23	0xce
24	0xb1
25	0xcf
26	0x84

The bytes class can be used to cast a tuple of int values between 0:256 to a bytes instance:

In [10]: bytes((206, 147, 206, 181, 206, 185, 206, 177,  32, 207,
                131, 206, 191, 207, 133,  32, 206, 154, 206, 191,
                207, 131, 206, 188, 206, 191,  33))
Out[10]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'

Now that the element in each Collection is understood, the following Collection based identifiers can be used:

# 🔧 Collection-Based Methods (from `str` and the Collection ABC):
#     - __contains__(self, key, /)                  : Checks if a substring is in the string (`in`).
#     - __iter__(self, /)                           : Returns an iterator over the string.
#     - __len__(self, /)                            : Returns the length of the string.
#     - __getitem__(self, key, /)                   : Retrieves a character by index (`[]`).
#     - count(self, sub, start=0,                   : Counts the occurrences of a substring.
#             end=9223372036854775807, /) 
#     - index(self, sub, start=0,                   : Returns the index of the first occurrence of a substring.
#             end=9223372036854775807, /)

The data model method __len__ defines the behaviour of the builtins function len and essentially retrieves the Size shown on the Variable Explorer:

In [11]: len(text) # text.__len__()
Out[11]: 15
In [12]: len(text_b) # text_b.__len__()
Out[12]: 27

The data model method __contains__ defines the behaviour of the in keyword:

In [13]: 'ει' in text # text.__contains__('ει')
Out[13]: True
In [14]: 'ε' in text # text.__contains__('ε')
Out[14]: True
In [15]: bytes((147, 206)) in text_b # text_b.__contains__(bytes((147, 206)))
Out[15]: True
In [16]: 147 in text_b # text_b.__contains__(147, 206)
Out[16]: True

The data model method __getitem__ will retrieve a value at an integer index:

In [17]: text[1] # text.__index__(1)
Out[17]: 'ε'

Notice that Python use zero-order indexing. This means the first index is at index 0 and the last index is the length of the Collection minus 1:

In [17]: text[0] 
Out[17]: 'Γ'

In [18]: text[len(text)] 
Traceback (most recent call last):

  Cell In[18], line 1
    text[len(text)]

IndexError: string index out of range

In [19]: text[len(text)-1] 
Out[19]: '!'

text Variable Explorer

text_as_tuple - tuple (15 elements)
Index	Value
0	'Γ'
1	'ε'
2	'ι'
3	'α'
4	' '
5	'σ'
6	'ο'
7	'υ'
8	' '
9	'Κ'
10	'ο'
11	'σ'
12	'μ'
13	'ο'
14	'!'

The builtins class slice has consistent input arguments start, stop[, step] to the builtins class range:

In [20]: slice()
# Docstring popup
"""
Init signature: slice(self, /, *args, **kwargs)
Docstring:     
slice(stop)
slice(start, stop[, step])

Create a slice object.  This is used for extended slicing (e.g. a[0:10:2]).
Type:           type
Subclasses:     
"""

range, uses zero-order indexing so is inclusive of the start bound and exclusive of the stop bound:

In [20]: tuple(range(0, 5, 1))
Out[20]: (0, 1, 2, 3, 4)

In [21]: tuple(range(0, 5)) # default step=1
Out[21]: (0, 1, 2, 3, 4)

In [22]: tuple(range(5)) # default stop=0
Out[22]: (0, 1, 2, 3, 4)

slice behaves consistently:

In [23]: text[slice(0, 5, 1)]
Out[23]: 'Γεια '

In [24]: text[slice(0, 5)] # default step=1
Out[24]: 'Γεια '

In [25]: text[slice(5)] # default stop=len(text)
Out[25]: 'Γεια '

Essentially the section from and including index 0 is made to and excluding index 5:

text Variable Explorer Annotated

text_as_tuple - tuple (15 elements)
Index	Value
0	'Γ'
1	'ε'
2	'ι'
3	'α'
4	' '
5	'σ'
6	'ο'
7	'υ'
8	' '
9	'Κ'
10	'ο'
11	'σ'
12	'μ'
13	'ο'
14	'!'

The slice instance can be assigned to an object name for the sake of readibiliy:

In [26]: selection = slice(0, 5, 1)
In [27]: text[selection] 
Out[27]: 'Γεια '

Variable Explorer
Name ▲	Type	Size	Value
ascii_text	str	12	Hello World!
ascii_text_b	bytes	12	Hello World!
selection	slice	1	slice(0, 5, 1)
text	str	15	Γεια σου Κοσμο!
text_as_tuple	tuple	15	('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …)
text_b	tuple	27	Hello World!
text_b_as_tuple	tuple	27	(206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …)
text_b_as_tuple_bin	tuple	27	('0b11001110', '0b10010011', '0b11001110', '0b10110101','0b11001110', …)
text_b_as_tuple_hex	tuple	27	('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …)

However normally slicing is done using a colon : instead:

In [28]: text[0:5:1] # text[slice(0, 5, 1)]
Out[28]: 'Γεια '

In [29]: text[0:5] # default step=1
Out[29]: 'Γεια '

In [30]: text[5] # default stop=len(text)
Out[30]: 'Γεια '

Using the notation with the colons is a bit more flexible:

In [31]: text[:2] # default start=0
Out[31]: 'Γε'

If a step of -1 is used, the string is reversed:

In [32]: text[::-1] # default start=0
Out[32]: '!ομσοΚ υοσ αιεΓ'

This means the default start is -1 and stop is -len(text)-1 taking into account zero-order indexing:

In [33]: text[-1:-len(text)-1:-1]
Out[33]: '!ομσοΚ υοσ αιεΓ'

If the bytes instance text_b is now examined. Notice that indexing a single value returns an int corresponding to the byte:

In [34]: text_b[0]
Out[34]: 206

However slicing, returns a bytes instance:

In [35]: text_b[0:1]
Out[35]: b'\xce'

text_b Variable Explorer

text_b_as_tuple - tuple (27 elements)
Index	Value
0	206
1	147
2	206
3	181
4	32
5	185
6	206
7	177
8	32
9	207
10	132
11	206
12	181
13	206
14	183
15	207
16	131
17	32
18	207
19	140
20	207
21	132
22	32
23	206
24	177
25	207
26	132

The data model method __iter__ defines the behaviour of the builtins function iter and casts the str into an iterator:

In [36]: forward = iter(text)

Variable Explorer
Name ▲	Type	Size	Value
ascii_text	str	12	Hello World!
ascii_text_b	bytes	12	Hello World!
forward	str_iterator	1	<str_iterator at 0x22a2f5b70a0>
selection	slice	1	slice(0, 5, 1)
text	str	15	Γεια σου Κοσμο!
text_as_tuple	tuple	15	('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …)
text_b	tuple	27	Hello World!
text_b_as_tuple	tuple	27	(206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …)
text_b_as_tuple_bin	tuple	27	('0b11001110', '0b10010011', '0b11001110', '0b10110101','0b11001110', …)
text_b_as_tuple_hex	tuple	27	('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …)

The iterator essentially only displays a single value at a time. The builtins function next can be called to advance to the next value, which consumes the previous value:

In [37]: next(forward)
Out[37]: 'Γ'
In [38]: next(forward)
Out[38]: 'ε'
In [39]: next(forward)
Out[39]: 'ι'

A while loop can be constructed that breaks when the StopIteration error is encountered:

In [40]: forward = iter(text):
       :  while True:
       :      try:
       :          print(next(forward))
       :      except StopIteration:
       :          break
       :
Γ
ε
ι
α
 
σ
ο
υ
 
Κ
ο
σ
μ
ο
!

The syntax for a for loop is cleaner. However behind the scenes, the while loop and iterator are used:

In [41]: for unicode_char in text:
       :     print(unicode_char)
       :
Γ
ε
ι
α
 
σ
ο
υ
 
Κ
ο
σ
μ
ο
!

The enumerate class can be used to enumerate the tuple. To visualise the enumeration object it can be cast into a dictionary:

In [42]: enum_text = enumerate(text)
In [43]: enum_text_as_dict = dict(enum_text)

Name ▲	Type	Size	Value
ascii_text	str	12	Hello World!
ascii_text_b	bytes	12	Hello World!
forward	str_iterator	1	<str_iterator at 0x22a2f5b70a0>
enum_text	enumerate	1	<enumerate at 0x22a2ed61260>
enum_text_as_dict	dict	15	{0: 'Γ', 1: 'ε', 2: 'ι', 3: 'α', 4: ' ', 5: 'σ', 6: 'ο', 7: 'υ', 8: ' ', 9: 'Κ', …}
selection	slice	1	slice(0, 5, 1)
text	str	15	Γεια σου Κοσμο!
text_as_tuple	tuple	15	('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …)
text_b	tuple	27	Hello World!
text_b_as_tuple	tuple	27	(206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …)
text_b_as_tuple_bin	tuple	27	('0b11001110', '0b10010011', '0b11001110', '0b10110101','0b11001110', …)
text_b_as_tuple_hex	tuple	27	('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …)

enum_text_as_dict can be expanded:

enum_text_as_dict
Key	Value
0	Γ
1	ε
2	ι
3	α
4
5	σ
6	ο
7	υ
8
9	Κ
10	ο
11	σ
12	μ
13	ο
14	!

A for loop can be constructed with the enumeration of text:

In [44]: for index, unicode_char in enumerate(text):
       :    print(index, unicode_char)
       :
0 Γ
1 ε
2 ι
3 α
4  
5 σ
6 ο
7 υ
8  
9 Κ
10 ο
11 σ
12 μ
13 ο
14 !

The negative indexes can also be examined using:

In [45]: for index, unicode_char in enumerate(text):
       :    print(index-len(text), unicode_char)
       :
-15 Γ
-14 ε
-13 ι
-12 α
-11  
-10 σ
-9 ο
-8 υ
-7  
-6 Κ
-5 ο
-4 σ
-3 μ
-2 ο
-1 !

The negative indexes can be viewed alongside the positive indexes:

In [46]: for index, unicode_char in enumerate(text):
       :     print(index-len(text), unicode_char)
       : for index, unicode_char in enumerate(text):
       :     print(index, unicode_char)
       :
-15 Γ
-14 ε
-13 ι
-12 α
-11  
-10 σ
-9 ο
-8 υ
-7  
-6 Κ
-5 ο
-4 σ
-3 μ
-2 ο
-1 !
0 Γ
1 ε
2 ι
3 α
4  
5 σ
6 ο
7 υ
8  
9 Κ
10 ο
11 σ
12 μ
13 ο
14 !

This makes it easier to conceptualise slicing using a negative step:

In [47]: text[-8:-11:-1]
Out[47]: 'υοσ'

A step of 2 can be used to return a str of every second unicode character:

In [48]: text[::2]
Out[48]: 'Γι ο ομ!'

In [49]: text[1::2]
Out[49]: 'εασυΚσο'

It is possible to do the same for the bytes instance:

In [50]: for index, byte_int in enumerate(text_b):
       :     print(index-len(text_b), byte_int)
       : for index, byte_int in enumerate(text_b):
       :     print(index, byte_int)
       :
-27 206
-26 147
-25 206
-24 181
-23 206
-22 185
-21 206
-20 177
-19 32
-18 207
-17 131
-16 206
-15 191
-14 207
-13 133
-12 32
-11 206
-10 154
-9 206
-8 191
-7 207
-6 131
-5 206
-4 188
-3 206
-2 191
-1 33
0 206
1 147
2 206
3 181
4 206
5 185
6 206
7 177
8 32
9 207
10 131
11 206
12 191
13 207
14 133
15 32
16 206
17 154
18 206
19 191
20 207
21 131
22 206
23 188
24 206
25 191
26 33

However slicing using a step with a multibyte encoding such as utf-8 will usually result in a bytes instance that cannot be decoded:

In [51]: text_b[0::2]
Out[51]: b'\xce\xce\xce\xce \x83\xbf\x85\xce\xce\xcf\xce\xce!'

In [52]: text_b[0::2].decode(encoding='utf-8')
Traceback (most recent call last):

  Cell In[52], line 1
    text_b[0::2].decode(encoding='utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte

The Collection method count will count the number of occurances a substring occurs in a str:

In [53]: text
Out[33]: 'Γεια σου Κοσμο!'

In [53]: text.count('σου')
Out[53]: 1

In [54]: text.count('σ')
Out[54]: 2

The Collection method index will retrieve the index of the first occurance of a value:

In [55]: dict(enumerate(text))
Out[55]: 
{0: 'Γ',
 1: 'ε',
 2: 'ι',
 3: 'α',
 4: ' ',
 5: 'σ',
 6: 'ο',
 7: 'υ',
 8: ' ',
 9: 'Κ',
 10: 'ο',
 11: 'σ',
 12: 'μ',
 13: 'ο',
 14: '!'}

In [56]: text.index('σ')
Out[56]: 5

The optional positional input arguments start and stop can be used to constrict the range of indexes to search over:

In [57]: first = text.index('σ')
       : text.index('σ', first+1, len(text))
Out[57]: 11

The method index will produce a ValueError when the substring is not found:

In [58]: second = text.index('σ', first+1, len(text))
       : text.index('σ', second+1, len(text))
Traceback (most recent call last):

  Cell In[58], line 2
    text.index('σ', second+1, len(text))

ValueError: substring not found

In the str class there is a similar method find that behaves similarly to index but returns -1 when a substring is not found:

In [59]: text.find('σ')
Out[59]: 5

In [60]: text.find('σ', first+1, len(text))
Out[60]: 11

In [61]: text.find('σ', second+1, len(text))
Out[61]: -1

index and find search from left to right and have the counterparts, rindex and rfind which operate from right to left:

In [62]: text.index('σ')
Out[62]: 11

Once again, these only differ when the substring is not found returning a ValueError or -1 upon failure respectively.

The replace method can be used to replace an old substring with a new substring returning a new str with the changes. If the old substring is found multiple times, it will be replaced by the new string multiple times by default unless the count of the number of replacements is specified, for example 1 where it will only make the first replacement:

In [63]: text
Out[63]: 'Γεια σου Κοσμο!'

In [64]: text.replace('Γεια', 'Γϵια')
Out[64]: 'Γϵια σου Κοσμο!'

In [65]: text.replace('σ', 'ς')
Out[65]: 'Γεια ςου Κοςμο!'

In [66]: text.replace('σ', 'ς', 1)
Out[66]: 'Γεια ςου Κοσμο!'

The bytes class behaves similarly:

In [67]: text_b
Out[67]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'

In [68]: dict(enumerate(text_b))
Out[68]: 
{0: 206,
 1: 147,
 2: 206,
 3: 181,
 4: 206,
 5: 185,
 6: 206,
 7: 177,
 8: 32,
 9: 207,
 10: 131,
 11: 206,
 12: 191,
 13: 207,
 14: 133,
 15: 32,
 16: 206,
 17: 154,
 18: 206,
 19: 191,
 20: 207,
 21: 131,
 22: 206,
 23: 188,
 24: 206,
 25: 191,
 26: 33}

In [69]: text_b.count(bytes((207, 131)))
Out[69]: 2

In [70]: bytes((207, 131))
Out[70]: b'\xcf\x83'

In [71]: text_b.index(bytes((207, 131)))
Out[71]: 9

In [72]: text_b.count(207)
Out[72]: 3

In [73]: text_b.index(207)
Out[73]: 9

In [74]: text_b.index(207, 9+1, len(text_b))
Out[74]: 13

The str has the following Collection based binary operators:

# 🔧 Collection-Like Operators:
#     - __add__(self, value, /)                     : Implements string concatenation (`+`).
#     - __mul__(self, value, /)                     : Implements string repetition (`*`).
#     - __rmul__(self, value, /)                    : Implements reflected multiplication (`*`).

The data model method __add__ defines the behaviour of the + operator and performs str concatenation:

In [75]: text
Out[75]: 'Γεια σου Κοσμο!'
In [76]: ascii_text
Out[76]: 'Hello World!'
In [77]: text + ascii_text # text.__add__(ascii_text)
Out[77]: 'Γεια σου Κοσμο!Hello World!'

Notice that no space is added, if this is desired it can also be concatenated:

In [78]: text + ascii_text
Out[78]: 'Γεια σου Κοσμο! Hello World!'

The data model method __mul__ defines the behaviour of the * operator and performs str replication with an int instance:

In [79]: text * 3 # text.__mul__(3)
Out[79]: 'Γεια σου Κοσμο!Γεια σου Κοσμο!Γεια σου Κοσμο!'

The reverse data model method __rmul__ gives instructions when the position of the str instance and int instance around the operator are reversed:

In [80]: 3 * text # (3).__mul__(text) # Not Defined in int class
                  # text.__rmul__(3) 
Out[80]: 'Γεια σου Κοσμο!Γεια σου Κοσμο!Γεια σου Κοσμο!'

The bytes class behaves similarly:

In [81]: text_b + ascii_text_b
Out[81]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!Hello World!'
In [82]: text_b * 3
Out[82]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [83]: exit

Instantiation and MutableCollection Properties

The bytes class has the mutable counterpart the bytearray. A bytearray instance can be instantiated by casting from a bytes instance to a bytearray:

In [1]: text_b = b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [2]: text_b
Out[2]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [3]: text_ba = bytearray(b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')
In [4]: text_ba
Out[4]: bytearray(b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')

The printed formal representation Out[4] shows the recommended way to instantiate a bytearray is by casting a bytes instance to a bytearray. There is no shorthand way of instantiating this class as it is less commonly used.

The behaviour of all the immutable methods is consistent:

In [4]: len(text_ba)
Out[4]: 27
In [5]: 207 in text_ba
Out[5]: True
In [6]: text_ba.count(207)
Out[6]: 3
In [7]: text_ba.index(207)
Out[7]: 9

The hash function can be used to verify an immutable object (an object that does not change). Notice that text_b which is immutable has a unique hash value but text_ba which is mutable is unhashable:

In [8]: hash(text_b)
Out[8]: -2033065742153678299
In [9]: hash(text_ba)
Traceback (most recent call last):

  Cell In[9], line 1
    hash(text_ba)

TypeError: unhashable type: 'bytearray'

The data model method __getitem__ can be used to index into an immutable bytes or mutable bytearray.

In [10]: text_b[0]
Out[10]: 206
In [11]: text_ba[0]
Out[11]: 206
In [12]: hex(text_ba[0])
Out[12]: '0xce'

The id function can be used to obtain the identification of an object:

In [13]: id(text_ba)
Out[13]: 1968878586928
In [14]: text_ba
Out[14]: bytearray(b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')

The data model method __setitem__ defines the behaviour when indexing into a value and using assignment:

In [15]: int('0xcf', base=16)
Out[15]: 207
In [15]: text_ba[0] = 207

Notice because a value is being assigned in In [15] there is no Out[15]. It text_ba is examined, it is updated in place, notice that the object id does not change:

In [16]: text_ba
Out[16]: bytearray(b'\xcf\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')
In [16]: id(text_ba)
Out[16]: 1968878586928

Notice:

In [17]: id(text_ba)
Out[17]: 1968878586928

The data model method __delitem__ defines the behaviour when deleting a value that has been indexed into:

In [18]: del text_ba[0]

Notice there is no Out[18] and instead text_ba is modified in place:

In [19]: text_ba
Out[19]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')

The first byte is missing, and this won't encode properly because only a single byte from an expected multiple byte is deleted. Notice the identification is constant:

In [20]: id(text_ba)
Out[20]: 1968878586928

The mutable method append will append a single byte represented by a byte to the end of a bytearray:

In [21]: text_ba.append(206) # '\xce'

As this method is mutable it has no return value. text_ba can be seen to be modified in place:

In [22]: text_ba
Out[22]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce')

The mutable method extend will can be used to extend the bytearray by another bytearray:

In [23]: text_ba.extend(bytearray((177, 206, 177))) # '\xb1\xce\xb1'

Once again this method is mutable and text_ba is modified in place:

In [24]: text_ba
Out[24]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\xb1\xce\xb1')

The mutable method insert can be used to insert a single byte as an int at an index, for example at index 1:

In [25]: text_ba.insert(208) # '\xd0'

Once again this method is mutable and text_ba is modified in place:

In [26]: text_ba
Out[26]: bytearray(b'\x93\x94\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\xb1\xce\xb1')

The mutable method remove can be used to remove a the first occurance of a byte:

In [27]: text_ba.remove(206) # '\xce'

Once again this method is mutable and text_ba is modified in place, the \0xce that was at index 2 is no longer here and instead \xb5 which was previously at idnex 3 is shown at index 2:

In [28]: text_ba
Out[28]: bytearray(b'\x93\x94\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\xb1\xce\xb1')

The mutable method reverse can be used to reverse the order of each byte in the `bytearray:

In [29]: text_ba.reverse()

Once again this method is mutable and text_ba is modified in place:

In [30]: text_ba
Out[30]: bytearray(b'\xb1\xce\xb1\xce!\xbf\xce\xbc\xce\x83\xcf\xbf\xce\x9a\xce \x85\xcf\xbf\xce\x83\xcf \xb1\xce\xb9\xce\xb5\x94\x93')

The mutable method clear will clear each byte from the bytearray:

In [31]: text_ba.clear()

Once again this method is mutable and text_ba is modified in place:

In [30]: text_ba
Out[30]: bytearray(b'')

The mutable method extend can be used to extend this empty bytearray:

In [31]: text_ba.extend(bytearray(b'\x93\x94\xce\xb5\xce\xb9\xce\xb1'))

Most mutable methods have no return value, which distinguishes them clearly from immutable methods which jhave a return value. The mutable method pop is unique because it returns the value popped (by default the last value) and mutates the bytearray in place:

In [31]: text_ba.pop()
Out[31]: 177 # '\xb1'

In [32]: text_ba
Out[32]: bytearray(b'\x93\x94\xce\xb5\xce\xb9\xce')

An index to pop can be specified:

In [34]: text_ba.pop(1)
Out[34]: 148 # '\x94'

In [32]: text_ba
Out[32]: bytearray(b'\x93\xce\xb5\xce\xb9\xce')

Notice that after all these mutable methods are used the identification of text_ba remains the same:

In [33]: id(text_ba)
Out[33]: 1968878586928

The copy method can be used to create a copy of the bytearray:

In [34]: text_ba2 = text_ba.copy()

Notice the copy has a different identification:

In [35]: id(text_ba2)
Out[35]: 1968878402416

The copies (at present) have equal values but are not the same object:

In [36]: text_ba2 == text_ba
Out[36]: True
In [37]: text_ba2 is text_ba
Out[37]: False

The __add__ and __mul__ data model methods behave consistently:

In [38]: text_ba + text_ba2
Out[38]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\x93\xce\xb5\xce\xb9\xce')
In [39]: text_ba * 3
Out[39]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\x93\xce\xb5\xce\xb9\xce\x93\xce\xb5\xce\xb9\xce')

However there is a subtitle difference when the in place counterparts are used. Notice for the immutable bytes that two operations take place, essentially concatenation returning a new value and then reassignment, notice the identification changes which means the label text_b has been peeled off the old bytes instance with identification 1968877623024 and placed on the new bytes instance with identification 1968877576688:

In [40]: text_b
Out[40]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [41]: id(text_b) 
Out[41]: 1968877623024
In [42]: text_b += b'\xce'
In [43]: text_b
Out[43]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce'
In [44]: id(text_b) 
Out[44]: 1968877576688

Notice for the mutable bytes that a single operation has taken place and the identification remains constant:

In [45]: text_ba
Out[45]: bytearray(b'\x93\xce\xb5\xce\xb9\xce')
In [46]: id(text_ba) 
Out[46]: 1968878586928
In [47]: text_ba += bytearray(b'\xce')
In [48]: text_ba
Out[48]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xce')
In [49]: id(text_ba) 
Out[49]: 1968878586928
In [50]: text_ba *= 2
In [51]: text_ba
Out[51]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xce\x93\xce\xb5\xce\xb9\xce\xce')
In [52]: id(text_ba) 
Out[52]: 1968878586928
In [53]: exit

Formatted Strings

Returning to the str class, the remaining str methods will now be examined. Recall that a str is immutable and all methods therefore return a value, which is commonly another str instance.

It is common to insert an object into a string and format it within the string body, to produce what is known as a formatted string.

Look at the following string body:

In [1]: body = 'The string to 0 is 1 2!'

Supposing there are three str instances:

In [2]: var0 = 'print'
      : var1 = 'hello'
      : var2 = 'world'

The str method format can be used to insert these str instances within the string body. Let's examine the docstring of the str method format:

In [3]: body.format(
# Docstring popup
"""
Docstring:
S.format(*args, **kwargs) -> str

Return a formatted version of S, using substitutions from args and kwargs.
The substitutions are identified by braces ('{' and '}').
Type:      builtin_function_or_method
"""

From the docstring, the string body should contain curly braces, which are used as placeholders to insert a Python object. Each placeholder can be numbered positionally:

In [3]: body = 'The string to {0} is {1} {2}!'

The *args in the docstring indicates a variable number of positional arguments. When inserting multiple object instances into the string body, each positional argument should correspond to a placeholder:

In [4]: body.format(var0, var1, var2)
Out[4]: 'The string to print is hello world!'

The string body can alternatively be setup to contain named named arguments:

In [5]: 
body = 'The string to {var0_} is {var1_} {var2_}!'

The **kwargs in the docstring indicates a variable number of named arguments also known as keyword parameters:

In [6]: 
body.format(var0_=var0, var1_=var1, var2_=var2)
Out[6]: 'The string to print is hello world!'

Combining the above:

In [7]: 'The string to {var0_} is {var1_} {var2_}!'.format(var0_=var0, var1_=var1, var2_=var2)
Out[7]: 'The string to print is hello world!'

It is common for the placeholder to be given the same name as the `object` name of the `object` to be inserted:

```python
In [8]: 'The string to {var0} is {var1} {var2}!'.format(var0=var0, var1=var1, var2=var2)
Out[8]: 'The string to print is hello world!'

Notice in the above that each object name is essentially repeated 3 times which is pretty cumbersome. Therefore a shorthand way of writing the expression above is to use the prefix f, f means formatted string:

In [9]: f'The string to {var0} is {var1} {var2}!'
Out[9]: 'The string to print is hello world!'

The object data model __format__ method defines the behaviour of the builtins function:

In [10]: format(
# Docstring popup
"""
Signature: format(object, format_spec, /)
Docstring:
Default object formatter.

Return str(self) if format_spec is empty. Raise TypeError otherwise.
Type:      method_descriptor
"""

Notice there is a format specification format_spec. 's' denotes the format specification for a str instance:

In [11]: format('Hello World!', 's')
Out[11]: 'Hello World!'

If it is prefixed with a number for instance '22s', this is an instruction for the str instance to occupy a width of 22 within the formatted string. Because the original length was 12, it now has 10 spaces until the end of the string:

In [12]: format('Hello World!', '22s')
Out[12]: 'Hello World!          '

Prefixing with a 0 is not common with a str instance and replaces each space with a 0:

In [13]: format('Hello World!', '022s')
Out[13]: 'Hello World!0000000000'

The format specified is inserted within a variable within the placeholder and the colon : is used to seperate out the variable from the format specification:

In [14]: f'The string to {var0:s} is {var1} {var2}!'
Out[14]: 'The string to print is hello world!'
In [15]: f'The string to {var0:10s} is {var1} {var2}!'
Out[15]: 'The string to print      is hello world!'
In [16]: f'The string to {var0:010s} is {var1} {var2}!'
Out[16]: 'The string to print00000 is hello world!'

Numeric values are commonly inserted into a string body:

In [17]: num1 = 1
       : num2 = 0.0000123456789
       : num3 = 12.3456789

In [18]: f'The numbers are {num1}, {num2} and {num3}.' 
Out[18]: The numbers are 1, 1.23456789e-05 and 12.3456789.'

num1 is an integer and an integer can have various format specifiers. d is used to represent a decimal integer:

In [19]: f'The numbers are {num1:d}, {num2} and {num3}.' 
Out[19]: 
'The numbers are 1, 1.23456789e-05 and 12.3456789.'

The width can also be specified:

In [19]: f'The numbers are {num1:5d}, {num2} and {num3}.' 
Out[19]: 
'The numbers are     1, 1.23456789e-05 and 12.3456789.'

Prefixing this with 0 will display leading zeros:

In [19]: f'The numbers are {num1:5d}, {num2} and {num3}.' 
Out[19]: 
'The numbers are 00001, 1.23456789e-05 and 12.3456789.'

num2 and num3 are float instances and the format specified f can be used to express each float in the fixed format:

In [20]: for num in range(9, -1, -1):
       :    print('0.'+num*'0'+'123')
       :
       : for num in range(18):
       :    print('123'+num*'0'+'.')
       :
0.000000000123
0.00000000123
0.0000000123
0.000000123
0.00000123
0.0000123
0.000123
0.00123
0.0123
0.123
123.
1230.
12300.
123000.
1230000.
12300000.
123000000.
1230000000.
12300000000.
123000000000.
1230000000000.
12300000000000.
123000000000000.
1230000000000000.
12300000000000000.
123000000000000000.
1230000000000000000.
12300000000000000000.

Typically when the float is very small or very large, scientific notation is used, with the format e. The format g is the general format and used the fixed format or the exponential format depending on the size of the float:

In [21]: for num in range(9, -1, -1):
       :     print(float('0.'+num*'0'+'123'))
       :
       : for num in range(18):
       :     print(float('123'+num*'0'+'.'))
1.23e-10
1.23e-09
1.23e-08
1.23e-07
1.23e-06
1.23e-05
0.000123
0.00123
0.0123
0.123
123.0
1230.0
12300.0
123000.0
1230000.0
12300000.0
123000000.0
1230000000.0
12300000000.0
123000000000.0
1230000000000.0
12300000000000.0
123000000000000.0
1230000000000000.0
1.23e+16
1.23e+17
1.23e+18
1.23e+19

In [22]: 
f'The numbers are {num1:g}, {num2:g} and {num3:g}.' 
Out[22]: 'The numbers are 1, 1.23457e-05 and 12.3457.'
In [23]: 
f'The numbers are {num1:f}, {num2:f} and {num3:f}.' 
Out[23]: 'The numbers are 1.000000, 0.000012 and 12.345679.'
In [24]: 
f'The numbers are {num1:e}, {num2:e} and {num3:e}.' 
Out[24]: 'The numbers are 1.000000e+00, 1.234568e-05 and 1.234568e+01.'

A width of 10 characters, with 3 characters past the decimal point can be specified:

In [25]: format(num1, '10.3e')
Out[25]: ' 1.000e+00'
In [26]: format(num1, '010.3e')
Out[26]: '01.000e+00'
In [27]: format(num1, '010.2e')
Out[27]: '001.00e+00'

Notice the width includes all the characters used to represent the number as a string such as the decimal point, e and power.

The same modifications can be made in the fixed format:

In [25]: format(num1, '10.3f')
Out[25]: '     1.000'
In [26]: format(num1, '010.3f')
Out[26]: '000001.000'
In [27]: format(num1, '010.2f')
Out[27]: '0000001.00'

In [28]: f'The numbers are {num1:03d}, {num2:06.3f} and {num3:010.3e}.' 
Out[28]: 'The numbers are 001, 00.000 and 01.235e+01.'
In [29]: exit

Returning to the string body:

In [1]: body = 'The numbers are {num1:03d}, {num2:06.3f} and {num3:010.3e}.'

The docstring of the str method format_map can be viewed:

In [2]: body.format_map(
# Docstring popup
"""
Docstring:
S.format_map(mapping) -> str

Return a formatted version of S, using substitutions from mapping.
The substitutions are identified by braces ('{' and '}').
Type:      builtin_function_or_method
"""

To use this method, all the variables to be incorporated into the formatted string are grouped together in a mapping such as a dict:

In [2]: numbers = {'num1': 1, 'num2': 0.0000123456789, 'num3': 12.3456789}

The str method format_map can then be used to map all the variables from this dict into their placeholders within the string body:

In [3]: body.format_map(numbers)
Out[3]: 'The numbers are 001, 00.000 and 01.235e+01.

In the str class the data model method __mod__ is defined to implement C-style formatted strings which controls the behaviour of the operator %.

In [4]: body = 'The numbers are %03d, %06.3f and %0.3g.' 
      : nums = (1, 0.0000123456789, 12.3456789)
In [5]: body % nums
Out[5]: 'The numbers are 001, 00.000 and 12.3.'

Case Methods

The Greek alphabet looks as follows, notice it has uppercase and lowercase letters. Notice also that some characters such as epsilon and sigma have two lowercase variations:

Greek Alphabet

Greek Alphabet	Uppercase	Lower Case
Alpha	Α	α
Beta	Β	β
Gamma	Γ	γ
Delta	Δ	δ
Epsilon	Ε	ε or ϵ
Zeta	Ζ	ζ
Eta	Η	η
Theta	Θ	θ
Iota	Ι	ι
Kappa	Κ	κ
Lambda	Λ	λ
Mu	Μ	μ
Nu	Ν	ν
Xi	Ξ	ξ
Omicron	Ο	ο
Pi	Π	π
Rho	Ρ	ρ
Sigma	Σ	σ or ς
Tau	Τ	τ
Upsilon	Υ	υ
Phi	Φ	φ
Chi	Χ	χ
Psi	Ψ	ψ
Omega	Ω	ω

The str case method upper returns a string where every character is upper case:

Out[6]: 'γεια σου κοσμο!'.upper()
Out[6]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'

The str case method capitalize (U.S. spelling with z) returns a string where only the first character is in upper case and the rest of the characters are in lower case:

In [7]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.capitalize()
Out[7]: 'Γεια σου κοσμο!'

The str case method title returns a string where only the first character and first character after very space is in upper case and the rest of the characters are in lower case:

In [8]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.title()
Out[8]: 'Γεια Σου Κοσμο!'

The str case method lower returns a string where each characer is in lower case:

In [9]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.lower()
Out[9]: 'γεια σου κοσμο!'

The following characters are less common lowe case variants of epsilon and sigma. Therefore when the str method lower is used on them, they are unchanged:

In [10]: 'ϵ'.lower()
Out[10]: 'ϵ'
In [11]: 'ς'.lower()
Out[11]: 'ς'

The str case method casefold returns a string where each characer is in lower case and transforms any variants to the most common variant:

In [12]: 'ϵ'.casefold()
Out[12]: 'ε'
In [13]: 'ς'.casefold()
Out[13]: 'σ'

The difference between the str methods lower and casefold can be seen in the example below:

In [14]: 'Γϵια ςου Κοςμο!'.lower()
Out[14]: 'γϵια ςου κοςμο!'

In [15]: 'Γϵια ςου Κοςμο!'.casefold()
Out[15]: 'γεια σου κοσμο!'

The str case method swapcase swaps the case of each character in the str:

In [16]: 'Γεια Σου Κοσμο!'.swapcase()
Out[16]: 'γΕΙΑ σΟΥ κΟΣΜΟ!'

Boolean Classification

The str class has a number of boolean classification methods which return True if every Unicode character in a str satisfies the classification:

In [17]: 'γεια σου κοσμο!'.islower()
Out[17]: True
In [18]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.islower()
Out[18]: False
In [19]: 'γεια σου κοσμο!'.isupper()
Out[19]: True
In [20]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.isupper()
Out[20]: True

The boolean classification istitle will return True if the str is title case:

In [21]: 'Γεια σου κοσμο!'.istitle()
Out[21]: False
In [22]: 'Γεια Σου Κοσμο!'.istitle()
Out[22]: True

The boolean classification isspace will return True if each character is whitespace, this includes tabs and newlines:

In [23]: ' '.isspace()
Out[23]: True
In [24]: '   '.isspace()
Out[24]: True
In [25]: ' \t\n\r\x0b\x0c'.isspace()
Out[25]: True

The escape character \t represents a tab, \n represents a new line and \r a carriage return. \x0b is the vertical tab and \x0c is the form feed, these are less commonly used and expressed as their byte.

The boolean classification isprintable will check to see if every character in the string is printable, i.e. doesn't have any non-printable ASCII characters

In [26]: '\x00'.isprintable()
Out[26]: False
In [27]: 'Γεια σου Κοσμο!'.isprintable()
Out[27]: True

The boolean classification isascii will check to see if every character in the string is an ASCII character:

In [28]: 'Γεια σου Κοσμο!'.isascii()
Out[28]: False
In [29]: 'Hello World!'.isascii()
Out[29]: True

The boolean classification isalpha will check to see if every number in the string is alphabetical. Note this isn't limited to only ASCII alphabetical characters:

In [30]: 'Γεια σου Κοσμο!'.isalpha()
Out[30]: False

In [31]: 'αβγΑΒΓ'.isalpha()
Out[31]: True

In [32]: 'abcABC'.isalpha()
Out[32]: True

There are three numeric classifications and the difference between these can be seen by examining the following numeric groups:

In [33]: numeric_groups = {'ascii': '0123456789', 
                           'font1': '𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿', 
                           'font2': '𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵', 
                           'font3': '𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡', 
                           'subscript': '₀₁₂₃₄₅₆₇₈₉',
                           'superscript': '⁰¹²³⁴⁵⁶⁷⁸⁹',
                           'circled1': '➀➁➂➃➄➅➆➇➈',
                           'circled2': '➉',
                           'fractions': '½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉', 
                           'asciihex': '0123456789abcdef', }

isdecimal is the most restrictive and recognises numeric digits of various different fonts:

In [34]: for key, value in numeric_groups.items():
       :      print(key, value, value.isdecimal())
       :
ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ False
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ False
circled1 ➀➁➂➃➄➅➆➇➈ False
circled2 ➉ False
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False
asciihex 0123456789abcdef False

isdigit recognises more including subscripts, superscripts and circled digits however the circled 10 isn't recognised as it has two digits opposed to one:

In [35]: for key, value in numeric_groups.items():
       :      print(key, value, value.isdigit())
       :
ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ False
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False
asciihex 0123456789abcdef False

isnumeric recognises more including the circled 10 and fractions:

In [36]: for key, value in numeric_groups.items():
       :      print(key, value, value.isnumeric())
       :
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ True
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True
asciihex 0123456789abcdef False

isalnum esseentially is a combination of Unicode characters accepted from isalpha and isnumeric:

In [37]: for key, value in numeric_groups.items():
       :      print(key, value, value.isalnum())
       :
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ True
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True
asciihex 0123456789abcdef False

The boolean classification isidentifier will check to see if the string is a valid identifier name. Recall identifiers (object names) cannot begin with a number, but can include a number elsewhere and cannot use spaces or special characters with exception to the underscore:

In [38]: 'variable'.isidentifier()
Out[38]: True
In [39]: '2variable'.isidentifier()
Out[39]: False
In [40]: 'variable2'.isidentifier()
Out[40]: True
In [41]: 'variable 2'.isidentifier()
Out[41]: False
In [42]: 'variable_2'.isidentifier()
Out[42]: True

startswith endswith

Alignment Methods

The str alignment methods can be used as an alternative way to format a string. left justify ljust, right justify rjust and center will align a string using a specified width:

In [43]: len('Γεια σου Κοσμο!')
Out[43]: 15

In [44]: 'Γεια σου Κοσμο!'.ljust(20)
Out[44]: 'Γεια σου Κοσμο!     '

In [45]: 'Γεια σου Κοσμο!'.rjust(20)
Out[45]: '     Γεια σου Κοσμο!'

In [46]: 'Γεια σου Κοσμο!'.center(20)
Out[46]: '  Γεια σου Κοσμο!   '

These str alignment methods accept an optional fill character:

In [47]: 'Γεια σου Κοσμο!'.rjust(20, '0')
Out[47]: '00000Γεια σου Κοσμο!'

Using right justification with a fill character of 0 is commonly used for numeric strings and is available as the str method zerofill zfill:

In [48]: '1'.zfill(5)
Out[48]: '00001'

The str method expandtabs can be used to expand tabs to a specified number of spaces, the default value is 8:

In [49]: '\tΓεια σου Κοσμο!'.expandtabs()
Out[49]: '        Γεια σου Κοσμο!'
In [50]: '\tΓεια σου Κοσμο!'.expandtabs(4)
Out[50]: '    Γεια σου Κοσμο!'

Stripping Methods

The methods left strip lstrip, right strip rstrip and strip strip the whitespace in a string by default:

In [51]: '  Γεια σου Κοσμο!   '.lstrip()
Out[51]: 'Γεια σου Κοσμο!   '
In [52]: '  Γεια σου Κοσμο!   '.rstrip()
Out[52]: '  Γεια σου Κοσμο!'
In [53]: '  Γεια σου Κοσμο!   '.strip()
Out[53]: 'Γεια σου Κοσμο!'

Alternatively they can be used to strip a specified character:

In [54]: '00001'.lstrip('0')
Out[54]: '1'

Or one of multiple characters:

In [55]: '0x01'.lstrip('0x')
Out[55]: '1'

Sometime it is more useful to use the str methods removeprefix and removesuffix which will remove only a specified prefix or suffix:

In [56]: '0x01'.removeprefix('0x')
Out[56]: '01'
In [57]: '0x01'.removesuffix('01')
Out[57]: '0x'

Splitting and Joining Methods

The str method split, splits each word in a sentance using a whitespace character returning a list of str instances. Conceptually this splits every word in a sentance:

In [58]: 'Γεια σου Κοσμο!'.split()
Out[58]: ['Γεια', 'σου', 'Κοσμο!']

This is completed by the str method join which joins list of str instances:

In [59]: ' '.join(['Γεια', 'σου', 'Κοσμο!'])
Out[59]: ['Γεια', 'σου', 'Κοσμο!']

A different character can be specified in the split method:

In [60]: 'Γεια σου Κοσμο!'.split('σ')
Out[60]: ['Γεια ', 'ου Κο', 'μο!']

In [61]: 'σ'.join(['Γεια ', 'ου Κο', 'μο!'])
Out[61]: 'Γεια σου Κοσμο!'

A maximum split can be specified and here, split can be seen to operate on the string, left to right:

In [60]: 'Γεια σου Κοσμο!'.split(maxsplit=1)
Out[60]: ['Γεια', 'σου Κοσμο!']

The counterpart rsplit operates from right to left:

In [61]: 'Γεια σου Κοσμο!'.rsplit(maxsplit=1)
Out[61]: ['Γεια σου', 'Κοσμο!']

When maxsplit isn't specified, rsplit and split behave identically and split is generally preferred.

The str method splitlines is essentially split with the split character being specified as a new line \n:

In [62]: print('Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n')
Γεια σου Κοσμο!
Γεια σου Κοσμο!
Γεια σου Κοσμο!

In [63]: 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n'.splitlines()
Out[63]: ['Γεια σου Κοσμο!', 'Γεια σου Κοσμο!', 'Γεια σου Κοσμο!']

The str method partition is similar to split but only occurs once and always returns a three element tuple around the split character:

In [64]: 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n'.partition('\n')
Out[64]: ('Γεια σου Κοσμο!', '\n', 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\n')

Partition operates left to right, there is the rpartition counterpart which operates right to left:

In [65]: 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n'.rpartition('\n')
Out[65]: ('Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!', '\n', '')

The string Module

The string module contains identifiers that are related to string manipulation but not available as non-callable attributes in the str class. A design choice was made to compartmentalise these into a separate string module. As a result all the identifiers of the str class, outwith the data model identifiers are immutable callable methods which return a value. Compartmentalising these also reduced the memory overhead in the str class.

In [1]: import string
In [2]: string.
# Available Identifiers for `string` module
# -------------------------------
# Available Identifiers in `string`:
# ----------------------------------

# 🔠 Character Sets:
#     ascii_letters : Concatenation of `ascii_lowercase` and `ascii_uppercase`.
#     ascii_lowercase : Lowercase ASCII letters (`abcdefghijklmnopqrstuvwxyz`).
#     ascii_uppercase : Uppercase ASCII letters (`ABCDEFGHIJKLMNOPQRSTUVWXYZ`).
#     digits : Decimal digit characters (`0123456789`).
#     hexdigits : Hexadecimal digit characters (`0123456789abcdefABCDEF`).
#     octdigits : Octal digit characters (`01234567`).
#     printable : Characters deemed "printable" (`digits`, `ascii_letters`, punctuation, and whitespace).
#     punctuation : String of all ASCII punctuation characters.
#     whitespace : String of all ASCII whitespace characters.

In [3]: string.printable
Out[3]: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
In [4]: string.ascii_lowercase
Out[4]: 'abcdefghijklmnopqrstuvwxyz'
In [5]: string.ascii_uppercase
Out[5]: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
In [7]: string.hexdigits # base 16
Out[7]: '0123456789abcdefABCDEF'
In [6]: string.digits # base 10
Out[6]: '0123456789'
In [8]: string.octdigits # base 8
Out[8]: '01234567'
In [9]: string.punctuation
Out[9]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [10]: string.whitespace
Out[10]: ' \t\n\r\x0b\x0c'
In [11]: string.printable
Out[11]: '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Translation

The str method maketrans is a static method that creates a translation table which maps from one character to another (conceptualise the translation). A translation table from Greek to Latin letters can be made. To visualise this, it can be cast into a dict:

In [12]: greek2latin = str.maketrans('αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ', 'abgdezhqiklmnxoprstyfcuwABGDEZHQIKLMNXOPRSTYFCUW')

In [13]: greek2latin_as_dict = dict(greek2latin)

Translation Table
Key	Value
945	97
946	98
947	103
948	100
949	101
950	122
951	104
952	113
953	105
954	107
955	108
956	109
957	110
958	120
959	111
960	112
961	114
963	115
964	116
965	121
966	102
967	99
968	117
969	119
913	65
914	66
915	71
916	68
917	69
918	90
919	72
920	81
921	73
922	75
923	76
924	77
925	78
926	79
927	84
928	82
929	83
931	84
932	85
933	86
934	87
935	88
936	89
937	90
938	91
939	92
940	93
941	94
942	95
943	96
944	97
945	98
946	99
947	100
948	101
949	102
950	103
951	104
952	105
953	106
954	107
955	108
956	109
957	110
958	111
959	112
960	113
961	114
962	115
963	116
964	117
965	118
966	119
967	120
968	121
969	122
970	123
971	124
972	125
973	126
974	127
975	128
976	129
977	130
978	131
979	132
980	133
981	134
982	135
983	136
984	137
985	138
986	139
987	140
988	141
989	142
990	143
991	144
992	145
993	146
994	147
995	148
996	149
997	150
998	151
999	152
1000	153

Notice the keys and the values are numerical, displayed as int instances. These can be understood better by looking at the int in binary or hexadecimal using the bin and hex functions respectively. The str methods explored above will be used to display all the bits or hexadecimal digits. The character chr function will display the Unicode character corresponding to the supplied int (utf-8). The ordinal ord function performs the counter operation:

In [14]: bin(945)
Out[14]: '0b1110110001' # 2 bytes 'utf-8'

In [15]: '0b'+bin(945).removeprefix('0b').zfill(16)
Out[15]: '0b0000001110110001'

In [14]: hex(945)
Out[14]: '0x3b1' # 2 bytes 'utf-8'

In [15]: '0x'+hex(945).removeprefix('0x').zfill(4)
Out[15]: '0x03b1'

In [16]: chr(945)
Out[16]: 'α'

In [17]: ord('α')
Out[17]: 945

In [18]: bin(97)
Out[18]: '0b1100001' # 1 byte 'utf-8'

In [19]: '0b'+bin(97).removeprefix('0b').zfill(8)
Out[19]: '0b01100001'

In [20]: hex(97)
Out[20]: '0x61' # 1 byte 'utf-8'

In [21]: '0x'+hex(97).removeprefix('0x').zfill(2)
Out[21]: '0x61'

In [22]: chr(97)
Out[22]: 'a'

In [23]: ord('a')
Out[23]: 97

The str method translate can use this translation table to convert characters from the Greek to the Latin alphabet:

In [24]: 'Γεια σου Κοσμο!'.translate(greek2latin)
Out[24]: 'Geia soy Kosmo!'

Recall that a static method is not bound to an instance or a class, but merely found in the classes namespace as its the expected place for the method to be found.

When the translation table was made, the two strings supplied had to be an equal length of Unicode characters for 1 to 1 mapping. Sometimes it is desirable to create a translation table that removes characters entirely and in this case an empty string should be supplied for each of the positional arguments and the characters that are to be mapped to None should be supplied as a third positional argument, in this case the punctuation characters which are available as string.punctuation):

In [25]: remove_punctuation = str.maketrans('', '', string.punctuation)
In [26]: remove_punctuation_as_dict = dict(remove_punctuation)

remove_punctuation_as_dict
Key	Value
33	None
34	None
35	None
36	None
37	None
38	None
39	None
40	None
41	None
42	None
43	None
44	None
45	None
46	None
47	None
58	None
59	None
60	None
61	None
62	None
63	None
64	None
91	None
92	None
93	None
94	None
95	None
96	None
123	None
124	None
125	None
126	None

And this can be used to remove the punctuation, in combination with a casefold and split to get a list of lowercase words:

In [27]: 'Γεια σου Κοσμο!'.translate(remove_punctuation).casefold().split()
Out[27]: ['γεια', 'σου', 'κοσμο']

This can be used to count the number of occurances of each word using a collection such as a Counter and the top words can be examined:

In [28]: from collections import Counter
In [29]: Counter(['γεια', 'σου', 'κοσμο'])
Out[29]: Counter({'γεια': 1, 'σου': 1, 'κοσμο': 1})

This essentially is the basis of most natural language processing problems. A natural language toolkit in English would filter out stop words:

In [30]: stop_words = ['a', 'an', 'the', 'at', 'by', 'for', 
                       'in', 'of', 'on', 'to', 'he', 'she', 
                       'it', 'they', 'we', 'you', 'I', 'me', 'my',
                       'your', 'and', 'but', 'or', 'so', 'yet', 'is', 
                       'am', 'are', 'was', 'were', 'be', 'being', 'been', 
                       'have', 'has', 'had', 'do', 'does', 'did', 'not', 
                       'this', 'that', 'these', 'those', 'all', 'any', 
                       'some', 'such'
                      ]

And usually examine sentimental text:

In [31]: sentiment_dict = {'positive': ['happy', 'joyful', 'love', 
                                        'excellent', 'great', 'fantastic', 
                                        'amazing', 'wonderful', 'cheerful', 
                                        'positive'],
                           'negative': ['sad', 'hate', 'terrible', 'awful', 
                                        'bad', 'horrible', 'disappointing', 
                                        'angry', 'frustrated', 'negative'],
                           'neutral': ['okay', 'fine', 'average', 'normal', 
                                       'medium','fair', 'indifferent', 
                                       'moderate', 'tolerable', 'usual']}

A natural language problem would essentially take a piece of text and convert it into a number for example a number that can be evaluated from a large number of product reviews.

Some additional translation tables may need to be created to remove accents from accented characters, which casefold doesn't handle.

Python has a number of third-party natural language toolkits, which are out of the scope of this tutorial.

The re Module

The str module contains a number of simple identifiers which allow for example a substring to be found within a string. These are complemented by regular expressions, if the following str instance text is examined:

In [32]: exit
In [1]: text = 'Email local@domain.com, local2@domain2.co.uk Telephone 0000000000 Website https://www.domain.com'

Notice it has two emails, a telephone and a website which you as a human can isntantly recognised. Python has a regular expressions re module and the purpose of this module is to create a pattern in the form of a regular expression and search within a string for this pattern:

In [2]: import re
In [3]: re.
# Available Identifiers for `re` module
# -------------------------------
# Available Identifiers for `re`:
# -------------------------------------

## Functions
# - `re.match(pattern, string)`
# - `re.search(pattern, string)`
# - `re.findall(pattern, string)`
# - `re.finditer(pattern, string)`
# - `re.sub(pattern, repl, string)`
# - `re.subn(pattern, repl, string)`
# - `re.split(pattern, string)`
# - `re.compile(pattern, flags=0)`
# - `re.escape(string)`
# - `re.fullmatch(pattern, string)`
# - `re.purge()`

## Flags
# - `re.IGNORECASE`
# - `re.I`
# - `re.MULTILINE`
# - `re.M`
# - `re.DOTALL`
# - `re.S`
# - `re.VERBOSE`
# - `re.X`

## Match Object Methods
# - `match.group([group])`
# - `match.groups()`
# - `match.start([group])`
# - `match.end([group])`
# - `match.span([group])`
# - `match.re`
# - `match.string`
# - `match.lastindex`
# - `match.lastgroup`

## Special Sequences
# - `\d` - Matches any decimal digit.
# - `\D` - Matches any non-digit character.
# - `\w` - Matches any alphanumeric character (and underscore).
# - `\W` - Matches any non-alphanumeric character.
# - `\s` - Matches any whitespace character.
# - `\S` - Matches any non-whitespace character.
# - `\b` - Matches a word boundary.
# - `\B` - Matches a non-word boundary.

## Character Classes
# - `[abc]` - Matches any character in the set.
# - `[^abc]` - Matches any character not in the set.
# - `[a-z]` - Matches any character in the range from a to z.
# - `.` - Matches any character except a newline.

## Groups
# - `(...)` - Capturing group.
# - `(?:...)` - Non-capturing group.
# - `(?P<name>...)` - Named capturing group.
# - `(?=...)` - Positive lookahead.
# - `(?!...)` - Negative lookahead.
# - `(?<=...)` - Positive lookbehind.
# - `(?<!...)` - Negative lookbehind.

Notice the use of \ for a special sequence, because \ is used a pattern should be supplied as a regular expression with the prefix r. Lower case r is preferred as many IDEs will apply syntax highlighting for regular expressions. If upper case R is used, the raw string will still work as a regular expression but the IDE will just syntax the regular expression consistently to a normal string:

In [3]: email_pattern = r'\b[A-Za-z0-9._]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
      : number_pattern = r'\b\d{10}\b'
      : website_pattern = r'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

If the email is examined local@domain.com the pattern is r'\b[A-Za-z0-9._]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'. This pattern can be broken down:

\b beginning of a word boundary.
[A-Za-z0-9._]+ is the local component of the email
- [A-Z] # string.ascii_uppercase
- [a-z] # string.ascii_lowercase
- [0-9] # string.digits
- [._] additional characters allowed in the local component
- + used to denote 1 or more character
@ is the at symbol
[A-Za-z0-9.-] is the domain name
- [A-Z] # string.ascii_uppercase
- [a-z] # string.ascii_lowercase
- [0-9] # string.digits
- [._] additional characters allowed in the local component
- + used to denote 1 or more characters
+\. is the dot ., note the . is used in a regular expression, so in this case is inserted as an escape character.
[A-Z|a-z]{2,} is the top level domain
- [A-Z] # string.ascii_uppercase
- [a-z] # string.ascii_lowercase
- {2,} two or more characters
\b ending of a word boundary.

If the number is examined 0000000000 the pattern is r'\b\d{10}\b'. This pattern can be broken down:

\b beginning of a word boundary.
\d decimal characters
- {10} ten of them
\b ending of a word boundary.

If the website is examined https://www.domain.com the pattern is r'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'. This pattern can be broken down:

\b beginning of a word boundary.
https? is the Hypertext Transfer Protocol (Secured)
- http literal
- s? optional, meaning s (may or may not be present)
- :// literal (used to seperate the protocol from the address)
(?:www\.)
- (?:) creates a non-capturing group
- www literal
- \. the dot is inserted as an escape character
? optional, meaning www. (may or may not be present)
[A-Za-z0-9.-] is the domain (same as email)
+.[A-Z|a-z]{2,}` is the top level domain (same as email)
\b ending of a word boundary.

The regular expression function findall will return a list of pattern matches:

In [4]: re.findall(email_pattern, text)
Out[4]: ['local@domain.com', 'local2@domain2.co.uk']

In [5]: re.findall(number_pattern, text)
Out[5]: ['0000000000']

In [6]: re.findall(website_pattern, text)
Out[6]: ['https://www.domain.com']

The regular expressions module is very powerful and regular expressions can get quite complicated. A simple demonstration here was used just to show the concept.

Return to Python Tutorials

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Text Data Types

The object Base Class and Collections Abstract Base Class

Instantiation, Encoding and Collection Properties

Instantiation and MutableCollection Properties

Formatted Strings

Case Methods

Boolean Classification

Alignment Methods

Stripping Methods

Splitting and Joining Methods

The string Module

Translation

The re Module

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Text Data Types

The object Base Class and Collections Abstract Base Class

Instantiation, Encoding and Collection Properties

Instantiation and MutableCollection Properties

Formatted Strings

Case Methods

Boolean Classification

Alignment Methods

Stripping Methods

Splitting and Joining Methods

The string Module

Translation

The re Module