Skip to content

Latest commit

 

History

History
6606 lines (5948 loc) · 300 KB

readme.md

File metadata and controls

6606 lines (5948 loc) · 300 KB

Text Data Types

The object Base Class and Collections Abstract Base Class

Recall from the previous tutorial covering the data model that the object class is the base class of all classes. dir can be used to view a list of it's identifiers:

In [1]: dir(object)
Out[1]: [
            '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', 
            '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', 
            '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', 
            '__sizeof__', '__str__', '__subclasshook__'
           ]

The identifiers can also be viewed if object is input followed by a dot .:

In [2]: object.
# -------------------------------
# Available Identifiers for `object`:
# -------------------------------------
#   🔧 Functions:
#     - __init__(self, /, *args, **kwargs)          : Initializes the object.
#     - __new__(*args, **kwargs)                    : Creates a new instance of the class.
#     - __delattr__(self, name, /)                  : Defines behavior for when an attribute is deleted.
#     - __dir__(self, /)                            : Default dir() implementation.
#     - __sizeof__(self, /)                         : Returns the size of the object in memory, in bytes.
#     - __eq__(self, value, /)                      : Checks for equality with another object.
#     - __ne__(self, value, /)                      : Checks for inequality with another object.
#     - __lt__(self, value, /)                      : Checks if the object is less than another.
#     - __le__(self, value, /)                      : Checks if the object is less than or equal to another.
#     - __gt__(self, value, /)                      : Checks if the object is greater than another.
#     - __ge__(self, value, /)                      : Checks if the object is greater than or equal to another.
#     - __repr__(self, /)                           : Returns a string representation of the object.
#     - __str__(self, /)                            : Returns a string for display purposes.
#     - __format__(self, format_spec, /)            : Returns a formatted string representation of the object.
#     - __hash__(self, /)                           : Returns a hash of the object.
#     - __getattribute__(self, name, /)             : Gets an attribute from the object.
#     - __setattr__(self, name, value, /)           : Sets an attribute on the object.
#     - __delattr__(self, name, /)                  : Deletes an attribute from the object.
#     - __reduce__(self, /)                         : Prepares the object for pickling.
#     - __reduce_ex__(self, protocol, /)            : Similar to __reduce__, with a protocol argument.
#     - __init_subclass__(...)                      : Called when a class is subclassed; default 
#                                                     implementation does nothing.
#     - __subclasshook__(...)                       : Customize issubclass() for abstract classes.
#
#    🔍 Attributes:
#     - __class__                                    : The class of the object.
#     - __doc__                                      : The docstring of the object.
# -------------------------------------

If the str class is now examined, notice that it has many more identifiers:

In [2]: dir(str)
Out[2]: [
          '__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__',
          '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__',
          '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__',
          '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__repr__',
          '__radd__', '__rmatmul__', '__rmul__', '__setattr__', '__sizeof__', '__str__',
          '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode',
          'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum',
          'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric',
          'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',
          'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust',
          'rpartition', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase',
          'title', 'upper', 'zfill'
       ]

Because, the object is the base class, it is present in the str classes method resolution order:

In [3]: str.mro()
Out[3]: ['str', 'object']

Recall the str class inherits everything from the object class. Some identifiers are redefined in the str class for additional functionality and additional identifiers are supplemented. The method resolution order essentially means preferentially use the method if it is redefined in the str class over the equivalent method in the object class.

The str class follows the design pattern of the abstract base class immutable Collection and therefore has the behaviour of an immutable Collection. When str is input, followed by a dot . the identifiers are typically listed alphabetically. However it is easier to understand the identifiers in the str class when the identifiers are grouped by design pattern and purpose:

In [4]: str.
# -------------------------------
# Available Identifiers for `str`:
# -------------------------------------

# 🔧 Functions from `object` (inherited by `str`):
#     - __init__(self, /, *args, **kwargs)          : Initializes the object.
#     - __new__(*args, **kwargs)                    : Creates a new instance of the class.
#     - __delattr__(self, name, /)                  : Defines behavior for when an attribute is deleted.
#     - __dir__(self, /)                            : Default dir() implementation.
#     - __sizeof__(self, /)                         : Returns the size of the object in memory, in bytes.
#     - __eq__(self, value, /)                      : Checks for equality with another object.
#     - __ne__(self, value, /)                      : Checks for inequality with another object.
#     - __lt__(self, value, /)                      : Checks if the object is less than another.
#     - __le__(self, value, /)                      : Checks if the object is less than or equal to another.
#     - __gt__(self, value, /)                      : Checks if the object is greater than another.
#     - __ge__(self, value, /)                      : Checks if the object is greater than or equal to another.
#     - __repr__(self, /)                           : Returns a string representation of the object.
#     - __str__(self, /)                            : Returns a string for display purposes.
#     - __format__(self, format_spec, /)            : Returns a formatted string representation of the object.
#     - __hash__(self, /)                           : Returns a hash of the object.
#     - __getattribute__(self, name, /)             : Gets an attribute from the object.
#     - __setattr__(self, name, value, /)           : Sets an attribute on the object.
#     - __delattr__(self, name, /)                  : Deletes an attribute from the object.
#     - __reduce__(self, /)                         : Prepares the object for pickling.
#     - __reduce_ex__(self, protocol, /)            : Similar to __reduce__, with a protocol argument.

# 🔍 Attributes from `object`:
#     - __class__                                   : The class of the string.
#     - __doc__                                     : The docstring of the string class.

# 🔧 Collection-Based Methods (from `str` and the Collection ABC):
#     - __contains__(self, key, /)                  : Checks if a substring is in the string (`in`).
#     - __iter__(self, /)                           : Returns an iterator over the string.
#     - __len__(self, /)                            : Returns the length of the string.
#     - __getitem__(self, key, /)                   : Retrieves a character by index (`[]`).
#     - count(self, sub, start=0,                   : Counts the occurrences of a substring.
#             end=9223372036854775807, /) 
#     - index(self, sub, start=0,                   : Returns the index of the first occurrence of a substring.
#             end=9223372036854775807, /) 

# 🔧 Supplementary Collection-Based Methods:
#     - rindex(self, sub, start=0,                  : Returns the highest index of a substring.
#              end=9223372036854775807, /) 
#     - find(self, sub, start=0,                    : Finds the first index of a substring.
#            end=9223372036854775807, /) 
#     - rfind(self, sub, start=0,                   : Finds the highest index of a substring.
#             end=9223372036854775807, /) 
#     - replace(self, old, new, count=-1, /)        : Replaces occurrences of a substring.

# 🔧 Collection-Like Operators:
#     - __add__(self, value, /)                     : Implements string concatenation (`+`).
#     - __mul__(self, value, /)                     : Implements string repetition (`*`).
#     - __rmul__(self, value, /)                    : Implements reflected multiplication (`*`).

# 🔧 Encoding-Related Methods:
#     - encode(self, encoding='utf-8', )            : Encodes the string using a specified encoding.
#              errors='strict', /
#

# 🔧 String-Specific Dunder Methods (from `str`):
#     - __bytes__(self, /)                          : Converts the bytes object to a bytes object.

# 🔧 Additional String-Specific Methods (Grouped by Similarity):

# 🔧 Formatting Methods:
#     - format(self, /, *args, **kwargs)            : Formats the string using a format string.
#     - format_map(self, mapping, /)                : Formats the string using a dictionary.
#     - translate(self, table, /)                   : Maps characters using a translation table.
#     - __mod__(self, value, /)                     : Implements C style string formatting using `%`.
#     - __rmod__(self, value, /)                    : Implements reverse C style string formatting using `%`.

# 🅰️ Case-Specific Methods:
#     - lower(self, /)                              : Converts all characters to lowercase.
#     - casefold(self, /)                           : Returns a casefolded version for caseless matching.
#     - upper(self, /)                              : Converts all characters to uppercase.
#     - capitalize(self, /)                         : Capitalizes the first character of the string.
#     - title(self, /)                              : Returns a title-cased version of the string.
#     - swapcase(self, /)                           : Swaps the case of all characters.

# 🔠 Boolean Methods (Grouped by Type):

# Character Classification:
#     - isascii(self, /)                            : Checks if all characters are ASCII.
#     - isalpha(self, /)                            : Checks if the string contains only alphabetic characters.

# Numeric Classification:
#     - isdecimal(self, /)                          : Checks if the string contains only decimal characters.
#     - isdigit(self, /)                            : Checks if the string contains only digits.
#     - isnumeric(self, /)                          : Checks if the string contains only numeric characters.

# Whitespace and Titlecase:
#     - islower(self, /)                            : Checks if all characters are lowercase.
#     - isupper(self, /)                            : Checks if all characters are uppercase.
#     - isspace(self, /)                            : Checks if the string contains only whitespace.
#     - istitle(self, /)                            : Checks if the string is title-cased.
#     - isprintable(self, /)                        : Checks if all characters are printable.
#     - isidentifier(self, /)                       : Checks if the string is a valid Python identifier.

# Starts or Ends With:
#     - startswith(self, prefix, start=0,           : Checks if the string starts with a prefix.
#                  end=9223372036854775807, /) 
#     - endswith(self, suffix, start=0,             : Checks if the string ends with a suffix.
#                end=9223372036854775807, /)   

# 🔄 Manipulation Methods (Grouping Similar Functions):
#     - ljust(self, width, fillchar=' ', /)         : Left-justifies the string in a field of a given width.
#     - rjust(self, width, fillchar=' ', /)         : Right-justifies the string in a field of a given width.
#     - center(self, width, fillchar=' ', /)        : Centers the string in a field of a given width.
#     - zfill(self, width, /)                       : Pads the string with zeros on the left.
#     - expandtabs(self, tabsize=8, /)              : Expands tabs in the string into spaces.

# 🔄 Stripping Methods:
#     - lstrip(self, chars=None, /)                 : Strips leading characters from the string.
#     - rstrip(self, chars=None, /)                 : Strips trailing characters from the string.
#     - strip(self, chars=None, /)                  : Strips leading and trailing characters from the string.
#     - removeprefix(self, prefix, /)               : Removes the specified prefix from the string.
#     - removesuffix(self, suffix, /)               : Removes the specified suffix from the string.

# 🧩 Splitting and Joining:
#     - split(self, sep=None, maxsplit=-1, /)       : Splits the string at occurrences of a separator.
#     - rsplit(self, sep=None, maxsplit=-1, /)      : Splits the string at occurrences of a separator, from the #                                                     right.
#     - splitlines(self, keepends=False, /)         : Splits the string at line breaks.
#     - join(self, iterable, /)                     : Joins an iterable with the string as a separator.
#     - partition(self, sep, /)                     : Splits the string into a 3-tuple around a separator.
#     - rpartition(self, sep, /)                    : Splits the string into a 3-tuple around a separator, from #                                                     the right.

The str is a Collection where each element (fundamental unit) in the Collection is a Unicode Character. The str class always uses the Unicode Transformation Format-8 (UTF-8) to encode an Unicode character and this greatly simplifies text related operations as the user does not need to handle encoding and decoding using various other translation tables.

Another text datatype is the bytes class. The bytes class is also a Collection where each element in the Collection is a byte. The byte is a logical unit in a computers memory. It is helpful to conceptualise it as the combination of 8 binary switches:

img_001

Each combination in the 8 switches above corresponds to an int between 0 and 256 so the bytes class also has some numeric behaviour. An encoding standard is used to designate a single byte or multiple bytes to a Unicode character. However unlike the Unicode str, there are a variety of encoding tables and the numeric bytes Collection must be encoded and decoded using the same encoding table for the text to make sense. Notice that the identifiers in the bytes class are largely constent with identifiers in the str class but may behave slightly different as they use a difference unit in the Collection:

In [4]: dir(bytes)
Out[4]: ['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', 
          '__format__', '__ge__', '__getitem__', '__getattribute__', '__gt__', 
          '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', 
          '__ne__', '__repr__', '__radd__', '__rmod__', '__sizeof__', '__str__', 
          '__bytes__', 'capitalize', 'casefold', 'count', 'decode', 'endswith', 
          'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdecimal', 
          'isdigit', 'islower', 'isupper', 'join', 'ljust', 'lower', 'replace', 
          'rfind', 'rindex', 'rjust', 'split', 'splitlines', 'startswith', 
          'title', 'upper', 'zfill']
In [5]: bytes.
# -------------------------------
# Available Identifiers for `bytes`:
# -------------------------------------

# 🔧 Functions from `object` (inherited by `bytes`):
#     - __init__(self, /, *args, **kwargs)             : Initializes the object.
#     - __new__(*args, **kwargs)                       : Creates a new instance of the class.
#     - __delattr__(self, name, /)                     : Defines behavior for when an attribute is deleted.
#     - __dir__(self, /)                               : Default dir() implementation.
#     - __sizeof__(self, /)                            : Returns the size of the object in memory, in bytes.
#     - __eq__(self, value, /)                         : Checks for equality with another object.
#     - __ne__(self, value, /)                         : Checks for inequality with another object.
#     - __lt__(self, value, /)                         : Checks if the object is less than another.
#     - __le__(self, value, /)                         : Checks if the object is less than or equal to another.
#     - __gt__(self, value, /)                         : Checks if the object is greater than another.
#     - __ge__(self, value, /)                         : Checks if the object is greater than or equal to another.
#     - __repr__(self, /)                              : Returns a string representation of the object.
#     - __str__(self, /)                               : Returns a string for display purposes.
#     - __format__(self, format_spec, /)               : Returns a formatted string representation of the object.
#     - __hash__(self, /)                              : Returns a hash of the object.
#     - __getattribute__(self, name, /)                : Gets an attribute from the object.
#     - __setattr__(self, name, value, /)              : Sets an attribute on the object.
#     - __delattr__(self, name, /)                     : Deletes an attribute from the object.
#     - __reduce__(self, /)                            : Prepares the object for pickling.
#     - __reduce_ex__(self, protocol, /)               : Similar to __reduce__, with a protocol argument.

# 🔍 Attributes from `object`:
#     - __class__                                      : The class of the bytes object.
#     - __doc__                                        : The docstring of the bytes class.

# 🔧 Collection-Based Methods (from `bytes` and the Collection ABC):
#     - __contains__(self, key, /)                     : Checks if a byte value is in the bytes (`in`).
#     - __iter__(self, /)                              : Returns an iterator over the bytes.
#     - __len__(self, /)                               : Returns the length of the bytes.
#     - __getitem__(self, key, /)                      : Retrieves a byte by index (`[]`).
#     - count(self, sub, start=0,                      : Counts the occurrences of a sub-byte sequence.
#             end=9223372036854775807, /) 
#     - index(self, sub, start=0,                      : Returns the index of the first occurrence of a sub-byte.
#             end=9223372036854775807, /) 

# 🔧 Supplementary Collection-Based Methods:
#     - rindex(self, sub, start=0,                      : Returns the highest index of the first occurrence of a sub-byte.
#     - find(self, sub, start=0,                       : Finds the index of a sub-byte sequence.
#            end=9223372036854775807, /)  
#     - rfind(self, sub, start=0,                      : Finds the highest index of a sub-byte sequence.
#            end=9223372036854775807, /)  
#     - replace(self, old, new, count=-1, /)           : Replaces occurrences of a sub-byte sequence.

# 🔧 Collection-Like Operators:
#     - __add__(self, value, /)                        : Implements bytes concatenation (`+`).
#     - __mul__(self, value, /)                        : Implements bytes repetition (`*`).
#     - __rmul__(self, value, /)                       : Implements reflected multiplication (`*`).

# 🔧 Encoding-Related Methods:
#     - decode(self, encoding='utf-8',                 : Decodes the bytes using a specified encoding.
#             errors='strict', /)

# 🔧 Bytes-Specific Dunder Methods (from `bytes`):
#     - __bytes__(self, /)                             : Returns a copy of the bytes object.
#     - __iter__(self, /)                              : Returns an iterator over the bytes.

# 🔧 Additional Bytes-Specific Methods (Grouped by Similarity):

# 🔧 Formatting and Representation:
#     - hex(self, /)                                   : Returns a string of hexadecimal values.
#     - fromhex(string, /)                             : Creates a `bytes` object from a hexadecimal string.
#     - __mod__(self, value, /)                        : Implements C-style formatting using `%`.
#     - __rmod__(self, value, /)                       : Implements reverse C-style formatting using `%`.

# 🅰️ Case-Specific Methods (For Mutable Equivalent `bytearray`):
#     - **N/A for `bytes`, as they are immutable.** (Mutable `bytearray` provides `lower`, `upper`, etc.)

# 🔠 Boolean Methods (Data Validation):
#     - isalnum(self, /)                               : Checks if all bytes are alphanumeric.
#     - isalpha(self, /)                               : Checks if all bytes are alphabetic.
#     - isascii(self, /)                               : Checks if all bytes are ASCII.
#     - isdigit(self, /)                               : Checks if all bytes are digits.
#     - islower(self, /)                               : Checks if all bytes are lowercase alphabetic.
#     - isupper(self, /)                               : Checks if all bytes are uppercase alphabetic.
#     - isspace(self, /)                               : Checks if all bytes are whitespace.
#     - startswith(self, prefix, start=0,              : Checks if starts with a prefix.
#                 end=9223372036854775807, /) 
#     - endswith(self, suffix, start=0,                : Checks if ends with a suffix.
#               end=9223372036854775807, /)   

# 🔄 Manipulation Methods (Grouping Similar Functions):
#     - ljust(self, width, fillchar=b' ', /)           : Left-justifies in a field of a given width.
#     - rjust(self, width, fillchar=b' ', /)           : Right-justifies in a field of a given width.
#     - center(self, width, fillchar=b' ', /)          : Centers in a field of a given width.
#     - zfill(self, width, /)                          : Pads with zeros on the left.
#     - expandtabs(self, tabsize=8, /)                 : Expands tabs into spaces.

# 🔄 Stripping Methods:
#     - lstrip(self, bytes=None, /)                    : Strips leading bytes from the bytes object.
#     - rstrip(self, bytes=None, /)                    : Strips trailing bytes from the bytes object.
#     - strip(self, bytes=None, /)                     : Strips leading and trailing bytes from the bytes object.

# 🧩 Splitting and Joining:
#     - split(self, sep=None, maxsplit=-1, /)          : Splits at occurrences of a separator.
#     - rsplit(self, sep=None, maxsplit=-1, /)         : Splits at occurrences of a separator, from the right.
#     - splitlines(self, keepends=False, /)            : Splits at line breaks.
#     - join(self, iterable_of_bytes, /)               : Joins an iterable with bytes as a separator.
#     - partition(self, sep, /)                        : Splits into a 3-tuple around a separator.
#     - rpartition(self, sep, /)                       : Splits into a 3-tuple around a separator, from the right.

The str and bytes classes are immutable which essentially means all methods with exception to the constructor return a new instance (of the same class or a different class). The bytes class has a mutable counterpart the bytearray, which has additional methods which mutate the bytearray in place:

In [5]: bytearray.
# -------------------------------
# Available Identifiers for `bytearray`:
# -------------------------------------

# 🔧 Functions from `object` (inherited by `bytearray`):
#     - __init__(self, /, *args, **kwargs)             : Initializes the object.
#     - __new__(*args, **kwargs)                       : Creates a new instance of the class.
#     - __delattr__(self, name, /)                     : Defines behavior for when an attribute is deleted.
#     - __dir__(self, /)                               : Default dir() implementation.
#     - __sizeof__(self, /)                            : Returns the size of the object in memory, in bytes.
#     - __eq__(self, value, /)                         : Checks for equality with another object.
#     - __ne__(self, value, /)                         : Checks for inequality with another object.
#     - __lt__(self, value, /)                         : Checks if the object is less than another.
#     - __le__(self, value, /)                         : Checks if the object is less than or equal to another.
#     - __gt__(self, value, /)                         : Checks if the object is greater than another.
#     - __ge__(self, value, /)                         : Checks if the object is greater than or equal to another.
#     - __repr__(self, /)                              : Returns a string representation of the object.
#     - __str__(self, /)                               : Returns a string for display purposes.
#     - __format__(self, format_spec, /)               : Returns a formatted string representation of the object.
#     - __hash__(self, /)                              : Returns a hash of the object.
#     - __getattribute__(self, name, /)                : Gets an attribute from the object.
#     - __setattr__(self, name, value, /)              : Sets an attribute on the object.
#     - __delattr__(self, name, /)                     : Deletes an attribute from the object.
#     - __reduce__(self, /)                            : Prepares the object for pickling.
#     - __reduce_ex__(self, protocol, /)               : Similar to __reduce__, with a protocol argument.

# 🔍 Attributes from `object`:
#     - __class__                                      : The class of the bytearray object.
#     - __doc__                                        : The docstring of the bytearray class.

# 🔧 Collection-Based Methods (from `bytearray` and the Collection ABC):
#     - __contains__(self, key, /)                     : Checks if a byte value is in the bytearray (`in`).
#     - __iter__(self, /)                              : Returns an iterator over the bytearray.
#     - __len__(self, /)                               : Returns the length of the bytearray.
#     - __getitem__(self, key, /)                      : Retrieves a byte by index (`[]`).
#     - count(self, sub, start=0,                      : Counts the occurrences of a sub-byte sequence.
#             end=9223372036854775807, /) 
#     - index(self, sub, start=0,                      : Returns the index of the first occurrence of a sub-byte.
#             end=9223372036854775807, /) 

# 🔧 Supplementary Collection-Based Methods:
#     - rindex(self, sub, start=0,                     : Returns the highest index of a sub-byte sequence.
#              end=9223372036854775807, /) 
#     - find(self, sub, start=0,                       : Finds the lowest index of a sub-byte sequence.
#            end=9223372036854775807, /)  
#     - rfind(self, sub, start=0,                      : Finds the highest index of a sub-byte sequence.
#            end=9223372036854775807, /)  
#     - replace(self, old, new, count=-1, /)           : Replaces occurrences of a sub-byte sequence.

# 🔧 Mutable Collection-Specific Methods:
#     - __setitem__(self, key, value, /)               : Assigns a value to an item (`[] =`).
#     - __delitem__(self, key, /)                      : Deletes an item from the bytearray.
#     - append(self, item, /)                          : Appends a byte to the end of the bytearray.
#     - extend(self, iterable_of_bytes, /)             : Appends multiple bytes to the bytearray.
#     - insert(self, index, item, /)                   : Inserts a byte at a specific position.
#     - pop(self, index=-1, /)                         : Removes and returns a byte at a given index.
#     - remove(self, value, /)                         : Removes the first occurrence of a value.
#     - clear(self, /)                                 : Removes all bytes from the bytearray.
#     - reverse(self, /)                               : Reverses the order of bytes in place.

# 🔧 Collection-Like Operators:
#     - __add__(self, value, /)                        : Implements bytearray concatenation (`+`).
#     - __mul__(self, value, /)                        : Implements bytearray repetition (`*`).
#     - __rmul__(self, value, /)                       : Implements reflected multiplication (`*`).

# 🔧 Encoding-Related Methods:
#     - decode(self, encoding='utf-8',                 : Decodes the bytearray using a specified encoding.
#             errors='strict', /)

# 🔧 Bytes-Specific Dunder Methods (from `bytearray`):
#     - __bytes__(self, /)                             : Returns a bytes object copy of the bytearray.
#     - __iter__(self, /)                              : Returns an iterator over the bytearray.

# 🔧 Additional Bytearray-Specific Methods (Grouped by Similarity):

# 🔧 Formatting and Representation:
#     - hex(self, /)                                   : Returns a string of hexadecimal values.
#     - fromhex(string, /)                             : Creates a `bytearray` object from a hexadecimal string.
#     - __mod__(self, value, /)                        : Implements C-style formatting using `%`.
#     - __rmod__(self, value, /)                       : Implements reverse C-style formatting using `%`.

# 🅰️ Case-Specific Methods:
#     - lower(self, /)                                 : Converts to lowercase.
#     - upper(self, /)                                 : Converts to uppercase.
#     - capitalize(self, /)                            : Capitalizes the first byte.
#     - title(self, /)                                 : Converts to title case.
#     - swapcase(self, /)                              : Swaps case.
#     - casefold(self, /)                              : Converts for case-insensitive comparisons.

# 🔠 Boolean Methods (Data Validation):
#     - isalnum(self, /)                               : Checks if all bytes are alphanumeric.
#     - isalpha(self, /)                               : Checks if all bytes are alphabetic.
#     - isascii(self, /)                               : Checks if all bytes are ASCII.
#     - isdigit(self, /)                               : Checks if all bytes are digits.
#     - islower(self, /)                               : Checks if all bytes are lowercase alphabetic.
#     - isupper(self, /)                               : Checks if all bytes are uppercase alphabetic.
#     - isspace(self, /)                               : Checks if all bytes are whitespace.
#     - startswith(self, prefix, start=0,              : Checks if starts with a prefix.
#                 end=9223372036854775807, /) 
#     - endswith(self, suffix, start=0,                : Checks if ends with a suffix.
#               end=9223372036854775807, /)   

# 🔄 Manipulation Methods (Grouping Similar Functions):
#     - ljust(self, width, fillchar=b' ', /)           : Left-justifies in a field of a given width.
#     - rjust(self, width, fillchar=b' ', /)           : Right-justifies in a field of a given width.
#     - center(self, width, fillchar=b' ', /)          : Centers in a field of a given width.
#     - zfill(self, width, /)                          : Pads with zeros on the left.
#     - expandtabs(self, tabsize=8, /)                 : Expands tabs into spaces.

# 🔄 Stripping Methods:
#     - lstrip(self, bytes=None, /)                    : Strips leading bytes from the bytearray.
#     - rstrip(self, bytes=None, /)                    : Strips trailing bytes from the bytearray.
#     - strip(self, bytes=None, /)                     : Strips leading and trailing bytes from the bytearray.

# 🧩 Splitting and Joining:
#     - split(self, sep=None, maxsplit=-1, /)          : Splits at occurrences of a separator.
#     - rsplit(self, sep=None, maxsplit=-1, /)         : Splits at occurrences of a separator, from the right.
#     - splitlines(self, keepends=False, /)            : Splits at line breaks.
#     - join(self, iterable_of_bytes, /)               : Joins an iterable with bytearray as a separator.
#     - partition(self, sep, /)                        : Splits into a 3-tuple around a separator.
#     - rpartition(self, sep, /)                       : Splits into a 3-tuple around a separator, from the right.

Instantiation, Encoding and Collection Properties

A str instance can be explictly instantiated using:

In [5]: exit
In [6]: str('Hello World!')
Out[6]: 'Hello World!'

The return value shows the printed formal representation, which recall is the preferred way to initialise a str. Since the str class is the fundamental builtins text class, the preferred way str instance is without explictly using the str class. The Unicode str can use any Unicode Character. In this example Greek letters will be used:

In [7]: `Γεια σου Κοσμο!`
Out[7]: `Γεια σου Κοσμο!`
Greek Alphabet
Greek Alphabet Uppercase Lower Case
Alpha Α α
Beta Β β
Gamma Γ γ
Delta Δ δ
Epsilon Ε ε or ϵ
Zeta Ζ ζ
Eta Η η
Theta Θ θ
Iota Ι ι
Kappa Κ κ
Lambda Λ λ
Mu Μ μ
Nu Ν ν
Xi Ξ ξ
Omicron Ο ο
Pi Π π
Rho Ρ ρ
Sigma Σ σ or ς
Tau Τ τ
Upsilon Υ υ
Phi Φ φ
Chi Χ χ
Psi Ψ ψ
Omega Ω ω

The str instances can be assigned to object names:

In [8]: ascii_text = 'Hello World!'
        text = 'Γεια σου Κοσμο!'

And will display in the Variable Explorer:

Variable Explorer
Name ▲ Type Size Value
ascii_text str 12 Hello World!
text str 15 Γεια σου Κοσμο!

Notice the Variable Explorer displays the type and the length and the length is the number of Unicode Characters in each str.

Another text datatype is the byte class. Recall the bytes class is a Collection where each element in the Collection is a byte and a byte can be concepualised as a combination of 8 switches:

img_001

The byte class requires an encoding table. The encoding table maps a command to a memory configuration in bytes. One of the first widespread encoding tables was the American Standard for Information Interchange (ASCII). A very early generation computer is based on the typewritter. The type writter has a limited number of commands that control the device. Many of these commands are printable key presses, however there are commands that aren't printable such as the carriage return and form feed which need to be used in order to print text out onto a piece of paper:

img_002

Notice the limited number of characters in in ASCII are essentially restricted to the English Language. Select ASCII Encoding to view all the ASCII Characters:

ASCII Encoding
Binary to Character Mapping
Binary Character
0b00000000 NUL (null character)
0b00000001 SOH (start of header)
0b00000010 STX (start of text)
0b00000011 ETX (end of text)
0b00000100 EOT (end of transmission)
0b00000101 ENQ (enquiry)
0b00000110 ACK (acknowledge)
0b00000111 BEL (bell)
0b00001000 BS (backspace)
0b00001001 TAB (horizontal tab)
0b00001010 LF (line feed)
0b00001011 VT (vertical tab)
0b00001100 FF (form feed)
0b00001101 CR (carriage return)
0b00001110 SO (shift out)
0b00001111 SI (shift in)
0b00010000 DLE (data link escape)
0b00010001 DC1 (device control 1)
0b00010010 DC2 (device control 2)
0b00010011 DC3 (device control 3)
0b00010100 DC4 (device control 4)
0b00010101 NAK (negative acknowledge)
0b00010110 SYN (synchronous idle)
0b00010111 ETB (end of transmission block)
0b00011000 CAN (cancel)
0b00011001 EM (end of medium)
0b00011010 SUB (substitute)
0b00011011 ESC (escape)
0b00011100 FS (file separator)
0b00011101 GS (group separator)
0b00011110 RS (record separator)
0b00011111 US (unit separator)
0b00100000
0b00100001 ! (exclamation mark)
0b00100010 " (double quote)
0b00100011 # (number sign)
0b00100100 $ (dollar sign)
0b00100101 % (percent)
0b00100110 & (ampersand)
0b00100111 ' (single quote)
0b00101000 ( left parenthesis
0b00101001 ) right parenthesis
0b00101010 * (asterisk)
0b00101011 + (plus)
0b00101100 , (comma)
0b00101101 - (hyphen)
0b00101110 . (period)
0b00101111 / (slash)
0b00110000 0 (digit zero)
0b00110001 1 (digit one)
0b00110010 2 (digit two)
0b00110011 3 (digit three)
0b00110100 4 (digit four)
0b00110101 5 (digit five)
0b00110110 6 (digit six)
0b00110111 7 (digit seven)
0b00111000 8 (digit eight)
0b00111001 9 (digit nine)
0b00111010 : (colon)
0b00111011 ; (semicolon)
0b00111100 < (less than)
0b00111101 = (equal sign)
0b00111110 > (greater than)
0b00111111 ? (question mark)
0b01000000 @ (commercial at)
0b01000001 A (uppercase A)
0b01000010 B (uppercase B)
0b01000011 C (uppercase C)
0b01000100 D (uppercase D)
0b01000101 E (uppercase E)
0b01000110 F (uppercase F)
0b01000111 G (uppercase G)
0b01001000 H (uppercase H)
0b01001001 I (uppercase I)
0b01001010 J (uppercase J)
0b01001011 K (uppercase K)
0b01001100 L (uppercase L)
0b01001101 M (uppercase M)
0b01001110 N (uppercase N)
0b01001111 O (uppercase O)
0b01010000 P (uppercase P)
0b01010001 Q (uppercase Q)
0b01010010 R (uppercase R)
0b01010011 S (uppercase S)
0b01010100 T (uppercase T)
0b01010101 U (uppercase U)
0b01010110 V (uppercase V)
0b01010111 W (uppercase W)
0b01011000 X (uppercase X)
0b01011001 Y (uppercase Y)
0b01011010 Z (uppercase Z)
0b01011011 [ (left square bracket)
0b01011100 \ (backslash)
0b01011101 ] (right square bracket)
0b01011110 ^ (caret)
0b01011111 _ (underscore)
0b01100000 ` (grave accent)
0b01100001 a (lowercase a)
0b01100010 b (lowercase b)
0b01100011 c (lowercase c)
0b01100100 d (lowercase d)
0b01100101 e (lowercase e)
0b01100110 f (lowercase f)
0b01100111 g (lowercase g)
0b01101000 h (lowercase h)
0b01101001 i (lowercase i)
0b01101010 j (lowercase j)
0b01101011 k (lowercase k)
0b01101100 l (lowercase l)
0b01101101 m (lowercase m)
0b01101110 n (lowercase n)
0b01101111 o (lowercase o)
0b01110000 p (lowercase p)
0b01110001 q (lowercase q)
0b01110010 r (lowercase r)
0b01110011 s (lowercase s)
0b01110100 t (lowercase t)
0b01110101 u (lowercase u)
0b01110110 v (lowercase v)
0b01110111 w (lowercase w)
0b01111000 x (lowercase x)
0b01111001 y (lowercase y)
0b01111010 z (lowercase z)
0b01111011 { (left curly brace)
0b01111100 | (vertical bar)
0b01111101 } (right curly brace)
0b01111110 ~ (tilde)

The bytes class can be used to cast a str instance to a bytes instance:

In [9]: bytes(ascii_text)
bytes(ascii_text)
Traceback (most recent call last):

  Cell In[9], line 1
    bytes(ascii_text)

TypeError: string argument without an encoding

Notice an encoding table needs to be specified:

In [10]: bytes(ascii_text)
Out[10]: b'Hello World!'

The print out of the formal representation shows the preferential way of constructing a bytes instance which consists of only ASCII characters. Notice that the prefix b is used to distinguish a bytes object from a str object:

In [11]: bytes(ascii_text, encoding='ascii')
Out[11]: b'Hello World!'

\ is a special character in a string (str object or bytes object) that can be used to insert an escape character. For example \t is a tab and \n is a new line (the new line is actually two commands the line feed and carriage return):

In [12]: ascii_text = 'Hello\tWorld!'
Out[12]: b_ascii_text = b'Hello\tWorld!'

Notice that the Variable Explorer will display the printed format with the escape character processed:

Variable Explorer
Name ▲ Type Size Value
ascii_text str 12 Hello    World!
b_ascii_text bytes 12 Hello    World!
text str 15 Γεια σου Κοσμο!

Binary is machine readible but humans have problems transcribing a long line of zeros and ones. Therefore it is common to split the 8 bit byte into two 4 bit halves. Each half is represented by use of a hexadecimal character:

img_003

binary hexadecimal decimal
0b0000 0x0 0
0b0001 0x1 1
0b0010 0x2 2
0b0011 0x3 3
0b0100 0x4 4
0b0101 0x5 5
0b0110 0x6 6
0b0111 0x7 7
0b1000 0x8 8
0b1001 0x9 9
0b1010 0xa 10
0b1011 0xb 11
0b1100 0xc 12
0b1101 0xd 13
0b1110 0xe 14
0b1111 0xf 15

All the ASCII characters can be reviewed using the three numbering systems binary (base 2 denoted with the prefix 0b), hexadecimal (base 16 denoted with the prefix 0x) and decimal (base 10 standard representation, therefore no prefix). Select ASCII Encoding to view all the ASCII Characters:

ASCII Encoding
Binary Hexadecimal Decimal Character Name
0b00000000 0x00 0 NUL (null)
0b00000001 0x01 1 SOH (start of heading)
0b00000010 0x02 2 STX (start of text)
0b00000011 0x03 3 ETX (end of text)
0b00000100 0x04 4 EOT (end of transmission)
0b00000101 0x05 5 ENQ (enquiry)
0b00000110 0x06 6 ACK (acknowledge)
0b00000111 0x07 7 BEL (bell)
0b00001000 0x08 8 BS (backspace)
0b00001001 0x09 9 HT (horizontal tab)
0b00001010 0x0a 10 LF (line feed)
0b00001011 0x0b 11 VT (vertical tab)
0b00001100 0x0c 12 FF (form feed)
0b00001101 0x0d 13 CR (carriage return)
0b00001110 0x0e 14 SO (shift out)
0b00001111 0x0f 15 SI (shift in)
0b00010000 0x10 16 DLE (data link escape)
0b00010001 0x11 17 DC1 (device control 1)
0b00010010 0x12 18 DC2 (device control 2)
0b00010011 0x13 19 DC3 (device control 3)
0b00010100 0x14 20 DC4 (device control 4)
0b00010101 0x15 21 NAK (negative acknowledgment)
0b00010110 0x16 22 SYN (synchronous idle)
0b00010111 0x17 23 ETB (end of transmission block)
0b00011000 0x18 24 CAN (cancel)
0b00011001 0x19 25 EM (end of medium)
0b00011010 0x1a 26 SUB (substitute)
0b00011011 0x1b 27 ESC (escape)
0b00011100 0x1c 28 FS (file separator)
0b00011101 0x1d 29 GS (group separator)
0b00011110 0x1e 30 RS (record separator)
0b00011111 0x1f 31 US (unit separator)
0b00010000 0x20 32
0b00010001 0x21 33 ! (exclamation mark)
0b00010010 0x22 34 " (double quote)
0b00010011 0x23 35 # (number sign)
0b00010100 0x24 36 $ (dollar sign)
0b00010101 0x25 37 % (percent)
0b00010110 0x26 38 & (ampersand)
0b00010111 0x27 39 ' (apostrophe)
0b00011000 0x28 40 ( (left parenthesis)
0b00011001 0x29 41 ) (right parenthesis)
0b00101010 0x2a 42 * (asterisk)
0b00101011 0x2b 43 + (plus sign)
0b00101100 0x2c 44 , (comma)
0b00101101 0x2d 45 - (minus sign)
0b00101110 0x2e 46 . (period)
0b00101111 0x2f 47 / (slash)
0b00101010 0x2a 42 (asterisk)
0b00101011 0x2b 43 (plus sign)
0b00101100 0x2c 44 (comma)
0b00101101 0x2d 45 (minus sign)
0b00101110 0x2e 46 (period)
0b00101111 0x2f 47 (slash)
0b00110000 0x30 48 0 (zero)
0b00110001 0x31 49 1 (one)
0b00110010 0x32 50 2 (two)
0b00110011 0x33 51 3 (three)
0b00110100 0x34 52 4 (four)
0b00110101 0x35 53 5 (five)
0b00110110 0x36 54 6 (six)
0b00110111 0x37 55 7 (seven)
0b00111000 0x38 56 8 (eight)
0b00111001 0x39 57 9 (nine)
0b00111010 0x3a 58 : (colon)
0b00111011 0x3b 59 ; (semicolon)
0b00111100 0x3c 60 < (less than)
0b00111101 0x3d 61 = (equal sign)
0b00111110 0x3e 62 > (greater than)
0b00111111 0x3f 63 ? (question mark)
0b01000000 0x40 64 @ (at sign)
0b01000001 0x41 65 A (capital A)
0b01000010 0x42 66 B (capital B)
0b01000011 0x43 67 C (capital C)
0b01000100 0x44 68 D (capital D)
0b01000101 0x45 69 E (capital E)
0b01000110 0x46 70 F (capital F)
0b01000111 0x47 71 G (capital G)
0b01001000 0x48 72 H (capital H)
0b01001001 0x49 73 I (capital I)
0b01001010 0x4a 74 J (capital J)
0b01001011 0x4b 75 K (capital K)
0b01001100 0x4c 76 L (capital L)
0b01001101 0x4d 77 M (capital M)
0b01001110 0x4e 78 N (capital N)
0b01001111 0x4f 79 O (capital O)
0b01010000 0x50 80 P (capital P)
0b01010001 0x51 81 Q (capital Q)
0b01010010 0x52 82 R (capital R)
0b01010011 0x53 83 S (capital S)
0b01010100 0x54 84 T (capital T)
0b01010101 0x55 85 U (capital U)
0b01010110 0x56 86 V (capital V)
0b01010111 0x57 87 W (capital W)
0b01011000 0x58 88 X (capital X)
0b01011001 0x59 89 Y (capital Y)
0b01011010 0x5a 90 Z (capital Z)
0b01011011 0x5b 91 [ (opening bracket)
0b01011100 0x5c 92 \ (backslash)
0b01011101 0x5d 93 ] (closing bracket)
0b01011110 0x5e 94 ^ (caret)
0b01011111 0x5f 95 _ (underscore)
0b01100000 0x60 96 ` (grave accent)
0b01100001 0x61 97 a (lowercase a)
0b01100010 0x62 98 b (lowercase b)
0b01100011 0x63 99 c (lowercase c)
0b01100100 0x64 100 d (lowercase d)
0b01100101 0x65 101 e (lowercase e)
0b01100110 0x66 102 f (lowercase f)
0b01100111 0x67 103 g (lowercase g)
0b01101000 0x68 104 h (lowercase h)
0b01101001 0x69 105 i (lowercase i)
0b01101010 0x6a 106 j (lowercase j)
0b01101011 0x6b 107 k (lowercase k)
0b01101100 0x6c 108 l (lowercase l)
0b01101101 0x6d 109 m (lowercase m)
0b01101110 0x6e 110 n (lowercase n)
0b01101111 0x6f 111 o (lowercase o)
0b01110000 0x70 112 p (lowercase p)
0b01110001 0x71 113 q (lowercase q)
0b01110010 0x72 114 r (lowercase r)
0b01110011 0x73 115 s (lowercase s)
0b01110100 0x74 116 t (lowercase t)
0b01110101 0x75 117 u (lowercase u)
0b01110110 0x76 118 v (lowercase v)
0b01110111 0x77 119 w (lowercase w)
0b01111000 0x78 120 x (lowercase x)
0b01111001 0x79 121 y (lowercase y)
0b01111010 0x7a 122 z (lowercase z)
0b01111011 0x7b 123 { (left brace)
0b01111100 0x7c 124 | (vertical bar)
0b01111101 0x7d 125 } (right brace)
0b01111110 0x7e 126 ~ (tilde)
0b01111111 0x7f 127 (delete)

A bytes str can be represented as a hexadecimal string:

In [13]: ascii_text_b.hex()
Out[13]: '48656c6c6f09576f726c6421'

\x is used to insert a hexadecimal characters and expects 2 hexadecimal digits:

In [14]: b'\x48\x65\x6c\x6c\x6f\x09\x57\x6f\x72\x6c\x64\x21'
Out[14]: b'Hello\tWorld!'

Notice the formal representation prefers using the printable ASCII character where present over the hexadecimal escape character. If a character is included outwith the ASCII printable character range for example the NUL character at 0x00:

In [15]: b'\x00\x48\x65\x6c\x6c\x6f\x09\x57\x6f\x72\x6c\x64\x21'
Out[15]: b'\x00Hello\tWorld!'

Then it has no printable alternative and this byte therefore remains represented as a hexadecimal escape character.

Notice that ASCII covers only occupies half the possible values that span over a byte. The remaining values were used regionally in extended ASCII tables:

Extended ASCII Tables
binary hexadecimal decimal latin1 latin2 latin3 latin4 cyrillic arabic greek hebrew turkish nordic thai
0b10000000 0x80 128
0b10000001 0x81 129
0b10000010 0x82 130
0b10000011 0x83 131
0b10000100 0x84 132
0b10000101 0x85 133
0b10000110 0x86 134
0b10000111 0x87 135
0b10001000 0x88 136
0b10001001 0x89 137
0b10001010 0x8a 138
0b10001011 0x8b 139
0b10001100 0x8c 140
0b10001101 0x8d 141
0b10001110 0x8e 142
0b10001111 0x8f 143
0b10010000 0x90 144
0b10010001 0x91 145
0b10010010 0x92 146
0b10010011 0x93 147
0b10010100 0x94 148
0b10010101 0x95 149
0b10010110 0x96 150
0b10010111 0x97 151
0b10011000 0x98 152
0b10011001 0x99 153
0b10011010 0x9a 154
0b10011011 0x9b 155
0b10011100 0x9c 156
0b10011101 0x9d 157
0b10011110 0x9e 158
0b10011111 0x9f 159
0b10100000 0xa0 160 NBSP NBSP NBSP NBSP NBSP NBSP NBSP NBSP NBSP NBSP NBSP
0b10100001 0xa1 161 ¡ Ą Ħ Ą Ё ¡ Ą
0b10100010 0xa2 162 ¢ ˘ ˘ ĸ Ђ ¢ ¢ Ē
0b10100011 0xa3 163 £ Ł £ Ŗ Ѓ £ £ £ Ģ
0b10100100 0xa4 164 ¤ ¤ ¤ ¤ Є ¤ ¤ ¤ Ī
0b10100101 0xa5 165 ¥ Ľ Ĩ Ѕ ¥ ¥ Ĩ
0b10100110 0xa6 166 ¦ Ś Ĥ Ļ І ¦ ¦ ¦ Ķ
0b10100111 0xa7 167 § § § § Ї § § § §
0b10101000 0xa8 168 ¨ ¨ ¨ ¨ Ј ¨ ¨ ¨ Ļ
0b10101001 0xa9 169 © Š İ Š Љ © © © Đ
0b10101010 0xaa 170 ª Ş Ş Ē Њ ͺ × ª Š
0b10101011 0xab 171 « Ť Ğ Ģ Ћ « « « Ŧ
0b10101100 0xac 172 ¬ Ź Ĵ Ŧ Ќ ، ¬ ¬ ¬ Ž
0b10101101 0xad 173 SHY SHY SHY SHY SHY SHY SHY SHY SHY SHY
0b10101110 0xae 174 ® Ž Ž Ў ® ® Ū
0b10101111 0xaf 175 ¯ Ż Ż ¯ Џ ¯ ¯ Ŋ
0b10110000 0xb0 176 ° ° ° ° А ° ° ° °
0b10110001 0xb1 177 ± ą ħ ą Б ± ± ± ą
0b10110010 0xb2 178 ² ˛ ² ˛ В ² ² ² ē
0b10110011 0xb3 179 ³ ł ³ ŗ Г ³ ³ ³ ģ
0b10110100 0xb4 180 ´ ´ ´ ´ Д ΄ ´ ´ ī
0b10110101 0xb5 181 µ ľ µ ĩ Е ΅ µ µ ĩ
0b10110110 0xb6 182 ś ĥ ļ Ж Ά ķ
0b10110111 0xb7 183 · ˇ · ˇ З · · · ·
0b10111000 0xb8 184 ¸ ¸ ¸ ¸ И Έ ¸ ¸ ļ
0b10111001 0xb9 185 ¹ š ı š Й Ή ¹ ¹ đ
0b10111010 0xba 186 º ş ş ē К Ί ÷ º š
0b10111011 0xbb 187 » ť ğ ģ Л ؛ » » » ŧ
0b10111100 0xbc 188 ¼ ź ĵ ŧ М Ό ¼ ¼ ž
0b10111101 0xbd 189 ½ ˝ ½ Ŋ Н ½ ½ ½
0b10111110 0xbe 190 ¾ ž ž О Ύ ¾ ¾ ū
0b10111111 0xbf 191 ¿ ż ż ŋ П ؟ Ώ ¿ ŋ
0b11000000 0xc0 192 À Ŕ À Ā Р ΐ À Ā
0b11000001 0xc1 193 Á Á Á Á С ء Α Á Á
0b11000010 0xc2 194 Â Â Â Â Т آ Β Â Â
0b11000011 0xc3 195 Ã Ă Ã У أ Γ Ã Ã
0b11000100 0xc4 196 Ä Ä Ä Ä Ф ؤ Δ Ä Ä
0b11000101 0xc5 197 Å Ĺ Ċ Å Х إ Ε Å Å
0b11000110 0xc6 198 Æ Ć Ĉ Æ Ц ئ Ζ Æ Æ
0b11000111 0xc7 199 Ç Ç Ç Į Ч ا Η Ç Į
0b11001000 0xc8 200 È Č È Č Ш ب Θ È Č
0b11001001 0xc9 201 É É É É Щ ة Ι É É
0b11001010 0xca 202 Ê Ę Ê Ę Ъ ت Κ Ê Ę
0b11001011 0xcb 203 Ë Ë Ë Ë Ы ث Λ Ë Ë
0b11001100 0xcc 204 Ì Ě Ì Ė Ь ج Μ Ì Ė
0b11001101 0xcd 205 Í Í Í Í Э ح Ν Í Í
0b11001110 0xce 206 Î Î Î Î Ю خ Ξ Î Î
0b11001111 0xcf 207 Ï Ď Ï Ī Я د Ο Ï Ï
0b11010000 0xd0 208 Ð Đ Đ а ذ Π Ğ Ð
0b11010001 0xd1 209 Ñ Ń Ñ Ņ б ر Ρ Ñ Ņ
0b11010010 0xd2 210 Ò Ň Ò Ō в ز Ò Ō
0b11010011 0xd3 211 Ó Ó Ó Ķ г س Σ Ó Ó
0b11010100 0xd4 212 Ô Ô Ô Ô д ش Τ Ô Ô
0b11010101 0xd5 213 Õ Ő Ġ Õ е ص Υ Õ Õ
0b11010110 0xd6 214 Ö Ö Ö Ö ж ض Φ Ö Ö
0b11010111 0xd7 215 × × × × з ط Χ × Ũ
0b11011000 0xd8 216 Ø Ř Ĝ Ø и ظ Ψ Ø Ø
0b11011001 0xd9 217 Ù Ů Ù Ų й ع Ω Ù Ų
0b11011010 0xda 218 Ú Ú Ú Ú к غ Ϊ Ú Ú
0b11011011 0xdb 219 Û Ű Û Û л Ϋ Û Û
0b11011100 0xdc 220 Ü Ü Ü Ü м ά Ü Ü
0b11011101 0xdd 221 Ý Ý Ŭ Ũ н έ İ Ý
0b11011110 0xde 222 Þ Ţ Ŝ Ū о ή Ş Þ
0b11011111 0xdf 223 ß ß ß ß п ί ß ß ฿
0b11100000 0xe0 224 à ŕ à ā р ـ ΰ א à ā
0b11100001 0xe1 225 á á á á с ف α ב á á
0b11100010 0xe2 226 â â â â т ق β ג â â
0b11100011 0xe3 227 ã ă ã у ك γ ד ã ã
0b11100100 0xe4 228 ä ä ä ä ф ل δ ה ä ä
0b11100101 0xe5 229 å ĺ ċ å х م ε ו å å
0b11100110 0xe6 230 æ ć ĉ æ ц ن ζ ז æ æ
0b11100111 0xe7 231 ç ç ç į ч ه η ח ç į
0b11101000 0xe8 232 è č è č ш و θ ט è č
0b11101001 0xe9 233 é é é é щ ى ι י é é
0b11101010 0xea 234 ê ę ê ę ъ ي κ ך ê ę
0b11101011 0xeb 235 ë ë ë ë ы ً λ כ ë ë
0b11101100 0xec 236 ì ě ì ė ь ٌ μ ל ì ė
0b11101101 0xed 237 í í í í э ٍ ν ם í í
0b11101110 0xee 238 î î î î ю َ ξ מ î î
0b11101111 0xef 239 ï ď ï ī я ُ ο ן ï ï
0b11110000 0xf0 240 ð đ đ ِ π נ ğ ð 0
0b11110001 0xf1 241 ñ ń ñ ņ ё ّ ρ ס ñ ņ 1
0b11110010 0xf2 242 ò ň ò ō ђ ْ ς ע ò ō 2
0b11110011 0xf3 243 ó ó ó ķ ѓ σ ף ó ó 3
0b11110100 0xf4 244 ô ô ô ô є τ פ ô ô 4
0b11110101 0xf5 245 õ ő ġ õ ѕ υ ץ õ õ 5
0b11110110 0xf6 246 ö ö ö ö і φ צ ö ö 6
0b11110111 0xf7 247 ÷ ÷ ÷ ÷ ї χ ק ÷ ũ 7
0b11111000 0xf8 248 ø ř ĝ ø ј ψ ר ø ø 8
0b11111001 0xf9 249 ù ů ù ų љ ω ש ù ų 9
0b11111010 0xfa 250 ú ú ú ú њ ϊ ת ú ú
0b11111011 0xfb 251 û ű û û ћ ϋ û û
0b11111100 0xfc 252 ü ü ü ü ќ ό ü ü
0b11111101 0xfd 253 ý ý ŭ ũ § ύ ‎LRM ı ý
0b11111110 0xfe 254 þ ţ ŝ ū ў ώ ‏RLM ş þ
0b11111111 0xff 255 ÿ ˙ ˙ ˙ џ ÿ ĸ

If 0xe5 is examined for example, then notice that it maps to a different character in most of the ASCII tables:

binary 0b11100101
hexadecimal 0xe5
decimal 229
latin1 å
latin2 ĺ
latin3 ċ
latin4 å
cyrillic х
arabic م
greek ε
hebrew ו
turkish å
nordic å
thai

If inserted as a hexadecimal escape character in a string, notice that it is automatically encoded using latin1, the most common ASCII table:

In [16]: '\xe5'
Out[16]: 'å'

If it is inserted as a hexadecimal character into a bytes object, notice that it remains a hexadecimal escape sequence:

In [17]: b'\xe5'
Out[17]: b'\xe5'

The bytes object can be decoded to a string when the correct encoding is applied:

In [18]: b'\xe5'.decode('latin1')
Out[18]: 'å'
In [19]: b'\xe5'.decode('latin2')
Out[19]: 'ĺ'
In [20]: b'\xe5'.decode('latin3')
Out[20]: 'ċ'
In [21]: b'\xe5'.decode('latin4')
Out[21]: 'å'
In [22]: b'\xe5'.decode('cyrillic')
Out[22]: 'х'
In [23]: b'\xe5'.decode('arabic')
Out[23]: 'م'
In [24]: b'\xe5'.decode('greek')
Out[24]: 'ε'
In [25]: b'\xe5'.decode('iso8859-9') # turkish
Out[25]: 'å'
In [26]: b'\xe5'.decode('hebrew')
Out[26]: 'ו'
In [27]: b'\xe5'.decode('iso8859-10') # nordic
Out[27]: 'å'
In [28]: b'\xe5'.decode('thai')
Out[28]: 'ๅ'

Returning to the str instance text, it can be encoded into a bytes object using the greek encoding table:

In [29]: text
Out[30]: 'Γεια σου Κοσμο!'

In [31]: text.encode(encoding='greek')
Out[31]: b'\xc3\xe5\xe9\xe1 \xf3\xef\xf5 \xca\xef\xf3\xec\xef!'

Notice in the bytes object that each of the printable ASCII characters is displayed using it's ASCII character and the non-ASCII characters are displayed using a hexadecimal escape character.

1 byte character encoding was suitable for offline regional computing however the advent of the internet resulted in a number of issues. Essentially a computer in Greece would produce content using the greek encoding table and then be read using a computer in the UK with the latin1 encoding table and the following character substitution would take place:

In [32]: text.encode(encoding='greek').decode(encoding='latin1')
Out[32]: 'Ãåéá óïõ Êïóìï!'

1 byte (8 bit) encoding allows:

In [33]: 2 ** 8
Out[33]: 65536

commands. 2 bytes (16 bits) encoding allows:

In [34]: 2 ** 16
Out[34]: 256

65536 commands.

The utf-16 standard was produced which includes all the characters seen in the extended ASCII tables:

In [35]: text.encode(encoding='utf-16-be')
Out[35]: b'\x03\x93\x03\xb5\x03\xb9\x03\xb1\x00 \x03\xc3\x03\xbf\x03\xc5\x00 \x03\x9a\x03\xbf\x03\xc3\x03\xbc\x03\xbf\x00!'

The bytes instance can be displayed as a hexadecimal string:

In [36]: text.encode(encoding='utf-16-be').hex()
Out[36]: '039303b503b903b1002003c303bf03c50020039a03bf03c303bc03bf0021'

Let's examine an ASCII character. In utf-16 encoding, the byte corresponding to the ASCII character when ascii encoding is used, is paired with the NULL byte 00 and the 2 bytes are used to encode the character.

utf-16-be is a variant of utf-16 that is Big Endian. Big Endian is typically the way, humans count where the Big (most significant byte 00) is placed before the Little (least significant byte 61):

In [37]: 'a'.encode(encoding='utf-16-be').hex()
Out[37]: '0061'

Intel processors typically use Little Endian where the Little (least significant byte 61) is placed before the Big (most significant byte 00):

In [38]: 'a'.encode(encoding='utf-16-le').hex()
Out[38]: '6100'

There was initially some confusion because of this and therefore a standard was produced that was Little Endian that includes a Byte Order Marker (BOM) as a prefix:

In [39]: 'a'.encode(encoding='utf-16').hex()
Out[39]: 'fffe6100'

The BOM can be seen by examinination of an empty str:

In [40]: ''.encode(encoding='utf-16').hex()
Out[40]: 'fffe'

A Greek character can also be examined using the utf-16 encoding variants:

In [39]: 'α'.encode(encoding='utf-16-be').hex()
Out[39]: '03b1'
In [40]: 'α'.encode(encoding='utf-16-le').hex()
Out[40]: 'b103'
In [41]: 'α'.encode(encoding='utf-16').hex()
Out[41]: 'fffeb103'

Some languages such as Chinese use more than 50000 characters and therefore 65536 commands is insufficient to incorporate all Latin and Asian characters. Therefore utf-16 was quickly phased out by utf-32. utf-32 uses 4 bytes (32 bits) encoding which allows:

In [42]: 2 ** 32
Out[42]: 4294967296

commands which is sufficient to cover all characters used in all languages. utf-32 has byte ordering variants:

In [43]: 'a'.encode(encoding='utf-32-be').hex()
Out[43]: '00000061'
In [44]: 'a'.encode(encoding='utf-32-le').hex()
Out[44]: '61000000'
In [45]: 'a'.encode(encoding='utf-32').hex()
Out[45]: 'fffe000061000000'


In [46]: 'α'.encode(encoding='utf-32-be').hex()
Out[46]: '000003b1'
In [47]: 'α'.encode(encoding='utf-16-le').hex()
Out[47]: 'b1030000'
In [48]: 'α'.encode(encoding='utf-16').hex()
Out[48]: 'fffe0000b1030000'

In [49]: '我'.encode(encoding='utf-32-be').hex()
Out[49]: '00006211'
In [50]: '我'.encode(encoding='utf-16-le').hex()
Out[50]: '11620000'
In [51]: '我'.encode(encoding='utf-16').hex()
Out[51]: 'fffe000011620000'

A Unicode character can be inserted into a string using the hexadecimal escape character \U this expects 8 hexadecimal values in the format shown by utf-32-be:

In [52]: '\U00000061'
Out[52]: 'a'
In [53]: '\U000003b1'
Out[53]: 'α'
In [54]: '\U00006211'
Out[54]: '我'

In a Python string the \ is an instruction to inser an escape character and \U expects 8 hexadecimal characters. On Windows \ is used as the default directory seperator. Therefore \\ has to be used within a file path, where the first \ is an instruction to an insert an escape character and the second \ is the escape character to be inserted. The prefix R can be used for a Raw String which has no escape characters. Note upper case R is preferentially used for a raw string and syntax highlighting won't be applied. Lower case r is instead prefentially used for a regular expression and syntax highlighting for a regular expression may be applied:

In [55]: 'C:\\Users\\Philip'
Out[55]: 'C:\\Users\\Philip'

In [56]: R'C:\Users\Philip'
Out[56]: 'C:\\Users\\Philip'

In [57]: r'C:\Users\Philip'
Out[57]: 'C:\\Users\\Philip'

On Windows if a string is used instead of a raw string, the following error message is common:

In [58]: 'C:\Users\Philip'
  Cell In[58], line 1
    'C:\Users\Philip'
    ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

The number of trailing zeros for an ASCII character and confusion due to byte order marker resulted in a new standard with a variable byte length per character:

In [59]: 'a'.encode(encoding='utf-8').hex()
Out[59]: '61'
In [60]: 'α'.encode(encoding='utf-8').hex()
Out[60]: 'ceb1'
In [61]: '我'.encode(encoding='utf-8').hex()
Out[61]: 'e68891'
In [62]: '🐱'.encode(encoding='utf-8').hex()
Out[62]: 'f09f90b1'

It is called utf-8 because the ASCII characters only occupy 1 byte (8 bits). Greek characters occupy 2 bytes (16 bits), Asian characters occupy 3 bytes (24 bits) and emojis cover 4 bytes (32 bits).

There is no byte order marker and under the hood the binary sequence is used which outlines the expected number of bytes per character:

number of bytes binary sequence
1 0b 0aaaaaaa
2 0b 110aaaaa 10aaaaaa
3 0b 1110aaaa 10aaaaaa 10aaaaaa
4 0b 11110aaa 10aaaaaa 10aaaaaa 10aaaaaa

These underlying patterns can be seen when the binary sequence for each of the characters above is examined:

In [63]: '0b'+bin(int('a'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(8)
Out[63]: '0b01100001'
In [64]: '0b'+bin(int('α'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(16)
Out[64]: '0b1100111010110001'
In [65]: '0b'+bin(int('我'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(24)
Out[64]: '0b111001101000100010010001'
In [65]: '0b'+bin(int('🐱'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(32)
Out[65]: '0b11110000100111111001000010110001'

Although utf-8 was designed to not require a BOM. Microsoft produced a version utf-8-sig which has the BOM:

In [66]: ''.encode(encoding='utf-8').hex()
Out[66]: ''

In [67]: ''.encode(encoding='utf-8-sig').hex()
Out[67]: 'efbbbf'

In [68]: 'a'.encode(encoding='utf-8-sig').hex()
Out[68]: 'efbbbf61'
In [69]: 'α'.encode(encoding='utf-8-sig').hex()
Out[69]: 'efbbbfceb1'
In [70]: '我'.encode(encoding='utf-8-sig').hex()
Out[70]: 'efbbbfe68891'
In [71]: '🐱'.encode(encoding='utf-8-sig').hex()
Out[71]: 'efbbbff09f90b1'

The bytes class has the alternative constructor fromhex which can be used to construct a bytes instance from a hexadecimal string:

In [72]: bytes.fromhex('61')
Out[72]: b'a'
In [73]: bytes.fromhex('ceb1')
Out[73]: b'\xce\xb1'
In [74]: bytes.fromhex('e68891')
Out[74]: b'\xe6\x88\x91'
In [75]: bytes.fromhex('f09f90b1')
Out[75]: b'\xf0\x9f\x90\xb1'
In [76]: exit

From now on utf-8 will be used as the default encoding table. The following str instances can eb isntantiated and encoded to bytes instances:

In [1]: ascii_text = 'Hello World!'
In [2]: text = 'Γεια σου Κοσμο!'
In [3]: ascii_text.encode(encoding='utf-8')
Out[3]: b'Hello World!'
In [4]: text.encode(encoding='utf-8')
Out[4]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'

As ascii_text consists only of printable ASCII characters the bytes instance returned, which shows the preferred formal representation displays each byte as its printable ASCII character.

As text contains a mixture of pritnable ASCII characters and non-ASCII characters, the formal representation displays each byte as its printable ASCII character where applicable and a hexadecimal escape sequence otherwise. If these are assigned to variables:

In [5]: ascii_text_b = b'Hello World!'
In [6]: text_b = b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'     

If these are shown in the Variable Explorer:

Variable Explorer
Name ▲ Type Size Value
ascii_text str 12 Hello World!
ascii_text_b bytes 12 Hello World!
text str 15 Γεια σου Κοσμο!
text_b bytes 27 Γεια σου Κοσμο!

The Variable Explorer in Spyder assumes 'utf-8' encoding for a bytes instance and attempts to display any printable character.

Notice the length of text and text_b are different because the element in each class is different. In text_b some of the characters are encoded to multiple bytes:

This can be seen by casting each Collection explictly to a tuple:

In [7]: text_as_tuple = tuple(text)
In [8]: text_b_as_tuple = tuple(text_b)
Variable Explorer
Name ▲ Type Size Value
ascii_text str 12 Hello World!
ascii_text_b bytes 12 Hello World!
text str 15 Γεια σου Κοσμο!
text_as_tuple tuple 15 ('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …)
text_b bytes 27 Hello World!
text_b_as_tuple tuple 27 (206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …)

If text_as_tuple is expanded, the value at each index can be seen to be a Unicode character because a Unicode character is an element of a str:

text_as_tuple - tuple (15 elements)
Index Value
0 'Γ'
1 'ε'
2 'ι'
3 'α'
4 ' '
5 'σ'
6 'ο'
7 'υ'
8 ' '
9 'Κ'
10 'ο'
11 'σ'
12 'μ'
13 'ο'
14 '!'

If text_b_as_tuple is expanded, the value at each index can be seen to be an int between 0:256:

text_b_as_tuple - tuple (27 elements)
Index Value
0 206
1 147
2 206
3 181
4 32
5 185
6 206
7 177
8 32
9 207
10 132
11 206
12 181
13 206
14 183
15 207
16 131
17 32
18 207
19 140
20 207
21 132
22 32
23 206
24 177
25 207
26 132

Recall a byte is a numeric value between 0:256:

img_001

The binary bin and hexadecimal hex functions can be used to display this int as a binary string or hexadecimal string:

In [9]: text_b_as_tuple_bin = tuple([bin(byte) for byte in text_b_as_tuple])
In [10]: 

text_b_as_tuple_bin and text_b_as_tuple_hex display in the Variable Explorer:

Variable Explorer
Name ▲ Type Size Value
ascii_text str 12 Hello World!
ascii_text_b bytes 12 Hello World!
text str 15 Γεια σου Κοσμο!
text_as_tuple tuple 15 ('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …)
text_b tuple 27 Hello World!
text_b_as_tuple tuple 27 (206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …)
text_b_as_tuple_bin tuple 27 ('0b11001110', '0b10010011', '0b11001110', '0b10110101', '0b11001110', …)
text_b_as_tuple_hex tuple 27 ('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …)

text_b_as_tuple_bin can be expanded to view each byte in binary:

text_b_as_tuple_bin - tuple (27 elements)
Index Value
0 0b11001110
1 0b10010011
2 0b11001110
3 0b10110101
4 0b11001110
5 0b10111001
6 0b11001110
7 0b10110001
8 0b00100000
9 0b11001111
10 0b10000100
11 0b11001110
12 0b10110101
13 0b11001110
14 0b10111011
15 0b11001111
16 0b10000101
17 0b00100000
18 0b11001111
19 0b10001100
20 0b11001111
21 0b10000100
22 0b00100000
23 0b11001110
24 0b10110001
25 0b11001111
26 0b10000100

text_b_as_tuple_hex can be expanded to view each byte in hexadecimal:

text_b_as_tuple_hex - tuple (27 elements)
Index Value
0 0xce
1 0x93
2 0xce
3 0xb5
4 0xce
5 0xb9
6 0xce
7 0xb1
8 0x20
9 0xcf
10 0x84
11 0xce
12 0xb5
13 0xce
14 0xbb
15 0xcf
16 0x85
17 0x20
18 0xcf
19 0x8c
20 0xcf
21 0x84
22 0x20
23 0xce
24 0xb1
25 0xcf
26 0x84

The bytes class can be used to cast a tuple of int values between 0:256 to a bytes instance:

In [10]: bytes((206, 147, 206, 181, 206, 185, 206, 177,  32, 207,
                131, 206, 191, 207, 133,  32, 206, 154, 206, 191,
                207, 131, 206, 188, 206, 191,  33))
Out[10]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'

Now that the element in each Collection is understood, the following Collection based identifiers can be used:

# 🔧 Collection-Based Methods (from `str` and the Collection ABC):
#     - __contains__(self, key, /)                  : Checks if a substring is in the string (`in`).
#     - __iter__(self, /)                           : Returns an iterator over the string.
#     - __len__(self, /)                            : Returns the length of the string.
#     - __getitem__(self, key, /)                   : Retrieves a character by index (`[]`).
#     - count(self, sub, start=0,                   : Counts the occurrences of a substring.
#             end=9223372036854775807, /) 
#     - index(self, sub, start=0,                   : Returns the index of the first occurrence of a substring.
#             end=9223372036854775807, /) 

The data model method __len__ defines the behaviour of the builtins function len and essentially retrieves the Size shown on the Variable Explorer:

In [11]: len(text) # text.__len__()
Out[11]: 15
In [12]: len(text_b) # text_b.__len__()
Out[12]: 27

The data model method __contains__ defines the behaviour of the in keyword:

In [13]: 'ει' in text # text.__contains__('ει')
Out[13]: True
In [14]: 'ε' in text # text.__contains__('ε')
Out[14]: True
In [15]: bytes((147, 206)) in text_b # text_b.__contains__(bytes((147, 206)))
Out[15]: True
In [16]: 147 in text_b # text_b.__contains__(147, 206)
Out[16]: True

The data model method __getitem__ will retrieve a value at an integer index:

In [17]: text[1] # text.__index__(1)
Out[17]: 'ε'

Notice that Python use zero-order indexing. This means the first index is at index 0 and the last index is the length of the Collection minus 1:

In [17]: text[0] 
Out[17]: 'Γ'

In [18]: text[len(text)] 
Traceback (most recent call last):

  Cell In[18], line 1
    text[len(text)]

IndexError: string index out of range

In [19]: text[len(text)-1] 
Out[19]: '!'
text Variable Explorer
text_as_tuple - tuple (15 elements)
Index Value
0 'Γ'
1 'ε'
2 'ι'
3 'α'
4 ' '
5 'σ'
6 'ο'
7 'υ'
8 ' '
9 'Κ'
10 'ο'
11 'σ'
12 'μ'
13 'ο'
14 '!'

The builtins class slice has consistent input arguments start, stop[, step] to the builtins class range:

In [20]: slice()
# Docstring popup
"""
Init signature: slice(self, /, *args, **kwargs)
Docstring:     
slice(stop)
slice(start, stop[, step])

Create a slice object.  This is used for extended slicing (e.g. a[0:10:2]).
Type:           type
Subclasses:     
"""

range, uses zero-order indexing so is inclusive of the start bound and exclusive of the stop bound:

In [20]: tuple(range(0, 5, 1))
Out[20]: (0, 1, 2, 3, 4)

In [21]: tuple(range(0, 5)) # default step=1
Out[21]: (0, 1, 2, 3, 4)

In [22]: tuple(range(5)) # default stop=0
Out[22]: (0, 1, 2, 3, 4)

slice behaves consistently:

In [23]: text[slice(0, 5, 1)]
Out[23]: 'Γεια '

In [24]: text[slice(0, 5)] # default step=1
Out[24]: 'Γεια '

In [25]: text[slice(5)] # default stop=len(text)
Out[25]: 'Γεια '

Essentially the section from and including index 0 is made to and excluding index 5:

text Variable Explorer Annotated
text_as_tuple - tuple (15 elements)
Index Value
0 'Γ'
1 'ε'
2 'ι'
3 'α'
4 ' '
5 'σ'
6 'ο'
7 'υ'
8 ' '
9 'Κ'
10 'ο'
11 'σ'
12 'μ'
13 'ο'
14 '!'

The slice instance can be assigned to an object name for the sake of readibiliy:

In [26]: selection = slice(0, 5, 1)
In [27]: text[selection] 
Out[27]: 'Γεια '
Variable Explorer
Name ▲ Type Size Value
ascii_text str 12 Hello World!
ascii_text_b bytes 12 Hello World!
selection slice 1 slice(0, 5, 1)
text str 15 Γεια σου Κοσμο!
text_as_tuple tuple 15 ('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …)
text_b tuple 27 Hello World!
text_b_as_tuple tuple 27 (206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …)
text_b_as_tuple_bin tuple 27 ('0b11001110', '0b10010011', '0b11001110', '0b10110101','0b11001110', …)
text_b_as_tuple_hex tuple 27 ('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …)

However normally slicing is done using a colon : instead:

In [28]: text[0:5:1] # text[slice(0, 5, 1)]
Out[28]: 'Γεια '

In [29]: text[0:5] # default step=1
Out[29]: 'Γεια '

In [30]: text[5] # default stop=len(text)
Out[30]: 'Γεια '

Using the notation with the colons is a bit more flexible:

In [31]: text[:2] # default start=0
Out[31]: 'Γε'

If a step of -1 is used, the string is reversed:

In [32]: text[::-1] # default start=0
Out[32]: '!ομσοΚ υοσ αιεΓ'

This means the default start is -1 and stop is -len(text)-1 taking into account zero-order indexing:

In [33]: text[-1:-len(text)-1:-1]
Out[33]: '!ομσοΚ υοσ αιεΓ'

If the bytes instance text_b is now examined. Notice that indexing a single value returns an int corresponding to the byte:

In [34]: text_b[0]
Out[34]: 206

However slicing, returns a bytes instance:

In [35]: text_b[0:1]
Out[35]: b'\xce'
text_b Variable Explorer
text_b_as_tuple - tuple (27 elements)
Index Value
0 206
1 147
2 206
3 181
4 32
5 185
6 206
7 177
8 32
9 207
10 132
11 206
12 181
13 206
14 183
15 207
16 131
17 32
18 207
19 140
20 207
21 132
22 32
23 206
24 177
25 207
26 132

The data model method __iter__ defines the behaviour of the builtins function iter and casts the str into an iterator:

In [36]: forward = iter(text)
Variable Explorer
Name ▲ Type Size Value
ascii_text str 12 Hello World!
ascii_text_b bytes 12 Hello World!
forward str_iterator 1 <str_iterator at 0x22a2f5b70a0>
selection slice 1 slice(0, 5, 1)
text str 15 Γεια σου Κοσμο!
text_as_tuple tuple 15 ('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …)
text_b tuple 27 Hello World!
text_b_as_tuple tuple 27 (206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …)
text_b_as_tuple_bin tuple 27 ('0b11001110', '0b10010011', '0b11001110', '0b10110101','0b11001110', …)
text_b_as_tuple_hex tuple 27 ('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …)

The iterator essentially only displays a single value at a time. The builtins function next can be called to advance to the next value, which consumes the previous value:

In [37]: next(forward)
Out[37]: 'Γ'
In [38]: next(forward)
Out[38]: 'ε'
In [39]: next(forward)
Out[39]: 'ι'

A while loop can be constructed that breaks when the StopIteration error is encountered:

In [40]: forward = iter(text):
       :  while True:
       :      try:
       :          print(next(forward))
       :      except StopIteration:
       :          break
       :
Γ
ε
ι
α
 
σ
ο
υ
 
Κ
ο
σ
μ
ο
!

The syntax for a for loop is cleaner. However behind the scenes, the while loop and iterator are used:

In [41]: for unicode_char in text:
       :     print(unicode_char)
       :
Γ
ε
ι
α
 
σ
ο
υ
 
Κ
ο
σ
μ
ο
!

The enumerate class can be used to enumerate the tuple. To visualise the enumeration object it can be cast into a dictionary:

In [42]: enum_text = enumerate(text)
In [43]: enum_text_as_dict = dict(enum_text)
Name ▲ Type Size Value
ascii_text str 12 Hello World!
ascii_text_b bytes 12 Hello World!
forward str_iterator 1 <str_iterator at 0x22a2f5b70a0>
enum_text enumerate 1 <enumerate at 0x22a2ed61260>
enum_text_as_dict dict 15 {0: 'Γ', 1: 'ε', 2: 'ι', 3: 'α', 4: ' ', 5: 'σ', 6: 'ο', 7: 'υ', 8: ' ', 9: 'Κ', …}
selection slice 1 slice(0, 5, 1)
text str 15 Γεια σου Κοσμο!
text_as_tuple tuple 15 ('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …)
text_b tuple 27 Hello World!
text_b_as_tuple tuple 27 (206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …)
text_b_as_tuple_bin tuple 27 ('0b11001110', '0b10010011', '0b11001110', '0b10110101','0b11001110', …)
text_b_as_tuple_hex tuple 27 ('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …)

enum_text_as_dict can be expanded:

enum_text_as_dict
Key Value
0 Γ
1 ε
2 ι
3 α
4
5 σ
6 ο
7 υ
8
9 Κ
10 ο
11 σ
12 μ
13 ο
14 !

A for loop can be constructed with the enumeration of text:

In [44]: for index, unicode_char in enumerate(text):
       :    print(index, unicode_char)
       :
0 Γ
1 ε
2 ι
3 α
4  
5 σ
6 ο
7 υ
8  
9 Κ
10 ο
11 σ
12 μ
13 ο
14 !

The negative indexes can also be examined using:

In [45]: for index, unicode_char in enumerate(text):
       :    print(index-len(text), unicode_char)
       :
-15 Γ
-14 ε
-13 ι
-12 α
-11  
-10 σ
-9 ο
-8 υ
-7  
-6 Κ
-5 ο
-4 σ
-3 μ
-2 ο
-1 !

The negative indexes can be viewed alongside the positive indexes:

In [46]: for index, unicode_char in enumerate(text):
       :     print(index-len(text), unicode_char)
       : for index, unicode_char in enumerate(text):
       :     print(index, unicode_char)
       :
-15 Γ
-14 ε
-13 ι
-12 α
-11  
-10 σ
-9 ο
-8 υ
-7  
-6 Κ
-5 ο
-4 σ
-3 μ
-2 ο
-1 !
0 Γ
1 ε
2 ι
3 α
4  
5 σ
6 ο
7 υ
8  
9 Κ
10 ο
11 σ
12 μ
13 ο
14 !

This makes it easier to conceptualise slicing using a negative step:

In [47]: text[-8:-11:-1]
Out[47]: 'υοσ'

A step of 2 can be used to return a str of every second unicode character:

In [48]: text[::2]
Out[48]: 'Γι ο ομ!'

In [49]: text[1::2]
Out[49]: 'εασυΚσο'

It is possible to do the same for the bytes instance:

In [50]: for index, byte_int in enumerate(text_b):
       :     print(index-len(text_b), byte_int)
       : for index, byte_int in enumerate(text_b):
       :     print(index, byte_int)
       :
-27 206
-26 147
-25 206
-24 181
-23 206
-22 185
-21 206
-20 177
-19 32
-18 207
-17 131
-16 206
-15 191
-14 207
-13 133
-12 32
-11 206
-10 154
-9 206
-8 191
-7 207
-6 131
-5 206
-4 188
-3 206
-2 191
-1 33
0 206
1 147
2 206
3 181
4 206
5 185
6 206
7 177
8 32
9 207
10 131
11 206
12 191
13 207
14 133
15 32
16 206
17 154
18 206
19 191
20 207
21 131
22 206
23 188
24 206
25 191
26 33

However slicing using a step with a multibyte encoding such as utf-8 will usually result in a bytes instance that cannot be decoded:

In [51]: text_b[0::2]
Out[51]: b'\xce\xce\xce\xce \x83\xbf\x85\xce\xce\xcf\xce\xce!'

In [52]: text_b[0::2].decode(encoding='utf-8')
Traceback (most recent call last):

  Cell In[52], line 1
    text_b[0::2].decode(encoding='utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte

The Collection method count will count the number of occurances a substring occurs in a str:

In [53]: text
Out[33]: 'Γεια σου Κοσμο!'

In [53]: text.count('σου')
Out[53]: 1

In [54]: text.count('σ')
Out[54]: 2

The Collection method index will retrieve the index of the first occurance of a value:

In [55]: dict(enumerate(text))
Out[55]: 
{0: 'Γ',
 1: 'ε',
 2: 'ι',
 3: 'α',
 4: ' ',
 5: 'σ',
 6: 'ο',
 7: 'υ',
 8: ' ',
 9: 'Κ',
 10: 'ο',
 11: 'σ',
 12: 'μ',
 13: 'ο',
 14: '!'}

In [56]: text.index('σ')
Out[56]: 5

The optional positional input arguments start and stop can be used to constrict the range of indexes to search over:

In [57]: first = text.index('σ')
       : text.index('σ', first+1, len(text))
Out[57]: 11

The method index will produce a ValueError when the substring is not found:

In [58]: second = text.index('σ', first+1, len(text))
       : text.index('σ', second+1, len(text))
Traceback (most recent call last):

  Cell In[58], line 2
    text.index('σ', second+1, len(text))

ValueError: substring not found

In the str class there is a similar method find that behaves similarly to index but returns -1 when a substring is not found:

In [59]: text.find('σ')
Out[59]: 5

In [60]: text.find('σ', first+1, len(text))
Out[60]: 11

In [61]: text.find('σ', second+1, len(text))
Out[61]: -1

index and find search from left to right and have the counterparts, rindex and rfind which operate from right to left:

In [62]: text.index('σ')
Out[62]: 11

Once again, these only differ when the substring is not found returning a ValueError or -1 upon failure respectively.

The replace method can be used to replace an old substring with a new substring returning a new str with the changes. If the old substring is found multiple times, it will be replaced by the new string multiple times by default unless the count of the number of replacements is specified, for example 1 where it will only make the first replacement:

In [63]: text
Out[63]: 'Γεια σου Κοσμο!'

In [64]: text.replace('Γεια', 'Γϵια')
Out[64]: 'Γϵια σου Κοσμο!'

In [65]: text.replace('σ', 'ς')
Out[65]: 'Γεια ςου Κοςμο!'

In [66]: text.replace('σ', 'ς', 1)
Out[66]: 'Γεια ςου Κοσμο!'

The bytes class behaves similarly:

In [67]: text_b
Out[67]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'

In [68]: dict(enumerate(text_b))
Out[68]: 
{0: 206,
 1: 147,
 2: 206,
 3: 181,
 4: 206,
 5: 185,
 6: 206,
 7: 177,
 8: 32,
 9: 207,
 10: 131,
 11: 206,
 12: 191,
 13: 207,
 14: 133,
 15: 32,
 16: 206,
 17: 154,
 18: 206,
 19: 191,
 20: 207,
 21: 131,
 22: 206,
 23: 188,
 24: 206,
 25: 191,
 26: 33}

In [69]: text_b.count(bytes((207, 131)))
Out[69]: 2

In [70]: bytes((207, 131))
Out[70]: b'\xcf\x83'

In [71]: text_b.index(bytes((207, 131)))
Out[71]: 9

In [72]: text_b.count(207)
Out[72]: 3

In [73]: text_b.index(207)
Out[73]: 9

In [74]: text_b.index(207, 9+1, len(text_b))
Out[74]: 13

The str has the following Collection based binary operators:

# 🔧 Collection-Like Operators:
#     - __add__(self, value, /)                     : Implements string concatenation (`+`).
#     - __mul__(self, value, /)                     : Implements string repetition (`*`).
#     - __rmul__(self, value, /)                    : Implements reflected multiplication (`*`).

The data model method __add__ defines the behaviour of the + operator and performs str concatenation:

In [75]: text
Out[75]: 'Γεια σου Κοσμο!'
In [76]: ascii_text
Out[76]: 'Hello World!'
In [77]: text + ascii_text # text.__add__(ascii_text)
Out[77]: 'Γεια σου Κοσμο!Hello World!'

Notice that no space is added, if this is desired it can also be concatenated:

In [78]: text + ascii_text
Out[78]: 'Γεια σου Κοσμο! Hello World!'

The data model method __mul__ defines the behaviour of the * operator and performs str replication with an int instance:

In [79]: text * 3 # text.__mul__(3)
Out[79]: 'Γεια σου Κοσμο!Γεια σου Κοσμο!Γεια σου Κοσμο!'

The reverse data model method __rmul__ gives instructions when the position of the str instance and int instance around the operator are reversed:

In [80]: 3 * text # (3).__mul__(text) # Not Defined in int class
                  # text.__rmul__(3) 
Out[80]: 'Γεια σου Κοσμο!Γεια σου Κοσμο!Γεια σου Κοσμο!'

The bytes class behaves similarly:

In [81]: text_b + ascii_text_b
Out[81]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!Hello World!'
In [82]: text_b * 3
Out[82]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [83]: exit

Instantiation and MutableCollection Properties

The bytes class has the mutable counterpart the bytearray. A bytearray instance can be instantiated by casting from a bytes instance to a bytearray:

In [1]: text_b = b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [2]: text_b
Out[2]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [3]: text_ba = bytearray(b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')
In [4]: text_ba
Out[4]: bytearray(b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')

The printed formal representation Out[4] shows the recommended way to instantiate a bytearray is by casting a bytes instance to a bytearray. There is no shorthand way of instantiating this class as it is less commonly used.

The behaviour of all the immutable methods is consistent:

In [4]: len(text_ba)
Out[4]: 27
In [5]: 207 in text_ba
Out[5]: True
In [6]: text_ba.count(207)
Out[6]: 3
In [7]: text_ba.index(207)
Out[7]: 9

The hash function can be used to verify an immutable object (an object that does not change). Notice that text_b which is immutable has a unique hash value but text_ba which is mutable is unhashable:

In [8]: hash(text_b)
Out[8]: -2033065742153678299
In [9]: hash(text_ba)
Traceback (most recent call last):

  Cell In[9], line 1
    hash(text_ba)

TypeError: unhashable type: 'bytearray'

The data model method __getitem__ can be used to index into an immutable bytes or mutable bytearray.

In [10]: text_b[0]
Out[10]: 206
In [11]: text_ba[0]
Out[11]: 206
In [12]: hex(text_ba[0])
Out[12]: '0xce'

The id function can be used to obtain the identification of an object:

In [13]: id(text_ba)
Out[13]: 1968878586928
In [14]: text_ba
Out[14]: bytearray(b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')

The data model method __setitem__ defines the behaviour when indexing into a value and using assignment:

In [15]: int('0xcf', base=16)
Out[15]: 207
In [15]: text_ba[0] = 207

Notice because a value is being assigned in In [15] there is no Out[15]. It text_ba is examined, it is updated in place, notice that the object id does not change:

In [16]: text_ba
Out[16]: bytearray(b'\xcf\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')
In [16]: id(text_ba)
Out[16]: 1968878586928

Notice:

In [17]: id(text_ba)
Out[17]: 1968878586928

The data model method __delitem__ defines the behaviour when deleting a value that has been indexed into:

In [18]: del text_ba[0]

Notice there is no Out[18] and instead text_ba is modified in place:

In [19]: text_ba
Out[19]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')

The first byte is missing, and this won't encode properly because only a single byte from an expected multiple byte is deleted. Notice the identification is constant:

In [20]: id(text_ba)
Out[20]: 1968878586928

The mutable method append will append a single byte represented by a byte to the end of a bytearray:

In [21]: text_ba.append(206) # '\xce'

As this method is mutable it has no return value. text_ba can be seen to be modified in place:

In [22]: text_ba
Out[22]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce')

The mutable method extend will can be used to extend the bytearray by another bytearray:

In [23]: text_ba.extend(bytearray((177, 206, 177))) # '\xb1\xce\xb1'

Once again this method is mutable and text_ba is modified in place:

In [24]: text_ba
Out[24]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\xb1\xce\xb1')

The mutable method insert can be used to insert a single byte as an int at an index, for example at index 1:

In [25]: text_ba.insert(208) # '\xd0'

Once again this method is mutable and text_ba is modified in place:

In [26]: text_ba
Out[26]: bytearray(b'\x93\x94\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\xb1\xce\xb1')

The mutable method remove can be used to remove a the first occurance of a byte:

In [27]: text_ba.remove(206) # '\xce'

Once again this method is mutable and text_ba is modified in place, the \0xce that was at index 2 is no longer here and instead \xb5 which was previously at idnex 3 is shown at index 2:

In [28]: text_ba
Out[28]: bytearray(b'\x93\x94\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\xb1\xce\xb1')

The mutable method reverse can be used to reverse the order of each byte in the `bytearray:

In [29]: text_ba.reverse()

Once again this method is mutable and text_ba is modified in place:

In [30]: text_ba
Out[30]: bytearray(b'\xb1\xce\xb1\xce!\xbf\xce\xbc\xce\x83\xcf\xbf\xce\x9a\xce \x85\xcf\xbf\xce\x83\xcf \xb1\xce\xb9\xce\xb5\x94\x93')

The mutable method clear will clear each byte from the bytearray:

In [31]: text_ba.clear()

Once again this method is mutable and text_ba is modified in place:

In [30]: text_ba
Out[30]: bytearray(b'')

The mutable method extend can be used to extend this empty bytearray:

In [31]: text_ba.extend(bytearray(b'\x93\x94\xce\xb5\xce\xb9\xce\xb1'))

Most mutable methods have no return value, which distinguishes them clearly from immutable methods which jhave a return value. The mutable method pop is unique because it returns the value popped (by default the last value) and mutates the bytearray in place:

In [31]: text_ba.pop()
Out[31]: 177 # '\xb1'
In [32]: text_ba
Out[32]: bytearray(b'\x93\x94\xce\xb5\xce\xb9\xce')

An index to pop can be specified:

In [34]: text_ba.pop(1)
Out[34]: 148 # '\x94'
In [32]: text_ba
Out[32]: bytearray(b'\x93\xce\xb5\xce\xb9\xce')

Notice that after all these mutable methods are used the identification of text_ba remains the same:

In [33]: id(text_ba)
Out[33]: 1968878586928

The copy method can be used to create a copy of the bytearray:

In [34]: text_ba2 = text_ba.copy()

Notice the copy has a different identification:

In [35]: id(text_ba2)
Out[35]: 1968878402416

The copies (at present) have equal values but are not the same object:

In [36]: text_ba2 == text_ba
Out[36]: True
In [37]: text_ba2 is text_ba
Out[37]: False

The __add__ and __mul__ data model methods behave consistently:

In [38]: text_ba + text_ba2
Out[38]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\x93\xce\xb5\xce\xb9\xce')
In [39]: text_ba * 3
Out[39]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\x93\xce\xb5\xce\xb9\xce\x93\xce\xb5\xce\xb9\xce')

However there is a subtitle difference when the in place counterparts are used. Notice for the immutable bytes that two operations take place, essentially concatenation returning a new value and then reassignment, notice the identification changes which means the label text_b has been peeled off the old bytes instance with identification 1968877623024 and placed on the new bytes instance with identification 1968877576688:

In [40]: text_b
Out[40]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [41]: id(text_b) 
Out[41]: 1968877623024
In [42]: text_b += b'\xce'
In [43]: text_b
Out[43]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce'
In [44]: id(text_b) 
Out[44]: 1968877576688

Notice for the mutable bytes that a single operation has taken place and the identification remains constant:

In [45]: text_ba
Out[45]: bytearray(b'\x93\xce\xb5\xce\xb9\xce')
In [46]: id(text_ba) 
Out[46]: 1968878586928
In [47]: text_ba += bytearray(b'\xce')
In [48]: text_ba
Out[48]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xce')
In [49]: id(text_ba) 
Out[49]: 1968878586928
In [50]: text_ba *= 2
In [51]: text_ba
Out[51]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xce\x93\xce\xb5\xce\xb9\xce\xce')
In [52]: id(text_ba) 
Out[52]: 1968878586928
In [53]: exit

Formatted Strings

Returning to the str class, the remaining str methods will now be examined. Recall that a str is immutable and all methods therefore return a value, which is commonly another str instance.

It is common to insert an object into a string and format it within the string body, to produce what is known as a formatted string.

Look at the following string body:

In [1]: body = 'The string to 0 is 1 2!'

Supposing there are three str instances:

In [2]: var0 = 'print'
      : var1 = 'hello'
      : var2 = 'world'

The str method format can be used to insert these str instances within the string body. Let's examine the docstring of the str method format:

In [3]: body.format(
# Docstring popup
"""
Docstring:
S.format(*args, **kwargs) -> str

Return a formatted version of S, using substitutions from args and kwargs.
The substitutions are identified by braces ('{' and '}').
Type:      builtin_function_or_method
"""

From the docstring, the string body should contain curly braces, which are used as placeholders to insert a Python object. Each placeholder can be numbered positionally:

In [3]: body = 'The string to {0} is {1} {2}!'

The *args in the docstring indicates a variable number of positional arguments. When inserting multiple object instances into the string body, each positional argument should correspond to a placeholder:

In [4]: body.format(var0, var1, var2)
Out[4]: 'The string to print is hello world!'

The string body can alternatively be setup to contain named named arguments:

In [5]: 
body = 'The string to {var0_} is {var1_} {var2_}!'

The **kwargs in the docstring indicates a variable number of named arguments also known as keyword parameters:

In [6]: 
body.format(var0_=var0, var1_=var1, var2_=var2)
Out[6]: 'The string to print is hello world!'

Combining the above:

In [7]: 'The string to {var0_} is {var1_} {var2_}!'.format(var0_=var0, var1_=var1, var2_=var2)
Out[7]: 'The string to print is hello world!'

It is common for the placeholder to be given the same name as the `object` name of the `object` to be inserted:

```python
In [8]: 'The string to {var0} is {var1} {var2}!'.format(var0=var0, var1=var1, var2=var2)
Out[8]: 'The string to print is hello world!'

Notice in the above that each object name is essentially repeated 3 times which is pretty cumbersome. Therefore a shorthand way of writing the expression above is to use the prefix f, f means formatted string:

In [9]: f'The string to {var0} is {var1} {var2}!'
Out[9]: 'The string to print is hello world!'

The object data model __format__ method defines the behaviour of the builtins function:

In [10]: format(
# Docstring popup
"""
Signature: format(object, format_spec, /)
Docstring:
Default object formatter.

Return str(self) if format_spec is empty. Raise TypeError otherwise.
Type:      method_descriptor
"""

Notice there is a format specification format_spec. 's' denotes the format specification for a str instance:

In [11]: format('Hello World!', 's')
Out[11]: 'Hello World!'

If it is prefixed with a number for instance '22s', this is an instruction for the str instance to occupy a width of 22 within the formatted string. Because the original length was 12, it now has 10 spaces until the end of the string:

In [12]: format('Hello World!', '22s')
Out[12]: 'Hello World!          '

Prefixing with a 0 is not common with a str instance and replaces each space with a 0:

In [13]: format('Hello World!', '022s')
Out[13]: 'Hello World!0000000000'

The format specified is inserted within a variable within the placeholder and the colon : is used to seperate out the variable from the format specification:

In [14]: f'The string to {var0:s} is {var1} {var2}!'
Out[14]: 'The string to print is hello world!'
In [15]: f'The string to {var0:10s} is {var1} {var2}!'
Out[15]: 'The string to print      is hello world!'
In [16]: f'The string to {var0:010s} is {var1} {var2}!'
Out[16]: 'The string to print00000 is hello world!'

Numeric values are commonly inserted into a string body:

In [17]: num1 = 1
       : num2 = 0.0000123456789
       : num3 = 12.3456789
In [18]: f'The numbers are {num1}, {num2} and {num3}.' 
Out[18]: The numbers are 1, 1.23456789e-05 and 12.3456789.'

num1 is an integer and an integer can have various format specifiers. d is used to represent a decimal integer:

In [19]: f'The numbers are {num1:d}, {num2} and {num3}.' 
Out[19]: 
'The numbers are 1, 1.23456789e-05 and 12.3456789.'

The width can also be specified:

In [19]: f'The numbers are {num1:5d}, {num2} and {num3}.' 
Out[19]: 
'The numbers are     1, 1.23456789e-05 and 12.3456789.'

Prefixing this with 0 will display leading zeros:

In [19]: f'The numbers are {num1:5d}, {num2} and {num3}.' 
Out[19]: 
'The numbers are 00001, 1.23456789e-05 and 12.3456789.'

num2 and num3 are float instances and the format specified f can be used to express each float in the fixed format:

In [20]: for num in range(9, -1, -1):
       :    print('0.'+num*'0'+'123')
       :
       : for num in range(18):
       :    print('123'+num*'0'+'.')
       :
0.000000000123
0.00000000123
0.0000000123
0.000000123
0.00000123
0.0000123
0.000123
0.00123
0.0123
0.123
123.
1230.
12300.
123000.
1230000.
12300000.
123000000.
1230000000.
12300000000.
123000000000.
1230000000000.
12300000000000.
123000000000000.
1230000000000000.
12300000000000000.
123000000000000000.
1230000000000000000.
12300000000000000000.

Typically when the float is very small or very large, scientific notation is used, with the format e. The format g is the general format and used the fixed format or the exponential format depending on the size of the float:

In [21]: for num in range(9, -1, -1):
       :     print(float('0.'+num*'0'+'123'))
       :
       : for num in range(18):
       :     print(float('123'+num*'0'+'.'))
1.23e-10
1.23e-09
1.23e-08
1.23e-07
1.23e-06
1.23e-05
0.000123
0.00123
0.0123
0.123
123.0
1230.0
12300.0
123000.0
1230000.0
12300000.0
123000000.0
1230000000.0
12300000000.0
123000000000.0
1230000000000.0
12300000000000.0
123000000000000.0
1230000000000000.0
1.23e+16
1.23e+17
1.23e+18
1.23e+19
In [22]: 
f'The numbers are {num1:g}, {num2:g} and {num3:g}.' 
Out[22]: 'The numbers are 1, 1.23457e-05 and 12.3457.'
In [23]: 
f'The numbers are {num1:f}, {num2:f} and {num3:f}.' 
Out[23]: 'The numbers are 1.000000, 0.000012 and 12.345679.'
In [24]: 
f'The numbers are {num1:e}, {num2:e} and {num3:e}.' 
Out[24]: 'The numbers are 1.000000e+00, 1.234568e-05 and 1.234568e+01.'

A width of 10 characters, with 3 characters past the decimal point can be specified:

In [25]: format(num1, '10.3e')
Out[25]: ' 1.000e+00'
In [26]: format(num1, '010.3e')
Out[26]: '01.000e+00'
In [27]: format(num1, '010.2e')
Out[27]: '001.00e+00'

Notice the width includes all the characters used to represent the number as a string such as the decimal point, e and power.

The same modifications can be made in the fixed format:

In [25]: format(num1, '10.3f')
Out[25]: '     1.000'
In [26]: format(num1, '010.3f')
Out[26]: '000001.000'
In [27]: format(num1, '010.2f')
Out[27]: '0000001.00'
In [28]: f'The numbers are {num1:03d}, {num2:06.3f} and {num3:010.3e}.' 
Out[28]: 'The numbers are 001, 00.000 and 01.235e+01.'
In [29]: exit

Returning to the string body:

In [1]: body = 'The numbers are {num1:03d}, {num2:06.3f} and {num3:010.3e}.'

The docstring of the str method format_map can be viewed:

In [2]: body.format_map(
# Docstring popup
"""
Docstring:
S.format_map(mapping) -> str

Return a formatted version of S, using substitutions from mapping.
The substitutions are identified by braces ('{' and '}').
Type:      builtin_function_or_method
"""

To use this method, all the variables to be incorporated into the formatted string are grouped together in a mapping such as a dict:

In [2]: numbers = {'num1': 1, 'num2': 0.0000123456789, 'num3': 12.3456789}

The str method format_map can then be used to map all the variables from this dict into their placeholders within the string body:

In [3]: body.format_map(numbers)
Out[3]: 'The numbers are 001, 00.000 and 01.235e+01.

In the str class the data model method __mod__ is defined to implement C-style formatted strings which controls the behaviour of the operator %.

In [4]: body = 'The numbers are %03d, %06.3f and %0.3g.' 
      : nums = (1, 0.0000123456789, 12.3456789)
In [5]: body % nums
Out[5]: 'The numbers are 001, 00.000 and 12.3.'

Case Methods

The Greek alphabet looks as follows, notice it has uppercase and lowercase letters. Notice also that some characters such as epsilon and sigma have two lowercase variations:

Greek Alphabet
Greek Alphabet Uppercase Lower Case
Alpha Α α
Beta Β β
Gamma Γ γ
Delta Δ δ
Epsilon Ε ε or ϵ
Zeta Ζ ζ
Eta Η η
Theta Θ θ
Iota Ι ι
Kappa Κ κ
Lambda Λ λ
Mu Μ μ
Nu Ν ν
Xi Ξ ξ
Omicron Ο ο
Pi Π π
Rho Ρ ρ
Sigma Σ σ or ς
Tau Τ τ
Upsilon Υ υ
Phi Φ φ
Chi Χ χ
Psi Ψ ψ
Omega Ω ω

The str case method upper returns a string where every character is upper case:

Out[6]: 'γεια σου κοσμο!'.upper()
Out[6]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'

The str case method capitalize (U.S. spelling with z) returns a string where only the first character is in upper case and the rest of the characters are in lower case:

In [7]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.capitalize()
Out[7]: 'Γεια σου κοσμο!'

The str case method title returns a string where only the first character and first character after very space is in upper case and the rest of the characters are in lower case:

In [8]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.title()
Out[8]: 'Γεια Σου Κοσμο!'

The str case method lower returns a string where each characer is in lower case:

In [9]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.lower()
Out[9]: 'γεια σου κοσμο!'

The following characters are less common lowe case variants of epsilon and sigma. Therefore when the str method lower is used on them, they are unchanged:

In [10]: 'ϵ'.lower()
Out[10]: 'ϵ'
In [11]: 'ς'.lower()
Out[11]: 'ς'

The str case method casefold returns a string where each characer is in lower case and transforms any variants to the most common variant:

In [12]: 'ϵ'.casefold()
Out[12]: 'ε'
In [13]: 'ς'.casefold()
Out[13]: 'σ'

The difference between the str methods lower and casefold can be seen in the example below:

In [14]: 'Γϵια ςου Κοςμο!'.lower()
Out[14]: 'γϵια ςου κοςμο!'

In [15]: 'Γϵια ςου Κοςμο!'.casefold()
Out[15]: 'γεια σου κοσμο!'

The str case method swapcase swaps the case of each character in the str:

In [16]: 'Γεια Σου Κοσμο!'.swapcase()
Out[16]: 'γΕΙΑ σΟΥ κΟΣΜΟ!'

Boolean Classification

The str class has a number of boolean classification methods which return True if every Unicode character in a str satisfies the classification:

In [17]: 'γεια σου κοσμο!'.islower()
Out[17]: True
In [18]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.islower()
Out[18]: False
In [19]: 'γεια σου κοσμο!'.isupper()
Out[19]: True
In [20]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.isupper()
Out[20]: True

The boolean classification istitle will return True if the str is title case:

In [21]: 'Γεια σου κοσμο!'.istitle()
Out[21]: False
In [22]: 'Γεια Σου Κοσμο!'.istitle()
Out[22]: True

The boolean classification isspace will return True if each character is whitespace, this includes tabs and newlines:

In [23]: ' '.isspace()
Out[23]: True
In [24]: '   '.isspace()
Out[24]: True
In [25]: ' \t\n\r\x0b\x0c'.isspace()
Out[25]: True

The escape character \t represents a tab, \n represents a new line and \r a carriage return. \x0b is the vertical tab and \x0c is the form feed, these are less commonly used and expressed as their byte.

The boolean classification isprintable will check to see if every character in the string is printable, i.e. doesn't have any non-printable ASCII characters

In [26]: '\x00'.isprintable()
Out[26]: False
In [27]: 'Γεια σου Κοσμο!'.isprintable()
Out[27]: True

The boolean classification isascii will check to see if every character in the string is an ASCII character:

In [28]: 'Γεια σου Κοσμο!'.isascii()
Out[28]: False
In [29]: 'Hello World!'.isascii()
Out[29]: True

The boolean classification isalpha will check to see if every number in the string is alphabetical. Note this isn't limited to only ASCII alphabetical characters:

In [30]: 'Γεια σου Κοσμο!'.isalpha()
Out[30]: False

In [31]: 'αβγΑΒΓ'.isalpha()
Out[31]: True

In [32]: 'abcABC'.isalpha()
Out[32]: True

There are three numeric classifications and the difference between these can be seen by examining the following numeric groups:

In [33]: numeric_groups = {'ascii': '0123456789', 
                           'font1': '𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿', 
                           'font2': '𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵', 
                           'font3': '𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡', 
                           'subscript': '₀₁₂₃₄₅₆₇₈₉',
                           'superscript': '⁰¹²³⁴⁵⁶⁷⁸⁹',
                           'circled1': '➀➁➂➃➄➅➆➇➈',
                           'circled2': '➉',
                           'fractions': '½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉', 
                           'asciihex': '0123456789abcdef', }

isdecimal is the most restrictive and recognises numeric digits of various different fonts:

In [34]: for key, value in numeric_groups.items():
       :      print(key, value, value.isdecimal())
       :
ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ False
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ False
circled1 ➀➁➂➃➄➅➆➇➈ False
circled2False
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False
asciihex 0123456789abcdef False

isdigit recognises more including subscripts, superscripts and circled digits however the circled 10 isn't recognised as it has two digits opposed to one:

In [35]: for key, value in numeric_groups.items():
       :      print(key, value, value.isdigit())
       :
ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2False
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False
asciihex 0123456789abcdef False

isnumeric recognises more including the circled 10 and fractions:

In [36]: for key, value in numeric_groups.items():
       :      print(key, value, value.isnumeric())
       :
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2True
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True
asciihex 0123456789abcdef False

isalnum esseentially is a combination of Unicode characters accepted from isalpha and isnumeric:

In [37]: for key, value in numeric_groups.items():
       :      print(key, value, value.isalnum())
       :
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2True
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True
asciihex 0123456789abcdef False

The boolean classification isidentifier will check to see if the string is a valid identifier name. Recall identifiers (object names) cannot begin with a number, but can include a number elsewhere and cannot use spaces or special characters with exception to the underscore:

In [38]: 'variable'.isidentifier()
Out[38]: True
In [39]: '2variable'.isidentifier()
Out[39]: False
In [40]: 'variable2'.isidentifier()
Out[40]: True
In [41]: 'variable 2'.isidentifier()
Out[41]: False
In [42]: 'variable_2'.isidentifier()
Out[42]: True

startswith endswith

Alignment Methods

The str alignment methods can be used as an alternative way to format a string. left justify ljust, right justify rjust and center will align a string using a specified width:

In [43]: len('Γεια σου Κοσμο!')
Out[43]: 15

In [44]: 'Γεια σου Κοσμο!'.ljust(20)
Out[44]: 'Γεια σου Κοσμο!     '

In [45]: 'Γεια σου Κοσμο!'.rjust(20)
Out[45]: '     Γεια σου Κοσμο!'

In [46]: 'Γεια σου Κοσμο!'.center(20)
Out[46]: '  Γεια σου Κοσμο!   '

These str alignment methods accept an optional fill character:

In [47]: 'Γεια σου Κοσμο!'.rjust(20, '0')
Out[47]: '00000Γεια σου Κοσμο!'

Using right justification with a fill character of 0 is commonly used for numeric strings and is available as the str method zerofill zfill:

In [48]: '1'.zfill(5)
Out[48]: '00001'

The str method expandtabs can be used to expand tabs to a specified number of spaces, the default value is 8:

In [49]: '\tΓεια σου Κοσμο!'.expandtabs()
Out[49]: '        Γεια σου Κοσμο!'
In [50]: '\tΓεια σου Κοσμο!'.expandtabs(4)
Out[50]: '    Γεια σου Κοσμο!'

Stripping Methods

The methods left strip lstrip, right strip rstrip and strip strip the whitespace in a string by default:

In [51]: '  Γεια σου Κοσμο!   '.lstrip()
Out[51]: 'Γεια σου Κοσμο!   '
In [52]: '  Γεια σου Κοσμο!   '.rstrip()
Out[52]: '  Γεια σου Κοσμο!'
In [53]: '  Γεια σου Κοσμο!   '.strip()
Out[53]: 'Γεια σου Κοσμο!'

Alternatively they can be used to strip a specified character:

In [54]: '00001'.lstrip('0')
Out[54]: '1'

Or one of multiple characters:

In [55]: '0x01'.lstrip('0x')
Out[55]: '1'

Sometime it is more useful to use the str methods removeprefix and removesuffix which will remove only a specified prefix or suffix:

In [56]: '0x01'.removeprefix('0x')
Out[56]: '01'
In [57]: '0x01'.removesuffix('01')
Out[57]: '0x'

Splitting and Joining Methods

The str method split, splits each word in a sentance using a whitespace character returning a list of str instances. Conceptually this splits every word in a sentance:

In [58]: 'Γεια σου Κοσμο!'.split()
Out[58]: ['Γεια', 'σου', 'Κοσμο!']

This is completed by the str method join which joins list of str instances:

In [59]: ' '.join(['Γεια', 'σου', 'Κοσμο!'])
Out[59]: ['Γεια', 'σου', 'Κοσμο!']

A different character can be specified in the split method:

In [60]: 'Γεια σου Κοσμο!'.split('σ')
Out[60]: ['Γεια ', 'ου Κο', 'μο!']

In [61]: 'σ'.join(['Γεια ', 'ου Κο', 'μο!'])
Out[61]: 'Γεια σου Κοσμο!'

A maximum split can be specified and here, split can be seen to operate on the string, left to right:

In [60]: 'Γεια σου Κοσμο!'.split(maxsplit=1)
Out[60]: ['Γεια', 'σου Κοσμο!']

The counterpart rsplit operates from right to left:

In [61]: 'Γεια σου Κοσμο!'.rsplit(maxsplit=1)
Out[61]: ['Γεια σου', 'Κοσμο!']

When maxsplit isn't specified, rsplit and split behave identically and split is generally preferred.

The str method splitlines is essentially split with the split character being specified as a new line \n:

In [62]: print('Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n')
Γεια σου Κοσμο!
Γεια σου Κοσμο!
Γεια σου Κοσμο!

In [63]: 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n'.splitlines()
Out[63]: ['Γεια σου Κοσμο!', 'Γεια σου Κοσμο!', 'Γεια σου Κοσμο!']

The str method partition is similar to split but only occurs once and always returns a three element tuple around the split character:

In [64]: 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n'.partition('\n')
Out[64]: ('Γεια σου Κοσμο!', '\n', 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\n')

Partition operates left to right, there is the rpartition counterpart which operates right to left:

In [65]: 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n'.rpartition('\n')
Out[65]: ('Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!', '\n', '')

The string Module

The string module contains identifiers that are related to string manipulation but not available as non-callable attributes in the str class. A design choice was made to compartmentalise these into a separate string module. As a result all the identifiers of the str class, outwith the data model identifiers are immutable callable methods which return a value. Compartmentalising these also reduced the memory overhead in the str class.

In [1]: import string
In [2]: string.
# Available Identifiers for `string` module
# -------------------------------
# Available Identifiers in `string`:
# ----------------------------------

# 🔠 Character Sets:
#     ascii_letters : Concatenation of `ascii_lowercase` and `ascii_uppercase`.
#     ascii_lowercase : Lowercase ASCII letters (`abcdefghijklmnopqrstuvwxyz`).
#     ascii_uppercase : Uppercase ASCII letters (`ABCDEFGHIJKLMNOPQRSTUVWXYZ`).
#     digits : Decimal digit characters (`0123456789`).
#     hexdigits : Hexadecimal digit characters (`0123456789abcdefABCDEF`).
#     octdigits : Octal digit characters (`01234567`).
#     printable : Characters deemed "printable" (`digits`, `ascii_letters`, punctuation, and whitespace).
#     punctuation : String of all ASCII punctuation characters.
#     whitespace : String of all ASCII whitespace characters.
In [3]: string.printable
Out[3]: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
In [4]: string.ascii_lowercase
Out[4]: 'abcdefghijklmnopqrstuvwxyz'
In [5]: string.ascii_uppercase
Out[5]: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
In [7]: string.hexdigits # base 16
Out[7]: '0123456789abcdefABCDEF'
In [6]: string.digits # base 10
Out[6]: '0123456789'
In [8]: string.octdigits # base 8
Out[8]: '01234567'
In [9]: string.punctuation
Out[9]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [10]: string.whitespace
Out[10]: ' \t\n\r\x0b\x0c'
In [11]: string.printable
Out[11]: '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Translation

The str method maketrans is a static method that creates a translation table which maps from one character to another (conceptualise the translation). A translation table from Greek to Latin letters can be made. To visualise this, it can be cast into a dict:

In [12]: greek2latin = str.maketrans('αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ', 'abgdezhqiklmnxoprstyfcuwABGDEZHQIKLMNXOPRSTYFCUW')

In [13]: greek2latin_as_dict = dict(greek2latin)
Translation Table
Key Value
945 97
946 98
947 103
948 100
949 101
950 122
951 104
952 113
953 105
954 107
955 108
956 109
957 110
958 120
959 111
960 112
961 114
963 115
964 116
965 121
966 102
967 99
968 117
969 119
913 65
914 66
915 71
916 68
917 69
918 90
919 72
920 81
921 73
922 75
923 76
924 77
925 78
926 79
927 84
928 82
929 83
931 84
932 85
933 86
934 87
935 88
936 89
937 90
938 91
939 92
940 93
941 94
942 95
943 96
944 97
945 98
946 99
947 100
948 101
949 102
950 103
951 104
952 105
953 106
954 107
955 108
956 109
957 110
958 111
959 112
960 113
961 114
962 115
963 116
964 117
965 118
966 119
967 120
968 121
969 122
970 123
971 124
972 125
973 126
974 127
975 128
976 129
977 130
978 131
979 132
980 133
981 134
982 135
983 136
984 137
985 138
986 139
987 140
988 141
989 142
990 143
991 144
992 145
993 146
994 147
995 148
996 149
997 150
998 151
999 152
1000 153

Notice the keys and the values are numerical, displayed as int instances. These can be understood better by looking at the int in binary or hexadecimal using the bin and hex functions respectively. The str methods explored above will be used to display all the bits or hexadecimal digits. The character chr function will display the Unicode character corresponding to the supplied int (utf-8). The ordinal ord function performs the counter operation:

In [14]: bin(945)
Out[14]: '0b1110110001' # 2 bytes 'utf-8'

In [15]: '0b'+bin(945).removeprefix('0b').zfill(16)
Out[15]: '0b0000001110110001'

In [14]: hex(945)
Out[14]: '0x3b1' # 2 bytes 'utf-8'

In [15]: '0x'+hex(945).removeprefix('0x').zfill(4)
Out[15]: '0x03b1'

In [16]: chr(945)
Out[16]: 'α'

In [17]: ord('α')
Out[17]: 945
In [18]: bin(97)
Out[18]: '0b1100001' # 1 byte 'utf-8'

In [19]: '0b'+bin(97).removeprefix('0b').zfill(8)
Out[19]: '0b01100001'

In [20]: hex(97)
Out[20]: '0x61' # 1 byte 'utf-8'

In [21]: '0x'+hex(97).removeprefix('0x').zfill(2)
Out[21]: '0x61'

In [22]: chr(97)
Out[22]: 'a'

In [23]: ord('a')
Out[23]: 97

The str method translate can use this translation table to convert characters from the Greek to the Latin alphabet:

In [24]: 'Γεια σου Κοσμο!'.translate(greek2latin)
Out[24]: 'Geia soy Kosmo!'

Recall that a static method is not bound to an instance or a class, but merely found in the classes namespace as its the expected place for the method to be found.

When the translation table was made, the two strings supplied had to be an equal length of Unicode characters for 1 to 1 mapping. Sometimes it is desirable to create a translation table that removes characters entirely and in this case an empty string should be supplied for each of the positional arguments and the characters that are to be mapped to None should be supplied as a third positional argument, in this case the punctuation characters which are available as string.punctuation):

In [25]: remove_punctuation = str.maketrans('', '', string.punctuation)
In [26]: remove_punctuation_as_dict = dict(remove_punctuation)
remove_punctuation_as_dict
Key Value
33 None
34 None
35 None
36 None
37 None
38 None
39 None
40 None
41 None
42 None
43 None
44 None
45 None
46 None
47 None
58 None
59 None
60 None
61 None
62 None
63 None
64 None
91 None
92 None
93 None
94 None
95 None
96 None
123 None
124 None
125 None
126 None

And this can be used to remove the punctuation, in combination with a casefold and split to get a list of lowercase words:

In [27]: 'Γεια σου Κοσμο!'.translate(remove_punctuation).casefold().split()
Out[27]: ['γεια', 'σου', 'κοσμο']

This can be used to count the number of occurances of each word using a collection such as a Counter and the top words can be examined:

In [28]: from collections import Counter
In [29]: Counter(['γεια', 'σου', 'κοσμο'])
Out[29]: Counter({'γεια': 1, 'σου': 1, 'κοσμο': 1})

This essentially is the basis of most natural language processing problems. A natural language toolkit in English would filter out stop words:

In [30]: stop_words = ['a', 'an', 'the', 'at', 'by', 'for', 
                       'in', 'of', 'on', 'to', 'he', 'she', 
                       'it', 'they', 'we', 'you', 'I', 'me', 'my',
                       'your', 'and', 'but', 'or', 'so', 'yet', 'is', 
                       'am', 'are', 'was', 'were', 'be', 'being', 'been', 
                       'have', 'has', 'had', 'do', 'does', 'did', 'not', 
                       'this', 'that', 'these', 'those', 'all', 'any', 
                       'some', 'such'
                      ]

And usually examine sentimental text:

In [31]: sentiment_dict = {'positive': ['happy', 'joyful', 'love', 
                                        'excellent', 'great', 'fantastic', 
                                        'amazing', 'wonderful', 'cheerful', 
                                        'positive'],
                           'negative': ['sad', 'hate', 'terrible', 'awful', 
                                        'bad', 'horrible', 'disappointing', 
                                        'angry', 'frustrated', 'negative'],
                           'neutral': ['okay', 'fine', 'average', 'normal', 
                                       'medium','fair', 'indifferent', 
                                       'moderate', 'tolerable', 'usual']}

A natural language problem would essentially take a piece of text and convert it into a number for example a number that can be evaluated from a large number of product reviews.

Some additional translation tables may need to be created to remove accents from accented characters, which casefold doesn't handle.

Python has a number of third-party natural language toolkits, which are out of the scope of this tutorial.

The re Module

The str module contains a number of simple identifiers which allow for example a substring to be found within a string. These are complemented by regular expressions, if the following str instance text is examined:

In [32]: exit
In [1]: text = 'Email [email protected], [email protected] Telephone 0000000000 Website https://www.domain.com'

Notice it has two emails, a telephone and a website which you as a human can isntantly recognised. Python has a regular expressions re module and the purpose of this module is to create a pattern in the form of a regular expression and search within a string for this pattern:

In [2]: import re
In [3]: re.
# Available Identifiers for `re` module
# -------------------------------
# Available Identifiers for `re`:
# -------------------------------------

## Functions
# - `re.match(pattern, string)`
# - `re.search(pattern, string)`
# - `re.findall(pattern, string)`
# - `re.finditer(pattern, string)`
# - `re.sub(pattern, repl, string)`
# - `re.subn(pattern, repl, string)`
# - `re.split(pattern, string)`
# - `re.compile(pattern, flags=0)`
# - `re.escape(string)`
# - `re.fullmatch(pattern, string)`
# - `re.purge()`

## Flags
# - `re.IGNORECASE`
# - `re.I`
# - `re.MULTILINE`
# - `re.M`
# - `re.DOTALL`
# - `re.S`
# - `re.VERBOSE`
# - `re.X`

## Match Object Methods
# - `match.group([group])`
# - `match.groups()`
# - `match.start([group])`
# - `match.end([group])`
# - `match.span([group])`
# - `match.re`
# - `match.string`
# - `match.lastindex`
# - `match.lastgroup`

## Special Sequences
# - `\d` - Matches any decimal digit.
# - `\D` - Matches any non-digit character.
# - `\w` - Matches any alphanumeric character (and underscore).
# - `\W` - Matches any non-alphanumeric character.
# - `\s` - Matches any whitespace character.
# - `\S` - Matches any non-whitespace character.
# - `\b` - Matches a word boundary.
# - `\B` - Matches a non-word boundary.

## Character Classes
# - `[abc]` - Matches any character in the set.
# - `[^abc]` - Matches any character not in the set.
# - `[a-z]` - Matches any character in the range from a to z.
# - `.` - Matches any character except a newline.

## Groups
# - `(...)` - Capturing group.
# - `(?:...)` - Non-capturing group.
# - `(?P<name>...)` - Named capturing group.
# - `(?=...)` - Positive lookahead.
# - `(?!...)` - Negative lookahead.
# - `(?<=...)` - Positive lookbehind.
# - `(?<!...)` - Negative lookbehind.

Notice the use of \ for a special sequence, because \ is used a pattern should be supplied as a regular expression with the prefix r. Lower case r is preferred as many IDEs will apply syntax highlighting for regular expressions. If upper case R is used, the raw string will still work as a regular expression but the IDE will just syntax the regular expression consistently to a normal string:

In [3]: email_pattern = r'\b[A-Za-z0-9._]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
      : number_pattern = r'\b\d{10}\b'
      : website_pattern = r'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

If the email is examined [email protected] the pattern is r'\b[A-Za-z0-9._]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'. This pattern can be broken down:

  • \b beginning of a word boundary.
  • [A-Za-z0-9._]+ is the local component of the email
    • [A-Z] # string.ascii_uppercase
    • [a-z] # string.ascii_lowercase
    • [0-9] # string.digits
    • [._] additional characters allowed in the local component
    • + used to denote 1 or more character
  • @ is the at symbol
  • [A-Za-z0-9.-] is the domain name
    • [A-Z] # string.ascii_uppercase
    • [a-z] # string.ascii_lowercase
    • [0-9] # string.digits
    • [._] additional characters allowed in the local component
    • + used to denote 1 or more characters
  • +\. is the dot ., note the . is used in a regular expression, so in this case is inserted as an escape character.
  • [A-Z|a-z]{2,} is the top level domain
    • [A-Z] # string.ascii_uppercase
    • [a-z] # string.ascii_lowercase
    • {2,} two or more characters
  • \b ending of a word boundary.

If the number is examined 0000000000 the pattern is r'\b\d{10}\b'. This pattern can be broken down:

  • \b beginning of a word boundary.
  • \d decimal characters
    • {10} ten of them
  • \b ending of a word boundary.

If the website is examined https://www.domain.com the pattern is r'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'. This pattern can be broken down:

  • \b beginning of a word boundary.
  • https? is the Hypertext Transfer Protocol (Secured)
    • http literal
    • s? optional, meaning s (may or may not be present)
    • :// literal (used to seperate the protocol from the address)
  • (?:www\.)
    • (?:) creates a non-capturing group
    • www literal
    • \. the dot is inserted as an escape character
  • ? optional, meaning www. (may or may not be present)
  • [A-Za-z0-9.-] is the domain (same as email)
  • +.[A-Z|a-z]{2,}` is the top level domain (same as email)
  • \b ending of a word boundary.

The regular expression function findall will return a list of pattern matches:

In [4]: re.findall(email_pattern, text)
Out[4]: ['[email protected]', '[email protected]']

In [5]: re.findall(number_pattern, text)
Out[5]: ['0000000000']

In [6]: re.findall(website_pattern, text)
Out[6]: ['https://www.domain.com']

The regular expressions module is very powerful and regular expressions can get quite complicated. A simple demonstration here was used just to show the concept.

Return to Python Tutorials