Recall from the previous tutorial covering the data model that the object
class is the base
class of all classes. dir
can be used to view a list of it's identifiers:
In [1]: dir(object)
Out[1]: [
'__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__',
'__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__',
'__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__',
'__sizeof__', '__str__', '__subclasshook__'
]
The identifiers can also be viewed if object
is input followed by a dot .
:
In [2]: object.
# -------------------------------
# Available Identifiers for `object`:
# -------------------------------------
# 🔧 Functions:
# - __init__(self, /, *args, **kwargs) : Initializes the object.
# - __new__(*args, **kwargs) : Creates a new instance of the class.
# - __delattr__(self, name, /) : Defines behavior for when an attribute is deleted.
# - __dir__(self, /) : Default dir() implementation.
# - __sizeof__(self, /) : Returns the size of the object in memory, in bytes.
# - __eq__(self, value, /) : Checks for equality with another object.
# - __ne__(self, value, /) : Checks for inequality with another object.
# - __lt__(self, value, /) : Checks if the object is less than another.
# - __le__(self, value, /) : Checks if the object is less than or equal to another.
# - __gt__(self, value, /) : Checks if the object is greater than another.
# - __ge__(self, value, /) : Checks if the object is greater than or equal to another.
# - __repr__(self, /) : Returns a string representation of the object.
# - __str__(self, /) : Returns a string for display purposes.
# - __format__(self, format_spec, /) : Returns a formatted string representation of the object.
# - __hash__(self, /) : Returns a hash of the object.
# - __getattribute__(self, name, /) : Gets an attribute from the object.
# - __setattr__(self, name, value, /) : Sets an attribute on the object.
# - __delattr__(self, name, /) : Deletes an attribute from the object.
# - __reduce__(self, /) : Prepares the object for pickling.
# - __reduce_ex__(self, protocol, /) : Similar to __reduce__, with a protocol argument.
# - __init_subclass__(...) : Called when a class is subclassed; default
# implementation does nothing.
# - __subclasshook__(...) : Customize issubclass() for abstract classes.
#
# 🔍 Attributes:
# - __class__ : The class of the object.
# - __doc__ : The docstring of the object.
# -------------------------------------
If the str
class is now examined, notice that it has many more identifiers:
In [2]: dir(str)
Out[2]: [
'__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__',
'__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__',
'__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__',
'__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__repr__',
'__radd__', '__rmatmul__', '__rmul__', '__setattr__', '__sizeof__', '__str__',
'__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode',
'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum',
'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric',
'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',
'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust',
'rpartition', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase',
'title', 'upper', 'zfill'
]
Because, the object
is the base
class, it is present in the str
classes method resolution order:
In [3]: str.mro()
Out[3]: ['str', 'object']
Recall the str
class inherits everything from the object
class. Some identifiers are redefined in the str
class for additional functionality and additional identifiers are supplemented. The method resolution order essentially means preferentially use the method if it is redefined in the str
class over the equivalent method in the object
class.
The str
class follows the design pattern of the abstract base class immutable Collection
and therefore has the behaviour of an immutable Collection
. When str
is input, followed by a dot .
the identifiers are typically listed alphabetically. However it is easier to understand the identifiers in the str
class when the identifiers are grouped by design pattern and purpose:
In [4]: str.
# -------------------------------
# Available Identifiers for `str`:
# -------------------------------------
# 🔧 Functions from `object` (inherited by `str`):
# - __init__(self, /, *args, **kwargs) : Initializes the object.
# - __new__(*args, **kwargs) : Creates a new instance of the class.
# - __delattr__(self, name, /) : Defines behavior for when an attribute is deleted.
# - __dir__(self, /) : Default dir() implementation.
# - __sizeof__(self, /) : Returns the size of the object in memory, in bytes.
# - __eq__(self, value, /) : Checks for equality with another object.
# - __ne__(self, value, /) : Checks for inequality with another object.
# - __lt__(self, value, /) : Checks if the object is less than another.
# - __le__(self, value, /) : Checks if the object is less than or equal to another.
# - __gt__(self, value, /) : Checks if the object is greater than another.
# - __ge__(self, value, /) : Checks if the object is greater than or equal to another.
# - __repr__(self, /) : Returns a string representation of the object.
# - __str__(self, /) : Returns a string for display purposes.
# - __format__(self, format_spec, /) : Returns a formatted string representation of the object.
# - __hash__(self, /) : Returns a hash of the object.
# - __getattribute__(self, name, /) : Gets an attribute from the object.
# - __setattr__(self, name, value, /) : Sets an attribute on the object.
# - __delattr__(self, name, /) : Deletes an attribute from the object.
# - __reduce__(self, /) : Prepares the object for pickling.
# - __reduce_ex__(self, protocol, /) : Similar to __reduce__, with a protocol argument.
# 🔍 Attributes from `object`:
# - __class__ : The class of the string.
# - __doc__ : The docstring of the string class.
# 🔧 Collection-Based Methods (from `str` and the Collection ABC):
# - __contains__(self, key, /) : Checks if a substring is in the string (`in`).
# - __iter__(self, /) : Returns an iterator over the string.
# - __len__(self, /) : Returns the length of the string.
# - __getitem__(self, key, /) : Retrieves a character by index (`[]`).
# - count(self, sub, start=0, : Counts the occurrences of a substring.
# end=9223372036854775807, /)
# - index(self, sub, start=0, : Returns the index of the first occurrence of a substring.
# end=9223372036854775807, /)
# 🔧 Supplementary Collection-Based Methods:
# - rindex(self, sub, start=0, : Returns the highest index of a substring.
# end=9223372036854775807, /)
# - find(self, sub, start=0, : Finds the first index of a substring.
# end=9223372036854775807, /)
# - rfind(self, sub, start=0, : Finds the highest index of a substring.
# end=9223372036854775807, /)
# - replace(self, old, new, count=-1, /) : Replaces occurrences of a substring.
# 🔧 Collection-Like Operators:
# - __add__(self, value, /) : Implements string concatenation (`+`).
# - __mul__(self, value, /) : Implements string repetition (`*`).
# - __rmul__(self, value, /) : Implements reflected multiplication (`*`).
# 🔧 Encoding-Related Methods:
# - encode(self, encoding='utf-8', ) : Encodes the string using a specified encoding.
# errors='strict', /
#
# 🔧 String-Specific Dunder Methods (from `str`):
# - __bytes__(self, /) : Converts the bytes object to a bytes object.
# 🔧 Additional String-Specific Methods (Grouped by Similarity):
# 🔧 Formatting Methods:
# - format(self, /, *args, **kwargs) : Formats the string using a format string.
# - format_map(self, mapping, /) : Formats the string using a dictionary.
# - translate(self, table, /) : Maps characters using a translation table.
# - __mod__(self, value, /) : Implements C style string formatting using `%`.
# - __rmod__(self, value, /) : Implements reverse C style string formatting using `%`.
# 🅰️ Case-Specific Methods:
# - lower(self, /) : Converts all characters to lowercase.
# - casefold(self, /) : Returns a casefolded version for caseless matching.
# - upper(self, /) : Converts all characters to uppercase.
# - capitalize(self, /) : Capitalizes the first character of the string.
# - title(self, /) : Returns a title-cased version of the string.
# - swapcase(self, /) : Swaps the case of all characters.
# 🔠 Boolean Methods (Grouped by Type):
# Character Classification:
# - isascii(self, /) : Checks if all characters are ASCII.
# - isalpha(self, /) : Checks if the string contains only alphabetic characters.
# Numeric Classification:
# - isdecimal(self, /) : Checks if the string contains only decimal characters.
# - isdigit(self, /) : Checks if the string contains only digits.
# - isnumeric(self, /) : Checks if the string contains only numeric characters.
# Whitespace and Titlecase:
# - islower(self, /) : Checks if all characters are lowercase.
# - isupper(self, /) : Checks if all characters are uppercase.
# - isspace(self, /) : Checks if the string contains only whitespace.
# - istitle(self, /) : Checks if the string is title-cased.
# - isprintable(self, /) : Checks if all characters are printable.
# - isidentifier(self, /) : Checks if the string is a valid Python identifier.
# Starts or Ends With:
# - startswith(self, prefix, start=0, : Checks if the string starts with a prefix.
# end=9223372036854775807, /)
# - endswith(self, suffix, start=0, : Checks if the string ends with a suffix.
# end=9223372036854775807, /)
# 🔄 Manipulation Methods (Grouping Similar Functions):
# - ljust(self, width, fillchar=' ', /) : Left-justifies the string in a field of a given width.
# - rjust(self, width, fillchar=' ', /) : Right-justifies the string in a field of a given width.
# - center(self, width, fillchar=' ', /) : Centers the string in a field of a given width.
# - zfill(self, width, /) : Pads the string with zeros on the left.
# - expandtabs(self, tabsize=8, /) : Expands tabs in the string into spaces.
# 🔄 Stripping Methods:
# - lstrip(self, chars=None, /) : Strips leading characters from the string.
# - rstrip(self, chars=None, /) : Strips trailing characters from the string.
# - strip(self, chars=None, /) : Strips leading and trailing characters from the string.
# - removeprefix(self, prefix, /) : Removes the specified prefix from the string.
# - removesuffix(self, suffix, /) : Removes the specified suffix from the string.
# 🧩 Splitting and Joining:
# - split(self, sep=None, maxsplit=-1, /) : Splits the string at occurrences of a separator.
# - rsplit(self, sep=None, maxsplit=-1, /) : Splits the string at occurrences of a separator, from the # right.
# - splitlines(self, keepends=False, /) : Splits the string at line breaks.
# - join(self, iterable, /) : Joins an iterable with the string as a separator.
# - partition(self, sep, /) : Splits the string into a 3-tuple around a separator.
# - rpartition(self, sep, /) : Splits the string into a 3-tuple around a separator, from # the right.
The str
is a Collection
where each element (fundamental unit) in the Collection
is a Unicode Character. The str
class always uses the Unicode Transformation Format-8 (UTF-8) to encode an Unicode character and this greatly simplifies text related operations as the user does not need to handle encoding and decoding using various other translation tables.
Another text datatype is the bytes
class. The bytes
class is also a Collection
where each element in the Collection
is a byte. The byte is a logical unit in a computers memory. It is helpful to conceptualise it as the combination of 8 binary switches:
Each combination in the 8 switches above corresponds to an int
between 0
and 256
so the bytes
class also has some numeric behaviour. An encoding standard is used to designate a single byte or multiple bytes to a Unicode character. However unlike the Unicode str
, there are a variety of encoding tables and the numeric bytes
Collection
must be encoded and decoded using the same encoding table for the text to make sense. Notice that the identifiers in the bytes
class are largely constent with identifiers in the str
class but may behave slightly different as they use a difference unit in the Collection
:
In [4]: dir(bytes)
Out[4]: ['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__',
'__format__', '__ge__', '__getitem__', '__getattribute__', '__gt__',
'__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__',
'__ne__', '__repr__', '__radd__', '__rmod__', '__sizeof__', '__str__',
'__bytes__', 'capitalize', 'casefold', 'count', 'decode', 'endswith',
'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdecimal',
'isdigit', 'islower', 'isupper', 'join', 'ljust', 'lower', 'replace',
'rfind', 'rindex', 'rjust', 'split', 'splitlines', 'startswith',
'title', 'upper', 'zfill']
In [5]: bytes.
# -------------------------------
# Available Identifiers for `bytes`:
# -------------------------------------
# 🔧 Functions from `object` (inherited by `bytes`):
# - __init__(self, /, *args, **kwargs) : Initializes the object.
# - __new__(*args, **kwargs) : Creates a new instance of the class.
# - __delattr__(self, name, /) : Defines behavior for when an attribute is deleted.
# - __dir__(self, /) : Default dir() implementation.
# - __sizeof__(self, /) : Returns the size of the object in memory, in bytes.
# - __eq__(self, value, /) : Checks for equality with another object.
# - __ne__(self, value, /) : Checks for inequality with another object.
# - __lt__(self, value, /) : Checks if the object is less than another.
# - __le__(self, value, /) : Checks if the object is less than or equal to another.
# - __gt__(self, value, /) : Checks if the object is greater than another.
# - __ge__(self, value, /) : Checks if the object is greater than or equal to another.
# - __repr__(self, /) : Returns a string representation of the object.
# - __str__(self, /) : Returns a string for display purposes.
# - __format__(self, format_spec, /) : Returns a formatted string representation of the object.
# - __hash__(self, /) : Returns a hash of the object.
# - __getattribute__(self, name, /) : Gets an attribute from the object.
# - __setattr__(self, name, value, /) : Sets an attribute on the object.
# - __delattr__(self, name, /) : Deletes an attribute from the object.
# - __reduce__(self, /) : Prepares the object for pickling.
# - __reduce_ex__(self, protocol, /) : Similar to __reduce__, with a protocol argument.
# 🔍 Attributes from `object`:
# - __class__ : The class of the bytes object.
# - __doc__ : The docstring of the bytes class.
# 🔧 Collection-Based Methods (from `bytes` and the Collection ABC):
# - __contains__(self, key, /) : Checks if a byte value is in the bytes (`in`).
# - __iter__(self, /) : Returns an iterator over the bytes.
# - __len__(self, /) : Returns the length of the bytes.
# - __getitem__(self, key, /) : Retrieves a byte by index (`[]`).
# - count(self, sub, start=0, : Counts the occurrences of a sub-byte sequence.
# end=9223372036854775807, /)
# - index(self, sub, start=0, : Returns the index of the first occurrence of a sub-byte.
# end=9223372036854775807, /)
# 🔧 Supplementary Collection-Based Methods:
# - rindex(self, sub, start=0, : Returns the highest index of the first occurrence of a sub-byte.
# - find(self, sub, start=0, : Finds the index of a sub-byte sequence.
# end=9223372036854775807, /)
# - rfind(self, sub, start=0, : Finds the highest index of a sub-byte sequence.
# end=9223372036854775807, /)
# - replace(self, old, new, count=-1, /) : Replaces occurrences of a sub-byte sequence.
# 🔧 Collection-Like Operators:
# - __add__(self, value, /) : Implements bytes concatenation (`+`).
# - __mul__(self, value, /) : Implements bytes repetition (`*`).
# - __rmul__(self, value, /) : Implements reflected multiplication (`*`).
# 🔧 Encoding-Related Methods:
# - decode(self, encoding='utf-8', : Decodes the bytes using a specified encoding.
# errors='strict', /)
# 🔧 Bytes-Specific Dunder Methods (from `bytes`):
# - __bytes__(self, /) : Returns a copy of the bytes object.
# - __iter__(self, /) : Returns an iterator over the bytes.
# 🔧 Additional Bytes-Specific Methods (Grouped by Similarity):
# 🔧 Formatting and Representation:
# - hex(self, /) : Returns a string of hexadecimal values.
# - fromhex(string, /) : Creates a `bytes` object from a hexadecimal string.
# - __mod__(self, value, /) : Implements C-style formatting using `%`.
# - __rmod__(self, value, /) : Implements reverse C-style formatting using `%`.
# 🅰️ Case-Specific Methods (For Mutable Equivalent `bytearray`):
# - **N/A for `bytes`, as they are immutable.** (Mutable `bytearray` provides `lower`, `upper`, etc.)
# 🔠 Boolean Methods (Data Validation):
# - isalnum(self, /) : Checks if all bytes are alphanumeric.
# - isalpha(self, /) : Checks if all bytes are alphabetic.
# - isascii(self, /) : Checks if all bytes are ASCII.
# - isdigit(self, /) : Checks if all bytes are digits.
# - islower(self, /) : Checks if all bytes are lowercase alphabetic.
# - isupper(self, /) : Checks if all bytes are uppercase alphabetic.
# - isspace(self, /) : Checks if all bytes are whitespace.
# - startswith(self, prefix, start=0, : Checks if starts with a prefix.
# end=9223372036854775807, /)
# - endswith(self, suffix, start=0, : Checks if ends with a suffix.
# end=9223372036854775807, /)
# 🔄 Manipulation Methods (Grouping Similar Functions):
# - ljust(self, width, fillchar=b' ', /) : Left-justifies in a field of a given width.
# - rjust(self, width, fillchar=b' ', /) : Right-justifies in a field of a given width.
# - center(self, width, fillchar=b' ', /) : Centers in a field of a given width.
# - zfill(self, width, /) : Pads with zeros on the left.
# - expandtabs(self, tabsize=8, /) : Expands tabs into spaces.
# 🔄 Stripping Methods:
# - lstrip(self, bytes=None, /) : Strips leading bytes from the bytes object.
# - rstrip(self, bytes=None, /) : Strips trailing bytes from the bytes object.
# - strip(self, bytes=None, /) : Strips leading and trailing bytes from the bytes object.
# 🧩 Splitting and Joining:
# - split(self, sep=None, maxsplit=-1, /) : Splits at occurrences of a separator.
# - rsplit(self, sep=None, maxsplit=-1, /) : Splits at occurrences of a separator, from the right.
# - splitlines(self, keepends=False, /) : Splits at line breaks.
# - join(self, iterable_of_bytes, /) : Joins an iterable with bytes as a separator.
# - partition(self, sep, /) : Splits into a 3-tuple around a separator.
# - rpartition(self, sep, /) : Splits into a 3-tuple around a separator, from the right.
The str
and bytes
classes are immutable which essentially means all methods with exception to the constructor return a new instance (of the same class or a different class). The bytes
class has a mutable counterpart the bytearray
, which has additional methods which mutate the bytearray
in place:
In [5]: bytearray.
# -------------------------------
# Available Identifiers for `bytearray`:
# -------------------------------------
# 🔧 Functions from `object` (inherited by `bytearray`):
# - __init__(self, /, *args, **kwargs) : Initializes the object.
# - __new__(*args, **kwargs) : Creates a new instance of the class.
# - __delattr__(self, name, /) : Defines behavior for when an attribute is deleted.
# - __dir__(self, /) : Default dir() implementation.
# - __sizeof__(self, /) : Returns the size of the object in memory, in bytes.
# - __eq__(self, value, /) : Checks for equality with another object.
# - __ne__(self, value, /) : Checks for inequality with another object.
# - __lt__(self, value, /) : Checks if the object is less than another.
# - __le__(self, value, /) : Checks if the object is less than or equal to another.
# - __gt__(self, value, /) : Checks if the object is greater than another.
# - __ge__(self, value, /) : Checks if the object is greater than or equal to another.
# - __repr__(self, /) : Returns a string representation of the object.
# - __str__(self, /) : Returns a string for display purposes.
# - __format__(self, format_spec, /) : Returns a formatted string representation of the object.
# - __hash__(self, /) : Returns a hash of the object.
# - __getattribute__(self, name, /) : Gets an attribute from the object.
# - __setattr__(self, name, value, /) : Sets an attribute on the object.
# - __delattr__(self, name, /) : Deletes an attribute from the object.
# - __reduce__(self, /) : Prepares the object for pickling.
# - __reduce_ex__(self, protocol, /) : Similar to __reduce__, with a protocol argument.
# 🔍 Attributes from `object`:
# - __class__ : The class of the bytearray object.
# - __doc__ : The docstring of the bytearray class.
# 🔧 Collection-Based Methods (from `bytearray` and the Collection ABC):
# - __contains__(self, key, /) : Checks if a byte value is in the bytearray (`in`).
# - __iter__(self, /) : Returns an iterator over the bytearray.
# - __len__(self, /) : Returns the length of the bytearray.
# - __getitem__(self, key, /) : Retrieves a byte by index (`[]`).
# - count(self, sub, start=0, : Counts the occurrences of a sub-byte sequence.
# end=9223372036854775807, /)
# - index(self, sub, start=0, : Returns the index of the first occurrence of a sub-byte.
# end=9223372036854775807, /)
# 🔧 Supplementary Collection-Based Methods:
# - rindex(self, sub, start=0, : Returns the highest index of a sub-byte sequence.
# end=9223372036854775807, /)
# - find(self, sub, start=0, : Finds the lowest index of a sub-byte sequence.
# end=9223372036854775807, /)
# - rfind(self, sub, start=0, : Finds the highest index of a sub-byte sequence.
# end=9223372036854775807, /)
# - replace(self, old, new, count=-1, /) : Replaces occurrences of a sub-byte sequence.
# 🔧 Mutable Collection-Specific Methods:
# - __setitem__(self, key, value, /) : Assigns a value to an item (`[] =`).
# - __delitem__(self, key, /) : Deletes an item from the bytearray.
# - append(self, item, /) : Appends a byte to the end of the bytearray.
# - extend(self, iterable_of_bytes, /) : Appends multiple bytes to the bytearray.
# - insert(self, index, item, /) : Inserts a byte at a specific position.
# - pop(self, index=-1, /) : Removes and returns a byte at a given index.
# - remove(self, value, /) : Removes the first occurrence of a value.
# - clear(self, /) : Removes all bytes from the bytearray.
# - reverse(self, /) : Reverses the order of bytes in place.
# 🔧 Collection-Like Operators:
# - __add__(self, value, /) : Implements bytearray concatenation (`+`).
# - __mul__(self, value, /) : Implements bytearray repetition (`*`).
# - __rmul__(self, value, /) : Implements reflected multiplication (`*`).
# 🔧 Encoding-Related Methods:
# - decode(self, encoding='utf-8', : Decodes the bytearray using a specified encoding.
# errors='strict', /)
# 🔧 Bytes-Specific Dunder Methods (from `bytearray`):
# - __bytes__(self, /) : Returns a bytes object copy of the bytearray.
# - __iter__(self, /) : Returns an iterator over the bytearray.
# 🔧 Additional Bytearray-Specific Methods (Grouped by Similarity):
# 🔧 Formatting and Representation:
# - hex(self, /) : Returns a string of hexadecimal values.
# - fromhex(string, /) : Creates a `bytearray` object from a hexadecimal string.
# - __mod__(self, value, /) : Implements C-style formatting using `%`.
# - __rmod__(self, value, /) : Implements reverse C-style formatting using `%`.
# 🅰️ Case-Specific Methods:
# - lower(self, /) : Converts to lowercase.
# - upper(self, /) : Converts to uppercase.
# - capitalize(self, /) : Capitalizes the first byte.
# - title(self, /) : Converts to title case.
# - swapcase(self, /) : Swaps case.
# - casefold(self, /) : Converts for case-insensitive comparisons.
# 🔠 Boolean Methods (Data Validation):
# - isalnum(self, /) : Checks if all bytes are alphanumeric.
# - isalpha(self, /) : Checks if all bytes are alphabetic.
# - isascii(self, /) : Checks if all bytes are ASCII.
# - isdigit(self, /) : Checks if all bytes are digits.
# - islower(self, /) : Checks if all bytes are lowercase alphabetic.
# - isupper(self, /) : Checks if all bytes are uppercase alphabetic.
# - isspace(self, /) : Checks if all bytes are whitespace.
# - startswith(self, prefix, start=0, : Checks if starts with a prefix.
# end=9223372036854775807, /)
# - endswith(self, suffix, start=0, : Checks if ends with a suffix.
# end=9223372036854775807, /)
# 🔄 Manipulation Methods (Grouping Similar Functions):
# - ljust(self, width, fillchar=b' ', /) : Left-justifies in a field of a given width.
# - rjust(self, width, fillchar=b' ', /) : Right-justifies in a field of a given width.
# - center(self, width, fillchar=b' ', /) : Centers in a field of a given width.
# - zfill(self, width, /) : Pads with zeros on the left.
# - expandtabs(self, tabsize=8, /) : Expands tabs into spaces.
# 🔄 Stripping Methods:
# - lstrip(self, bytes=None, /) : Strips leading bytes from the bytearray.
# - rstrip(self, bytes=None, /) : Strips trailing bytes from the bytearray.
# - strip(self, bytes=None, /) : Strips leading and trailing bytes from the bytearray.
# 🧩 Splitting and Joining:
# - split(self, sep=None, maxsplit=-1, /) : Splits at occurrences of a separator.
# - rsplit(self, sep=None, maxsplit=-1, /) : Splits at occurrences of a separator, from the right.
# - splitlines(self, keepends=False, /) : Splits at line breaks.
# - join(self, iterable_of_bytes, /) : Joins an iterable with bytearray as a separator.
# - partition(self, sep, /) : Splits into a 3-tuple around a separator.
# - rpartition(self, sep, /) : Splits into a 3-tuple around a separator, from the right.
A str
instance can be explictly instantiated using:
In [5]: exit
In [6]: str('Hello World!')
Out[6]: 'Hello World!'
The return value shows the printed formal representation, which recall is the preferred way to initialise a str
. Since the str
class is the fundamental builtins
text class, the preferred way str
instance is without explictly using the str
class. The Unicode str
can use any Unicode Character. In this example Greek letters will be used:
In [7]: `Γεια σου Κοσμο!`
Out[7]: `Γεια σου Κοσμο!`
Greek Alphabet
Greek Alphabet | Uppercase | Lower Case |
---|---|---|
Alpha | Α | α |
Beta | Β | β |
Gamma | Γ | γ |
Delta | Δ | δ |
Epsilon | Ε | ε or ϵ |
Zeta | Ζ | ζ |
Eta | Η | η |
Theta | Θ | θ |
Iota | Ι | ι |
Kappa | Κ | κ |
Lambda | Λ | λ |
Mu | Μ | μ |
Nu | Ν | ν |
Xi | Ξ | ξ |
Omicron | Ο | ο |
Pi | Π | π |
Rho | Ρ | ρ |
Sigma | Σ | σ or ς |
Tau | Τ | τ |
Upsilon | Υ | υ |
Phi | Φ | φ |
Chi | Χ | χ |
Psi | Ψ | ψ |
Omega | Ω | ω |
The str
instances can be assigned to object
names:
In [8]: ascii_text = 'Hello World!'
text = 'Γεια σου Κοσμο!'
And will display in the Variable Explorer:
Variable Explorer | |||
---|---|---|---|
Name ▲ | Type | Size | Value |
ascii_text | str | 12 | Hello World! |
text | str | 15 | Γεια σου Κοσμο! |
Notice the Variable Explorer displays the type and the length and the length is the number of Unicode Characters in each str
.
Another text datatype is the byte
class. Recall the bytes
class is a Collection
where each element in the Collection
is a byte and a byte can be concepualised as a combination of 8 switches:
The byte
class requires an encoding table. The encoding table maps a command to a memory configuration in bytes
. One of the first widespread encoding tables was the American Standard for Information Interchange (ASCII). A very early generation computer is based on the typewritter. The type writter has a limited number of commands that control the device. Many of these commands are printable key presses, however there are commands that aren't printable such as the carriage return and form feed which need to be used in order to print text out onto a piece of paper:
Notice the limited number of characters in in ASCII are essentially restricted to the English Language. Select ASCII Encoding to view all the ASCII Characters:
ASCII Encoding
Binary to Character Mapping | |
---|---|
Binary | Character |
0b00000000 | NUL (null character) |
0b00000001 | SOH (start of header) |
0b00000010 | STX (start of text) |
0b00000011 | ETX (end of text) |
0b00000100 | EOT (end of transmission) |
0b00000101 | ENQ (enquiry) |
0b00000110 | ACK (acknowledge) |
0b00000111 | BEL (bell) |
0b00001000 | BS (backspace) |
0b00001001 | TAB (horizontal tab) |
0b00001010 | LF (line feed) |
0b00001011 | VT (vertical tab) |
0b00001100 | FF (form feed) |
0b00001101 | CR (carriage return) |
0b00001110 | SO (shift out) |
0b00001111 | SI (shift in) |
0b00010000 | DLE (data link escape) |
0b00010001 | DC1 (device control 1) |
0b00010010 | DC2 (device control 2) |
0b00010011 | DC3 (device control 3) |
0b00010100 | DC4 (device control 4) |
0b00010101 | NAK (negative acknowledge) |
0b00010110 | SYN (synchronous idle) |
0b00010111 | ETB (end of transmission block) |
0b00011000 | CAN (cancel) |
0b00011001 | EM (end of medium) |
0b00011010 | SUB (substitute) |
0b00011011 | ESC (escape) |
0b00011100 | FS (file separator) |
0b00011101 | GS (group separator) |
0b00011110 | RS (record separator) |
0b00011111 | US (unit separator) |
0b00100000 | |
0b00100001 | ! (exclamation mark) |
0b00100010 | " (double quote) |
0b00100011 | # (number sign) |
0b00100100 | $ (dollar sign) |
0b00100101 | % (percent) |
0b00100110 | & (ampersand) |
0b00100111 | ' (single quote) |
0b00101000 | ( left parenthesis |
0b00101001 | ) right parenthesis |
0b00101010 | * (asterisk) |
0b00101011 | + (plus) |
0b00101100 | , (comma) |
0b00101101 | - (hyphen) |
0b00101110 | . (period) |
0b00101111 | / (slash) |
0b00110000 | 0 (digit zero) |
0b00110001 | 1 (digit one) |
0b00110010 | 2 (digit two) |
0b00110011 | 3 (digit three) |
0b00110100 | 4 (digit four) |
0b00110101 | 5 (digit five) |
0b00110110 | 6 (digit six) |
0b00110111 | 7 (digit seven) |
0b00111000 | 8 (digit eight) |
0b00111001 | 9 (digit nine) |
0b00111010 | : (colon) |
0b00111011 | ; (semicolon) |
0b00111100 | < (less than) |
0b00111101 | = (equal sign) |
0b00111110 | > (greater than) |
0b00111111 | ? (question mark) |
0b01000000 | @ (commercial at) |
0b01000001 | A (uppercase A) |
0b01000010 | B (uppercase B) |
0b01000011 | C (uppercase C) |
0b01000100 | D (uppercase D) |
0b01000101 | E (uppercase E) |
0b01000110 | F (uppercase F) |
0b01000111 | G (uppercase G) |
0b01001000 | H (uppercase H) |
0b01001001 | I (uppercase I) |
0b01001010 | J (uppercase J) |
0b01001011 | K (uppercase K) |
0b01001100 | L (uppercase L) |
0b01001101 | M (uppercase M) |
0b01001110 | N (uppercase N) |
0b01001111 | O (uppercase O) |
0b01010000 | P (uppercase P) |
0b01010001 | Q (uppercase Q) |
0b01010010 | R (uppercase R) |
0b01010011 | S (uppercase S) |
0b01010100 | T (uppercase T) |
0b01010101 | U (uppercase U) |
0b01010110 | V (uppercase V) |
0b01010111 | W (uppercase W) |
0b01011000 | X (uppercase X) |
0b01011001 | Y (uppercase Y) |
0b01011010 | Z (uppercase Z) |
0b01011011 | [ (left square bracket) |
0b01011100 | \ (backslash) |
0b01011101 | ] (right square bracket) |
0b01011110 | ^ (caret) |
0b01011111 | _ (underscore) |
0b01100000 | ` (grave accent) |
0b01100001 | a (lowercase a) |
0b01100010 | b (lowercase b) |
0b01100011 | c (lowercase c) |
0b01100100 | d (lowercase d) |
0b01100101 | e (lowercase e) |
0b01100110 | f (lowercase f) |
0b01100111 | g (lowercase g) |
0b01101000 | h (lowercase h) |
0b01101001 | i (lowercase i) |
0b01101010 | j (lowercase j) |
0b01101011 | k (lowercase k) |
0b01101100 | l (lowercase l) |
0b01101101 | m (lowercase m) |
0b01101110 | n (lowercase n) |
0b01101111 | o (lowercase o) |
0b01110000 | p (lowercase p) |
0b01110001 | q (lowercase q) |
0b01110010 | r (lowercase r) |
0b01110011 | s (lowercase s) |
0b01110100 | t (lowercase t) |
0b01110101 | u (lowercase u) |
0b01110110 | v (lowercase v) |
0b01110111 | w (lowercase w) |
0b01111000 | x (lowercase x) |
0b01111001 | y (lowercase y) |
0b01111010 | z (lowercase z) |
0b01111011 | { (left curly brace) |
0b01111100 | | (vertical bar) |
0b01111101 | } (right curly brace) |
0b01111110 | ~ (tilde) |
The bytes
class can be used to cast a str
instance to a bytes
instance:
In [9]: bytes(ascii_text)
bytes(ascii_text)
Traceback (most recent call last):
Cell In[9], line 1
bytes(ascii_text)
TypeError: string argument without an encoding
Notice an encoding table needs to be specified:
In [10]: bytes(ascii_text)
Out[10]: b'Hello World!'
The print out of the formal representation shows the preferential way of constructing a bytes
instance which consists of only ASCII characters. Notice that the prefix b
is used to distinguish a bytes
object from a str
object:
In [11]: bytes(ascii_text, encoding='ascii')
Out[11]: b'Hello World!'
\
is a special character in a string (str
object or bytes
object) that can be used to insert an escape character. For example \t
is a tab and \n
is a new line (the new line is actually two commands the line feed and carriage return):
In [12]: ascii_text = 'Hello\tWorld!'
Out[12]: b_ascii_text = b'Hello\tWorld!'
Notice that the Variable Explorer will display the printed format with the escape character processed:
Variable Explorer | |||
---|---|---|---|
Name ▲ | Type | Size | Value |
ascii_text | str | 12 | Hello World! |
b_ascii_text | bytes | 12 | Hello World! |
text | str | 15 | Γεια σου Κοσμο! |
Binary is machine readible but humans have problems transcribing a long line of zeros and ones. Therefore it is common to split the 8 bit byte into two 4 bit halves. Each half is represented by use of a hexadecimal character:
binary | hexadecimal | decimal |
---|---|---|
0b0000 | 0x0 | 0 |
0b0001 | 0x1 | 1 |
0b0010 | 0x2 | 2 |
0b0011 | 0x3 | 3 |
0b0100 | 0x4 | 4 |
0b0101 | 0x5 | 5 |
0b0110 | 0x6 | 6 |
0b0111 | 0x7 | 7 |
0b1000 | 0x8 | 8 |
0b1001 | 0x9 | 9 |
0b1010 | 0xa | 10 |
0b1011 | 0xb | 11 |
0b1100 | 0xc | 12 |
0b1101 | 0xd | 13 |
0b1110 | 0xe | 14 |
0b1111 | 0xf | 15 |
All the ASCII characters can be reviewed using the three numbering systems binary (base 2 denoted with the prefix 0b
), hexadecimal (base 16 denoted with the prefix 0x
) and decimal (base 10 standard representation, therefore no prefix). Select ASCII Encoding to view all the ASCII Characters:
ASCII Encoding
Binary | Hexadecimal | Decimal | Character Name |
---|---|---|---|
0b00000000 | 0x00 | 0 | NUL (null) |
0b00000001 | 0x01 | 1 | SOH (start of heading) |
0b00000010 | 0x02 | 2 | STX (start of text) |
0b00000011 | 0x03 | 3 | ETX (end of text) |
0b00000100 | 0x04 | 4 | EOT (end of transmission) |
0b00000101 | 0x05 | 5 | ENQ (enquiry) |
0b00000110 | 0x06 | 6 | ACK (acknowledge) |
0b00000111 | 0x07 | 7 | BEL (bell) |
0b00001000 | 0x08 | 8 | BS (backspace) |
0b00001001 | 0x09 | 9 | HT (horizontal tab) |
0b00001010 | 0x0a | 10 | LF (line feed) |
0b00001011 | 0x0b | 11 | VT (vertical tab) |
0b00001100 | 0x0c | 12 | FF (form feed) |
0b00001101 | 0x0d | 13 | CR (carriage return) |
0b00001110 | 0x0e | 14 | SO (shift out) |
0b00001111 | 0x0f | 15 | SI (shift in) |
0b00010000 | 0x10 | 16 | DLE (data link escape) |
0b00010001 | 0x11 | 17 | DC1 (device control 1) |
0b00010010 | 0x12 | 18 | DC2 (device control 2) |
0b00010011 | 0x13 | 19 | DC3 (device control 3) |
0b00010100 | 0x14 | 20 | DC4 (device control 4) |
0b00010101 | 0x15 | 21 | NAK (negative acknowledgment) |
0b00010110 | 0x16 | 22 | SYN (synchronous idle) |
0b00010111 | 0x17 | 23 | ETB (end of transmission block) |
0b00011000 | 0x18 | 24 | CAN (cancel) |
0b00011001 | 0x19 | 25 | EM (end of medium) |
0b00011010 | 0x1a | 26 | SUB (substitute) |
0b00011011 | 0x1b | 27 | ESC (escape) |
0b00011100 | 0x1c | 28 | FS (file separator) |
0b00011101 | 0x1d | 29 | GS (group separator) |
0b00011110 | 0x1e | 30 | RS (record separator) |
0b00011111 | 0x1f | 31 | US (unit separator) |
0b00010000 | 0x20 | 32 | |
0b00010001 | 0x21 | 33 | ! (exclamation mark) |
0b00010010 | 0x22 | 34 | " (double quote) |
0b00010011 | 0x23 | 35 | # (number sign) |
0b00010100 | 0x24 | 36 | $ (dollar sign) |
0b00010101 | 0x25 | 37 | % (percent) |
0b00010110 | 0x26 | 38 | & (ampersand) |
0b00010111 | 0x27 | 39 | ' (apostrophe) |
0b00011000 | 0x28 | 40 | ( (left parenthesis) |
0b00011001 | 0x29 | 41 | ) (right parenthesis) |
0b00101010 | 0x2a | 42 | * (asterisk) |
0b00101011 | 0x2b | 43 | + (plus sign) |
0b00101100 | 0x2c | 44 | , (comma) |
0b00101101 | 0x2d | 45 | - (minus sign) |
0b00101110 | 0x2e | 46 | . (period) |
0b00101111 | 0x2f | 47 | / (slash) |
0b00101010 | 0x2a | 42 | (asterisk) |
0b00101011 | 0x2b | 43 | (plus sign) |
0b00101100 | 0x2c | 44 | (comma) |
0b00101101 | 0x2d | 45 | (minus sign) |
0b00101110 | 0x2e | 46 | (period) |
0b00101111 | 0x2f | 47 | (slash) |
0b00110000 | 0x30 | 48 | 0 (zero) |
0b00110001 | 0x31 | 49 | 1 (one) |
0b00110010 | 0x32 | 50 | 2 (two) |
0b00110011 | 0x33 | 51 | 3 (three) |
0b00110100 | 0x34 | 52 | 4 (four) |
0b00110101 | 0x35 | 53 | 5 (five) |
0b00110110 | 0x36 | 54 | 6 (six) |
0b00110111 | 0x37 | 55 | 7 (seven) |
0b00111000 | 0x38 | 56 | 8 (eight) |
0b00111001 | 0x39 | 57 | 9 (nine) |
0b00111010 | 0x3a | 58 | : (colon) |
0b00111011 | 0x3b | 59 | ; (semicolon) |
0b00111100 | 0x3c | 60 | < (less than) |
0b00111101 | 0x3d | 61 | = (equal sign) |
0b00111110 | 0x3e | 62 | > (greater than) |
0b00111111 | 0x3f | 63 | ? (question mark) |
0b01000000 | 0x40 | 64 | @ (at sign) |
0b01000001 | 0x41 | 65 | A (capital A) |
0b01000010 | 0x42 | 66 | B (capital B) |
0b01000011 | 0x43 | 67 | C (capital C) |
0b01000100 | 0x44 | 68 | D (capital D) |
0b01000101 | 0x45 | 69 | E (capital E) |
0b01000110 | 0x46 | 70 | F (capital F) |
0b01000111 | 0x47 | 71 | G (capital G) |
0b01001000 | 0x48 | 72 | H (capital H) |
0b01001001 | 0x49 | 73 | I (capital I) |
0b01001010 | 0x4a | 74 | J (capital J) |
0b01001011 | 0x4b | 75 | K (capital K) |
0b01001100 | 0x4c | 76 | L (capital L) |
0b01001101 | 0x4d | 77 | M (capital M) |
0b01001110 | 0x4e | 78 | N (capital N) |
0b01001111 | 0x4f | 79 | O (capital O) |
0b01010000 | 0x50 | 80 | P (capital P) |
0b01010001 | 0x51 | 81 | Q (capital Q) |
0b01010010 | 0x52 | 82 | R (capital R) |
0b01010011 | 0x53 | 83 | S (capital S) |
0b01010100 | 0x54 | 84 | T (capital T) |
0b01010101 | 0x55 | 85 | U (capital U) |
0b01010110 | 0x56 | 86 | V (capital V) |
0b01010111 | 0x57 | 87 | W (capital W) |
0b01011000 | 0x58 | 88 | X (capital X) |
0b01011001 | 0x59 | 89 | Y (capital Y) |
0b01011010 | 0x5a | 90 | Z (capital Z) |
0b01011011 | 0x5b | 91 | [ (opening bracket) |
0b01011100 | 0x5c | 92 | \ (backslash) |
0b01011101 | 0x5d | 93 | ] (closing bracket) |
0b01011110 | 0x5e | 94 | ^ (caret) |
0b01011111 | 0x5f | 95 | _ (underscore) |
0b01100000 | 0x60 | 96 | ` (grave accent) |
0b01100001 | 0x61 | 97 | a (lowercase a) |
0b01100010 | 0x62 | 98 | b (lowercase b) |
0b01100011 | 0x63 | 99 | c (lowercase c) |
0b01100100 | 0x64 | 100 | d (lowercase d) |
0b01100101 | 0x65 | 101 | e (lowercase e) |
0b01100110 | 0x66 | 102 | f (lowercase f) |
0b01100111 | 0x67 | 103 | g (lowercase g) |
0b01101000 | 0x68 | 104 | h (lowercase h) |
0b01101001 | 0x69 | 105 | i (lowercase i) |
0b01101010 | 0x6a | 106 | j (lowercase j) |
0b01101011 | 0x6b | 107 | k (lowercase k) |
0b01101100 | 0x6c | 108 | l (lowercase l) |
0b01101101 | 0x6d | 109 | m (lowercase m) |
0b01101110 | 0x6e | 110 | n (lowercase n) |
0b01101111 | 0x6f | 111 | o (lowercase o) |
0b01110000 | 0x70 | 112 | p (lowercase p) |
0b01110001 | 0x71 | 113 | q (lowercase q) |
0b01110010 | 0x72 | 114 | r (lowercase r) |
0b01110011 | 0x73 | 115 | s (lowercase s) |
0b01110100 | 0x74 | 116 | t (lowercase t) |
0b01110101 | 0x75 | 117 | u (lowercase u) |
0b01110110 | 0x76 | 118 | v (lowercase v) |
0b01110111 | 0x77 | 119 | w (lowercase w) |
0b01111000 | 0x78 | 120 | x (lowercase x) |
0b01111001 | 0x79 | 121 | y (lowercase y) |
0b01111010 | 0x7a | 122 | z (lowercase z) |
0b01111011 | 0x7b | 123 | { (left brace) |
0b01111100 | 0x7c | 124 | | (vertical bar) |
0b01111101 | 0x7d | 125 | } (right brace) |
0b01111110 | 0x7e | 126 | ~ (tilde) |
0b01111111 | 0x7f | 127 | (delete) |
A bytes
str can be represented as a hexadecimal string:
In [13]: ascii_text_b.hex()
Out[13]: '48656c6c6f09576f726c6421'
\x
is used to insert a hexadecimal characters and expects 2 hexadecimal digits:
In [14]: b'\x48\x65\x6c\x6c\x6f\x09\x57\x6f\x72\x6c\x64\x21'
Out[14]: b'Hello\tWorld!'
Notice the formal representation prefers using the printable ASCII character where present over the hexadecimal escape character. If a character is included outwith the ASCII printable character range for example the NUL character at 0x00
:
In [15]: b'\x00\x48\x65\x6c\x6c\x6f\x09\x57\x6f\x72\x6c\x64\x21'
Out[15]: b'\x00Hello\tWorld!'
Then it has no printable alternative and this byte therefore remains represented as a hexadecimal escape character.
Notice that ASCII covers only occupies half the possible values that span over a byte. The remaining values were used regionally in extended ASCII tables:
Extended ASCII Tables
binary | hexadecimal | decimal | latin1 | latin2 | latin3 | latin4 | cyrillic | arabic | greek | hebrew | turkish | nordic | thai |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0b10000000 | 0x80 | 128 | |||||||||||
0b10000001 | 0x81 | 129 | |||||||||||
0b10000010 | 0x82 | 130 | |||||||||||
0b10000011 | 0x83 | 131 | |||||||||||
0b10000100 | 0x84 | 132 | |||||||||||
0b10000101 | 0x85 | 133 | |||||||||||
0b10000110 | 0x86 | 134 | |||||||||||
0b10000111 | 0x87 | 135 | |||||||||||
0b10001000 | 0x88 | 136 | |||||||||||
0b10001001 | 0x89 | 137 | |||||||||||
0b10001010 | 0x8a | 138 | |||||||||||
0b10001011 | 0x8b | 139 | |||||||||||
0b10001100 | 0x8c | 140 | |||||||||||
0b10001101 | 0x8d | 141 | |||||||||||
0b10001110 | 0x8e | 142 | |||||||||||
0b10001111 | 0x8f | 143 | |||||||||||
0b10010000 | 0x90 | 144 | |||||||||||
0b10010001 | 0x91 | 145 | |||||||||||
0b10010010 | 0x92 | 146 | |||||||||||
0b10010011 | 0x93 | 147 | |||||||||||
0b10010100 | 0x94 | 148 | |||||||||||
0b10010101 | 0x95 | 149 | |||||||||||
0b10010110 | 0x96 | 150 | |||||||||||
0b10010111 | 0x97 | 151 | |||||||||||
0b10011000 | 0x98 | 152 | |||||||||||
0b10011001 | 0x99 | 153 | |||||||||||
0b10011010 | 0x9a | 154 | |||||||||||
0b10011011 | 0x9b | 155 | |||||||||||
0b10011100 | 0x9c | 156 | |||||||||||
0b10011101 | 0x9d | 157 | |||||||||||
0b10011110 | 0x9e | 158 | |||||||||||
0b10011111 | 0x9f | 159 | |||||||||||
0b10100000 | 0xa0 | 160 | NBSP | NBSP | NBSP | NBSP | NBSP | NBSP | NBSP | NBSP | NBSP | NBSP | NBSP |
0b10100001 | 0xa1 | 161 | ¡ | Ą | Ħ | Ą | Ё | ‘ | ¡ | Ą | ก | ||
0b10100010 | 0xa2 | 162 | ¢ | ˘ | ˘ | ĸ | Ђ | ’ | ¢ | ¢ | Ē | ข | |
0b10100011 | 0xa3 | 163 | £ | Ł | £ | Ŗ | Ѓ | £ | £ | £ | Ģ | ฃ | |
0b10100100 | 0xa4 | 164 | ¤ | ¤ | ¤ | ¤ | Є | ¤ | € | ¤ | ¤ | Ī | ค |
0b10100101 | 0xa5 | 165 | ¥ | Ľ | Ĩ | Ѕ | ₯ | ¥ | ¥ | Ĩ | ฅ | ||
0b10100110 | 0xa6 | 166 | ¦ | Ś | Ĥ | Ļ | І | ¦ | ¦ | ¦ | Ķ | ฆ | |
0b10100111 | 0xa7 | 167 | § | § | § | § | Ї | § | § | § | § | ง | |
0b10101000 | 0xa8 | 168 | ¨ | ¨ | ¨ | ¨ | Ј | ¨ | ¨ | ¨ | Ļ | จ | |
0b10101001 | 0xa9 | 169 | © | Š | İ | Š | Љ | © | © | © | Đ | ฉ | |
0b10101010 | 0xaa | 170 | ª | Ş | Ş | Ē | Њ | ͺ | × | ª | Š | ช | |
0b10101011 | 0xab | 171 | « | Ť | Ğ | Ģ | Ћ | « | « | « | Ŧ | ซ | |
0b10101100 | 0xac | 172 | ¬ | Ź | Ĵ | Ŧ | Ќ | ، | ¬ | ¬ | ¬ | Ž | ฌ |
0b10101101 | 0xad | 173 | SHY | SHY | SHY | SHY | SHY | SHY | SHY | SHY | SHY | SHY | ญ |
0b10101110 | 0xae | 174 | ® | Ž | Ž | Ў | ® | ® | Ū | ฎ | |||
0b10101111 | 0xaf | 175 | ¯ | Ż | Ż | ¯ | Џ | ― | ¯ | ¯ | Ŋ | ฏ | |
0b10110000 | 0xb0 | 176 | ° | ° | ° | ° | А | ° | ° | ° | ° | ฐ | |
0b10110001 | 0xb1 | 177 | ± | ą | ħ | ą | Б | ± | ± | ± | ą | ฑ | |
0b10110010 | 0xb2 | 178 | ² | ˛ | ² | ˛ | В | ² | ² | ² | ē | ฒ | |
0b10110011 | 0xb3 | 179 | ³ | ł | ³ | ŗ | Г | ³ | ³ | ³ | ģ | ณ | |
0b10110100 | 0xb4 | 180 | ´ | ´ | ´ | ´ | Д | ΄ | ´ | ´ | ī | ด | |
0b10110101 | 0xb5 | 181 | µ | ľ | µ | ĩ | Е | ΅ | µ | µ | ĩ | ต | |
0b10110110 | 0xb6 | 182 | ¶ | ś | ĥ | ļ | Ж | Ά | ¶ | ¶ | ķ | ถ | |
0b10110111 | 0xb7 | 183 | · | ˇ | · | ˇ | З | · | · | · | · | ท | |
0b10111000 | 0xb8 | 184 | ¸ | ¸ | ¸ | ¸ | И | Έ | ¸ | ¸ | ļ | ธ | |
0b10111001 | 0xb9 | 185 | ¹ | š | ı | š | Й | Ή | ¹ | ¹ | đ | น | |
0b10111010 | 0xba | 186 | º | ş | ş | ē | К | Ί | ÷ | º | š | บ | |
0b10111011 | 0xbb | 187 | » | ť | ğ | ģ | Л | ؛ | » | » | » | ŧ | ป |
0b10111100 | 0xbc | 188 | ¼ | ź | ĵ | ŧ | М | Ό | ¼ | ¼ | ž | ผ | |
0b10111101 | 0xbd | 189 | ½ | ˝ | ½ | Ŋ | Н | ½ | ½ | ½ | ― | ฝ | |
0b10111110 | 0xbe | 190 | ¾ | ž | ž | О | Ύ | ¾ | ¾ | ū | พ | ||
0b10111111 | 0xbf | 191 | ¿ | ż | ż | ŋ | П | ؟ | Ώ | ¿ | ŋ | ฟ | |
0b11000000 | 0xc0 | 192 | À | Ŕ | À | Ā | Р | ΐ | À | Ā | ภ | ||
0b11000001 | 0xc1 | 193 | Á | Á | Á | Á | С | ء | Α | Á | Á | ม | |
0b11000010 | 0xc2 | 194 | Â | Â | Â | Â | Т | آ | Β | Â | Â | ย | |
0b11000011 | 0xc3 | 195 | Ã | Ă | Ã | У | أ | Γ | Ã | Ã | ร | ||
0b11000100 | 0xc4 | 196 | Ä | Ä | Ä | Ä | Ф | ؤ | Δ | Ä | Ä | ฤ | |
0b11000101 | 0xc5 | 197 | Å | Ĺ | Ċ | Å | Х | إ | Ε | Å | Å | ล | |
0b11000110 | 0xc6 | 198 | Æ | Ć | Ĉ | Æ | Ц | ئ | Ζ | Æ | Æ | ฦ | |
0b11000111 | 0xc7 | 199 | Ç | Ç | Ç | Į | Ч | ا | Η | Ç | Į | ว | |
0b11001000 | 0xc8 | 200 | È | Č | È | Č | Ш | ب | Θ | È | Č | ศ | |
0b11001001 | 0xc9 | 201 | É | É | É | É | Щ | ة | Ι | É | É | ษ | |
0b11001010 | 0xca | 202 | Ê | Ę | Ê | Ę | Ъ | ت | Κ | Ê | Ę | ส | |
0b11001011 | 0xcb | 203 | Ë | Ë | Ë | Ë | Ы | ث | Λ | Ë | Ë | ห | |
0b11001100 | 0xcc | 204 | Ì | Ě | Ì | Ė | Ь | ج | Μ | Ì | Ė | ฬ | |
0b11001101 | 0xcd | 205 | Í | Í | Í | Í | Э | ح | Ν | Í | Í | อ | |
0b11001110 | 0xce | 206 | Î | Î | Î | Î | Ю | خ | Ξ | Î | Î | ฮ | |
0b11001111 | 0xcf | 207 | Ï | Ď | Ï | Ī | Я | د | Ο | Ï | Ï | ฯ | |
0b11010000 | 0xd0 | 208 | Ð | Đ | Đ | а | ذ | Π | Ğ | Ð | ะ | ||
0b11010001 | 0xd1 | 209 | Ñ | Ń | Ñ | Ņ | б | ر | Ρ | Ñ | Ņ | ั | |
0b11010010 | 0xd2 | 210 | Ò | Ň | Ò | Ō | в | ز | Ò | Ō | า | ||
0b11010011 | 0xd3 | 211 | Ó | Ó | Ó | Ķ | г | س | Σ | Ó | Ó | ำ | |
0b11010100 | 0xd4 | 212 | Ô | Ô | Ô | Ô | д | ش | Τ | Ô | Ô | ิ | |
0b11010101 | 0xd5 | 213 | Õ | Ő | Ġ | Õ | е | ص | Υ | Õ | Õ | ี | |
0b11010110 | 0xd6 | 214 | Ö | Ö | Ö | Ö | ж | ض | Φ | Ö | Ö | ึ | |
0b11010111 | 0xd7 | 215 | × | × | × | × | з | ط | Χ | × | Ũ | ื | |
0b11011000 | 0xd8 | 216 | Ø | Ř | Ĝ | Ø | и | ظ | Ψ | Ø | Ø | ุ | |
0b11011001 | 0xd9 | 217 | Ù | Ů | Ù | Ų | й | ع | Ω | Ù | Ų | ู | |
0b11011010 | 0xda | 218 | Ú | Ú | Ú | Ú | к | غ | Ϊ | Ú | Ú | ฺ | |
0b11011011 | 0xdb | 219 | Û | Ű | Û | Û | л | Ϋ | Û | Û | |||
0b11011100 | 0xdc | 220 | Ü | Ü | Ü | Ü | м | ά | Ü | Ü | |||
0b11011101 | 0xdd | 221 | Ý | Ý | Ŭ | Ũ | н | έ | İ | Ý | |||
0b11011110 | 0xde | 222 | Þ | Ţ | Ŝ | Ū | о | ή | Ş | Þ | |||
0b11011111 | 0xdf | 223 | ß | ß | ß | ß | п | ί | ‗ | ß | ß | ฿ | |
0b11100000 | 0xe0 | 224 | à | ŕ | à | ā | р | ـ | ΰ | א | à | ā | เ |
0b11100001 | 0xe1 | 225 | á | á | á | á | с | ف | α | ב | á | á | แ |
0b11100010 | 0xe2 | 226 | â | â | â | â | т | ق | β | ג | â | â | โ |
0b11100011 | 0xe3 | 227 | ã | ă | ã | у | ك | γ | ד | ã | ã | ใ | |
0b11100100 | 0xe4 | 228 | ä | ä | ä | ä | ф | ل | δ | ה | ä | ä | ไ |
0b11100101 | 0xe5 | 229 | å | ĺ | ċ | å | х | م | ε | ו | å | å | ๅ |
0b11100110 | 0xe6 | 230 | æ | ć | ĉ | æ | ц | ن | ζ | ז | æ | æ | ๆ |
0b11100111 | 0xe7 | 231 | ç | ç | ç | į | ч | ه | η | ח | ç | į | ็ |
0b11101000 | 0xe8 | 232 | è | č | è | č | ш | و | θ | ט | è | č | ่ |
0b11101001 | 0xe9 | 233 | é | é | é | é | щ | ى | ι | י | é | é | ้ |
0b11101010 | 0xea | 234 | ê | ę | ê | ę | ъ | ي | κ | ך | ê | ę | ๊ |
0b11101011 | 0xeb | 235 | ë | ë | ë | ë | ы | ً | λ | כ | ë | ë | ๋ |
0b11101100 | 0xec | 236 | ì | ě | ì | ė | ь | ٌ | μ | ל | ì | ė | ์ |
0b11101101 | 0xed | 237 | í | í | í | í | э | ٍ | ν | ם | í | í | ํ |
0b11101110 | 0xee | 238 | î | î | î | î | ю | َ | ξ | מ | î | î | ๎ |
0b11101111 | 0xef | 239 | ï | ď | ï | ī | я | ُ | ο | ן | ï | ï | ๏ |
0b11110000 | 0xf0 | 240 | ð | đ | đ | № | ِ | π | נ | ğ | ð | 0 | |
0b11110001 | 0xf1 | 241 | ñ | ń | ñ | ņ | ё | ّ | ρ | ס | ñ | ņ | 1 |
0b11110010 | 0xf2 | 242 | ò | ň | ò | ō | ђ | ْ | ς | ע | ò | ō | 2 |
0b11110011 | 0xf3 | 243 | ó | ó | ó | ķ | ѓ | σ | ף | ó | ó | 3 | |
0b11110100 | 0xf4 | 244 | ô | ô | ô | ô | є | τ | פ | ô | ô | 4 | |
0b11110101 | 0xf5 | 245 | õ | ő | ġ | õ | ѕ | υ | ץ | õ | õ | 5 | |
0b11110110 | 0xf6 | 246 | ö | ö | ö | ö | і | φ | צ | ö | ö | 6 | |
0b11110111 | 0xf7 | 247 | ÷ | ÷ | ÷ | ÷ | ї | χ | ק | ÷ | ũ | 7 | |
0b11111000 | 0xf8 | 248 | ø | ř | ĝ | ø | ј | ψ | ר | ø | ø | 8 | |
0b11111001 | 0xf9 | 249 | ù | ů | ù | ų | љ | ω | ש | ù | ų | 9 | |
0b11111010 | 0xfa | 250 | ú | ú | ú | ú | њ | ϊ | ת | ú | ú | ๚ | |
0b11111011 | 0xfb | 251 | û | ű | û | û | ћ | ϋ | û | û | ๛ | ||
0b11111100 | 0xfc | 252 | ü | ü | ü | ü | ќ | ό | ü | ü | |||
0b11111101 | 0xfd | 253 | ý | ý | ŭ | ũ | § | ύ | LRM | ı | ý | ||
0b11111110 | 0xfe | 254 | þ | ţ | ŝ | ū | ў | ώ | RLM | ş | þ | ||
0b11111111 | 0xff | 255 | ÿ | ˙ | ˙ | ˙ | џ | ÿ | ĸ |
If 0xe5
is examined for example, then notice that it maps to a different character in most of the ASCII tables:
binary | 0b11100101 |
---|---|
hexadecimal | 0xe5 |
decimal | 229 |
latin1 | å |
latin2 | ĺ |
latin3 | ċ |
latin4 | å |
cyrillic | х |
arabic | م |
greek | ε |
hebrew | ו |
turkish | å |
nordic | å |
thai | ๅ |
If inserted as a hexadecimal escape character in a string, notice that it is automatically encoded using latin1
, the most common ASCII table:
In [16]: '\xe5'
Out[16]: 'å'
If it is inserted as a hexadecimal character into a bytes object, notice that it remains a hexadecimal escape sequence:
In [17]: b'\xe5'
Out[17]: b'\xe5'
The bytes
object can be decoded to a string when the correct encoding is applied:
In [18]: b'\xe5'.decode('latin1')
Out[18]: 'å'
In [19]: b'\xe5'.decode('latin2')
Out[19]: 'ĺ'
In [20]: b'\xe5'.decode('latin3')
Out[20]: 'ċ'
In [21]: b'\xe5'.decode('latin4')
Out[21]: 'å'
In [22]: b'\xe5'.decode('cyrillic')
Out[22]: 'х'
In [23]: b'\xe5'.decode('arabic')
Out[23]: 'م'
In [24]: b'\xe5'.decode('greek')
Out[24]: 'ε'
In [25]: b'\xe5'.decode('iso8859-9') # turkish
Out[25]: 'å'
In [26]: b'\xe5'.decode('hebrew')
Out[26]: 'ו'
In [27]: b'\xe5'.decode('iso8859-10') # nordic
Out[27]: 'å'
In [28]: b'\xe5'.decode('thai')
Out[28]: 'ๅ'
Returning to the str
instance text
, it can be encoded into a bytes
object
using the greek
encoding table:
In [29]: text
Out[30]: 'Γεια σου Κοσμο!'
In [31]: text.encode(encoding='greek')
Out[31]: b'\xc3\xe5\xe9\xe1 \xf3\xef\xf5 \xca\xef\xf3\xec\xef!'
Notice in the bytes
object
that each of the printable ASCII characters is displayed using it's ASCII character and the non-ASCII characters are displayed using a hexadecimal escape character.
1 byte character encoding was suitable for offline regional computing however the advent of the internet resulted in a number of issues. Essentially a computer in Greece would produce content using the greek
encoding table and then be read using a computer in the UK with the latin1
encoding table and the following character substitution would take place:
In [32]: text.encode(encoding='greek').decode(encoding='latin1')
Out[32]: 'Ãåéá óïõ Êïóìï!'
1 byte (8 bit) encoding allows:
In [33]: 2 ** 8
Out[33]: 65536
commands. 2 bytes (16 bits) encoding allows:
In [34]: 2 ** 16
Out[34]: 256
65536 commands.
The utf-16
standard was produced which includes all the characters seen in the extended ASCII tables:
In [35]: text.encode(encoding='utf-16-be')
Out[35]: b'\x03\x93\x03\xb5\x03\xb9\x03\xb1\x00 \x03\xc3\x03\xbf\x03\xc5\x00 \x03\x9a\x03\xbf\x03\xc3\x03\xbc\x03\xbf\x00!'
The bytes
instance can be displayed as a hexadecimal string:
In [36]: text.encode(encoding='utf-16-be').hex()
Out[36]: '039303b503b903b1002003c303bf03c50020039a03bf03c303bc03bf0021'
Let's examine an ASCII character. In utf-16
encoding, the byte corresponding to the ASCII character when ascii
encoding is used, is paired with the NULL byte 00
and the 2 bytes are used to encode the character.
utf-16-be
is a variant of utf-16
that is Big Endian. Big Endian is typically the way, humans count where the Big (most significant byte 00
) is placed before the Little (least significant byte 61
):
In [37]: 'a'.encode(encoding='utf-16-be').hex()
Out[37]: '0061'
Intel processors typically use Little Endian where the Little (least significant byte 61
) is placed before the Big (most significant byte 00
):
In [38]: 'a'.encode(encoding='utf-16-le').hex()
Out[38]: '6100'
There was initially some confusion because of this and therefore a standard was produced that was Little Endian that includes a Byte Order Marker (BOM) as a prefix:
In [39]: 'a'.encode(encoding='utf-16').hex()
Out[39]: 'fffe6100'
The BOM can be seen by examinination of an empty str
:
In [40]: ''.encode(encoding='utf-16').hex()
Out[40]: 'fffe'
A Greek character can also be examined using the utf-16
encoding variants:
In [39]: 'α'.encode(encoding='utf-16-be').hex()
Out[39]: '03b1'
In [40]: 'α'.encode(encoding='utf-16-le').hex()
Out[40]: 'b103'
In [41]: 'α'.encode(encoding='utf-16').hex()
Out[41]: 'fffeb103'
Some languages such as Chinese use more than 50000 characters and therefore 65536 commands is insufficient to incorporate all Latin and Asian characters. Therefore utf-16
was quickly phased out by utf-32
. utf-32
uses 4 bytes (32 bits) encoding which allows:
In [42]: 2 ** 32
Out[42]: 4294967296
commands which is sufficient to cover all characters used in all languages. utf-32
has byte ordering variants:
In [43]: 'a'.encode(encoding='utf-32-be').hex()
Out[43]: '00000061'
In [44]: 'a'.encode(encoding='utf-32-le').hex()
Out[44]: '61000000'
In [45]: 'a'.encode(encoding='utf-32').hex()
Out[45]: 'fffe000061000000'
In [46]: 'α'.encode(encoding='utf-32-be').hex()
Out[46]: '000003b1'
In [47]: 'α'.encode(encoding='utf-16-le').hex()
Out[47]: 'b1030000'
In [48]: 'α'.encode(encoding='utf-16').hex()
Out[48]: 'fffe0000b1030000'
In [49]: '我'.encode(encoding='utf-32-be').hex()
Out[49]: '00006211'
In [50]: '我'.encode(encoding='utf-16-le').hex()
Out[50]: '11620000'
In [51]: '我'.encode(encoding='utf-16').hex()
Out[51]: 'fffe000011620000'
A Unicode character can be inserted into a string using the hexadecimal escape character \U
this expects 8 hexadecimal values in the format shown by utf-32-be
:
In [52]: '\U00000061'
Out[52]: 'a'
In [53]: '\U000003b1'
Out[53]: 'α'
In [54]: '\U00006211'
Out[54]: '我'
In a Python string the \
is an instruction to inser an escape character and \U
expects 8 hexadecimal characters. On Windows \
is used as the default directory seperator. Therefore \\
has to be used within a file path, where the first \
is an instruction to an insert an escape character and the second \
is the escape character to be inserted. The prefix R
can be used for a Raw String which has no escape characters. Note upper case R
is preferentially used for a raw string and syntax highlighting won't be applied. Lower case r
is instead prefentially used for a regular expression and syntax highlighting for a regular expression may be applied:
In [55]: 'C:\\Users\\Philip'
Out[55]: 'C:\\Users\\Philip'
In [56]: R'C:\Users\Philip'
Out[56]: 'C:\\Users\\Philip'
In [57]: r'C:\Users\Philip'
Out[57]: 'C:\\Users\\Philip'
On Windows if a string is used instead of a raw string, the following error message is common:
In [58]: 'C:\Users\Philip'
Cell In[58], line 1
'C:\Users\Philip'
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
The number of trailing zeros for an ASCII character and confusion due to byte order marker resulted in a new standard with a variable byte length per character:
In [59]: 'a'.encode(encoding='utf-8').hex()
Out[59]: '61'
In [60]: 'α'.encode(encoding='utf-8').hex()
Out[60]: 'ceb1'
In [61]: '我'.encode(encoding='utf-8').hex()
Out[61]: 'e68891'
In [62]: '🐱'.encode(encoding='utf-8').hex()
Out[62]: 'f09f90b1'
It is called utf-8
because the ASCII characters only occupy 1 byte (8 bits). Greek characters occupy 2 bytes (16 bits), Asian characters occupy 3 bytes (24 bits) and emojis cover 4 bytes (32 bits).
There is no byte order marker and under the hood the binary sequence is used which outlines the expected number of bytes per character:
number of bytes | binary sequence |
---|---|
1 | 0b 0aaaaaaa |
2 | 0b 110aaaaa 10aaaaaa |
3 | 0b 1110aaaa 10aaaaaa 10aaaaaa |
4 | 0b 11110aaa 10aaaaaa 10aaaaaa 10aaaaaa |
These underlying patterns can be seen when the binary sequence for each of the characters above is examined:
In [63]: '0b'+bin(int('a'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(8)
Out[63]: '0b01100001'
In [64]: '0b'+bin(int('α'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(16)
Out[64]: '0b1100111010110001'
In [65]: '0b'+bin(int('我'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(24)
Out[64]: '0b111001101000100010010001'
In [65]: '0b'+bin(int('🐱'.encode(encoding='utf-8').hex(), base=16)).removeprefix('0b').zfill(32)
Out[65]: '0b11110000100111111001000010110001'
Although utf-8
was designed to not require a BOM. Microsoft produced a version utf-8-sig
which has the BOM:
In [66]: ''.encode(encoding='utf-8').hex()
Out[66]: ''
In [67]: ''.encode(encoding='utf-8-sig').hex()
Out[67]: 'efbbbf'
In [68]: 'a'.encode(encoding='utf-8-sig').hex()
Out[68]: 'efbbbf61'
In [69]: 'α'.encode(encoding='utf-8-sig').hex()
Out[69]: 'efbbbfceb1'
In [70]: '我'.encode(encoding='utf-8-sig').hex()
Out[70]: 'efbbbfe68891'
In [71]: '🐱'.encode(encoding='utf-8-sig').hex()
Out[71]: 'efbbbff09f90b1'
The bytes
class has the alternative constructor fromhex
which can be used to construct a bytes
instance from a hexadecimal string:
In [72]: bytes.fromhex('61')
Out[72]: b'a'
In [73]: bytes.fromhex('ceb1')
Out[73]: b'\xce\xb1'
In [74]: bytes.fromhex('e68891')
Out[74]: b'\xe6\x88\x91'
In [75]: bytes.fromhex('f09f90b1')
Out[75]: b'\xf0\x9f\x90\xb1'
In [76]: exit
From now on utf-8
will be used as the default encoding table. The following str
instances can eb isntantiated and encoded to bytes
instances:
In [1]: ascii_text = 'Hello World!'
In [2]: text = 'Γεια σου Κοσμο!'
In [3]: ascii_text.encode(encoding='utf-8')
Out[3]: b'Hello World!'
In [4]: text.encode(encoding='utf-8')
Out[4]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
As ascii_text
consists only of printable ASCII characters the bytes
instance returned, which shows the preferred formal representation displays each byte as its printable ASCII character.
As text
contains a mixture of pritnable ASCII characters and non-ASCII characters, the formal representation displays each byte as its printable ASCII character where applicable and a hexadecimal escape sequence otherwise. If these are assigned to variables:
In [5]: ascii_text_b = b'Hello World!'
In [6]: text_b = b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
If these are shown in the Variable Explorer:
Variable Explorer | |||
---|---|---|---|
Name ▲ | Type | Size | Value |
ascii_text | str | 12 | Hello World! |
ascii_text_b | bytes | 12 | Hello World! |
text | str | 15 | Γεια σου Κοσμο! |
text_b | bytes | 27 | Γεια σου Κοσμο! |
The Variable Explorer in Spyder assumes 'utf-8'
encoding for a bytes
instance and attempts to display any printable character.
Notice the length of text
and text_b
are different because the element in each class is different. In text_b
some of the characters are encoded to multiple bytes
:
This can be seen by casting each Collection
explictly to a tuple
:
In [7]: text_as_tuple = tuple(text)
In [8]: text_b_as_tuple = tuple(text_b)
Variable Explorer | |||
---|---|---|---|
Name ▲ | Type | Size | Value |
ascii_text | str | 12 | Hello World! |
ascii_text_b | bytes | 12 | Hello World! |
text | str | 15 | Γεια σου Κοσμο! |
text_as_tuple | tuple | 15 | ('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …) |
text_b | bytes | 27 | Hello World! |
text_b_as_tuple | tuple | 27 | (206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …) |
If text_as_tuple
is expanded, the value at each index can be seen to be a Unicode character because a Unicode character is an element of a str
:
text_as_tuple - tuple (15 elements) | |
---|---|
Index | Value |
0 | 'Γ' |
1 | 'ε' |
2 | 'ι' |
3 | 'α' |
4 | ' ' |
5 | 'σ' |
6 | 'ο' |
7 | 'υ' |
8 | ' ' |
9 | 'Κ' |
10 | 'ο' |
11 | 'σ' |
12 | 'μ' |
13 | 'ο' |
14 | '!' |
If text_b_as_tuple
is expanded, the value at each index can be seen to be an int
between 0:256
:
text_b_as_tuple - tuple (27 elements) | |
---|---|
Index | Value |
0 | 206 |
1 | 147 |
2 | 206 |
3 | 181 |
4 | 32 |
5 | 185 |
6 | 206 |
7 | 177 |
8 | 32 |
9 | 207 |
10 | 132 |
11 | 206 |
12 | 181 |
13 | 206 |
14 | 183 |
15 | 207 |
16 | 131 |
17 | 32 |
18 | 207 |
19 | 140 |
20 | 207 |
21 | 132 |
22 | 32 |
23 | 206 |
24 | 177 |
25 | 207 |
26 | 132 |
Recall a byte
is a numeric value between 0:256
:
The binary bin
and hexadecimal hex
functions can be used to display this int
as a binary string or hexadecimal string:
In [9]: text_b_as_tuple_bin = tuple([bin(byte) for byte in text_b_as_tuple])
In [10]:
text_b_as_tuple_bin
and text_b_as_tuple_hex
display in the Variable Explorer:
Variable Explorer | |||
---|---|---|---|
Name ▲ | Type | Size | Value |
ascii_text | str | 12 | Hello World! |
ascii_text_b | bytes | 12 | Hello World! |
text | str | 15 | Γεια σου Κοσμο! |
text_as_tuple | tuple | 15 | ('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …) |
text_b | tuple | 27 | Hello World! |
text_b_as_tuple | tuple | 27 | (206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …) |
text_b_as_tuple_bin | tuple | 27 | ('0b11001110', '0b10010011', '0b11001110', '0b10110101', '0b11001110', …) |
text_b_as_tuple_hex | tuple | 27 | ('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …) |
text_b_as_tuple_bin
can be expanded to view each byte in binary:
text_b_as_tuple_bin - tuple (27 elements) | |
---|---|
Index | Value |
0 | 0b11001110 |
1 | 0b10010011 |
2 | 0b11001110 |
3 | 0b10110101 |
4 | 0b11001110 |
5 | 0b10111001 |
6 | 0b11001110 |
7 | 0b10110001 |
8 | 0b00100000 |
9 | 0b11001111 |
10 | 0b10000100 |
11 | 0b11001110 |
12 | 0b10110101 |
13 | 0b11001110 |
14 | 0b10111011 |
15 | 0b11001111 |
16 | 0b10000101 |
17 | 0b00100000 |
18 | 0b11001111 |
19 | 0b10001100 |
20 | 0b11001111 |
21 | 0b10000100 |
22 | 0b00100000 |
23 | 0b11001110 |
24 | 0b10110001 |
25 | 0b11001111 |
26 | 0b10000100 |
text_b_as_tuple_hex
can be expanded to view each byte in hexadecimal:
text_b_as_tuple_hex - tuple (27 elements) | |
---|---|
Index | Value |
0 | 0xce |
1 | 0x93 |
2 | 0xce |
3 | 0xb5 |
4 | 0xce |
5 | 0xb9 |
6 | 0xce |
7 | 0xb1 |
8 | 0x20 |
9 | 0xcf |
10 | 0x84 |
11 | 0xce |
12 | 0xb5 |
13 | 0xce |
14 | 0xbb |
15 | 0xcf |
16 | 0x85 |
17 | 0x20 |
18 | 0xcf |
19 | 0x8c |
20 | 0xcf |
21 | 0x84 |
22 | 0x20 |
23 | 0xce |
24 | 0xb1 |
25 | 0xcf |
26 | 0x84 |
The bytes
class can be used to cast a tuple
of int
values between 0:256
to a bytes
instance:
In [10]: bytes((206, 147, 206, 181, 206, 185, 206, 177, 32, 207,
131, 206, 191, 207, 133, 32, 206, 154, 206, 191,
207, 131, 206, 188, 206, 191, 33))
Out[10]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
Now that the element in each Collection
is understood, the following Collection
based identifiers can be used:
# 🔧 Collection-Based Methods (from `str` and the Collection ABC):
# - __contains__(self, key, /) : Checks if a substring is in the string (`in`).
# - __iter__(self, /) : Returns an iterator over the string.
# - __len__(self, /) : Returns the length of the string.
# - __getitem__(self, key, /) : Retrieves a character by index (`[]`).
# - count(self, sub, start=0, : Counts the occurrences of a substring.
# end=9223372036854775807, /)
# - index(self, sub, start=0, : Returns the index of the first occurrence of a substring.
# end=9223372036854775807, /)
The data model method __len__
defines the behaviour of the builtins
function len
and essentially retrieves the Size shown on the Variable Explorer:
In [11]: len(text) # text.__len__()
Out[11]: 15
In [12]: len(text_b) # text_b.__len__()
Out[12]: 27
The data model method __contains__
defines the behaviour of the in
keyword:
In [13]: 'ει' in text # text.__contains__('ει')
Out[13]: True
In [14]: 'ε' in text # text.__contains__('ε')
Out[14]: True
In [15]: bytes((147, 206)) in text_b # text_b.__contains__(bytes((147, 206)))
Out[15]: True
In [16]: 147 in text_b # text_b.__contains__(147, 206)
Out[16]: True
The data model method __getitem__
will retrieve a value at an integer index:
In [17]: text[1] # text.__index__(1)
Out[17]: 'ε'
Notice that Python use zero-order indexing. This means the first index is at index 0
and the last index is the length of the Collection
minus 1
:
In [17]: text[0]
Out[17]: 'Γ'
In [18]: text[len(text)]
Traceback (most recent call last):
Cell In[18], line 1
text[len(text)]
IndexError: string index out of range
In [19]: text[len(text)-1]
Out[19]: '!'
text Variable Explorer
text_as_tuple - tuple (15 elements) | |
---|---|
Index | Value |
0 | 'Γ' |
1 | 'ε' |
2 | 'ι' |
3 | 'α' |
4 | ' ' |
5 | 'σ' |
6 | 'ο' |
7 | 'υ' |
8 | ' ' |
9 | 'Κ' |
10 | 'ο' |
11 | 'σ' |
12 | 'μ' |
13 | 'ο' |
14 | '!' |
The builtins
class slice
has consistent input arguments start, stop[, step]
to the builtins
class range
:
In [20]: slice()
# Docstring popup
"""
Init signature: slice(self, /, *args, **kwargs)
Docstring:
slice(stop)
slice(start, stop[, step])
Create a slice object. This is used for extended slicing (e.g. a[0:10:2]).
Type: type
Subclasses:
"""
range
, uses zero-order indexing so is inclusive of the start bound and exclusive of the stop bound:
In [20]: tuple(range(0, 5, 1))
Out[20]: (0, 1, 2, 3, 4)
In [21]: tuple(range(0, 5)) # default step=1
Out[21]: (0, 1, 2, 3, 4)
In [22]: tuple(range(5)) # default stop=0
Out[22]: (0, 1, 2, 3, 4)
slice
behaves consistently:
In [23]: text[slice(0, 5, 1)]
Out[23]: 'Γεια '
In [24]: text[slice(0, 5)] # default step=1
Out[24]: 'Γεια '
In [25]: text[slice(5)] # default stop=len(text)
Out[25]: 'Γεια '
Essentially the section from and including index 0
is made to and excluding index 5
:
text Variable Explorer Annotated
text_as_tuple - tuple (15 elements) | |
---|---|
Index | Value |
0 | 'Γ' |
1 | 'ε' |
2 | 'ι' |
3 | 'α' |
4 | ' ' |
5 | 'σ' |
6 | 'ο' |
7 | 'υ' |
8 | ' ' |
9 | 'Κ' |
10 | 'ο' |
11 | 'σ' |
12 | 'μ' |
13 | 'ο' |
14 | '!' |
The slice
instance can be assigned to an object
name for the sake of readibiliy:
In [26]: selection = slice(0, 5, 1)
In [27]: text[selection]
Out[27]: 'Γεια '
Variable Explorer | |||
---|---|---|---|
Name ▲ | Type | Size | Value |
ascii_text | str | 12 | Hello World! |
ascii_text_b | bytes | 12 | Hello World! |
selection | slice | 1 | slice(0, 5, 1) |
text | str | 15 | Γεια σου Κοσμο! |
text_as_tuple | tuple | 15 | ('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …) |
text_b | tuple | 27 | Hello World! |
text_b_as_tuple | tuple | 27 | (206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …) |
text_b_as_tuple_bin | tuple | 27 | ('0b11001110', '0b10010011', '0b11001110', '0b10110101','0b11001110', …) |
text_b_as_tuple_hex | tuple | 27 | ('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …) |
However normally slicing is done using a colon :
instead:
In [28]: text[0:5:1] # text[slice(0, 5, 1)]
Out[28]: 'Γεια '
In [29]: text[0:5] # default step=1
Out[29]: 'Γεια '
In [30]: text[5] # default stop=len(text)
Out[30]: 'Γεια '
Using the notation with the colons is a bit more flexible:
In [31]: text[:2] # default start=0
Out[31]: 'Γε'
If a step of -1
is used, the string is reversed:
In [32]: text[::-1] # default start=0
Out[32]: '!ομσοΚ υοσ αιεΓ'
This means the default start is -1
and stop is -len(text)-1
taking into account zero-order indexing:
In [33]: text[-1:-len(text)-1:-1]
Out[33]: '!ομσοΚ υοσ αιεΓ'
If the bytes
instance text_b
is now examined. Notice that indexing a single value returns an int
corresponding to the byte:
In [34]: text_b[0]
Out[34]: 206
However slicing, returns a bytes
instance:
In [35]: text_b[0:1]
Out[35]: b'\xce'
text_b Variable Explorer
text_b_as_tuple - tuple (27 elements) | |
---|---|
Index | Value |
0 | 206 |
1 | 147 |
2 | 206 |
3 | 181 |
4 | 32 |
5 | 185 |
6 | 206 |
7 | 177 |
8 | 32 |
9 | 207 |
10 | 132 |
11 | 206 |
12 | 181 |
13 | 206 |
14 | 183 |
15 | 207 |
16 | 131 |
17 | 32 |
18 | 207 |
19 | 140 |
20 | 207 |
21 | 132 |
22 | 32 |
23 | 206 |
24 | 177 |
25 | 207 |
26 | 132 |
The data model method __iter__
defines the behaviour of the builtins function iter
and casts the str
into an iterator:
In [36]: forward = iter(text)
Variable Explorer | |||
---|---|---|---|
Name ▲ | Type | Size | Value |
ascii_text | str | 12 | Hello World! |
ascii_text_b | bytes | 12 | Hello World! |
forward | str_iterator | 1 | <str_iterator at 0x22a2f5b70a0> |
selection | slice | 1 | slice(0, 5, 1) |
text | str | 15 | Γεια σου Κοσμο! |
text_as_tuple | tuple | 15 | ('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …) |
text_b | tuple | 27 | Hello World! |
text_b_as_tuple | tuple | 27 | (206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …) |
text_b_as_tuple_bin | tuple | 27 | ('0b11001110', '0b10010011', '0b11001110', '0b10110101','0b11001110', …) |
text_b_as_tuple_hex | tuple | 27 | ('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …) |
The iterator essentially only displays a single value at a time. The builtins
function next
can be called to advance to the next value, which consumes the previous value:
In [37]: next(forward)
Out[37]: 'Γ'
In [38]: next(forward)
Out[38]: 'ε'
In [39]: next(forward)
Out[39]: 'ι'
A while
loop can be constructed that breaks when the StopIteration
error is encountered:
In [40]: forward = iter(text):
: while True:
: try:
: print(next(forward))
: except StopIteration:
: break
:
Γ
ε
ι
α
σ
ο
υ
Κ
ο
σ
μ
ο
!
The syntax for a for
loop is cleaner. However behind the scenes, the while
loop and iterator are used:
In [41]: for unicode_char in text:
: print(unicode_char)
:
Γ
ε
ι
α
σ
ο
υ
Κ
ο
σ
μ
ο
!
The enumerate
class can be used to enumerate the tuple
. To visualise the enumeration
object
it can be cast into a dictionary:
In [42]: enum_text = enumerate(text)
In [43]: enum_text_as_dict = dict(enum_text)
Name ▲ | Type | Size | Value |
---|---|---|---|
ascii_text | str | 12 | Hello World! |
ascii_text_b | bytes | 12 | Hello World! |
forward | str_iterator | 1 | <str_iterator at 0x22a2f5b70a0> |
enum_text | enumerate | 1 | <enumerate at 0x22a2ed61260> |
enum_text_as_dict | dict | 15 | {0: 'Γ', 1: 'ε', 2: 'ι', 3: 'α', 4: ' ', 5: 'σ', 6: 'ο', 7: 'υ', 8: ' ', 9: 'Κ', …} |
selection | slice | 1 | slice(0, 5, 1) |
text | str | 15 | Γεια σου Κοσμο! |
text_as_tuple | tuple | 15 | ('Γ', 'ε', 'ι', 'α', ' ', 'σ', 'ο', 'υ', ' ', 'Κ', …) |
text_b | tuple | 27 | Hello World! |
text_b_as_tuple | tuple | 27 | (206, 147, 206, 181, 26, 185, 206, 177, 32, 207, …) |
text_b_as_tuple_bin | tuple | 27 | ('0b11001110', '0b10010011', '0b11001110', '0b10110101','0b11001110', …) |
text_b_as_tuple_hex | tuple | 27 | ('0xce', '0x93', '0xce', '0xb5', '0xce', '0xb9', '0xce', '0xb1', '0x20', …) |
enum_text_as_dict
can be expanded:
enum_text_as_dict | |
---|---|
Key | Value |
0 | Γ |
1 | ε |
2 | ι |
3 | α |
4 | |
5 | σ |
6 | ο |
7 | υ |
8 | |
9 | Κ |
10 | ο |
11 | σ |
12 | μ |
13 | ο |
14 | ! |
A for
loop can be constructed with the enumeration of text
:
In [44]: for index, unicode_char in enumerate(text):
: print(index, unicode_char)
:
0 Γ
1 ε
2 ι
3 α
4
5 σ
6 ο
7 υ
8
9 Κ
10 ο
11 σ
12 μ
13 ο
14 !
The negative indexes can also be examined using:
In [45]: for index, unicode_char in enumerate(text):
: print(index-len(text), unicode_char)
:
-15 Γ
-14 ε
-13 ι
-12 α
-11
-10 σ
-9 ο
-8 υ
-7
-6 Κ
-5 ο
-4 σ
-3 μ
-2 ο
-1 !
The negative indexes can be viewed alongside the positive indexes:
In [46]: for index, unicode_char in enumerate(text):
: print(index-len(text), unicode_char)
: for index, unicode_char in enumerate(text):
: print(index, unicode_char)
:
-15 Γ
-14 ε
-13 ι
-12 α
-11
-10 σ
-9 ο
-8 υ
-7
-6 Κ
-5 ο
-4 σ
-3 μ
-2 ο
-1 !
0 Γ
1 ε
2 ι
3 α
4
5 σ
6 ο
7 υ
8
9 Κ
10 ο
11 σ
12 μ
13 ο
14 !
This makes it easier to conceptualise slicing using a negative step:
In [47]: text[-8:-11:-1]
Out[47]: 'υοσ'
A step of 2
can be used to return a str
of every second unicode character:
In [48]: text[::2]
Out[48]: 'Γι ο ομ!'
In [49]: text[1::2]
Out[49]: 'εασυΚσο'
It is possible to do the same for the bytes
instance:
In [50]: for index, byte_int in enumerate(text_b):
: print(index-len(text_b), byte_int)
: for index, byte_int in enumerate(text_b):
: print(index, byte_int)
:
-27 206
-26 147
-25 206
-24 181
-23 206
-22 185
-21 206
-20 177
-19 32
-18 207
-17 131
-16 206
-15 191
-14 207
-13 133
-12 32
-11 206
-10 154
-9 206
-8 191
-7 207
-6 131
-5 206
-4 188
-3 206
-2 191
-1 33
0 206
1 147
2 206
3 181
4 206
5 185
6 206
7 177
8 32
9 207
10 131
11 206
12 191
13 207
14 133
15 32
16 206
17 154
18 206
19 191
20 207
21 131
22 206
23 188
24 206
25 191
26 33
However slicing using a step with a multibyte encoding such as utf-8
will usually result in a bytes
instance that cannot be decoded:
In [51]: text_b[0::2]
Out[51]: b'\xce\xce\xce\xce \x83\xbf\x85\xce\xce\xcf\xce\xce!'
In [52]: text_b[0::2].decode(encoding='utf-8')
Traceback (most recent call last):
Cell In[52], line 1
text_b[0::2].decode(encoding='utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte
The Collection
method count
will count the number of occurances a substring occurs in a str
:
In [53]: text
Out[33]: 'Γεια σου Κοσμο!'
In [53]: text.count('σου')
Out[53]: 1
In [54]: text.count('σ')
Out[54]: 2
The Collection
method index
will retrieve the index of the first occurance of a value:
In [55]: dict(enumerate(text))
Out[55]:
{0: 'Γ',
1: 'ε',
2: 'ι',
3: 'α',
4: ' ',
5: 'σ',
6: 'ο',
7: 'υ',
8: ' ',
9: 'Κ',
10: 'ο',
11: 'σ',
12: 'μ',
13: 'ο',
14: '!'}
In [56]: text.index('σ')
Out[56]: 5
The optional positional input arguments start
and stop
can be used to constrict the range of indexes to search over:
In [57]: first = text.index('σ')
: text.index('σ', first+1, len(text))
Out[57]: 11
The method index
will produce a ValueError
when the substring is not found:
In [58]: second = text.index('σ', first+1, len(text))
: text.index('σ', second+1, len(text))
Traceback (most recent call last):
Cell In[58], line 2
text.index('σ', second+1, len(text))
ValueError: substring not found
In the str
class there is a similar method find
that behaves similarly to index
but returns -1
when a substring is not found:
In [59]: text.find('σ')
Out[59]: 5
In [60]: text.find('σ', first+1, len(text))
Out[60]: 11
In [61]: text.find('σ', second+1, len(text))
Out[61]: -1
index
and find
search from left to right and have the counterparts, rindex
and rfind
which operate from right to left:
In [62]: text.index('σ')
Out[62]: 11
Once again, these only differ when the substring is not found returning a ValueError
or -1
upon failure respectively.
The replace
method can be used to replace an old substring with a new substring returning a new str
with the changes. If the old substring is found multiple times, it will be replaced by the new string multiple times by default unless the count of the number of replacements is specified, for example 1
where it will only make the first replacement:
In [63]: text
Out[63]: 'Γεια σου Κοσμο!'
In [64]: text.replace('Γεια', 'Γϵια')
Out[64]: 'Γϵια σου Κοσμο!'
In [65]: text.replace('σ', 'ς')
Out[65]: 'Γεια ςου Κοςμο!'
In [66]: text.replace('σ', 'ς', 1)
Out[66]: 'Γεια ςου Κοσμο!'
The bytes
class behaves similarly:
In [67]: text_b
Out[67]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [68]: dict(enumerate(text_b))
Out[68]:
{0: 206,
1: 147,
2: 206,
3: 181,
4: 206,
5: 185,
6: 206,
7: 177,
8: 32,
9: 207,
10: 131,
11: 206,
12: 191,
13: 207,
14: 133,
15: 32,
16: 206,
17: 154,
18: 206,
19: 191,
20: 207,
21: 131,
22: 206,
23: 188,
24: 206,
25: 191,
26: 33}
In [69]: text_b.count(bytes((207, 131)))
Out[69]: 2
In [70]: bytes((207, 131))
Out[70]: b'\xcf\x83'
In [71]: text_b.index(bytes((207, 131)))
Out[71]: 9
In [72]: text_b.count(207)
Out[72]: 3
In [73]: text_b.index(207)
Out[73]: 9
In [74]: text_b.index(207, 9+1, len(text_b))
Out[74]: 13
The str
has the following Collection
based binary operators:
# 🔧 Collection-Like Operators:
# - __add__(self, value, /) : Implements string concatenation (`+`).
# - __mul__(self, value, /) : Implements string repetition (`*`).
# - __rmul__(self, value, /) : Implements reflected multiplication (`*`).
The data model method __add__
defines the behaviour of the +
operator and performs str
concatenation:
In [75]: text
Out[75]: 'Γεια σου Κοσμο!'
In [76]: ascii_text
Out[76]: 'Hello World!'
In [77]: text + ascii_text # text.__add__(ascii_text)
Out[77]: 'Γεια σου Κοσμο!Hello World!'
Notice that no space is added, if this is desired it can also be concatenated:
In [78]: text + ascii_text
Out[78]: 'Γεια σου Κοσμο! Hello World!'
The data model method __mul__
defines the behaviour of the *
operator and performs str
replication with an int
instance:
In [79]: text * 3 # text.__mul__(3)
Out[79]: 'Γεια σου Κοσμο!Γεια σου Κοσμο!Γεια σου Κοσμο!'
The reverse data model method __rmul__
gives instructions when the position of the str
instance and int
instance around the operator are reversed:
In [80]: 3 * text # (3).__mul__(text) # Not Defined in int class
# text.__rmul__(3)
Out[80]: 'Γεια σου Κοσμο!Γεια σου Κοσμο!Γεια σου Κοσμο!'
The bytes
class behaves similarly:
In [81]: text_b + ascii_text_b
Out[81]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!Hello World!'
In [82]: text_b * 3
Out[82]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [83]: exit
The bytes
class has the mutable counterpart the bytearray
. A bytearray
instance can be instantiated by casting from a bytes
instance to a bytearray
:
In [1]: text_b = b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [2]: text_b
Out[2]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [3]: text_ba = bytearray(b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')
In [4]: text_ba
Out[4]: bytearray(b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')
The printed formal representation Out[4]
shows the recommended way to instantiate a bytearray
is by casting a bytes
instance to a bytearray
. There is no shorthand way of instantiating this class as it is less commonly used.
The behaviour of all the immutable methods is consistent:
In [4]: len(text_ba)
Out[4]: 27
In [5]: 207 in text_ba
Out[5]: True
In [6]: text_ba.count(207)
Out[6]: 3
In [7]: text_ba.index(207)
Out[7]: 9
The hash
function can be used to verify an immutable object
(an object
that does not change). Notice that text_b
which is immutable has a unique hash value but text_ba
which is mutable is unhashable:
In [8]: hash(text_b)
Out[8]: -2033065742153678299
In [9]: hash(text_ba)
Traceback (most recent call last):
Cell In[9], line 1
hash(text_ba)
TypeError: unhashable type: 'bytearray'
The data model method __getitem__
can be used to index into an immutable bytes
or mutable bytearray
.
In [10]: text_b[0]
Out[10]: 206
In [11]: text_ba[0]
Out[11]: 206
In [12]: hex(text_ba[0])
Out[12]: '0xce'
The id
function can be used to obtain the identification of an object
:
In [13]: id(text_ba)
Out[13]: 1968878586928
In [14]: text_ba
Out[14]: bytearray(b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')
The data model method __setitem__
defines the behaviour when indexing into a value and using assignment:
In [15]: int('0xcf', base=16)
Out[15]: 207
In [15]: text_ba[0] = 207
Notice because a value is being assigned in In [15]
there is no Out[15]
. It text_ba
is examined, it is updated in place, notice that the object
id does not change:
In [16]: text_ba
Out[16]: bytearray(b'\xcf\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')
In [16]: id(text_ba)
Out[16]: 1968878586928
Notice:
In [17]: id(text_ba)
Out[17]: 1968878586928
The data model method __delitem__
defines the behaviour when deleting a value that has been indexed into:
In [18]: del text_ba[0]
Notice there is no Out[18]
and instead text_ba
is modified in place:
In [19]: text_ba
Out[19]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!')
The first byte
is missing, and this won't encode properly because only a single byte from an expected multiple byte is deleted. Notice the identification is constant:
In [20]: id(text_ba)
Out[20]: 1968878586928
The mutable method append
will append a single byte represented by a byte
to the end of a bytearray
:
In [21]: text_ba.append(206) # '\xce'
As this method is mutable it has no return value. text_ba
can be seen to be modified in place:
In [22]: text_ba
Out[22]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce')
The mutable method extend
will can be used to extend the bytearray by another bytearray:
In [23]: text_ba.extend(bytearray((177, 206, 177))) # '\xb1\xce\xb1'
Once again this method is mutable and text_ba
is modified in place:
In [24]: text_ba
Out[24]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\xb1\xce\xb1')
The mutable method insert
can be used to insert a single byte as an int
at an index, for example at index 1
:
In [25]: text_ba.insert(208) # '\xd0'
Once again this method is mutable and text_ba
is modified in place:
In [26]: text_ba
Out[26]: bytearray(b'\x93\x94\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\xb1\xce\xb1')
The mutable method remove
can be used to remove a the first occurance of a byte
:
In [27]: text_ba.remove(206) # '\xce'
Once again this method is mutable and text_ba
is modified in place, the \0xce
that was at index 2 is no longer here and instead \xb5
which was previously at idnex 3 is shown at index 2:
In [28]: text_ba
Out[28]: bytearray(b'\x93\x94\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce\xb1\xce\xb1')
The mutable method reverse
can be used to reverse the order of each byte
in the `bytearray:
In [29]: text_ba.reverse()
Once again this method is mutable and text_ba
is modified in place:
In [30]: text_ba
Out[30]: bytearray(b'\xb1\xce\xb1\xce!\xbf\xce\xbc\xce\x83\xcf\xbf\xce\x9a\xce \x85\xcf\xbf\xce\x83\xcf \xb1\xce\xb9\xce\xb5\x94\x93')
The mutable method clear
will clear each byte
from the bytearray:
In [31]: text_ba.clear()
Once again this method is mutable and text_ba
is modified in place:
In [30]: text_ba
Out[30]: bytearray(b'')
The mutable method extend can be used to extend this empty bytearray
:
In [31]: text_ba.extend(bytearray(b'\x93\x94\xce\xb5\xce\xb9\xce\xb1'))
Most mutable methods have no return value, which distinguishes them clearly from immutable methods which jhave a return value. The mutable method pop
is unique because it returns the value popped (by default the last value) and mutates the bytearray
in place:
In [31]: text_ba.pop()
Out[31]: 177 # '\xb1'
In [32]: text_ba
Out[32]: bytearray(b'\x93\x94\xce\xb5\xce\xb9\xce')
An index to pop can be specified:
In [34]: text_ba.pop(1)
Out[34]: 148 # '\x94'
In [32]: text_ba
Out[32]: bytearray(b'\x93\xce\xb5\xce\xb9\xce')
Notice that after all these mutable methods are used the identification of text_ba
remains the same:
In [33]: id(text_ba)
Out[33]: 1968878586928
The copy
method can be used to create a copy of the bytearray
:
In [34]: text_ba2 = text_ba.copy()
Notice the copy has a different identification:
In [35]: id(text_ba2)
Out[35]: 1968878402416
The copies (at present) have equal values but are not the same object
:
In [36]: text_ba2 == text_ba
Out[36]: True
In [37]: text_ba2 is text_ba
Out[37]: False
The __add__
and __mul__
data model methods behave consistently:
In [38]: text_ba + text_ba2
Out[38]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\x93\xce\xb5\xce\xb9\xce')
In [39]: text_ba * 3
Out[39]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\x93\xce\xb5\xce\xb9\xce\x93\xce\xb5\xce\xb9\xce')
However there is a subtitle difference when the in place counterparts are used. Notice for the immutable bytes
that two operations take place, essentially concatenation returning a new value and then reassignment, notice the identification changes which means the label text_b
has been peeled off the old bytes
instance with identification 1968877623024
and placed on the new bytes
instance with identification 1968877576688
:
In [40]: text_b
Out[40]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!'
In [41]: id(text_b)
Out[41]: 1968877623024
In [42]: text_b += b'\xce'
In [43]: text_b
Out[43]: b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xce\xbf\xcf\x83\xce\xbc\xce\xbf!\xce'
In [44]: id(text_b)
Out[44]: 1968877576688
Notice for the mutable bytes
that a single operation has taken place and the identification remains constant:
In [45]: text_ba
Out[45]: bytearray(b'\x93\xce\xb5\xce\xb9\xce')
In [46]: id(text_ba)
Out[46]: 1968878586928
In [47]: text_ba += bytearray(b'\xce')
In [48]: text_ba
Out[48]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xce')
In [49]: id(text_ba)
Out[49]: 1968878586928
In [50]: text_ba *= 2
In [51]: text_ba
Out[51]: bytearray(b'\x93\xce\xb5\xce\xb9\xce\xce\x93\xce\xb5\xce\xb9\xce\xce')
In [52]: id(text_ba)
Out[52]: 1968878586928
In [53]: exit
Returning to the str
class, the remaining str
methods will now be examined. Recall that a str
is immutable and all methods therefore return a value, which is commonly another str
instance.
It is common to insert an object into a string and format it within the string body, to produce what is known as a formatted string.
Look at the following string body:
In [1]: body = 'The string to 0 is 1 2!'
Supposing there are three str
instances:
In [2]: var0 = 'print'
: var1 = 'hello'
: var2 = 'world'
The str
method format
can be used to insert these str
instances within the string body. Let's examine the docstring of the str
method format
:
In [3]: body.format(
# Docstring popup
"""
Docstring:
S.format(*args, **kwargs) -> str
Return a formatted version of S, using substitutions from args and kwargs.
The substitutions are identified by braces ('{' and '}').
Type: builtin_function_or_method
"""
From the docstring, the string body should contain curly braces, which are used as placeholders to insert a Python object
. Each placeholder can be numbered positionally:
In [3]: body = 'The string to {0} is {1} {2}!'
The *args
in the docstring indicates a variable number of positional arguments. When inserting multiple object
instances into the string body, each positional argument should correspond to a placeholder:
In [4]: body.format(var0, var1, var2)
Out[4]: 'The string to print is hello world!'
The string body can alternatively be setup to contain named named arguments:
In [5]:
body = 'The string to {var0_} is {var1_} {var2_}!'
The **kwargs
in the docstring indicates a variable number of named arguments also known as keyword parameters:
In [6]:
body.format(var0_=var0, var1_=var1, var2_=var2)
Out[6]: 'The string to print is hello world!'
Combining the above:
In [7]: 'The string to {var0_} is {var1_} {var2_}!'.format(var0_=var0, var1_=var1, var2_=var2)
Out[7]: 'The string to print is hello world!'
It is common for the placeholder to be given the same name as the `object` name of the `object` to be inserted:
```python
In [8]: 'The string to {var0} is {var1} {var2}!'.format(var0=var0, var1=var1, var2=var2)
Out[8]: 'The string to print is hello world!'
Notice in the above that each object
name is essentially repeated 3 times which is pretty cumbersome. Therefore a shorthand way of writing the expression above is to use the prefix f
, f
means formatted string:
In [9]: f'The string to {var0} is {var1} {var2}!'
Out[9]: 'The string to print is hello world!'
The object
data model __format__
method defines the behaviour of the builtins
function:
In [10]: format(
# Docstring popup
"""
Signature: format(object, format_spec, /)
Docstring:
Default object formatter.
Return str(self) if format_spec is empty. Raise TypeError otherwise.
Type: method_descriptor
"""
Notice there is a format specification format_spec
. 's'
denotes the format specification for a str
instance:
In [11]: format('Hello World!', 's')
Out[11]: 'Hello World!'
If it is prefixed with a number for instance '22s'
, this is an instruction for the str
instance to occupy a width of 22 within the formatted string. Because the original length was 12, it now has 10 spaces until the end of the string:
In [12]: format('Hello World!', '22s')
Out[12]: 'Hello World! '
Prefixing with a 0
is not common with a str
instance and replaces each space with a 0
:
In [13]: format('Hello World!', '022s')
Out[13]: 'Hello World!0000000000'
The format specified is inserted within a variable within the placeholder and the colon :
is used to seperate out the variable from the format specification:
In [14]: f'The string to {var0:s} is {var1} {var2}!'
Out[14]: 'The string to print is hello world!'
In [15]: f'The string to {var0:10s} is {var1} {var2}!'
Out[15]: 'The string to print is hello world!'
In [16]: f'The string to {var0:010s} is {var1} {var2}!'
Out[16]: 'The string to print00000 is hello world!'
Numeric values are commonly inserted into a string body:
In [17]: num1 = 1
: num2 = 0.0000123456789
: num3 = 12.3456789
In [18]: f'The numbers are {num1}, {num2} and {num3}.'
Out[18]: The numbers are 1, 1.23456789e-05 and 12.3456789.'
num1
is an integer and an integer can have various format specifiers. d
is used to represent a decimal integer:
In [19]: f'The numbers are {num1:d}, {num2} and {num3}.'
Out[19]:
'The numbers are 1, 1.23456789e-05 and 12.3456789.'
The width can also be specified:
In [19]: f'The numbers are {num1:5d}, {num2} and {num3}.'
Out[19]:
'The numbers are 1, 1.23456789e-05 and 12.3456789.'
Prefixing this with 0
will display leading zeros:
In [19]: f'The numbers are {num1:5d}, {num2} and {num3}.'
Out[19]:
'The numbers are 00001, 1.23456789e-05 and 12.3456789.'
num2
and num3
are float
instances and the format specified f
can be used to express each float
in the fixed format:
In [20]: for num in range(9, -1, -1):
: print('0.'+num*'0'+'123')
:
: for num in range(18):
: print('123'+num*'0'+'.')
:
0.000000000123
0.00000000123
0.0000000123
0.000000123
0.00000123
0.0000123
0.000123
0.00123
0.0123
0.123
123.
1230.
12300.
123000.
1230000.
12300000.
123000000.
1230000000.
12300000000.
123000000000.
1230000000000.
12300000000000.
123000000000000.
1230000000000000.
12300000000000000.
123000000000000000.
1230000000000000000.
12300000000000000000.
Typically when the float
is very small or very large, scientific notation is used, with the format e
. The format g
is the general format and used the fixed format or the exponential format depending on the size of the float
:
In [21]: for num in range(9, -1, -1):
: print(float('0.'+num*'0'+'123'))
:
: for num in range(18):
: print(float('123'+num*'0'+'.'))
1.23e-10
1.23e-09
1.23e-08
1.23e-07
1.23e-06
1.23e-05
0.000123
0.00123
0.0123
0.123
123.0
1230.0
12300.0
123000.0
1230000.0
12300000.0
123000000.0
1230000000.0
12300000000.0
123000000000.0
1230000000000.0
12300000000000.0
123000000000000.0
1230000000000000.0
1.23e+16
1.23e+17
1.23e+18
1.23e+19
In [22]:
f'The numbers are {num1:g}, {num2:g} and {num3:g}.'
Out[22]: 'The numbers are 1, 1.23457e-05 and 12.3457.'
In [23]:
f'The numbers are {num1:f}, {num2:f} and {num3:f}.'
Out[23]: 'The numbers are 1.000000, 0.000012 and 12.345679.'
In [24]:
f'The numbers are {num1:e}, {num2:e} and {num3:e}.'
Out[24]: 'The numbers are 1.000000e+00, 1.234568e-05 and 1.234568e+01.'
A width of 10 characters, with 3 characters past the decimal point can be specified:
In [25]: format(num1, '10.3e')
Out[25]: ' 1.000e+00'
In [26]: format(num1, '010.3e')
Out[26]: '01.000e+00'
In [27]: format(num1, '010.2e')
Out[27]: '001.00e+00'
Notice the width includes all the characters used to represent the number as a string such as the decimal point, e and power.
The same modifications can be made in the fixed format:
In [25]: format(num1, '10.3f')
Out[25]: ' 1.000'
In [26]: format(num1, '010.3f')
Out[26]: '000001.000'
In [27]: format(num1, '010.2f')
Out[27]: '0000001.00'
In [28]: f'The numbers are {num1:03d}, {num2:06.3f} and {num3:010.3e}.'
Out[28]: 'The numbers are 001, 00.000 and 01.235e+01.'
In [29]: exit
Returning to the string body:
In [1]: body = 'The numbers are {num1:03d}, {num2:06.3f} and {num3:010.3e}.'
The docstring of the str
method format_map
can be viewed:
In [2]: body.format_map(
# Docstring popup
"""
Docstring:
S.format_map(mapping) -> str
Return a formatted version of S, using substitutions from mapping.
The substitutions are identified by braces ('{' and '}').
Type: builtin_function_or_method
"""
To use this method, all the variables to be incorporated into the formatted string are grouped together in a mapping such as a dict
:
In [2]: numbers = {'num1': 1, 'num2': 0.0000123456789, 'num3': 12.3456789}
The str
method format_map
can then be used to map all the variables from this dict
into their placeholders within the string body:
In [3]: body.format_map(numbers)
Out[3]: 'The numbers are 001, 00.000 and 01.235e+01.
In the str
class the data model method __mod__
is defined to implement C-style formatted strings which controls the behaviour of the operator %
.
In [4]: body = 'The numbers are %03d, %06.3f and %0.3g.'
: nums = (1, 0.0000123456789, 12.3456789)
In [5]: body % nums
Out[5]: 'The numbers are 001, 00.000 and 12.3.'
The Greek alphabet looks as follows, notice it has uppercase and lowercase letters. Notice also that some characters such as epsilon and sigma have two lowercase variations:
Greek Alphabet
Greek Alphabet | Uppercase | Lower Case |
---|---|---|
Alpha | Α | α |
Beta | Β | β |
Gamma | Γ | γ |
Delta | Δ | δ |
Epsilon | Ε | ε or ϵ |
Zeta | Ζ | ζ |
Eta | Η | η |
Theta | Θ | θ |
Iota | Ι | ι |
Kappa | Κ | κ |
Lambda | Λ | λ |
Mu | Μ | μ |
Nu | Ν | ν |
Xi | Ξ | ξ |
Omicron | Ο | ο |
Pi | Π | π |
Rho | Ρ | ρ |
Sigma | Σ | σ or ς |
Tau | Τ | τ |
Upsilon | Υ | υ |
Phi | Φ | φ |
Chi | Χ | χ |
Psi | Ψ | ψ |
Omega | Ω | ω |
The str
case method upper
returns a string where every character is upper case:
Out[6]: 'γεια σου κοσμο!'.upper()
Out[6]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'
The str
case method capitalize
(U.S. spelling with z) returns a string where only the first character is in upper case and the rest of the characters are in lower case:
In [7]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.capitalize()
Out[7]: 'Γεια σου κοσμο!'
The str
case method title
returns a string where only the first character and first character after very space is in upper case and the rest of the characters are in lower case:
In [8]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.title()
Out[8]: 'Γεια Σου Κοσμο!'
The str
case method lower
returns a string where each characer is in lower case:
In [9]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.lower()
Out[9]: 'γεια σου κοσμο!'
The following characters are less common lowe case variants of epsilon and sigma. Therefore when the str
method lower
is used on them, they are unchanged:
In [10]: 'ϵ'.lower()
Out[10]: 'ϵ'
In [11]: 'ς'.lower()
Out[11]: 'ς'
The str
case method casefold
returns a string where each characer is in lower case and transforms any variants to the most common variant:
In [12]: 'ϵ'.casefold()
Out[12]: 'ε'
In [13]: 'ς'.casefold()
Out[13]: 'σ'
The difference between the str
methods lower
and casefold
can be seen in the example below:
In [14]: 'Γϵια ςου Κοςμο!'.lower()
Out[14]: 'γϵια ςου κοςμο!'
In [15]: 'Γϵια ςου Κοςμο!'.casefold()
Out[15]: 'γεια σου κοσμο!'
The str
case method swapcase
swaps the case of each character in the str
:
In [16]: 'Γεια Σου Κοσμο!'.swapcase()
Out[16]: 'γΕΙΑ σΟΥ κΟΣΜΟ!'
The str
class has a number of boolean classification methods which return True
if every Unicode character in a str
satisfies the classification:
In [17]: 'γεια σου κοσμο!'.islower()
Out[17]: True
In [18]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.islower()
Out[18]: False
In [19]: 'γεια σου κοσμο!'.isupper()
Out[19]: True
In [20]: 'ΓΕΙΑ ΣΟΥ ΚΟΣΜΟ!'.isupper()
Out[20]: True
The boolean classification istitle
will return True
if the str
is title case:
In [21]: 'Γεια σου κοσμο!'.istitle()
Out[21]: False
In [22]: 'Γεια Σου Κοσμο!'.istitle()
Out[22]: True
The boolean classification isspace
will return True
if each character is whitespace, this includes tabs and newlines:
In [23]: ' '.isspace()
Out[23]: True
In [24]: ' '.isspace()
Out[24]: True
In [25]: ' \t\n\r\x0b\x0c'.isspace()
Out[25]: True
The escape character \t
represents a tab, \n
represents a new line and \r
a carriage return. \x0b
is the vertical tab and \x0c
is the form feed, these are less commonly used and expressed as their byte.
The boolean classification isprintable
will check to see if every character in the string is printable, i.e. doesn't have any non-printable ASCII characters
In [26]: '\x00'.isprintable()
Out[26]: False
In [27]: 'Γεια σου Κοσμο!'.isprintable()
Out[27]: True
The boolean classification isascii
will check to see if every character in the string is an ASCII character:
In [28]: 'Γεια σου Κοσμο!'.isascii()
Out[28]: False
In [29]: 'Hello World!'.isascii()
Out[29]: True
The boolean classification isalpha
will check to see if every number in the string is alphabetical. Note this isn't limited to only ASCII alphabetical characters:
In [30]: 'Γεια σου Κοσμο!'.isalpha()
Out[30]: False
In [31]: 'αβγΑΒΓ'.isalpha()
Out[31]: True
In [32]: 'abcABC'.isalpha()
Out[32]: True
There are three numeric classifications and the difference between these can be seen by examining the following numeric groups:
In [33]: numeric_groups = {'ascii': '0123456789',
'font1': '𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿',
'font2': '𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵',
'font3': '𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡',
'subscript': '₀₁₂₃₄₅₆₇₈₉',
'superscript': '⁰¹²³⁴⁵⁶⁷⁸⁹',
'circled1': '➀➁➂➃➄➅➆➇➈',
'circled2': '➉',
'fractions': '½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉',
'asciihex': '0123456789abcdef', }
isdecimal
is the most restrictive and recognises numeric digits of various different fonts:
In [34]: for key, value in numeric_groups.items():
: print(key, value, value.isdecimal())
:
ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ False
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ False
circled1 ➀➁➂➃➄➅➆➇➈ False
circled2 ➉ False
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False
asciihex 0123456789abcdef False
isdigit
recognises more including subscripts, superscripts and circled digits however the circled 10 isn't recognised as it has two digits opposed to one:
In [35]: for key, value in numeric_groups.items():
: print(key, value, value.isdigit())
:
ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ False
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False
asciihex 0123456789abcdef False
isnumeric
recognises more including the circled 10 and fractions:
In [36]: for key, value in numeric_groups.items():
: print(key, value, value.isnumeric())
:
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ True
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True
asciihex 0123456789abcdef False
isalnum
esseentially is a combination of Unicode characters accepted from isalpha
and isnumeric
:
In [37]: for key, value in numeric_groups.items():
: print(key, value, value.isalnum())
:
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ True
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True
asciihex 0123456789abcdef False
The boolean classification isidentifier
will check to see if the string is a valid identifier name. Recall identifiers (object
names) cannot begin with a number, but can include a number elsewhere and cannot use spaces or special characters with exception to the underscore:
In [38]: 'variable'.isidentifier()
Out[38]: True
In [39]: '2variable'.isidentifier()
Out[39]: False
In [40]: 'variable2'.isidentifier()
Out[40]: True
In [41]: 'variable 2'.isidentifier()
Out[41]: False
In [42]: 'variable_2'.isidentifier()
Out[42]: True
startswith endswith
The str
alignment methods can be used as an alternative way to format a string. left justify ljust
, right justify rjust
and center
will align a string using a specified width:
In [43]: len('Γεια σου Κοσμο!')
Out[43]: 15
In [44]: 'Γεια σου Κοσμο!'.ljust(20)
Out[44]: 'Γεια σου Κοσμο! '
In [45]: 'Γεια σου Κοσμο!'.rjust(20)
Out[45]: ' Γεια σου Κοσμο!'
In [46]: 'Γεια σου Κοσμο!'.center(20)
Out[46]: ' Γεια σου Κοσμο! '
These str
alignment methods accept an optional fill character:
In [47]: 'Γεια σου Κοσμο!'.rjust(20, '0')
Out[47]: '00000Γεια σου Κοσμο!'
Using right justification with a fill character of 0
is commonly used for numeric strings and is available as the str
method zerofill zfill
:
In [48]: '1'.zfill(5)
Out[48]: '00001'
The str
method expandtabs
can be used to expand tabs to a specified number of spaces, the default value is 8
:
In [49]: '\tΓεια σου Κοσμο!'.expandtabs()
Out[49]: ' Γεια σου Κοσμο!'
In [50]: '\tΓεια σου Κοσμο!'.expandtabs(4)
Out[50]: ' Γεια σου Κοσμο!'
The methods left strip lstrip
, right strip rstrip
and strip
strip the whitespace in a string by default:
In [51]: ' Γεια σου Κοσμο! '.lstrip()
Out[51]: 'Γεια σου Κοσμο! '
In [52]: ' Γεια σου Κοσμο! '.rstrip()
Out[52]: ' Γεια σου Κοσμο!'
In [53]: ' Γεια σου Κοσμο! '.strip()
Out[53]: 'Γεια σου Κοσμο!'
Alternatively they can be used to strip a specified character:
In [54]: '00001'.lstrip('0')
Out[54]: '1'
Or one of multiple characters:
In [55]: '0x01'.lstrip('0x')
Out[55]: '1'
Sometime it is more useful to use the str
methods removeprefix
and removesuffix
which will remove only a specified prefix or suffix:
In [56]: '0x01'.removeprefix('0x')
Out[56]: '01'
In [57]: '0x01'.removesuffix('01')
Out[57]: '0x'
The str
method split
, splits each word in a sentance using a whitespace character returning a list
of str
instances. Conceptually this splits every word in a sentance:
In [58]: 'Γεια σου Κοσμο!'.split()
Out[58]: ['Γεια', 'σου', 'Κοσμο!']
This is completed by the str
method join
which joins list
of str
instances:
In [59]: ' '.join(['Γεια', 'σου', 'Κοσμο!'])
Out[59]: ['Γεια', 'σου', 'Κοσμο!']
A different character can be specified in the split
method:
In [60]: 'Γεια σου Κοσμο!'.split('σ')
Out[60]: ['Γεια ', 'ου Κο', 'μο!']
In [61]: 'σ'.join(['Γεια ', 'ου Κο', 'μο!'])
Out[61]: 'Γεια σου Κοσμο!'
A maximum split can be specified and here, split can be seen to operate on the string, left to right:
In [60]: 'Γεια σου Κοσμο!'.split(maxsplit=1)
Out[60]: ['Γεια', 'σου Κοσμο!']
The counterpart rsplit
operates from right to left:
In [61]: 'Γεια σου Κοσμο!'.rsplit(maxsplit=1)
Out[61]: ['Γεια σου', 'Κοσμο!']
When maxsplit
isn't specified, rsplit
and split
behave identically and split
is generally preferred.
The str
method splitlines
is essentially split
with the split character being specified as a new line \n
:
In [62]: print('Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n')
Γεια σου Κοσμο!
Γεια σου Κοσμο!
Γεια σου Κοσμο!
In [63]: 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n'.splitlines()
Out[63]: ['Γεια σου Κοσμο!', 'Γεια σου Κοσμο!', 'Γεια σου Κοσμο!']
The str
method partition
is similar to split
but only occurs once and always returns a three element tuple around the split character:
In [64]: 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n'.partition('\n')
Out[64]: ('Γεια σου Κοσμο!', '\n', 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\n')
Partition operates left to right, there is the rpartition
counterpart which operates right to left:
In [65]: 'Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!\n'.rpartition('\n')
Out[65]: ('Γεια σου Κοσμο!\nΓεια σου Κοσμο!\nΓεια σου Κοσμο!', '\n', '')
The string
module contains identifiers that are related to string manipulation but not available as non-callable attributes in the str
class. A design choice was made to compartmentalise these into a separate string
module. As a result all the identifiers of the str
class, outwith the data model identifiers are immutable callable methods which return a value. Compartmentalising these also reduced the memory overhead in the str
class.
In [1]: import string
In [2]: string.
# Available Identifiers for `string` module
# -------------------------------
# Available Identifiers in `string`:
# ----------------------------------
# 🔠 Character Sets:
# ascii_letters : Concatenation of `ascii_lowercase` and `ascii_uppercase`.
# ascii_lowercase : Lowercase ASCII letters (`abcdefghijklmnopqrstuvwxyz`).
# ascii_uppercase : Uppercase ASCII letters (`ABCDEFGHIJKLMNOPQRSTUVWXYZ`).
# digits : Decimal digit characters (`0123456789`).
# hexdigits : Hexadecimal digit characters (`0123456789abcdefABCDEF`).
# octdigits : Octal digit characters (`01234567`).
# printable : Characters deemed "printable" (`digits`, `ascii_letters`, punctuation, and whitespace).
# punctuation : String of all ASCII punctuation characters.
# whitespace : String of all ASCII whitespace characters.
In [3]: string.printable
Out[3]: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
In [4]: string.ascii_lowercase
Out[4]: 'abcdefghijklmnopqrstuvwxyz'
In [5]: string.ascii_uppercase
Out[5]: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
In [7]: string.hexdigits # base 16
Out[7]: '0123456789abcdefABCDEF'
In [6]: string.digits # base 10
Out[6]: '0123456789'
In [8]: string.octdigits # base 8
Out[8]: '01234567'
In [9]: string.punctuation
Out[9]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [10]: string.whitespace
Out[10]: ' \t\n\r\x0b\x0c'
In [11]: string.printable
Out[11]: '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
The str
method maketrans
is a static method that creates a translation table which maps from one character to another (conceptualise the translation). A translation table from Greek to Latin letters can be made. To visualise this, it can be cast into a dict
:
In [12]: greek2latin = str.maketrans('αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ', 'abgdezhqiklmnxoprstyfcuwABGDEZHQIKLMNXOPRSTYFCUW')
In [13]: greek2latin_as_dict = dict(greek2latin)
Translation Table | |
---|---|
Key | Value |
945 | 97 |
946 | 98 |
947 | 103 |
948 | 100 |
949 | 101 |
950 | 122 |
951 | 104 |
952 | 113 |
953 | 105 |
954 | 107 |
955 | 108 |
956 | 109 |
957 | 110 |
958 | 120 |
959 | 111 |
960 | 112 |
961 | 114 |
963 | 115 |
964 | 116 |
965 | 121 |
966 | 102 |
967 | 99 |
968 | 117 |
969 | 119 |
913 | 65 |
914 | 66 |
915 | 71 |
916 | 68 |
917 | 69 |
918 | 90 |
919 | 72 |
920 | 81 |
921 | 73 |
922 | 75 |
923 | 76 |
924 | 77 |
925 | 78 |
926 | 79 |
927 | 84 |
928 | 82 |
929 | 83 |
931 | 84 |
932 | 85 |
933 | 86 |
934 | 87 |
935 | 88 |
936 | 89 |
937 | 90 |
938 | 91 |
939 | 92 |
940 | 93 |
941 | 94 |
942 | 95 |
943 | 96 |
944 | 97 |
945 | 98 |
946 | 99 |
947 | 100 |
948 | 101 |
949 | 102 |
950 | 103 |
951 | 104 |
952 | 105 |
953 | 106 |
954 | 107 |
955 | 108 |
956 | 109 |
957 | 110 |
958 | 111 |
959 | 112 |
960 | 113 |
961 | 114 |
962 | 115 |
963 | 116 |
964 | 117 |
965 | 118 |
966 | 119 |
967 | 120 |
968 | 121 |
969 | 122 |
970 | 123 |
971 | 124 |
972 | 125 |
973 | 126 |
974 | 127 |
975 | 128 |
976 | 129 |
977 | 130 |
978 | 131 |
979 | 132 |
980 | 133 |
981 | 134 |
982 | 135 |
983 | 136 |
984 | 137 |
985 | 138 |
986 | 139 |
987 | 140 |
988 | 141 |
989 | 142 |
990 | 143 |
991 | 144 |
992 | 145 |
993 | 146 |
994 | 147 |
995 | 148 |
996 | 149 |
997 | 150 |
998 | 151 |
999 | 152 |
1000 | 153 |
Notice the keys and the values are numerical, displayed as int
instances. These can be understood better by looking at the int
in binary or hexadecimal using the bin
and hex
functions respectively. The str
methods explored above will be used to display all the bits or hexadecimal digits. The character chr
function will display the Unicode character corresponding to the supplied int
(utf-8
). The ordinal ord
function performs the counter operation:
In [14]: bin(945)
Out[14]: '0b1110110001' # 2 bytes 'utf-8'
In [15]: '0b'+bin(945).removeprefix('0b').zfill(16)
Out[15]: '0b0000001110110001'
In [14]: hex(945)
Out[14]: '0x3b1' # 2 bytes 'utf-8'
In [15]: '0x'+hex(945).removeprefix('0x').zfill(4)
Out[15]: '0x03b1'
In [16]: chr(945)
Out[16]: 'α'
In [17]: ord('α')
Out[17]: 945
In [18]: bin(97)
Out[18]: '0b1100001' # 1 byte 'utf-8'
In [19]: '0b'+bin(97).removeprefix('0b').zfill(8)
Out[19]: '0b01100001'
In [20]: hex(97)
Out[20]: '0x61' # 1 byte 'utf-8'
In [21]: '0x'+hex(97).removeprefix('0x').zfill(2)
Out[21]: '0x61'
In [22]: chr(97)
Out[22]: 'a'
In [23]: ord('a')
Out[23]: 97
The str
method translate
can use this translation table to convert characters from the Greek to the Latin alphabet:
In [24]: 'Γεια σου Κοσμο!'.translate(greek2latin)
Out[24]: 'Geia soy Kosmo!'
Recall that a static method is not bound to an instance or a class, but merely found in the classes namespace as its the expected place for the method to be found.
When the translation table was made, the two strings supplied had to be an equal length of Unicode characters for 1 to 1 mapping. Sometimes it is desirable to create a translation table that removes characters entirely and in this case an empty string should be supplied for each of the positional arguments and the characters that are to be mapped to None should be supplied as a third positional argument, in this case the punctuation characters which are available as string.punctuation
):
In [25]: remove_punctuation = str.maketrans('', '', string.punctuation)
In [26]: remove_punctuation_as_dict = dict(remove_punctuation)
remove_punctuation_as_dict | |
---|---|
Key | Value |
33 | None |
34 | None |
35 | None |
36 | None |
37 | None |
38 | None |
39 | None |
40 | None |
41 | None |
42 | None |
43 | None |
44 | None |
45 | None |
46 | None |
47 | None |
58 | None |
59 | None |
60 | None |
61 | None |
62 | None |
63 | None |
64 | None |
91 | None |
92 | None |
93 | None |
94 | None |
95 | None |
96 | None |
123 | None |
124 | None |
125 | None |
126 | None |
And this can be used to remove the punctuation, in combination with a casefold
and split
to get a list
of lowercase words:
In [27]: 'Γεια σου Κοσμο!'.translate(remove_punctuation).casefold().split()
Out[27]: ['γεια', 'σου', 'κοσμο']
This can be used to count the number of occurances of each word using a collection
such as a Counter
and the top words can be examined:
In [28]: from collections import Counter
In [29]: Counter(['γεια', 'σου', 'κοσμο'])
Out[29]: Counter({'γεια': 1, 'σου': 1, 'κοσμο': 1})
This essentially is the basis of most natural language processing problems. A natural language toolkit in English would filter out stop words:
In [30]: stop_words = ['a', 'an', 'the', 'at', 'by', 'for',
'in', 'of', 'on', 'to', 'he', 'she',
'it', 'they', 'we', 'you', 'I', 'me', 'my',
'your', 'and', 'but', 'or', 'so', 'yet', 'is',
'am', 'are', 'was', 'were', 'be', 'being', 'been',
'have', 'has', 'had', 'do', 'does', 'did', 'not',
'this', 'that', 'these', 'those', 'all', 'any',
'some', 'such'
]
And usually examine sentimental text:
In [31]: sentiment_dict = {'positive': ['happy', 'joyful', 'love',
'excellent', 'great', 'fantastic',
'amazing', 'wonderful', 'cheerful',
'positive'],
'negative': ['sad', 'hate', 'terrible', 'awful',
'bad', 'horrible', 'disappointing',
'angry', 'frustrated', 'negative'],
'neutral': ['okay', 'fine', 'average', 'normal',
'medium','fair', 'indifferent',
'moderate', 'tolerable', 'usual']}
A natural language problem would essentially take a piece of text and convert it into a number for example a number that can be evaluated from a large number of product reviews.
Some additional translation tables may need to be created to remove accents from accented characters, which casefold
doesn't handle.
Python has a number of third-party natural language toolkits, which are out of the scope of this tutorial.
The str
module contains a number of simple identifiers which allow for example a substring to be found within a string. These are complemented by regular expressions, if the following str
instance text
is examined:
In [32]: exit
In [1]: text = 'Email [email protected], [email protected] Telephone 0000000000 Website https://www.domain.com'
Notice it has two emails, a telephone and a website which you as a human can isntantly recognised. Python has a regular expressions re
module and the purpose of this module is to create a pattern in the form of a regular expression and search within a string for this pattern:
In [2]: import re
In [3]: re.
# Available Identifiers for `re` module
# -------------------------------
# Available Identifiers for `re`:
# -------------------------------------
## Functions
# - `re.match(pattern, string)`
# - `re.search(pattern, string)`
# - `re.findall(pattern, string)`
# - `re.finditer(pattern, string)`
# - `re.sub(pattern, repl, string)`
# - `re.subn(pattern, repl, string)`
# - `re.split(pattern, string)`
# - `re.compile(pattern, flags=0)`
# - `re.escape(string)`
# - `re.fullmatch(pattern, string)`
# - `re.purge()`
## Flags
# - `re.IGNORECASE`
# - `re.I`
# - `re.MULTILINE`
# - `re.M`
# - `re.DOTALL`
# - `re.S`
# - `re.VERBOSE`
# - `re.X`
## Match Object Methods
# - `match.group([group])`
# - `match.groups()`
# - `match.start([group])`
# - `match.end([group])`
# - `match.span([group])`
# - `match.re`
# - `match.string`
# - `match.lastindex`
# - `match.lastgroup`
## Special Sequences
# - `\d` - Matches any decimal digit.
# - `\D` - Matches any non-digit character.
# - `\w` - Matches any alphanumeric character (and underscore).
# - `\W` - Matches any non-alphanumeric character.
# - `\s` - Matches any whitespace character.
# - `\S` - Matches any non-whitespace character.
# - `\b` - Matches a word boundary.
# - `\B` - Matches a non-word boundary.
## Character Classes
# - `[abc]` - Matches any character in the set.
# - `[^abc]` - Matches any character not in the set.
# - `[a-z]` - Matches any character in the range from a to z.
# - `.` - Matches any character except a newline.
## Groups
# - `(...)` - Capturing group.
# - `(?:...)` - Non-capturing group.
# - `(?P<name>...)` - Named capturing group.
# - `(?=...)` - Positive lookahead.
# - `(?!...)` - Negative lookahead.
# - `(?<=...)` - Positive lookbehind.
# - `(?<!...)` - Negative lookbehind.
Notice the use of \
for a special sequence, because \
is used a pattern should be supplied as a regular expression with the prefix r
. Lower case r
is preferred as many IDEs will apply syntax highlighting for regular expressions. If upper case R
is used, the raw string will still work as a regular expression but the IDE will just syntax the regular expression consistently to a normal string:
In [3]: email_pattern = r'\b[A-Za-z0-9._]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
: number_pattern = r'\b\d{10}\b'
: website_pattern = r'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
If the email is examined [email protected]
the pattern is r'\b[A-Za-z0-9._]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
. This pattern can be broken down:
\b
beginning of a word boundary.[A-Za-z0-9._]+
is the local component of the email[A-Z]
# string.ascii_uppercase[a-z]
# string.ascii_lowercase[0-9]
# string.digits[._]
additional characters allowed in the local component+
used to denote 1 or more character
@
is the at symbol[A-Za-z0-9.-]
is the domain name[A-Z]
# string.ascii_uppercase[a-z]
# string.ascii_lowercase[0-9]
# string.digits[._]
additional characters allowed in the local component+
used to denote 1 or more characters
+\.
is the dot.
, note the.
is used in a regular expression, so in this case is inserted as an escape character.[A-Z|a-z]{2,}
is the top level domain[A-Z]
# string.ascii_uppercase[a-z]
# string.ascii_lowercase{2,}
two or more characters
\b
ending of a word boundary.
If the number is examined 0000000000
the pattern is r'\b\d{10}\b'
. This pattern can be broken down:
\b
beginning of a word boundary.\d
decimal characters{10}
ten of them
\b
ending of a word boundary.
If the website is examined https://www.domain.com
the pattern is r'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
. This pattern can be broken down:
\b
beginning of a word boundary.https?
is the Hypertext Transfer Protocol (Secured)http
literals?
optional, meanings
(may or may not be present)://
literal (used to seperate the protocol from the address)
(?:www\.)
(?:)
creates a non-capturing groupwww
literal\.
the dot is inserted as an escape character
?
optional, meaningwww.
(may or may not be present)[A-Za-z0-9.-]
is the domain (same as email)- +.[A-Z|a-z]{2,}` is the top level domain (same as email)
\b
ending of a word boundary.
The regular expression function findall
will return a list
of pattern matches:
In [4]: re.findall(email_pattern, text)
Out[4]: ['[email protected]', '[email protected]']
In [5]: re.findall(number_pattern, text)
Out[5]: ['0000000000']
In [6]: re.findall(website_pattern, text)
Out[6]: ['https://www.domain.com']
The regular expressions module is very powerful and regular expressions can get quite complicated. A simple demonstration here was used just to show the concept.