module documentation
(source)

This module contains utilities for parsing strings and text
Class Re Wrapper around regex pattern objects which memorizes the last match object and gives list access to it's capturing groups. See module re for regex docs.
Class TextBuffer No summary
Function escape_string Escape special characters with a backslash Escapes newline, tab, backslash itself and any characters in chars
Function link_type Function that returns a link type for urls and page links
Function normalize_win32_share No summary
Function parse_date Returns a tuple of (year, month, day) for a date string or None if failed to parse the string. Current supported formats:
Function split_escaped_string Split string on char while respecting backslash escapes
Function split_quoted_strings Split a word list respecting quotes
Function unescape_quoted_string Removes quotes from a string and unescapes embedded quotes @returns: string
Function unescape_string Unescape backslash escapes in string Recognizes \n and \t for newline and tab respectively, otherwise keeps the literal character
Function uri_scheme Function that returns a scheme for URIs, URLs and email addresses
Function url_decode Replace url-encoding hex sequences with their proper characters.
Function url_encode Replaces non-standard characters in urls with hex codes.
Function valid_interwiki_key Undocumented
Constant URL_ENCODE_DATA Undocumented
Constant URL_ENCODE_PATH Undocumented
Constant URL_ENCODE_READABLE Undocumented
Variable is_email_re Undocumented
Variable is_interwiki_keyword_re Undocumented
Variable is_interwiki_re Undocumented
Variable is_path_re Undocumented
Variable is_uri_re Undocumented
Variable is_url_re Undocumented
Variable is_win32_path_re Undocumented
Variable is_win32_share_re Undocumented
Variable is_www_link_re Undocumented
Variable url_re Undocumented
Function _escape Undocumented
Function _unescape Undocumented
Function _url_decode Undocumented
Function _url_decode_bytes Undocumented
Function _url_encode Undocumented
Function _url_encode_on_error Undocumented
Function _url_encode_readable Undocumented
Variable _classes Undocumented
Variable _parse_date_re Undocumented
Variable _url_bytes_decode_re Undocumented
Variable _url_decode_ascii_re Undocumented
Variable _url_decode_unicode_bytes_re Undocumented
Variable _url_encode_path_re Undocumented
Variable _url_encode_re Undocumented
def escape_string(string, chars=''): (source)
Escape special characters with a backslash Escapes newline, tab, backslash itself and any characters in chars
def link_type(link): (source)
Function that returns a link type for urls and page links
def normalize_win32_share(path): (source)
Translates paths for windows shares in the platform specific form. So on windows it translates smb:// URLs to \host\share form, and vice versa on all other platforms. Just returns the original path if it was already in the right form, or when it is not a path for a share drive.
Parameters
patha filesystem path or URL
Returns
the platform specific path or the original input path
def parse_date(string): (source)

Returns a tuple of (year, month, day) for a date string or None if failed to parse the string. Current supported formats:

  • dd?-mm?
  • dd?-mm?-yy
  • dd?-mm?-yyyy
  • yyyy-mm?-dd?

Where '-' can be replaced by any separator. Any preceding or trailing text will be ignored (so we can parse journal page names correctly).

TODO: Some setting to prefer US dates with mm-dd instead of dd-mm TODO: More date formats ?

def split_escaped_string(string, char): (source)
Split string on char while respecting backslash escapes
def split_quoted_strings(string, unescape=True, strict=True): (source)

Split a word list respecting quotes

This function always expect full words to be quoted, even if quotes appear in the middle of a word, they are considered word boundries.

( XDG Desktop Entry spec says full words must be quoted and quotes in a word escaped, but doesn't specifify what to do with loose quotes in a string. )

Also a comma "," is handled specially and is always considered a word on it's own.

Parameters
stringstring to split in words
unescapeif True quotes are removed, else they are left in place
strictif True unmatched quotes will cause a ValueError to be raised, if False unmatched quotes are ignored.
Returns
list of strings
def unescape_quoted_string(string): (source)
Removes quotes from a string and unescapes embedded quotes
Returns
string
def unescape_string(string): (source)
Unescape backslash escapes in string Recognizes \n and \t for newline and tab respectively, otherwise keeps the literal character
def uri_scheme(link): (source)
Function that returns a scheme for URIs, URLs and email addresses
def url_decode(url, mode=URL_ENCODE_PATH): (source)

Replace url-encoding hex sequences with their proper characters.

Mode can be:

  • URL_ENCODE_DATA: decode all chars
  • URL_ENCODE_PATH: same as URL_ENCODE_DATA
  • URL_ENCODE_READABLE: decode only whitespace and unicode characters

The mode URL_ENCODE_READABLE will not decode any other characters, so urls decoded with these modes can still contain escape sequences. They are safe to use within zim, but should be re-encoded with URL_ENCODE_READABLE before handing them to an external program.

This method will only decode non-ascii byte codes when the _whole_ byte equivalent of the URL is in valid UTF-8 decoding. Else it is assumed the encoding was done in another format and the decoding fails silently for these byte sequences.

def url_encode(url, mode=URL_ENCODE_PATH): (source)

Replaces non-standard characters in urls with hex codes.

Mode can be:

  • URL_ENCODE_DATA: encode all un-safe chars
  • URL_ENCODE_PATH: encode all un-safe chars except '/'
  • URL_ENCODE_READABLE: encode whitespace and all unicode characters

The mode URL_ENCODE_READABLE can be applied to urls that are already encoded because it does not touch the "%" character. The modes URL_ENCODE_DATA and URL_ENCODE_PATH can only be applied to strings that are known not to be encoded.

The encoded URL is a string containing only ASCII characters

def valid_interwiki_key(name): (source)

Undocumented

URL_ENCODE_DATA: int = (source)

Undocumented

Value
0
URL_ENCODE_PATH: int = (source)

Undocumented

Value
1
URL_ENCODE_READABLE: int = (source)

Undocumented

Value
2
is_email_re = (source)

Undocumented

is_interwiki_keyword_re = (source)

Undocumented

is_interwiki_re = (source)

Undocumented

is_path_re = (source)

Undocumented

is_uri_re = (source)

Undocumented

is_url_re = (source)

Undocumented

is_win32_path_re = (source)

Undocumented

is_win32_share_re = (source)

Undocumented

is_www_link_re = (source)

Undocumented

url_re = (source)

Undocumented

def _escape(match): (source)

Undocumented

def _unescape(match): (source)

Undocumented

def _url_decode(match): (source)

Undocumented

def _url_decode_bytes(match): (source)

Undocumented

def _url_encode(match): (source)

Undocumented

def _url_encode_on_error(error): (source)

Undocumented

def _url_encode_readable(match): (source)

Undocumented

_classes: dict[str, str] = (source)

Undocumented

_parse_date_re = (source)

Undocumented

_url_bytes_decode_re = (source)

Undocumented

_url_decode_ascii_re = (source)

Undocumented

_url_decode_unicode_bytes_re = (source)

Undocumented

_url_encode_path_re = (source)

Undocumented

_url_encode_re = (source)

Undocumented