zim.parsing

module documentation

(source)

This module contains utilities for parsing strings and text

Class	`Re`	Wrapper around regex pattern objects which memorizes the last match object and gives list access to it's capturing groups. See module re for regex docs.
Class	`TextBuffer`	No summary
Function	`escape_string`	Escape special characters with a backslash Escapes newline, tab, backslash itself and any characters in `chars`
Function	`link_type`	Function that returns a link type for urls and page links
Function	`normalize_win32_share`	No summary
Function	`parse_date`	Returns a tuple of (year, month, day) for a date string or None if failed to parse the string. Current supported formats:
Function	`split_escaped_string`	Split string on `char` while respecting backslash escapes
Function	`split_quoted_strings`	Split a word list respecting quotes
Function	`unescape_quoted_string`	Removes quotes from a string and unescapes embedded quotes @returns: string
Function	`unescape_string`	Unescape backslash escapes in string Recognizes `\n` and `\t` for newline and tab respectively, otherwise keeps the literal character
Function	`uri_scheme`	Function that returns a scheme for URIs, URLs and email addresses
Function	`url_decode`	Replace url-encoding hex sequences with their proper characters.
Function	`url_encode`	Replaces non-standard characters in urls with hex codes.
Function	`valid_interwiki_key`	Undocumented
Constant	`URL_ENCODE_DATA`	Undocumented
Constant	`URL_ENCODE_PATH`	Undocumented
Constant	`URL_ENCODE_READABLE`	Undocumented
Variable	`is_email_re`	Undocumented
Variable	`is_interwiki_keyword_re`	Undocumented
Variable	`is_interwiki_re`	Undocumented
Variable	`is_path_re`	Undocumented
Variable	`is_uri_re`	Undocumented
Variable	`is_url_re`	Undocumented
Variable	`is_win32_path_re`	Undocumented
Variable	`is_win32_share_re`	Undocumented
Variable	`is_www_link_re`	Undocumented
Variable	`url_re`	Undocumented
Function	`_escape`	Undocumented
Function	`_unescape`	Undocumented
Function	`_url_decode`	Undocumented
Function	`_url_decode_bytes`	Undocumented
Function	`_url_encode`	Undocumented
Function	`_url_encode_on_error`	Undocumented
Function	`_url_encode_readable`	Undocumented
Variable	`_classes`	Undocumented
Variable	`_parse_date_re`	Undocumented
Variable	`_url_bytes_decode_re`	Undocumented
Variable	`_url_decode_ascii_re`	Undocumented
Variable	`_url_decode_unicode_bytes_re`	Undocumented
Variable	`_url_encode_path_re`	Undocumented
Variable	`_url_encode_re`	Undocumented

def escape_string(string, chars=''): (source)

Escape special characters with a backslash Escapes newline, tab, backslash itself and any characters in chars

def link_type(link): (source)

Function that returns a link type for urls and page links

def normalize_win32_share(path): (source)

Translates paths for windows shares in the platform specific form. So on windows it translates smb:// URLs to \host\share form, and vice versa on all other platforms. Just returns the original path if it was already in the right form, or when it is not a path for a share drive.

Parameters
path	a filesystem path or URL
Returns
the platform specific path or the original input path

def parse_date(string): (source)

Returns a tuple of (year, month, day) for a date string or None if failed to parse the string. Current supported formats:

dd?-mm?
dd?-mm?-yy
dd?-mm?-yyyy
yyyy-mm?-dd?

Where '-' can be replaced by any separator. Any preceding or trailing text will be ignored (so we can parse journal page names correctly).

TODO: Some setting to prefer US dates with mm-dd instead of dd-mm TODO: More date formats ?

def split_escaped_string(string, char): (source)

Split string on char while respecting backslash escapes

def split_quoted_strings(string, unescape=True, strict=True): (source)

Split a word list respecting quotes

This function always expect full words to be quoted, even if quotes appear in the middle of a word, they are considered word boundries.

( XDG Desktop Entry spec says full words must be quoted and quotes in a word escaped, but doesn't specifify what to do with loose quotes in a string. )

Also a comma "," is handled specially and is always considered a word on it's own.

Parameters
string	string to split in words
unescape	if `True` quotes are removed, else they are left in place
strict	if `True` unmatched quotes will cause a `ValueError` to be raised, if `False` unmatched quotes are ignored.
Returns
list of strings

def unescape_quoted_string(string): (source)

Removes quotes from a string and unescapes embedded quotes

Returns
string

def unescape_string(string): (source)

Unescape backslash escapes in string Recognizes \n and \t for newline and tab respectively, otherwise keeps the literal character

def uri_scheme(link): (source)

Function that returns a scheme for URIs, URLs and email addresses

def url_decode(url, mode=URL_ENCODE_PATH): (source)

Replace url-encoding hex sequences with their proper characters.

Mode can be:

URL_ENCODE_DATA: decode all chars
URL_ENCODE_PATH: same as URL_ENCODE_DATA
URL_ENCODE_READABLE: decode only whitespace and unicode characters

The mode URL_ENCODE_READABLE will not decode any other characters, so urls decoded with these modes can still contain escape sequences. They are safe to use within zim, but should be re-encoded with URL_ENCODE_READABLE before handing them to an external program.

This method will only decode non-ascii byte codes when the _whole_ byte equivalent of the URL is in valid UTF-8 decoding. Else it is assumed the encoding was done in another format and the decoding fails silently for these byte sequences.

def url_encode(url, mode=URL_ENCODE_PATH): (source)

Replaces non-standard characters in urls with hex codes.

Mode can be:

URL_ENCODE_DATA: encode all un-safe chars
URL_ENCODE_PATH: encode all un-safe chars except '/'
URL_ENCODE_READABLE: encode whitespace and all unicode characters

The mode URL_ENCODE_READABLE can be applied to urls that are already encoded because it does not touch the "%" character. The modes URL_ENCODE_DATA and URL_ENCODE_PATH can only be applied to strings that are known not to be encoded.

The encoded URL is a string containing only ASCII characters

def valid_interwiki_key(name): (source)

Undocumented

URL_ENCODE_DATA: int = (source)

Undocumented

Value

URL_ENCODE_PATH: int = (source)

Undocumented

Value

URL_ENCODE_READABLE: int = (source)

Undocumented

Value

is_email_re = (source)

Undocumented

is_interwiki_keyword_re = (source)

Undocumented

is_interwiki_re = (source)

Undocumented

is_path_re = (source)

Undocumented

is_uri_re = (source)

Undocumented

is_url_re = (source)

Undocumented

is_win32_path_re = (source)

Undocumented

is_win32_share_re = (source)

Undocumented

is_www_link_re = (source)

Undocumented

url_re = (source)

Undocumented

def _escape(match): (source)

Undocumented

def _unescape(match): (source)

Undocumented

def _url_decode(match): (source)

Undocumented

def _url_decode_bytes(match): (source)

Undocumented

def _url_encode(match): (source)

Undocumented

def _url_encode_on_error(error): (source)

Undocumented

def _url_encode_readable(match): (source)

Undocumented

_classes: dict[str, str] = (source)

Undocumented

_parse_date_re = (source)

Undocumented

_url_bytes_decode_re = (source)

Undocumented

_url_decode_ascii_re = (source)

Undocumented

_url_decode_unicode_bytes_re = (source)

Undocumented

_url_encode_path_re = (source)

Undocumented

_url_encode_re = (source)

Undocumented