o -åÏ_öã@s0ddlZddlZddlmZGdd„deƒZdS)éNé)Ú ProbingStatec@sneZdZdZddd„Zdd„Zedd„ƒZd d „Zed d „ƒZ d d„Z e dd„ƒZ e dd„ƒZ e dd„ƒZdS)Ú CharSetProbergffffffî?NcCsd|_||_t t¡|_dS©N)Ú_stateÚ lang_filterÚloggingÚ getLoggerÚ__name__Úlogger)Úselfr©r ú7/usr/lib/python3/dist-packages/chardet/charsetprober.pyÚ__init__'szCharSetProber.__init__cCs tj|_dSr)rÚ DETECTINGr©r r r rÚreset,s zCharSetProber.resetcCódSrr rr r rÚ charset_name/szCharSetProber.charset_namecCrrr )r Úbufr r rÚfeed3ózCharSetProber.feedcCs|jSr)rrr r rÚstate6szCharSetProber.statecCsdS)Ngr rr r rÚget_confidence:rzCharSetProber.get_confidencecCst dd|¡}|S)Ns([-])+ó )ÚreÚsub)rr r rÚfilter_high_byte_only=sz#CharSetProber.filter_high_byte_onlycCs\tƒ}t d|¡}|D] }| |dd…¡|dd…}| ¡s&|dkr&d}| |¡q |S)u9 We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [€-ÿ] marker: everything else [^a-zA-Z€-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. s%[a-zA-Z]*[€-ÿ]+[a-zA-Z]*[^a-zA-Z€-ÿ]?Néÿÿÿÿó€r)Ú bytearrayrÚfindallÚextendÚisalpha)rÚfilteredÚwordsÚwordÚ last_charr r rÚfilter_international_wordsBsÿ  z(CharSetProber.filter_international_wordscCs¤tƒ}d}d}tt|ƒƒD]7}|||d…}|dkrd}n|dkr$d}|dkrD| ¡sD||kr@|s@| |||…¡| d¡|d}q |sP| ||d …¡|S) aÈ Returns a copy of ``buf`` that retains only the sequences of English alphabet and high byte characters that are not between <> characters. Also retains English alphabet and high byte characters immediately before occurrences of >. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``. Frró>ós