作者guardlan (亚修)
看板PHP
标题[请益] 这应该是特殊符号吧...请问要怎麽滤掉它
时间Thu Dec 29 01:29:54 2011
# 以下是小弟 log 的一小段程式
$string = '要被记录的字串';
$string = trim(preg_replace('/\s+/', ' ', preg_replace('/[\f\r\n\t]+/s', ' ', clear_unicode_spaces(clear_invisible_unicode($string)))));
logger->log($string); # log function
因为我会用 tail -f 的方式在看 log
原本只有用 preg_replace 的方式将 \r\n 等等的换行符号全部换成空白
但有时候就是会看到像下图这种断行的记录...
http://guardlan.myweb.hinet.net/tail.png
改用 vim 去看的时候就会看到一个淡蓝色 > 的符号...
http://guardlan.myweb.hinet.net/vim.png

我一直以为那是特殊符号,所以去网路上找了专门过滤字串的 clear_invisible_unicode 跟 clear_unicode_spaces 来使用
但是 log 内还是看到像这种 > 符号
这个到底是什麽,要怎麽样才能把它滤掉...
以下把 clear_invisible_unicode 跟 clear_unicode_spaces 贴出来
function clear_invisible_unicode($input){
$invisible = array(
"\0",
"\xc2\xad", # 'SOFT HYPHEN' (U+00AD)
"\xcc\xb7", # 'COMBINING SHORT SOLIDUS OVERLAY' (U+0337)
"\xcc\xb8", # 'COMBINING LONG SOLIDUS OVERLAY' (U+0338)
"\xcd\x8f", # 'COMBINING GRAPHEME JOINER' (U+034F)
"\xe1\x85\x9f", # 'HANGUL CHOSEONG FILLER' (U+115F)
"\xe1\x85\xa0", # 'HANGUL JUNGSEONG FILLER' (U+1160)
"\xe2\x80\x8b", # 'ZERO WIDTH SPACE' (U+200B)
"\xe2\x80\x8c", # 'ZERO WIDTH NON-JOINER' (U+200C)
"\xe2\x80\x8d", # 'ZERO WIDTH JOINER' (U+200D)
"\xe2\x80\x8e", # 'LEFT-TO-RIGHT MARK' (U+200E)
"\xe2\x80\x8f", # 'RIGHT-TO-LEFT MARK' (U+200F)
"\xe2\x80\xaa", # 'LEFT-TO-RIGHT EMBEDDING' (U+202A)
"\xe2\x80\xab", # 'RIGHT-TO-LEFT EMBEDDING' (U+202B)
"\xe2\x80\xac", # 'POP DIRECTIONAL FORMATTING' (U+202C)
"\xe2\x80\xad", # 'LEFT-TO-RIGHT OVERRIDE' (U+202D)
"\xe2\x80\xae", # 'RIGHT-TO-LEFT OVERRIDE' (U+202E)
"\xe3\x85\xa4", # 'HANGUL FILLER' (U+3164)
"\xef\xbb\xbf", # 'ZERO WIDTH NO-BREAK SPACE' (U+FEFF)
"\xef\xbe\xa0", # 'HALFWIDTH HANGUL FILLER' (U+FFA0)
"\xef\xbf\xb9", # 'INTERLINEAR ANNOTATION ANCHOR' (U+FFF9)
"\xef\xbf\xba", # 'INTERLINEAR ANNOTATION SEPARATOR' (U+FFFA)
"\xef\xbf\xbb", # 'INTERLINEAR ANNOTATION TERMINATOR' (U+FFFB)
);
return str_replace($invisible, '', $input);
}
function clear_unicode_spaces($input){
$spaces = array(
"\x9", # 'CHARACTER TABULATION' (U+0009)
//"\xa", # 'LINE FEED (LF)' (U+000A)
"\xb", # 'LINE TABULATION' (U+000B)
"\xc", # 'FORM FEED (FF)' (U+000C)
//"\xd", # 'CARRIAGE RETURN (CR)' (U+000D)
"\x20", # 'SPACE' (U+0020)
"\xc2\xa0", # 'NO-BREAK SPACE' (U+00A0)
"\xe1\x9a\x80", # 'OGHAM SPACE MARK' (U+1680)
"\xe1\xa0\x8e", # 'MONGOLIAN VOWEL SEPARATOR' (U+180E)
"\xe2\x80\x80", # 'EN QUAD' (U+2000)
"\xe2\x80\x81", # 'EM QUAD' (U+2001)
"\xe2\x80\x82", # 'EN SPACE' (U+2002)
"\xe2\x80\x83", # 'EM SPACE' (U+2003)
"\xe2\x80\x84", # 'THREE-PER-EM SPACE' (U+2004)
"\xe2\x80\x85", # 'FOUR-PER-EM SPACE' (U+2005)
"\xe2\x80\x86", # 'SIX-PER-EM SPACE' (U+2006)
"\xe2\x80\x87", # 'FIGURE SPACE' (U+2007)
"\xe2\x80\x88", # 'PUNCTUATION SPACE' (U+2008)
"\xe2\x80\x89", # 'THIN SPACE' (U+2009)
"\xe2\x80\x8a", # 'HAIR SPACE' (U+200A)
"\xe2\x80\xa8", # 'LINE SEPARATOR' (U+2028)
"\xe2\x80\xa9", # 'PARAGRAPH SEPARATOR' (U+2029)
"\xe2\x80\xaf", # 'NARROW NO-BREAK SPACE' (U+202F)
"\xe2\x81\x9f", # 'MEDIUM MATHEMATICAL SPACE' (U+205F)
"\xe3\x80\x80", # 'IDEOGRAPHIC SPACE' (U+3000)
);
return str_replace($spaces, ' ', $input);
}
--
※ 发信站: 批踢踢实业坊(ptt.cc)
◆ From: 111.240.54.65
1F:推 LPH66:能传个部份档案上来吗? (不要用贴的 找个空间传) 12/29 03:46
2F:推 mervynW:不转 acsii code 也不知道是啥啊 12/29 09:44
3F:→ poi987poi987:str_replace(chr(int $ascii), "", $input) 滤看看 12/29 18:57
感谢提示...
我去查了那段的ASCII码之後发现那个特殊字元是空的...根本没有那个字...
後来发现那个好像是 pietty 侦测视窗边界有问题造成的...
缩小视窗之後发现那个 > 出现在别行上=.=
有够搞笑...XDDD
※ 编辑: guardlan 来自: 111.240.54.65 (12/30 00:26)