Skip to main content
 首页 » 编程设计

regex之用于将 PCRE 正则表达式转换为 emacs 正则表达式的 Elisp 机制

2024年11月24日12lyj

我承认对喜欢有明显的偏见PCRE regexps 比 emacs 好得多,如果没有其他原因,当我输入 '(' 我几乎总是想要一个分组运算符。当然,\w 和类似的比其他等价物方便得多。

但是,当然,期望改变 emacs 的内部结构是很疯狂的。但是我认为应该可以从 PCRE experssion 转换为 emacs 表达式,并进行所有需要的转换,以便我可以写:

(defun my-super-regexp-function ... 
   (search-forward (pcre-convert "__\\w: \d+"))) 

(或类似)。

有人知道可以做到这一点的elisp库吗?

编辑:从下面的答案中选择一个回复...

哇,我喜欢从 4 天的假期回来寻找大量有趣的答案来整理!我喜欢这两种类型的解决方案的工作。

最后,看起来解决方案的 exec-a-script 和直接 elisp 版本都可以工作,但是从纯粹的速度和“正确性”方法来看,elisp 版本肯定是人们更喜欢的版本(包括我自己) .

请您参考如下方法:

https://github.com/joddie/pcre2el是这个答案的最新版本。

pcre2el or rxt (RegeXp Translator or RegeXp Tools) is a utility for working with regular expressions in Emacs, based on a recursive-descent parser for regexp syntax. In addition to converting (a subset of) PCRE syntax into its Emacs equivalent, it can do the following:

  • convert Emacs syntax to PCRE
  • convert either syntax to rx, an S-expression based regexp syntax
  • untangle complex regexps by showing the parse tree in rx form and highlighting the corresponding chunks of code
  • show the complete list of strings (productions) matching a regexp, provided the list is finite
  • provide live font-locking of regexp syntax (so far only for Elisp buffers – other modes on the TODO list)


原始答案的文本如下......

这是一个 quick and ugly Emacs lisp solution (编辑:现在更永久地位于 here )。它主要基于 pcrepattern 中的描述手册页,并逐个标记地工作,仅转换以下结构:
  • 括号分组( .. )
  • 交替|
  • 数字重复 {M,N}
  • 字符串引用 \Q .. \E
  • 简单的字符转义:\a , \c , \e , \f , \n , \r , \t , \x , 和 \ + 八进制数字
  • 字符类:\d , \D , \h , \H , \s , \S , \v , \V
  • \w\W保持原样(使用 Emacs 自己的单词和非单词字符的想法)

  • 它不会对更复杂的 PCRE 断言做任何事情,但它会尝试在字符类中转换转义符。在字符类包括类似 \D 的情况下,这是通过转换为具有交替的非捕获组来完成的。

    它通过了我为它编写的测试,但肯定存在错误,并且逐个 token 扫描的方法可能很慢。换句话说,没有保修。但也许出于某些目的,它可以完成工作中更简单的部分。欢迎有兴趣的人士改进它;-)
    (eval-when-compile (require 'cl)) 
     
    (defvar pcre-horizontal-whitespace-chars 
      (mapconcat 'char-to-string 
                 '(#x0009 #x0020 #x00A0 #x1680 #x180E #x2000 #x2001 #x2002 #x2003 
                          #x2004 #x2005 #x2006 #x2007 #x2008 #x2009 #x200A #x202F 
                          #x205F #x3000) 
                 "")) 
     
    (defvar pcre-vertical-whitespace-chars 
      (mapconcat 'char-to-string 
                 '(#x000A #x000B #x000C #x000D #x0085 #x2028 #x2029) "")) 
     
    (defvar pcre-whitespace-chars 
      (mapconcat 'char-to-string '(9 10 12 13 32) "")) 
     
    (defvar pcre-horizontal-whitespace 
      (concat "[" pcre-horizontal-whitespace-chars "]")) 
     
    (defvar pcre-non-horizontal-whitespace 
      (concat "[^" pcre-horizontal-whitespace-chars "]")) 
     
    (defvar pcre-vertical-whitespace 
      (concat "[" pcre-vertical-whitespace-chars "]")) 
     
    (defvar pcre-non-vertical-whitespace 
      (concat "[^" pcre-vertical-whitespace-chars "]")) 
     
    (defvar pcre-whitespace (concat "[" pcre-whitespace-chars "]")) 
     
    (defvar pcre-non-whitespace (concat "[^" pcre-whitespace-chars "]")) 
     
    (eval-when-compile 
      (defmacro pcre-token-case (&rest cases) 
        "Consume a token at point and evaluate corresponding forms. 
     
    CASES is a list of `cond'-like clauses, (REGEXP FORMS 
    ...). Considering CASES in order, if the text at point matches 
    REGEXP then moves point over the matched string and returns the 
    value of FORMS. Returns `nil' if none of the CASES matches." 
        (declare (debug (&rest (sexp &rest form)))) 
        `(cond 
          ,@(mapcar 
             (lambda (case) 
               (let ((token (car case)) 
                     (action (cdr case))) 
                 `((looking-at ,token) 
                   (goto-char (match-end 0)) 
                   ,@action))) 
             cases) 
          (t nil)))) 
     
    (defun pcre-to-elisp (pcre) 
      "Convert PCRE, a regexp in PCRE notation, into Elisp string form." 
      (with-temp-buffer 
        (insert pcre) 
        (goto-char (point-min)) 
        (let ((capture-count 0) (accum '()) 
              (case-fold-search nil)) 
          (while (not (eobp)) 
            (let ((translated 
                   (or 
                    ;; Handle tokens that are treated the same in 
                    ;; character classes 
                    (pcre-re-or-class-token-to-elisp)    
     
                    ;; Other tokens 
                    (pcre-token-case 
                     ("|" "\\|") 
                     ("(" (incf capture-count) "\\(") 
                     (")" "\\)") 
                     ("{" "\\{") 
                     ("}" "\\}") 
     
                     ;; Character class 
                     ("\\[" (pcre-char-class-to-elisp)) 
     
                     ;; Backslash + digits => backreference or octal char? 
                     ("\\\\\\([0-9]+\\)" 
                      (let* ((digits (match-string 1)) 
                             (dec (string-to-number digits))) 
                        ;; from "man pcrepattern": If the number is 
                        ;; less than 10, or if there have been at 
                        ;; least that many previous capturing left 
                        ;; parentheses in the expression, the entire 
                        ;; sequence is taken as a back reference.    
                        (cond ((< dec 10) (concat "\\" digits)) 
                              ((>= capture-count dec) 
                               (error "backreference \\%s can't be used in Emacs regexps" 
                                      digits)) 
                              (t 
                               ;; from "man pcrepattern": if the 
                               ;; decimal number is greater than 9 and 
                               ;; there have not been that many 
                               ;; capturing subpatterns, PCRE re-reads 
                               ;; up to three octal digits following 
                               ;; the backslash, and uses them to 
                               ;; generate a data character. Any 
                               ;; subsequent digits stand for 
                               ;; themselves. 
                               (goto-char (match-beginning 1)) 
                               (re-search-forward "[0-7]\\{0,3\\}") 
                               (char-to-string (string-to-number (match-string 0) 8)))))) 
     
                     ;; Regexp quoting. 
                     ("\\\\Q" 
                      (let ((beginning (point))) 
                        (search-forward "\\E") 
                        (regexp-quote (buffer-substring beginning (match-beginning 0))))) 
     
                     ;; Various character classes 
                     ("\\\\d" "[0-9]") 
                     ("\\\\D" "[^0-9]") 
                     ("\\\\h" pcre-horizontal-whitespace) 
                     ("\\\\H" pcre-non-horizontal-whitespace) 
                     ("\\\\s" pcre-whitespace) 
                     ("\\\\S" pcre-non-whitespace) 
                     ("\\\\v" pcre-vertical-whitespace) 
                     ("\\\\V" pcre-non-vertical-whitespace) 
     
                     ;; Use Emacs' native notion of word characters 
                     ("\\\\[Ww]" (match-string 0)) 
     
                     ;; Any other escaped character 
                     ("\\\\\\(.\\)" (regexp-quote (match-string 1))) 
     
                     ;; Any normal character 
                     ("." (match-string 0)))))) 
              (push translated accum))) 
          (apply 'concat (reverse accum))))) 
     
    (defun pcre-re-or-class-token-to-elisp () 
      "Consume the PCRE token at point and return its Elisp equivalent. 
     
    Handles only tokens which have the same meaning in character 
    classes as outside them." 
      (pcre-token-case 
       ("\\\\a" (char-to-string #x07))  ; bell 
       ("\\\\c\\(.\\)"                  ; control character 
        (char-to-string 
         (- (string-to-char (upcase (match-string 1))) 64))) 
       ("\\\\e" (char-to-string #x1b))  ; escape 
       ("\\\\f" (char-to-string #x0c))  ; formfeed 
       ("\\\\n" (char-to-string #x0a))  ; linefeed 
       ("\\\\r" (char-to-string #x0d))  ; carriage return 
       ("\\\\t" (char-to-string #x09))  ; tab 
       ("\\\\x\\([A-Za-z0-9]\\{2\\}\\)" 
        (char-to-string (string-to-number (match-string 1) 16))) 
       ("\\\\x{\\([A-Za-z0-9]*\\)}" 
        (char-to-string (string-to-number (match-string 1) 16))))) 
     
    (defun pcre-char-class-to-elisp () 
      "Consume the remaining PCRE character class at point and return its Elisp equivalent. 
     
    Point should be after the opening \"[\" when this is called, and 
    will be just after the closing \"]\" when it returns." 
      (let ((accum '("[")) 
            (pcre-char-class-alternatives '()) 
            (negated nil)) 
        (when (looking-at "\\^") 
          (setq negated t) 
          (push "^" accum) 
          (forward-char)) 
        (when (looking-at "\\]") (push "]" accum) (forward-char)) 
     
        (while (not (looking-at "\\]")) 
          (let ((translated 
                 (or 
                  (pcre-re-or-class-token-to-elisp) 
                  (pcre-token-case               
                   ;; Backslash + digits => always an octal char 
                   ("\\\\\\([0-7]\\{1,3\\}\\)"     
                    (char-to-string (string-to-number (match-string 1) 8))) 
     
                   ;; Various character classes. To implement negative char classes, 
                   ;; we cons them onto the list `pcre-char-class-alternatives' and 
                   ;; transform the char class into a shy group with alternation 
                   ("\\\\d" "0-9") 
                   ("\\\\D" (push (if negated "[0-9]" "[^0-9]") 
                                  pcre-char-class-alternatives) "") 
                   ("\\\\h" pcre-horizontal-whitespace-chars) 
                   ("\\\\H" (push (if negated 
                                      pcre-horizontal-whitespace 
                                    pcre-non-horizontal-whitespace) 
                                  pcre-char-class-alternatives) "") 
                   ("\\\\s" pcre-whitespace-chars) 
                   ("\\\\S" (push (if negated 
                                      pcre-whitespace 
                                    pcre-non-whitespace) 
                                  pcre-char-class-alternatives) "") 
                   ("\\\\v" pcre-vertical-whitespace-chars) 
                   ("\\\\V" (push (if negated 
                                      pcre-vertical-whitespace 
                                    pcre-non-vertical-whitespace) 
                                  pcre-char-class-alternatives) "") 
                   ("\\\\w" (push (if negated "\\W" "\\w")  
                                  pcre-char-class-alternatives) "") 
                   ("\\\\W" (push (if negated "\\w" "\\W")  
                                  pcre-char-class-alternatives) "") 
     
                   ;; Leave POSIX syntax unchanged 
                   ("\\[:[a-z]*:\\]" (match-string 0)) 
     
                   ;; Ignore other escapes 
                   ("\\\\\\(.\\)" (match-string 0)) 
     
                   ;; Copy everything else 
                   ("." (match-string 0)))))) 
            (push translated accum))) 
        (push "]" accum) 
        (forward-char) 
        (let ((class 
               (apply 'concat (reverse accum)))) 
          (when (or (equal class "[]") 
                    (equal class "[^]")) 
            (setq class "")) 
          (if (not pcre-char-class-alternatives) 
              class 
            (concat "\\(?:" 
                    class "\\|" 
                    (mapconcat 'identity 
                               pcre-char-class-alternatives 
                               "\\|") 
                    "\\)")))))