Find complete words in a string that may or may not be surrounded by "_"

2

I'm passing two lists for a class DictionaryMaker() . This will generate a match between words of the item of one list with the item of the other, in order to declare each item as a key and value.

The class and the method work for me, but to make it more precise I wanted to use RegEx and the re.compile to further refine the matches.

  • I still do not understand very well how the syntax goes, I've read many things and I think the problem is that I'm trying something like:

    (r'\b({})\b'.format(i))
    
  • This check me the word by its limits ( boundarys ), so that only I detect 'foo' and not 'football' . What happens is that in the other item that is checking '_foo_' is between low bars

    I know I should apply \w* (I BELIEVE), but I do not know how.

  • The thing is, I need you to make me a match even if this word finds it between low bars - > 'foo' match with '_foo_' .

How could I achieve it?

import re

chekingList = [u'Hitch_neck_01_proxy', u'Hitch_head_proxy', u'Hitch_chest_proxy', 
           u'Hitch_spine_04_proxy',u'Hitch_spine_03_proxy', u'Hitch_spine_02_proxy',
           u'Hitch_upperarm_r_proxy', u'Hitch_lowerarm_r_proxy', u'Hitch_upperarm_l_proxy',
           u'Hitch_lowerarm_l_proxy', u'Hitch_hips_proxy', u'Hitch_upperleg_l_proxy',
           u'Hitch_lowerleg_l_proxy', u'Hitch_upperleg_r_proxy', u'Hitch_lowerleg_r_proxy',
           u'Hitch_foot_l_proxy', u'Hitch_toes_l_proxy', u'Hitch_foot_r_proxy', 
           u'Hitch_toes_r_proxy', u'Hitch_hand_l_proxy','nestor_colt_02_nes','maria_perez_04_vie',
           'juan_carlos_lara_curso','referendum_julio_jodido']

checkerList = [u'suck_neck_01_target', u'suck_head_target', u'suck_chest_target', 
           u'suck_spine_04_target',u'suck_spine_03_target', u'suck_spine_02_target',
           u'suck_upperarm_r_target', u'suck_lowerarm_r_target', u'suck_upperarm_l_target',
           u'suck_lowerarm_l_target', u'suck_hips_target', u'suck_upperleg_l_target',
           u'suck_lowerleg_l_target', u'suck_upperleg_r_target', u'suck_lowerleg_r_target',
           u'suck_foot_l_target', u'suck_toes_l_target', u'suck_foot_r_target', 
           u'suck_toes_r_target', u'suck_hand_l_target',]


class DictionaryMaker:

    # __INIT__
    def __init__(self,listA=None,listB=None):
        self.listA = listA
        self.listB = listB

    # Must Pass first the list what you want as KEYS 
    # Then pass the list that you want as VALUES
    # It Has FIXEDVALUE for TOLERANCE

    def Match(self,listA,listB,fixedValue=2):

        dictionary = {}
        for x,y in [(x,y) for x in listA for y in listB]:    
            def BreakWord(x):
                counter = 0
                list2Check = x.split("_")
                for i in list2Check:
                    find = re.compile(r'\b({})\b'.format(i))
                    if find.search(y):
                        print ("it Match")
                        counter += 1
                    else:
                        print ("NOT MATCH")    

                return counter

            counter = BreakWord(x)
            print counter
            if counter >= fixedValue:                
                dictionary[y] = x

        # print the dictionary Created for debugging
        for k,v in dictionary.items():
            print ("{} < -- is key from : ---- >> {}".format(k,v))
        print "            "
        return dictionary

dict = DictionaryMaker()
DicForTestResult = dict.Match(chekingList,checkerList)
    
asked by Nestor Colt 28.06.2017 в 18:54
source

1 answer

8

The '_' are part of the word

\b matches full word limits . That is, in a position where on one side has a word character and on the other there is no word character.

The word characters (or \w ) are [a-zA-Z0-9_] . As you will see, the _ is also included, and is considered to be part of the same word.

Then, to solve it, you have to modify the expression a bit to match it.


Regular expression

Some options for you to choose:

  • That matches \bfoo\b or _foo_

    r'\bfoo\b|_foo_'
    
    • This also matches 'aaa_foo_zzz' .
    • Demo: regex101

  • The same as before, but that _foo_ is not surrounded by a word character.

    r'\bfoo\b|\b_foo_\b'
    
  • That matches foo if it is around \b or _ .

    r'(?:\b|_)foo(?:\b|_)'
    
    • This would also coincide with 'aaa_foo_zzz' , or% 'foo_' , or% 'aaa_foo.' .
    • I would use this expression to see your examples.
    • Demo: regex101

  • That matches with foo or _foo_ complete, as in the second case, but written all in one.

    r'\b(_?)foo\b'
    

  • Description of the constructions used

    For the options above, it was used:

    • | - That works like alternation . It is the same as a OR , and has one of the lowest precedence in regex. That is, something like ^aaaaa|bbb$ is interpreted as ^aaaaa or bbb$ (note that ^ applies only to the first, and $ only to the second).

      The expression r'\bfoo\b|_foo_' can be thought of as the union of 2 alternative expressions.

    • (?: ... ) - It's a group . It serves for that, to group a construction.

      In the case of r'(?:\b|_)foo(?:\b|_)' we are using it so that the | only applies to those 2 options (and not the whole regex).

      That is, (?:\b|_) matches a word limit position, or with _ .

    • ( .. ) - It's also a group, but it's a capturing group . Saves the text with which it coincided in memory. In that way, we can reference it later in the expression.

      In the case of r'\b(_?)foo\b' we are making optionally match a _ (the ? makes it optional). So it coincides with _ or with nothing.

      As it is the first (and only) group we use, at refer to it as% of% we are doing that match that: a if there was, or nothing if you did not have it.


    Code

    Demo of the code with _ :
    link

    Now, you may be interested in not distinguishing upper and lower case letters . It is established by passing r'(?:\b|_)foo(?:\b|_)' (or re.IGNORECASE ).

    re.compile(r'(?:\b|_){}(?:\b|_)'.format(i), re.IGNORECASE)
    


    On the other hand, you are comparing each word within each item. For example, with re.I you are building a different regular expression for 'Hitch_neck_01_proxy' , another for 'Hitch' , for 'neck' , and for '01' . That could make you more efficient.


    All in one regular expression

    re.compile( r'(?:\b|_)(?:Hitch|neck|01|proxy)(?=\b|_)', re.I)
    

    and we call 'proxy' .

    In the end, instead of using re.findall() , we now use (?:\b|_) , which is a positive inspection that It will not consume a character. Thus, if it is followed by (?=\b|_) , the next item can also match.

    for x,y in [(x,y) for x in listA for y in listB]:
        def BreakWord(x):
            list2Check = x.split("_")
    
            pattern2Check = '(?:' + '|'.join(list2Check) + ')'
            regex = r'(?:\b|_){}(?=\b|_)'.format(pattern2Check)
    
            find = re.compile(regex, re.IGNORECASE)
            resultado = find.findall(y)
    
            if resultado:
                print (r"r'{}'  COINCIDE CON    '{}'".format(regex,y))
            else:
                print (r"r'{}'  NO COINCIDE '{}'".format(regex,y))
    
            return len(resultado)
    

    Demo: link

        
    answered by 28.06.2017 / 19:14
    source