Find complete words in a string that may or may not be surrounded by "_"

Question

Find complete words in a string that may or may not be surrounded by "_"

Navigation

#1 by (8 votes)

2

I'm passing two lists for a class DictionaryMaker() . This will generate a match between words of the item of one list with the item of the other, in order to declare each item as a key and value.

The class and the method work for me, but to make it more precise I wanted to use RegEx and the re.compile to further refine the matches.

I still do not understand very well how the syntax goes, I've read many things and I think the problem is that I'm trying something like:
```
(r'\b({})\b'.format(i))
```
This check me the word by its limits ( boundarys ), so that only I detect 'foo' and not 'football' . What happens is that in the other item that is checking '_foo_' is between low bars

I know I should apply \w* (I BELIEVE), but I do not know how.
The thing is, I need you to make me a match even if this word finds it between low bars - > 'foo' match with '_foo_' .

How could I achieve it?

import re

chekingList = [u'Hitch_neck_01_proxy', u'Hitch_head_proxy', u'Hitch_chest_proxy', 
           u'Hitch_spine_04_proxy',u'Hitch_spine_03_proxy', u'Hitch_spine_02_proxy',
           u'Hitch_upperarm_r_proxy', u'Hitch_lowerarm_r_proxy', u'Hitch_upperarm_l_proxy',
           u'Hitch_lowerarm_l_proxy', u'Hitch_hips_proxy', u'Hitch_upperleg_l_proxy',
           u'Hitch_lowerleg_l_proxy', u'Hitch_upperleg_r_proxy', u'Hitch_lowerleg_r_proxy',
           u'Hitch_foot_l_proxy', u'Hitch_toes_l_proxy', u'Hitch_foot_r_proxy', 
           u'Hitch_toes_r_proxy', u'Hitch_hand_l_proxy','nestor_colt_02_nes','maria_perez_04_vie',
           'juan_carlos_lara_curso','referendum_julio_jodido']

checkerList = [u'suck_neck_01_target', u'suck_head_target', u'suck_chest_target', 
           u'suck_spine_04_target',u'suck_spine_03_target', u'suck_spine_02_target',
           u'suck_upperarm_r_target', u'suck_lowerarm_r_target', u'suck_upperarm_l_target',
           u'suck_lowerarm_l_target', u'suck_hips_target', u'suck_upperleg_l_target',
           u'suck_lowerleg_l_target', u'suck_upperleg_r_target', u'suck_lowerleg_r_target',
           u'suck_foot_l_target', u'suck_toes_l_target', u'suck_foot_r_target', 
           u'suck_toes_r_target', u'suck_hand_l_target',]


class DictionaryMaker:

    # __INIT__
    def __init__(self,listA=None,listB=None):
        self.listA = listA
        self.listB = listB

    # Must Pass first the list what you want as KEYS 
    # Then pass the list that you want as VALUES
    # It Has FIXEDVALUE for TOLERANCE

    def Match(self,listA,listB,fixedValue=2):

        dictionary = {}
        for x,y in [(x,y) for x in listA for y in listB]:    
            def BreakWord(x):
                counter = 0
                list2Check = x.split("_")
                for i in list2Check:
                    find = re.compile(r'\b({})\b'.format(i))
                    if find.search(y):
                        print ("it Match")
                        counter += 1
                    else:
                        print ("NOT MATCH")    

                return counter

            counter = BreakWord(x)
            print counter
            if counter >= fixedValue:                
                dictionary[y] = x

        # print the dictionary Created for debugging
        for k,v in dictionary.items():
            print ("{} < -- is key from : ---- >> {}".format(k,v))
        print "            "
        return dictionary

dict = DictionaryMaker()
DicForTestResult = dict.Match(chekingList,checkerList)

python regex python-2.7

asked by Nestor Colt 28.06.2017 в 16:54

source

1 answer

Display a list of elements from a view to a modal in jquery Slow query PHP MySQL

score 8 · Accepted Answer

The '_' are part of the word

\b matches full word limits . That is, in a position where on one side has a word character and on the other there is no word character.

The word characters (or \w ) are [a-zA-Z0-9_] . As you will see, the _ is also included, and is considered to be part of the same word.

Then, to solve it, you have to modify the expression a bit to match it.

Regular expression

Some options for you to choose:

That matches \bfoo\b or _foo_

r'\bfoo\b|_foo_'

This also matches 'aaa_foo_zzz' .
Demo: regex101

The same as before, but that _foo_ is not surrounded by a word character.

r'\bfoo\b|\b_foo_\b'

Demo regex101

That matches foo if it is around \b or _ .

r'(?:\b|_)foo(?:\b|_)'

This would also coincide with 'aaa_foo_zzz' , or% 'foo_' , or% 'aaa_foo.' .
I would use this expression to see your examples.
Demo: regex101

That matches with foo or _foo_ complete, as in the second case, but written all in one.

r'\b(_?)foo\b'

Demo: regex101

Description of the constructions used

For the options above, it was used:

| - That works like alternation . It is the same as a OR , and has one of the lowest precedence in regex. That is, something like ^aaaaa|bbb$ is interpreted as ^aaaaa or bbb$ (note that ^ applies only to the first, and $ only to the second).

The expression r'\bfoo\b|_foo_' can be thought of as the union of 2 alternative expressions.
(?: ... ) - It's a group . It serves for that, to group a construction.

In the case of r'(?:\b|_)foo(?:\b|_)' we are using it so that the | only applies to those 2 options (and not the whole regex).

That is, (?:\b|_) matches a word limit position, or with _ .
( .. ) - It's also a group, but it's a capturing group . Saves the text with which it coincided in memory. In that way, we can reference it later in the expression.

In the case of r'\b(_?)foo\b' we are making optionally match a _ (the ? makes it optional). So it coincides with _ or with nothing.

As it is the first (and only) group we use, at refer to it as% of% we are doing that match that: a if there was, or nothing if you did not have it.

Code

Demo of the code with _ :
link

Now, you may be interested in not distinguishing upper and lower case letters . It is established by passing r'(?:\b|_)foo(?:\b|_)' (or re.IGNORECASE ).

re.compile(r'(?:\b|_){}(?:\b|_)'.format(i), re.IGNORECASE)

On the other hand, you are comparing each word within each item. For example, with re.I you are building a different regular expression for 'Hitch_neck_01_proxy' , another for 'Hitch' , for 'neck' , and for '01' . That could make you more efficient.

All in one regular expression

re.compile( r'(?:\b|_)(?:Hitch|neck|01|proxy)(?=\b|_)', re.I)

and we call 'proxy' .

In the end, instead of using re.findall() , we now use (?:\b|_) , which is a positive inspection that It will not consume a character. Thus, if it is followed by (?=\b|_) , the next item can also match.

for x,y in [(x,y) for x in listA for y in listB]:
    def BreakWord(x):
        list2Check = x.split("_")

        pattern2Check = '(?:' + '|'.join(list2Check) + ')'
        regex = r'(?:\b|_){}(?=\b|_)'.format(pattern2Check)

        find = re.compile(regex, re.IGNORECASE)
        resultado = find.findall(y)

        if resultado:
            print (r"r'{}'  COINCIDE CON    '{}'".format(regex,y))
        else:
            print (r"r'{}'  NO COINCIDE '{}'".format(regex,y))

        return len(resultado)

Demo: link