Python: compare two files and export it

3

Searching the web I have managed to show the differences between two text files in a new one, with the following code:

with open('file1.txt', 'r') as file1:
    with open('file2.txt', 'r') as file2:
        difference = set(file1).difference(file2)

with open('diff.txt', 'w') as file_out:
    for line in difference:
        file_out.write(line)

The problem is that I export the differences with lines arranged alphabetically. For example:

file1.txt:

hola
casa
televisor
sillón
bebida

file2.txt:

hola
casa

diff.txt

bebida
sillón
televisor

How do I make sure that diff.txt does not have the lines arranged alphabetically?

    
asked by tomillo 13.05.2018 в 17:24
source

1 answer

3

The problem is that the sets do not maintain the order between their elements, in fact the output is not in alphabetical order, it does not have order directly. If you execute the script repeatedly you will get different outputs. Any supposed order in sets or dictionaries (as in Python 3.6) should be considered until now only a side effect of the implementation.

The pity is that the intersection of sets is a very efficient method to find the differences between two data sets, given that hash table searches have complexity O(1) on average, O(n) in the worst case .

Since you are looking for the rows of file1.txt that are not in file2.txt but not the other way round, you can still use a set to store the lines of file2.txt and iterate in order on file1.txt checking if the line exists or not in the set of file2.txt by in :

with open('file1.txt', 'r') as file1:
    with open('file2.txt', 'r') as file2:
        with open ("output.txt", "w") as out_file:
            f2_lines = set(file2)
            for line in file1:
                if line not in f2_lines:
                    out_file.write(line)

The output for your example would always be:

  

television
  armchair
  drink

Now, if there is a repeated line in file1 it will also appear repeated the same times in the output file. If you do not want this you can add the line to the set when it is found the first time:

with open('file1.txt', 'r') as file1:
    with open('file2.txt', 'r') as file2:
        with open ("output.txt", "w") as out_file:
            f2_lines = set(file2)
            for line in file1:
                if line not in f2_lines:
                    out_file.write(line)
                    f2_lines.add(line)

For the following content of file1.txt :

  

hello
  house
  television
  drink
  armchair
  drink

The first code will return us:

  

television
  drink
  armchair
  drink

and the latter:

  

television
  drink
  armchair

  

WARNING : both the original question code and the previous one of this answer compare the raw rows, that is, including the new line character. This is important, because if the last line of the files does not end with a new line character or the files use different character of new line (CR + LF, LF, CR, ...) the comparison would fail. In these cases, you can resort to apply str.rstrip on each row or correct the problem in the files prior to comparison, etc.

If your files have exactly the same content until the end of file2.txt and only vary in that file1.txt has new lines that were not added to file2.txt as your example shows , then we can play with the cursor instead of loading data in memory:

import shutil

with open('file1.txt', 'r') as file1:
    with open('file2.txt', 'r') as file2:
        with open ("output.txt", "w") as out_file:
            file2.seek(0, 2)                      # cursor de file2 al final del fichero  
            file1.seek(file2.tell())              # Cursor de file1 en la posición del de file2
            shutil.copyfileobj(file1, out_file)   # Copiamos file1 hasta el final

The expected output is the same as that of the first code, it does not omit possible duplicates. If file1.txt is the same or smaller in size than file2.txt, the output file will be empty.

    
answered by 13.05.2018 / 22:40
source