How to convert a csv (unicode) file to csv (utf-8) in python 3.6.5?

3

I download a CSV (unicode) file from a website, the fields are separated by tabulation and in double quotation marks and the line break does not do it at the end of each row of how it looks in the CSV, example of how it is see the CSV information when opening it with notepad:

"campo1"    "campo2"    "campo3"    "campo4"    
"campo5"    "campo6"    "000000AA1011"  "000000AA1012"  
"000000AA1013" "000000AA1014"   "000000AA1015"  "000000AA1016"  
"000000AA1017"  "000000AA1018"  "000000AA1019" "000000AA1020"   
"000000AA1021"  "000000AA1022"  

I want to convert it to CSV (utf-8) so that the fields are separated by a comma and are not enclosed in double quotes and the line break is done as seen in the CSV and in this way we can make the LOAD DATA LOCAL INFILE of MySQL and insert the CSV records without any problem to the MySQL table, example of how I want the CSV to look when I open it with a notepad:

campo1,campo2,campo3,campo4,campo5,campo6
000000AA1011,000000AA1012,000000AA1013,000000AA1014,000000AA1015,000000AA1016
000000AA1017,000000AA1018,000000AA1019,000000AA1020,000000AA1021,000000AA1022

It is worth mentioning that in the first instance try to take the CSV (unicode) and insert it in the table MySQL in the following way:

LOAD DATA LOCAL INFILE 'C:/data.csv' INTO TABLE bd.down_data
    FIELDS TERMINATED BY '\t' ESCAPED BY '"'
    LINES TERMINATED BY '\n' IGNORE 1 LINES
    (campo1, campo2, campo3);

This query inserts the data but puts a space after each character, example:

    campo1                       campo2                       campo3
0 0 0 0 0 0 A A 1 0 1 1     0 0 0 0 0 0 A A 1 0 1 2     0 0 0 0 0 0 A A 1 0 1 3
0 0 0 0 0 0 A A 1 0 1 7     0 0 0 0 0 0 A A 1 0 1 8     0 0 0 0 0 0 A A 1 0 1 9

and try to delete the spaces as follows:

Select REPLACE(campo1,' ','') as campo1 from bd.down_data;

but can not make any changes, I imagine that the problem is the coding of the original CSV file and therefore I want to correct the problem from the CSV (have the fields separated by comma and without double quotes when doing the LOAD DATA LOCAL INFILE ) because if I save the same CSV again from the excel as CSV replacing it and I open it with the notebook, it appears separated by commas.

I'm doing this in Python :

import csv

path =  'data.csv'

with open(path, 'r', encoding='utf-8', errors='ignore') as infile, open('final.csv', 'w') as outfile:
     inputs = csv.reader(infile)
     output = csv.writer(outfile)

     for index, row in enumerate(inputs):

         if index == 0:
             continue
         output.writerow(row)

I created the new file final.csv but without any registration and it marks me the following error:

Traceback (most recent call last):
  File "export.py", line 9, in <module>
    for index, row in enumerate(inputs):
_csv.Error: line contains NULL byte

UPDATE:

real file:

link

    
asked by El Cóndor 11.09.2018 в 08:04
source

1 answer

1

The file in principle uses Unicode but is encoded using UTF-16-LE as the encoding format (at least it is correct for the sample file). It is basically what Microsoft calls "UNICODE" files, although it is not a very specific term ... Therefore we must open the file using UTF-16 and use UTF-8 for the output file:

import csv

path =  'data.csv'

with open(path, 'r', encoding='utf-16-le',  newline='') as infile,\
     open('final.csv', 'w',  encoding='utf-8', newline='') as outfile:
     inputs = csv.reader(infile, delimiter="\t", quotechar='"')
     output = csv.writer(outfile, delimiter=",", quotechar='"')
     next(inputs) # Descartar primera fila (header)
     output.writerows(row for row in inputs)

Which generates us the csv without quotes (unless they are necessary, for example cell that has the separator as part of the data) and using the comma as a separator:

  

0000001NV462, MC1, AUTO,
  000000A34147, MC2, AUTO, 3,
  000000B51066, MC3, AUTO,
  000000B94242, MCG, AUTO, 2,
  000000C61056, MC9, AUTO, 534,
  4X30M39458, MC2, AUTO,
  SA30J76548, MC9, AUTO,
  SA70A15440, MCG, AUTO, 5,
  SA70A15460, MCB, AUTO, 1,
  SA70A15467, MCB, AUTO, 159,
  SA70A15468, MCB, AUTO, 159,
  ...

The file has an empty column at the end, no header and no data at least in this file. If you want to eliminate enough with a slicing:

output.writerows(row[:-1] for row in inputs)
    
answered by 11.09.2018 / 16:50
source