Citat:
Dakle, kod otvaranja oba fajla piše utf-8
Ne, piše u cp1252 a čita utf-8. Ni jedan open('w') nema definisan encoding, a na Windowsu je podrazumevano cp1252, tako da kada upisuje u fajl engleski_redovi.txt to prolazi, dok kad upisuje ( srp_red.write(f'{sr}\n'), linija 17 ) u srpski_redovi.txt tu izbacuje UnicodeEncodeError grešku. Na primer:
Code:
s = 'rečenica'
s.encode('cp1252')
UnicodeEncodeError: 'charmap' codec can't encode character '\u010d' in position 2: character maps to <undefined>
s.encode('utf-8')
b're\xc4\x8denica'
bs = s.encode('utf-8')
print(bs)
b're\xc4\x8denica'
type(bs)
bytes
bs.decode('utf-8')
'rečenica'
import encodings
encodings.aliases.aliases.values()
dict_values(['iso8859_15', 'johab', 'cp869', 'iso8859_8', 'cp1258', 'cp1140', 'mac_roman', 'gbk', 'utf_32', 'cp865', 'ptcp154',
'shift_jis', 'iso8859_9', 'euc_kr', 'latin_1', 'cp500', 'hz', 'cp1252', 'mac_roman', 'cp863', 'iso8859_8', 'ascii', 'iso8859_7', 'cp273',
'utf_8', 'cp1251', 'iso8859_10', 'bz2_codec', 'cp775', 'cp855', 'cp1125', 'iso8859_14', etc...])
encodings.aliases.aliases['utf8']
'utf_8'
Takođe ova liija:
for en, ser in zip(engleski, srpski), gde su ti engleski i srpski u navedenom kodu?
Nema potrebe ni da toliko puta koristiš with, na primer:
Code (python):
import re
import csv
with open('engleski.txt', encoding='utf-8') as eng_fh, open('srpski.txt', encoding='utf-8') as srp_fh:
for eng, srp in zip(eng_fh, srp_fh):
engleski = re.split(r"(?<=\.|\?|\!)\s", eng.strip())
srpski = re.split(r"(?<=\.|\?|\!)\s", srp.strip())
with open('englesko-srpski.csv', 'a', encoding='utf-8') as csv_file:
writer = csv.writer(csv_file)
for en, sr in zip(engleski, srpski):
writer.writerow([en] + [sr])