File encoding change after online editing (OD-2045)

sususweet opened 1 year ago

The original file encoding is GB18030, and after edit in the online editor, its encoding changes to utf-8. Is there any way that backend can write file into its original encoding?

Activities

sususweet changed fields 1 year ago
Name Previous Value Current Value
Type

Question

Improvement
sususweet referenced from pull request 1 year ago

feat: keep gbk file encoding when using online editing. (OD-55) Discarded
Robin Shen commented 1 year ago
This is expected behavior as explained in OD-82

If the code is changed to use original encoding, you will encounter below issue:

Commit a file with all English words. The file encoding will be ISO-8859-1

Now edit the file to add some Chinese characters. File content will be converted to ISO-8859-1 bytes and Chinese characters will be messed up.
sususweet commented 1 year ago
This is expected behavior as explained in OD-82

If the code is changed to use original encoding, you will encounter below issue:

Commit a file with all English words. The file encoding will be ISO-8859-1

Now edit the file to add some Chinese characters. File content will be converted to ISO-8859-1 bytes and Chinese characters will be messed up.

pr OD-55 has a new commit to solve this issue. This commit sets the encoding of a file with all English words to be Utf-8 instead of ISO-8859-1, so as to enable Chinese character editing.
Robin Shen commented 1 year ago

The all-english-words file may exist already. So this fix is not complete. Actually I do not think there will be a complete fix for this. So once edited from UI, OneDev will always use UTF-8 for encoding.
sususweet commented 1 year ago

The all-english-words file may exist already. So this fix is not complete. Actually I do not think there will be a complete fix for this. So once edited from UI, OneDev will always use UTF-8 for encoding.

This fix will consider all-english-words file as UTF-8 encoding instead of ISO-8859-1. Since our team use onedev for some Visual Studio Projects, this fix may be useful when managing VS Projects and editing GBK encoded file. Please have a try on this fix and figure out if there is any other bug after this fix.

As we know, GBK encoding is default for VS projects and when changing the encoding of file, some unexpected issues may appear.
Robin Shen commented 1 year ago
This fix still has problem:

Assume a file test.c already committed to repository, and it only contains ascii characters

When edited from UI, the initial encoding will be detected as ISO-8859-1 (even if it is encoded as UTF-8 initially, as bytes are the same for both encoding for pure ascii characters)

When some Chinese characters are added and saved, sticking to initial encoding (ISO-8859-1) will mess up the file.

As mentioned before, the approach to use original encoding will not work. Please either avoid editing online, or change your encoding to UTF-8 which is default for mojarity of code editors.
sususweet commented 1 year ago
This fix still has problem:

Assume a file test.c already committed to repository, and it only contains ascii characters

When edited from UI, the initial encoding will be detected as ISO-8859-1 (even if it is encoded as UTF-8 initially, as bytes are the same for both encoding for pure ascii characters)

When some Chinese characters are added and saved, sticking to initial encoding (ISO-8859-1) will mess up the file.

As mentioned before, the approach to use original encoding will not work. Please either avoid editing online, or change your encoding to UTF-8 which is default for mojarity of code editors.

Sorry, I don't seem to fully understand what you're trying to express. Based on the fixed code, the test results we obtained are:

Set a file test.c already committed to repository, and it only contains ascii characters.

When edited from UI, the initial encoding will be detected as UTF-8 instead of ISO-8859-1. (We changed the default encoding here.)

When some Chinese characters are added and saved, sticking to initial encoding (UTF-8) will NOT mess up the file.

Please have a look at the test result below. 20240902_001442.mp4
Robin Shen commented 1 year ago

I did not realize you are changing default encoding from ISO-8859-1 to UTF-8 in UniversalEncodingListener.java. This may cause backward compatibilities, as not all ISO-8859-1 char a valid UTF-8 char.
sususweet commented 1 year ago

I did not realize you are changing default encoding from ISO-8859-1 to UTF-8 in UniversalEncodingListener.java. This may cause backward compatibilities, as not all ISO-8859-1 char a valid UTF-8 char.

A newly commit has be updated as to only keep the encoding of GBK encoded file. Pls review the update.

This fix can make adaption to Windows Visual Studio Projects, in which source code files are encoded in GBK.
Robin Shen changed state to 'Closed' 1 year ago
Previous Value Current Value
Open

Closed
Robin Shen commented 1 year ago

Sorry the workaround to process GBK alone is not acceptable. I am closing the issue.
jbauer commented 1 year ago
GIT doesn't really care about encoding as it only stores binary data but it works best with UTF-8 because if git detects a file to be text it chooses UTF-8 by default.

If you use a tool that cannot produce UTF-8 files then you should tell GIT the file encoding using .gitattributes file and its working-tree-encoding option. GIT will then convert back and forth between the specified custom encoding and UTF-8 (used to store the data).

If OneDev would honor any existing .gitattributes file then OneDev could convert the String received from the browser to the encoding specified in .gitattributes before committing.

It is important to understand that ALL git client applications you use would then need to understand working-tree-encoding. If you use any git client that does not understand working-tree-encoding then you will mess up the encoding.

/.gitattributes: *.c working-tree-encoding=GB18030

https://git-scm.com/docs/gitattributes/2.46.0
Login to comment

Name	Previous Value	Current Value
Type	Question	Improvement

Previous Value	Current Value
Open	Closed

Type	Improvement
Priority	Normal
Assignee	Robin Shen
Labels	No labels

Issue Votes (1)

Watchers (3)

Reference

OD-2045