Conventional commits: Regex improvements for summary (OD-1401)
Released
jbauer opened 11 months ago

Currently the regex is

(([\w\-.]+)\s*(\(([\w\s\-./,]+)\)\s*)?(!\s*)?:.+)|(revert\s+\".*\")

which has some issues. I have prepared some example commit messages using the above regex at https://regex101.com/r/9XWUhp/1 . On the right side is a "MATCH INFORMATION" section that shows matching details for each line.

So we can see that:

  1. It allows many or no whitespace between type, optional scope, optional !, terminal : and after the terminal :. E.g. feat:message and      fix     (test)     !     :     message are valid in OneDev.
  2. It allows almost free-form text in the optional scope instead of just a noun.
  3. It does not match from the beginning of a line which causes some lines to match that should not match

The spec says:

  1. Commits MUST be prefixed with a type, which consists of a noun, feat, fix, etc., followed by the OPTIONAL scope, OPTIONAL !, and REQUIRED terminal colon and space.
  2. A scope MAY be provided after a type. A scope MUST consist of a noun describing a section of the codebase surrounded by parenthesis
  3. A description MUST immediately follow the colon and space after the type/scope prefix.

Rule 1. and 5. make it pretty clear that no whitespace is between type, scope, ! and : and that : needs a single following space. Rule 4. makes it clear that a scope must be a noun. Whitespace is not a letter and thus whitespace can not be part of a noun. That means whitespace surrounding the noun is not allowed. Generally only a single noun is allowed but OneDev allows whitespace and some special characters _-/,.. I assume you allow this in order to support some form of namespace, e.g. type(security/login): message. But I would strongly discourage , and whitespace an possibly also . because the whole point of conventional commits is to generate changelogs and if you have multiple scopes you don't really want to show up the same commit in multiple sections of the changelog. So usually you just want a single noun as a scope

If you look closely at the example lines of the link above you can see some tricky messages:

  • fix(fix:pe)!:message matches against fix:pe)!:message with group 2 being fix and no further groups. This would pass OneDev checks because if checks fix and does not detect a scope
  • fix(scope): matches because OneDev allows empty message
  • my:fix(sco: message: matches with group 2 being my and no group 4. This would be allowed if no types are configured or my was configured
  • my/fix(scope)!: message matches against fix(scope)!: message which would be allowed because group 2 is fix and group 4 is scope
  • rerevert "message" matches because it does not check the start of line

So I would vote for using the regex used in https://regex101.com/r/BPgw1H/1

Robin Shen commented 11 months ago

I am aware of the spec putting strict constraints over the spaces and texts. But just want to relax it to allow unintentional extra spaces in the commit message. You are right, the relaxed form violates conventional commit convention. I will fix that.

OneDev changed state to 'Closed' 11 months ago
Previous Value Current Value
Open
Closed
OneDev commented 11 months ago

State changed as code fixing the issue is committed (8290cf51)

Referenced from commit 11 months ago
OneDev changed state to 'Released' 11 months ago
Previous Value Current Value
Closed
Released
OneDev commented 11 months ago

State changed as build #3704 is successful

jbauer commented 11 months ago

In code line

https://code.onedev.io/onedev/server/~files/8d7349b87de3e9128318b970e632b4f7897e82b5/server-core/src/main/java/io/onedev/server/model/support/code/BranchProtection.java?position=source-417.1-417.59-1

you split the scope using /, , and whitespace. The comma and whitespace have been removed from the regex so it can be removed from the split as well.

Additionally I feel like the split on / should be removed as well and a user should configure a single scope named, e.g. security/login. So this would be a single scope and not two scopes.

What do you think?

Robin Shen commented 11 months ago

Thanks for pointing this out. Will make it consistent with regex.

For '/', I was originally thinking of using it as separator for multiple scopes. I ever see this separator in some examples. Do you know the recommended separator for multiple scopes? Or is this still not allowed in the spec?

jbauer commented 11 months ago

@robin I googled a bit and it seems that the spec itself strictly means a single noun. See answer in: https://github.com/conventional-commits/conventionalcommits.org/issues/472

However the answer also states that a linter might be a bit forgiving if desired since you never know how teams want to define scopes. However I think a linter should not be too forgiving so that a parser / changelog generator still work correctly.

I have also seen an example using fix(shopping cart): message which is technically forbidden as it is not a single noun but of course describes a single scope.

So I would always treat the content within the parenthesis as a single scope and allow the user to define scopes containing

  • single whitespace, e.g. shopping cart
  • / to support some hierarchy/namespace, e.g. security/login
  • - to support hyphenated words, e.g. self-service, command-cli

I would not support _ in scopes because it is usually not used in written text.

Something like: https://regex101.com/r/f6syXh/1

Robin Shen commented 11 months ago

Thanks for the investigation. Will do as you suggested.

Referenced from commit 11 months ago
jbauer commented 11 months ago

@robin Quickly tried the new regex on regex101.com and german umlauts do not work anymore because the \w predefined class has been replaced with a-zA-Z and the unicode setting that you can turn on on regex101.com (and java) is for predefined classes.

But beside the umlaut issue the regex seems fine in terms of checking commit subject line structure.

Given that you also aim for I18N in OneDev any regex in OneDev might need to be unicode compatible in the future.
For example the regex . does not match any character but instead matches any unicode code point. That is if you have something like ^.$ and try matching it against à it will fail to match if à is encoded using two unicode code points (one for the letter 'a' and one for the accent mark '`').

The page https://www.regular-expressions.info/unicode.html has a pretty good description of unicode support in regex. After reading it twice I think a-zA-Z should be replaced with \p{L} which matches all kinds of unicode letters (which does not include _ unlike \w).

According to the linked page \p{L} or \p{Letter} include the unicode categories:

  • \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
  • \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
  • \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first - letter of the word is capitalized.
  • \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
  • \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
  • \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.

These categories can be looked up at https://www.compart.com/en/unicode/category

See: https://regex101.com/r/RL1uoQ/1

Robin Shen commented 11 months ago

That is very good to know. Really appreciated! Will change to use \p{L}

issue 1 of 1
Type
Improvement
Priority
Normal
Assignee
Issue Votes (0)
Watchers (4)
Reference
OD-1401
Please wait...
Page is in error, reload to recover