Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Adding Support for **Monoatomic Groups** in GroupGrammar #8

Open
Syzseisus opened this issue Sep 15, 2024 · 0 comments
Open

Comments

@Syzseisus
Copy link

Hello, thanks for your wonderful work!

I would like to request adding support for monoatomic groups in the GroupGrammar class.

Currently, the system effectively handles multi-atom groups by converting them to GroupSELFIES, but monoatomic groups (e.g., [C], [F], [N]) are treated separately as individual tokens outside the GroupGrammar vocabulary. This leads to challenges when attempting to extract the full molecular connectivity between groups and individual atoms.

Required Feature:

Introduce monoatomic groups (e.g., fragC, fragF, fragN, etc.) to the GroupGrammar.vocab to ensure that atoms like ["B", "C", "N", "O", "P", "S", "F", "Cl", "Br", "I", "Li", "Na", "K", "Rb", "Cs", "Fr", "Be", "Mg", "Ca", "Sr", "Ba", "Ra"] can also be processed as groups.
Allow these monoatomic groups to be added dynamically or to be included in the essential grammar set, similar to how groups like frag65, frag66, etc., are treated.

Motivation:

The main issue arises when trying to extract the molecular connectivity between the subgraphs represented by GroupSELFIES. GroupSELFIES, in essence, represents the original molecular graph by grouping atoms into subgraphs (i.e., groups). The connectivity between group tokens is clearly defined, but for monoatomic tokens like [C], the connectivity remains unclear. This inconsistency makes it difficult to extract subgraph-to-subgraph connectivity in a unified way.

Adding support for monoatomic groups would allow all atoms, even single atoms like [C] and [F], to be treated as subgraphs, ensuring that the connections between subgraphs can be easily traced and understood.

Example:

Here is an example where monoatomic atoms are treated separately from the defined groups. Ideally, atoms like [C] and [F] should be included as monoatomic groups within GroupGrammar.vocab to clarify their connectivity:

smiles: Cc1ccc(NC(=O)c2ccc(COc3ccc(F)cc3)o2)c(C)c1
GroupSELFIES: [C][:2frag65][=Branch][:0frag68][Ring1][:5frag66][#Branch][F][pop][pop][pop][#Branch][C][pop]
ATOMS
0 C 1/4 bonds filled group_tag=(3, 0)  # 1
1 C 4/4 bonds filled group_tag=(0, 8)
2 C 3/4 bonds filled group_tag=(0, 6)
3 C 3/4 bonds filled group_tag=(0, 4)
4 C 4/4 bonds filled group_tag=(0, 3)
5 N 2/3 bonds filled group_tag=(0, 2)
6 C 4/4 bonds filled group_tag=(0, 1)
7 O 2/2 bonds filled group_tag=(0, 0)
8 C 4/4 bonds filled group_tag=(2, 1)
9 C 3/4 bonds filled group_tag=(2, 0)
10 C 3/4 bonds filled group_tag=(2, 6)
11 C 4/4 bonds filled group_tag=(2, 4)
12 C 2/4 bonds filled group_tag=(1, 12)
13 O 2/2 bonds filled group_tag=(1, 0)
14 C 4/4 bonds filled group_tag=(1, 1)
15 C 3/4 bonds filled group_tag=(1, 2)
16 C 3/4 bonds filled group_tag=(1, 4)
17 C 4/4 bonds filled group_tag=(1, 6)
18 F 1/1 bonds filled group_tag=(4, 0)  # 2
19 C 3/4 bonds filled group_tag=(1, 8)
20 C 3/4 bonds filled group_tag=(1, 10)
21 O 2/2 bonds filled group_tag=(2, 3)
22 C 4/4 bonds filled group_tag=(0, 12)
23 C 1/4 bonds filled group_tag=(5, 0)  # 3
24 C 3/4 bonds filled group_tag=(0, 10)

BONDS
0 -> 1 order=1 group_idxs [0, 3]  # 4
1 -> 2 order=2 group_idxs []
2 -> 3 order=1 group_idxs []
3 -> 4 order=2 group_idxs []
4 -> 5 order=1 group_idxs []
4 -> 22 order=1 group_idxs []
5 -> 6 order=1 group_idxs []
6 -> 7 order=2 group_idxs []
6 -> 8 order=1 group_idxs [0, 2]
8 -> 9 order=2 group_idxs []
9 -> 10 order=1 group_idxs []
10 -> 11 order=2 group_idxs []
11 -> 12 order=1 group_idxs [1, 2]
11 -> 21 order=1 group_idxs []
12 -> 13 order=1 group_idxs []
13 -> 14 order=1 group_idxs []
14 -> 15 order=2 group_idxs []
15 -> 16 order=1 group_idxs []
16 -> 17 order=2 group_idxs []
17 -> 18 order=1 group_idxs [1, 4]  # 5
17 -> 19 order=1 group_idxs []
19 -> 20 order=2 group_idxs []
20 -> 14 order=1 group_idxs []
21 -> 8 order=1 group_idxs []
22 -> 23 order=1 group_idxs [0, 5]  # 6
22 -> 24 order=2 group_idxs []
24 -> 1 order=1 group_idxs []

GROUPS
<Group frag65 O=C(N(C1=C(*1)C(*1)=C(*1)C(*1)=C1*1)*1)*1>
<Group frag66 O(C1=C(*1)C(*1)=C(*1)C(*1)=C1*1)C(*1)(*1)*1>
<Group frag68 C1=C(*1)OC(*1)=C1*1>
<Group C ??>  # 7
<Group F ??>  # 8
<Group C ??>  # 9

In this example:

  • 1, 2, and 3 are the parts which show a monoatomic group (i.e., [C], [F], and [C]) being treated as a part of a group, which is the behavior we want to implement.
  • Therefore, monomolecular group tokens, such as 4, 5, and 6, are also represented as connections.
  • You can also see that the "GROUPS" has a single-member group defined, such as 7, 8, and 9.
    To this end,
  1. The monoatomic group must be defined in the GroupGrammar.vocab
  2. When converting graphs to group_selfies, you must be able to match monoatomic groups with group tokens.

Conclusion:

By adding support for monoatomic groups, the molecular connectivity between all subgraphs (whether complex groups or individual atoms) can be traced uniformly, greatly simplifying tasks such as graph extraction, reconstruction, and representation.

Thank you for considering this request! Looking forward to your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant