Friday, May 6, 2016

A cheminformatics problem: protonate with SMILES and InChI

I am teaching a python programming course and this is one of the projects I want to try this year.

The over all goal of the project is to write one or more programs that generates protonation states for a list of nitrogen containing molecules specified by name. The project uses SMILES, and maybe InChI, which you can read more about here.

Getting started
1. Write code that from this list molecules = ["CCN", "CNC", "CN(C)C"] generates this output

CCN
CC[NH3+]

C[N@H]C
C[N@@H]C
C[NH2+]C

C[N@](C)C
C[N@@](C)C
C[N@H+](C)C
C[N@@H+](C)C

2. Write code that from this list  molecules = ["C(C(=O)O)N"] generates this output (the order is not important)

C(C(=O)O)N
C(C(=O)O)[NH3+]
C(C(=O)[O-])N
C(C(=O)[O-])[NH3+]

3. C(C(=O)O)N is the amino acid glycine.  Extend this program to work for alanine, asparagine, aspartate, and lysine.  Use this site to get SMILES strings for these amino acids. Find a picture of asparagine and make sure you're treating the side-chain correctly.

The project
4. (optional) Figure out how to generate a file containing SMILES strings from a file containing names. The best way is probably bash.  Get inspiration here, here, and here.

5. Generate all possible protonation state SMILES for the molecules in Table 2 in this paper. If you completed step 4 you can use tools like https://pdftables.com/ to generate a file with the names.

6. Repeat for Table 1 and 3

7. (optional) The neutral form of the amino acids histidine and arginine side chain groups have tautomers. Generate SMILES for all tautomers (InChI might help you identify tautomers).

8. (optional) Do any of the molecules in step 5 and 6 have tautomers? If so generate SMILES for all tautomers.

Some code snippets to get you started



This work is licensed under a Creative Commons Attribution 4.0

No comments: