Sei sulla pagina 1di 19

Molecular Descriptors

INTRODUCTION
Molecular descriptors are numerical values that characterize properties of molecules Examples: Physicochemical properties (empirical) Values from algorithms, such as 2D fingerprints

Vary in complexity of encoded information and in compute time

Descriptors for Large Data Sets Descriptors representing properties of complete molecules Examples: LogP, Molar Refractivity

Descriptors calculated from 2D graphs Examples: Topological Indexes, 2D fingerprints

Descriptors requiring 3D representations Example: Pharmacophore descriptors

DESCRIPTORS CALCULATED FROM 2D STRUCTURES


Simple counts of features Lipinski Rule of Five (H bonds, MW, etc.) Number of ring systems Number of rotatable bonds

Not likely to discriminate sufficiently when used alone Combined with other descriptors for best effect

Physicochemical Properties
Hydrophobicity LogP the logarithm of the partition coefficient between n-octanol and water

ClogP (Leo and Hansch) based on small set of values from a small set of simple molecules

BioByte: http://www.biobyte.com/ Daylights MedChem Help page http://www.daylight.com/dayhtml/databases/medchem/medchemhelp.html Isolating carbon: one not doubly or triply bonded to a heteroatom

ACD Labs Calculated Properties

http://www.acdlabs.com ACD Labs values now incorporated into the CAS Registry File for millions of compounds I-Lab: http://ilab.acdlabs.com/ Name generation NMR prediction Physical property prediction

Molar Refractivity MR = n2 1 MW -------- ----n2 + 2 d where n is the refractive index, d is density, and MW is molecular weight. Measures the steric bulk of a molecule.

Topological Indexes
Single-valued descriptors calculated from the 2D graph of the molecule Characterize structures according to size, degree of branching, and overall shape Example: Wiener Index counts the number of bonds between pairs of atoms and sums the distances between all pairs

Wiener Index Add up all the off-diagonal elements and divide by 2 (because matrix is symmetrical) The Wiener index correlates well with the boiling points of alkanes

Zagreb Index For each non-hydrogen atom, add up the squares of the number of connections to other nonhydrogen atoms (regardless of bond order)

Topological Indexes: Others Molecular Connectivity Indexes Randid (et al.) branching index Defines a degree of an atom as the number of adjacent non-hydrogen atoms Bond connectivity value is the reciprocal of the square root of the product of the degree of the two atoms in the bond. Branching index is the sum of the bond connectivities over all bonds in the molecule.

Chi indexes introduces valence values to encode sigma, pi, and lone pair electrons

Kappa Shape Indexes Characterize aspects of molecular shape Compare the molecule with the extreme shapes possible for that number of atoms 2D Fingerprints Two types: One based on a fragment dictionary Each bit position corresponds to a specific substructure fragment Fragments that occur infrequently may be more useful Range from linear molecules to completely connected graph

Another based on hashed methods Not dependent on a pre-defined dictionary Any fragment can be encoded

Originally designed for substructure searching, not for molecular descriptors

Topological indexes

another type of numerical descriptor that can be calculated from a 2D structure diagram there are many different topological indexes some are designed to represent structural features such as branching or shape

they can be calculated from connection tables, or closely-related formats e.g. the distance matrix o an N x N table showing the distance (in bonds) between each pair of atoms

Redundant Connection Table 1. O 2. C 3. O 4. C 5. N 6. C 7. C 8. C 9. C 10. C 11. C 12. C 13. O Distance Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 1 0 0 1 2 2 0 1 1 0 1 1 1 21 11 22 21 41 41 61 72 81 92 10 1 11 2 10 1 71 82 91 10 2 11 1 12 2 71 13 1 12 1 51 61 32 41

1. O 1 2 2 3 3 4 5 6 7 6 5 8 2. 1 C 1 1 2 2 3 4 5 6 5 4 7 3. 2 1 O 2 3 3 4 5 6 7 6 5 8

4. 2 1 2 C 1 1 2 3 4 5 4 3 6 5. 3 2 3 1 N 2 3 4 5 6 5 4 7 6. 3 2 3 1 2 C 1 2 3 4 3 2 5 7. 4 3 4 2 3 1 C 1 2 3 2 1 4 8. 5 4 5 3 4 2 1 C 1 2 3 2 3 9. 6 5 6 4 5 3 2 1 C 1 2 3 2 10. 7 6 7 5 6 4 3 2 1 C 1 2 1 11. 6 5 6 4 5 3 2 3 2 1 C 1 2 12. 5 4 5 3 4 2 1 2 3 2 1 C 3 13. 8 7 8 6 7 5 4 3 2 1 2 3 O Kier Shape Indexes Several indexes based on the number of atoms (N) and the number of bonds (P) in the graph k1 = N (N-1)2 / P2 k 2 = (N-1) (N-2)2 / P2 k 3 = (N-1) (N-3)2 / P2 (if N is odd) k 3 = (N-3) (N-2)2 / P2 (if N is even) alpha-modified kappa indexes can be generated where N is adjusted take into account the sizes of atoms, relative to sp2-hybridised carbons a molecular flexibility index is derived from these j = k1a k2 a / N Molecular Connectivity Indexes a whole series of indexes, developed by Kier and Hall in the late 1970s, following earlier work by Randic involves identifying all possible subgraphs of different sizes in the molecule size of subgraph determines the order of the index 0 bond subgraphs give 0c index

1-bond subgraphs give 1c index 2-bond subgraphs give 2c index 3-bond subgraphs give 3c indexes etc.

Molecular Connectivity Indexes At higher orders the subgraphs are divided into path subgraphs (only 1 and 2-connected nodes) cluster subgraphs (no 2-connected nodes) path-cluster subgraphs (any sort of node) chain subgraphs (involving rings)

Molecular Connectivity Indexes For each subgraph order and type the index is calculated as where di is number of connections of node i in the subgraph molecular connectivity indexes also exist in a valence-modified form that takes into account the heteroatoms present

Molecular Connectivity Indexes many experiments have been done to find correlations between them (and other indexes) and measured physico-chemical or biological properties this uses a statistical technique called multiple regression analysis to build an equation of the form Property = c0 + c1x1 + c2x2 + c3x3 + c4x4 + c5x5 + where x1, x2 etc. are topological indexes and c1, c2 etc. are constants good correlations have often been obtained

What do topological indexes mean? Good question!

it is often difficult to assign some chemical meaning to, e.g. the order-6 path-cluster, valencemodified Kier index topological indexes effectively encode the same information as fingerprint fragments in a less obvious way but one which can be processed numerically

Atom-Pair Descriptors Encode all pairs of atoms in a molecule Include the length of the shortest bond-by-bond path between them Elemental type plus the number of non-hydrogen atoms and the number of -bonding electrons

BCUT Descriptors Designed to encode atomic properties that govern intermolecular interactions Used in diversity analysis Encode atomic charge, atomic polarizability, and atomic hydrogen bonding ability

BCUT descriptors A type of topological index with a complex history B = Frank Burden C = Chemical Abstracts Service UT = University of Texas (Bob Pearlman)

based on 3D structure of molecule 6 different indexes generated for each molecule

often used as descriptors for cell-based partitioning of chemical space 6 descriptors = 6 dimensions

DESCRIPTORS BASED ON 3D REPRESENTATIONS Require the generation of 3D conformations

Can be computationally time consuming with large data sets Usually must take into account conformational flexibility 3D fragment screens encode spatial relationships between atoms, ring centroids, and planes

Pharmacophore Keys & Other 3D Descriptors Based on atoms or substructures thought to be relevant for receptor binding Typically include hydrogen bond donors and acceptors, charged centers, aromatic ring centers and hydrophobic centers Others: 3D topographical indexes, geometric atom pairs, quantum mechanical calculations for HUMO and LUMO

DATA VERIFICATION AND MANIPULATION Data spread and distribution Coefficient of variation (standard deviation divided by the mean)

Scaling (standardization): making sure that each descriptor has an equal chance of contributing to the overall analysis Correlations Reducing the dimensionality of a data set: Principal Components Analysis

Chemical Structure Representation and Search Systems Topics to be Covered Clustering identifying classes of molecules similar to each other, but different to those in other classes

Topological indexes numbers that can be calculated from connection tables

Property prediction predicting physicochemical or biological properties directly from connection tables

The Drug Discovery Process virtual screening

Cluster Analysis process of putting molecules (or other objects) into classes, based on similarity molecules in the same cluster are similar to each other molecules in different clusters are different from each other many different methods and algorithms different clustering methods will result in different clusters, with different relationships between them different algorithms can be used to implement the same method (some may be more efficient than others)

Downs, G. M., Barnard, J. M., Rev. Comput. Chem., 18 (2002) Hierarchical and non-hierarchical A basic distinction is between clustering methods that organise clusters hierarchically, and those that do not

Hierarchical Agglomerative the hierarchy is built from the bottom upwards several different methods and algorithms basic Lance-Williams algorithm (common to all methods) starts with table of similarities between all pairs of items at each step the most similar pair of molecules (or previously-formed clusters) are merged together until everything is in one big cluster methods differ in how they determine the similarity between clusters o o o single link chooses clusters whose closest members are most similar complete link chooses clusters whose furthest members are most similar other methods (e.g. Group-average method and Wards method) use some sort of average member

Hierarchical Agglomerative Lance-Williams algorithm is slow O(N2) to generate pairwise similarity table initially this table must be updated N times, once for each merge (agglomeration) of clusters overall time requirements are O(N3)

more efficient algorithms can be used for some methods single link can be O(N logN) with k-D trees algorithm Wards method and Group-Average method can be O(N2) using Murtaghs Reciprocal Nearest-Neighbour algorithm

Hierarchical Divisive the hierarchy is built from the top downwards at each step a cluster is chosen to divide, until each cluster has only one member various ways of choosing next cluster to divide one with most members one with least similar pair of members etc.

various ways of dividing it using a single descriptor (e.g. fingerprints bit) *monothetic+ using all descriptors (based on similarities between pairs of members) *polythetic+

most polythetic methods are slow

Non-hierarchical methods usually faster than hierarchical several different methods e.g. Leader algorithm make a single pass through the dataset (O(N))

if molecule is similar enough (need to define threshold) to an existing cluster, it joins that cluster otherwise it starts (leads) a new cluster

results depend on order of processing

Nearest neighbour methods non-hierarchical best known is example is Jarvis-Patrick method identify top k (e.g. 20) nearest neighbours for each molecule two molecules join same cluster if they have at least kmin of their top k nearest neighbours in common

very popular for chemical applications from mid 1980s rather less popular now tends to produce a few large heterogeneous clusters and a lot of singletons (single-member clusters) some variations have been tried variable-length nearest-neighbour lists (threshold similarity) reclustering of singletons

Relocation methods non-hierarchical clusters are initialised (sometimes randomly) iterative refinement then relocates molecules between clusters to improve some objective function

simplest and most common example is K-means select k random molecules to act as cluster seeds o k is required number of clusters

assign each remaining molecule to closest seed calculate centroid (mean) of each cluster

relocate molecules to nearest cluster centroid if necessary recalculate centroids and repeat until no further changes

K-means clustering K-means has the advantage of being fast (O(Nk)) and is popular with statisticians however it has several disadvantages sensitive to the initial choice of seeds o can try non-random sets of seeds

can converge to a local (rather than global) optimum tends to produce only spherical clusters of similar size difficult to decide what value of k to choose

Overlapping and fuzzy clusters some clustering methods produce overlapping clusters, in which some molecules are members of more than one cluster in fuzzy clustering, each molecule has partial membership of all clusters degree of membership in each cluster is in range 0.0 to 1.0 sum of membership over all clusters is 1.0

fuzzy clustering is arguably a better representation of the real world but makes it difficult to make decisions

Which method is best? as with similarity measures and structure descriptors, there is no definite agreement this is probably why there are so many methods

empirical property-prediction experiments have been done to evaluate different methods predicted property value is average of other members of same cluster (Sheffield University work) o calculate correlation coefficient between observed and predicted properties

active and inactive molecules should be in separate clusters (Abbott Laboratories work)

Which method is best?

Sheffield University work (mid-1980s) showed Wards (hierarchical agglomerative) and Jarvis-Patrick method gave best predictions o at that time Jarvis-Patrick was significantly faster

Joint CAS/Sheffield/BCI study in early 1990s showed Wards and minimum diameter (hierarchical divisive) significantly better than Jarvis-Patrick similar conclusions in Abbott study (mid 1990s) more recent work at Eli Lilly recommended K-means o certainly better for very large datasets, because of speed

still a very active area of research

How many clusters to choose? Hierarchical methods allow user to choose any slice across the hierarchy but what level is the best one to choose? there are methods that give a score to each level get the fewest and tightest clusters

How many clusters to choose? Non-hierarchical methods Jarvis-Patrick method decides for itself on basis of user-selected k and kmin with other methods (e.g. k-means) it is more difficult o what is the natural number of clusters?

The natural number of clusters What is clustering used for? compound acquisition o purchase compounds from clusters that contain no compounds from existing collections

high-throughput screening o o choose one compound per cluster in first round test other compounds from clusters where hits are found

homogeneous subsets for QSAR diverse subset selection from combinatorial libraries o maximise different clusters represented; penalise over-representation of individual clusters

classification of new compounds o which existing cluster is a new compound closest to?

A clustering of clustering methods Descriptor calculation various numerical descriptors can be calculated for chemical structures molecular weight counts of features o o o o hydrogen bond donors/acceptors aromatic rings rotatable bonds etc

these can be used in similarity searching and clustering

Property Prediction it is often useful to be able to calculate a physico-chemical property for a compound from its structure regression equations have been used to do this from topological indexes, but usually only for limited sets of molecules it would be better to have a more general method

logP

some important properties have had a lot of attention in this respect

octanol-water partition coefficient has been found very useful in predicting the bioavailability of a drug o o it needs to be soluble enough in lipid to be able to cross cell membranes but soluble enough in water not to get stuck there

many methods have been proposed for calculating a good estimate from the structure

Leo, A. J. Chemical Reviews, 1993, 93, 1281-1306 logP calculation fragment-based methods (ClogP) pioneered by Corwin Hansch and Al Leo (Pomona College) identify large fragments, whose contribution to logP value is known from their occurrence in other compounds with measured logP large training set of compounds with accurately-measured logP (the Starlist) works very well if test compound has the right fragments o problems arise if test compound contains fragments that are missing from the training set

logP calculation atom-based methods (AlogP, XlogP, SlogP) pioneered by Gordon Crippen (Univ. Michigan) based on identifying a series of atom types in the molecule o o essentially, small atom-centred fragments usually 60-200 such fragments are involved

each atom-type is assigned a numerical value logP is obtained by adding values for the atom types present in the test molecule atom-type values are obtained by regression analysis, based on a set of compounds with measured logP sometimes some extra correction factors are used too

Atom-based property calculations

atom-based principle has also been used for other properties molar refractivity charged partial surface area intestinal absorption etc.

The Drug Discovery Process pharmaceutical companies are in the business of identifying compounds that may be useful new drugs tens or hundreds of thousands of compounds are made and tested every year (screening) o tests are usually simple binding assays (does the molecule bind to a target protein?)

testing is done in two stages o o Lead Generation (find a compound that binds) Lead Optimisation (find a compound that binds better)

chemical informatics techniques are important at both these stages

Drug development Patents will be applied for as soon as a good compound (or class of compounds) is identified need to get in before the competition patent life (20 years) starts counting down from here

Much development work has still to be done animal tests clinical trials (several phases) regulatory requirements many drugs may fail during the process

Patent may have only 10 years left to run by the time a new drug is marketed

The need for early attrition

Only a tiny proportion of compounds make it all the way through this process If a potential new drug is going to fail it is better that it fail early before too much money has been spent on it

If you can identify the failures before you even synthesise them, so much the better virtual screening

Three stages of screening in silico (in silicon) virtual screening entirely in the computer

in vitro (in glass) uses test tube models of biological systems enzyme assays etc requires real compounds

in vivo (in life) compounds tested in living organisms

Virtual Screening Often based on concept of drug-likeness do these compounds actually look like drugs? need to calculate appropriate properties o Is compound likely to have suitable properties for Absorption Distribution Metabolism Excretion Toxicity

ADMET or ADME/Tox

suitable property ranges identified by analysing databases of existing drugs

Lipinski Rule of Five Widely used set of properties used for virtual screening Developed at Pfizer, 1997 molecular weight < 500 logP < 5.0 < 5 hydrogen bond donors o number of OH and NH groups

< 10 hydrogen bond acceptors o number of O and N atoms

Lead generation when testing a large number of compounds to identify a new lead, it is obviously desirable to have them as different from each other as possible pharmaceutical companies purchase large numbers of compounds from 3rd party suppliers (often Eastern European) to test they also synthesise combinatorial libraries of compounds

chemical diversity is important feature of such compound collections and libraries the idea is to cover as much of chemical space as possible

Lead optimisation when a lead compound has been identified, the next stage is to find compounds that are similar to it, which might bind even better this can involve similarity searching to find compounds previously made, or available commercially for purchase

in later stages, as activity of compound becomes better understood, medicinal chemists will make specific changes to the molecule which they hope will improve its binding affinity

Conclusions

Clustering is a useful technique for identifying classes of molecules in a dataset there are many different methods and algorithms some are faster or more effective than others

Topological indices are numbers that can be calculated from structures represented as connection tables there are many different indices available, some of which are designed to represent gross features like shape and branching

Topological indices can be used in regression equations to predict properties of a structure other methods are available for property prediction, based on summing scores for different fragments or atom types

Calculated properties can be used in virtual screening

Conclusions Many computer techniques are available to manipulate chemical structure representations some have inherent limitations but are none-the-less useful

Structure and substructure search algorithms are among the most important and useful There are useful techniques for calculating estimates of physico-chemical and other properties Identifying structurally similar molecules can lead to identifying molecules with similar biological activities Chemoinformatics is now a vital part of the drug discovery process in the pharmaceutical industry

Potrebbero piacerti anche