Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
INTRODUCTION
Molecular descriptors are numerical values that characterize properties of molecules Examples: Physicochemical properties (empirical) Values from algorithms, such as 2D fingerprints
Descriptors for Large Data Sets Descriptors representing properties of complete molecules Examples: LogP, Molar Refractivity
Not likely to discriminate sufficiently when used alone Combined with other descriptors for best effect
Physicochemical Properties
Hydrophobicity LogP the logarithm of the partition coefficient between n-octanol and water
ClogP (Leo and Hansch) based on small set of values from a small set of simple molecules
BioByte: http://www.biobyte.com/ Daylights MedChem Help page http://www.daylight.com/dayhtml/databases/medchem/medchemhelp.html Isolating carbon: one not doubly or triply bonded to a heteroatom
http://www.acdlabs.com ACD Labs values now incorporated into the CAS Registry File for millions of compounds I-Lab: http://ilab.acdlabs.com/ Name generation NMR prediction Physical property prediction
Molar Refractivity MR = n2 1 MW -------- ----n2 + 2 d where n is the refractive index, d is density, and MW is molecular weight. Measures the steric bulk of a molecule.
Topological Indexes
Single-valued descriptors calculated from the 2D graph of the molecule Characterize structures according to size, degree of branching, and overall shape Example: Wiener Index counts the number of bonds between pairs of atoms and sums the distances between all pairs
Wiener Index Add up all the off-diagonal elements and divide by 2 (because matrix is symmetrical) The Wiener index correlates well with the boiling points of alkanes
Zagreb Index For each non-hydrogen atom, add up the squares of the number of connections to other nonhydrogen atoms (regardless of bond order)
Topological Indexes: Others Molecular Connectivity Indexes Randid (et al.) branching index Defines a degree of an atom as the number of adjacent non-hydrogen atoms Bond connectivity value is the reciprocal of the square root of the product of the degree of the two atoms in the bond. Branching index is the sum of the bond connectivities over all bonds in the molecule.
Chi indexes introduces valence values to encode sigma, pi, and lone pair electrons
Kappa Shape Indexes Characterize aspects of molecular shape Compare the molecule with the extreme shapes possible for that number of atoms 2D Fingerprints Two types: One based on a fragment dictionary Each bit position corresponds to a specific substructure fragment Fragments that occur infrequently may be more useful Range from linear molecules to completely connected graph
Another based on hashed methods Not dependent on a pre-defined dictionary Any fragment can be encoded
Topological indexes
another type of numerical descriptor that can be calculated from a 2D structure diagram there are many different topological indexes some are designed to represent structural features such as branching or shape
they can be calculated from connection tables, or closely-related formats e.g. the distance matrix o an N x N table showing the distance (in bonds) between each pair of atoms
1. O 1 2 2 3 3 4 5 6 7 6 5 8 2. 1 C 1 1 2 2 3 4 5 6 5 4 7 3. 2 1 O 2 3 3 4 5 6 7 6 5 8
4. 2 1 2 C 1 1 2 3 4 5 4 3 6 5. 3 2 3 1 N 2 3 4 5 6 5 4 7 6. 3 2 3 1 2 C 1 2 3 4 3 2 5 7. 4 3 4 2 3 1 C 1 2 3 2 1 4 8. 5 4 5 3 4 2 1 C 1 2 3 2 3 9. 6 5 6 4 5 3 2 1 C 1 2 3 2 10. 7 6 7 5 6 4 3 2 1 C 1 2 1 11. 6 5 6 4 5 3 2 3 2 1 C 1 2 12. 5 4 5 3 4 2 1 2 3 2 1 C 3 13. 8 7 8 6 7 5 4 3 2 1 2 3 O Kier Shape Indexes Several indexes based on the number of atoms (N) and the number of bonds (P) in the graph k1 = N (N-1)2 / P2 k 2 = (N-1) (N-2)2 / P2 k 3 = (N-1) (N-3)2 / P2 (if N is odd) k 3 = (N-3) (N-2)2 / P2 (if N is even) alpha-modified kappa indexes can be generated where N is adjusted take into account the sizes of atoms, relative to sp2-hybridised carbons a molecular flexibility index is derived from these j = k1a k2 a / N Molecular Connectivity Indexes a whole series of indexes, developed by Kier and Hall in the late 1970s, following earlier work by Randic involves identifying all possible subgraphs of different sizes in the molecule size of subgraph determines the order of the index 0 bond subgraphs give 0c index
1-bond subgraphs give 1c index 2-bond subgraphs give 2c index 3-bond subgraphs give 3c indexes etc.
Molecular Connectivity Indexes At higher orders the subgraphs are divided into path subgraphs (only 1 and 2-connected nodes) cluster subgraphs (no 2-connected nodes) path-cluster subgraphs (any sort of node) chain subgraphs (involving rings)
Molecular Connectivity Indexes For each subgraph order and type the index is calculated as where di is number of connections of node i in the subgraph molecular connectivity indexes also exist in a valence-modified form that takes into account the heteroatoms present
Molecular Connectivity Indexes many experiments have been done to find correlations between them (and other indexes) and measured physico-chemical or biological properties this uses a statistical technique called multiple regression analysis to build an equation of the form Property = c0 + c1x1 + c2x2 + c3x3 + c4x4 + c5x5 + where x1, x2 etc. are topological indexes and c1, c2 etc. are constants good correlations have often been obtained
it is often difficult to assign some chemical meaning to, e.g. the order-6 path-cluster, valencemodified Kier index topological indexes effectively encode the same information as fingerprint fragments in a less obvious way but one which can be processed numerically
Atom-Pair Descriptors Encode all pairs of atoms in a molecule Include the length of the shortest bond-by-bond path between them Elemental type plus the number of non-hydrogen atoms and the number of -bonding electrons
BCUT Descriptors Designed to encode atomic properties that govern intermolecular interactions Used in diversity analysis Encode atomic charge, atomic polarizability, and atomic hydrogen bonding ability
BCUT descriptors A type of topological index with a complex history B = Frank Burden C = Chemical Abstracts Service UT = University of Texas (Bob Pearlman)
often used as descriptors for cell-based partitioning of chemical space 6 descriptors = 6 dimensions
Can be computationally time consuming with large data sets Usually must take into account conformational flexibility 3D fragment screens encode spatial relationships between atoms, ring centroids, and planes
Pharmacophore Keys & Other 3D Descriptors Based on atoms or substructures thought to be relevant for receptor binding Typically include hydrogen bond donors and acceptors, charged centers, aromatic ring centers and hydrophobic centers Others: 3D topographical indexes, geometric atom pairs, quantum mechanical calculations for HUMO and LUMO
DATA VERIFICATION AND MANIPULATION Data spread and distribution Coefficient of variation (standard deviation divided by the mean)
Scaling (standardization): making sure that each descriptor has an equal chance of contributing to the overall analysis Correlations Reducing the dimensionality of a data set: Principal Components Analysis
Chemical Structure Representation and Search Systems Topics to be Covered Clustering identifying classes of molecules similar to each other, but different to those in other classes
Property prediction predicting physicochemical or biological properties directly from connection tables
Cluster Analysis process of putting molecules (or other objects) into classes, based on similarity molecules in the same cluster are similar to each other molecules in different clusters are different from each other many different methods and algorithms different clustering methods will result in different clusters, with different relationships between them different algorithms can be used to implement the same method (some may be more efficient than others)
Downs, G. M., Barnard, J. M., Rev. Comput. Chem., 18 (2002) Hierarchical and non-hierarchical A basic distinction is between clustering methods that organise clusters hierarchically, and those that do not
Hierarchical Agglomerative the hierarchy is built from the bottom upwards several different methods and algorithms basic Lance-Williams algorithm (common to all methods) starts with table of similarities between all pairs of items at each step the most similar pair of molecules (or previously-formed clusters) are merged together until everything is in one big cluster methods differ in how they determine the similarity between clusters o o o single link chooses clusters whose closest members are most similar complete link chooses clusters whose furthest members are most similar other methods (e.g. Group-average method and Wards method) use some sort of average member
Hierarchical Agglomerative Lance-Williams algorithm is slow O(N2) to generate pairwise similarity table initially this table must be updated N times, once for each merge (agglomeration) of clusters overall time requirements are O(N3)
more efficient algorithms can be used for some methods single link can be O(N logN) with k-D trees algorithm Wards method and Group-Average method can be O(N2) using Murtaghs Reciprocal Nearest-Neighbour algorithm
Hierarchical Divisive the hierarchy is built from the top downwards at each step a cluster is chosen to divide, until each cluster has only one member various ways of choosing next cluster to divide one with most members one with least similar pair of members etc.
various ways of dividing it using a single descriptor (e.g. fingerprints bit) *monothetic+ using all descriptors (based on similarities between pairs of members) *polythetic+
Non-hierarchical methods usually faster than hierarchical several different methods e.g. Leader algorithm make a single pass through the dataset (O(N))
if molecule is similar enough (need to define threshold) to an existing cluster, it joins that cluster otherwise it starts (leads) a new cluster
Nearest neighbour methods non-hierarchical best known is example is Jarvis-Patrick method identify top k (e.g. 20) nearest neighbours for each molecule two molecules join same cluster if they have at least kmin of their top k nearest neighbours in common
very popular for chemical applications from mid 1980s rather less popular now tends to produce a few large heterogeneous clusters and a lot of singletons (single-member clusters) some variations have been tried variable-length nearest-neighbour lists (threshold similarity) reclustering of singletons
Relocation methods non-hierarchical clusters are initialised (sometimes randomly) iterative refinement then relocates molecules between clusters to improve some objective function
simplest and most common example is K-means select k random molecules to act as cluster seeds o k is required number of clusters
assign each remaining molecule to closest seed calculate centroid (mean) of each cluster
relocate molecules to nearest cluster centroid if necessary recalculate centroids and repeat until no further changes
K-means clustering K-means has the advantage of being fast (O(Nk)) and is popular with statisticians however it has several disadvantages sensitive to the initial choice of seeds o can try non-random sets of seeds
can converge to a local (rather than global) optimum tends to produce only spherical clusters of similar size difficult to decide what value of k to choose
Overlapping and fuzzy clusters some clustering methods produce overlapping clusters, in which some molecules are members of more than one cluster in fuzzy clustering, each molecule has partial membership of all clusters degree of membership in each cluster is in range 0.0 to 1.0 sum of membership over all clusters is 1.0
fuzzy clustering is arguably a better representation of the real world but makes it difficult to make decisions
Which method is best? as with similarity measures and structure descriptors, there is no definite agreement this is probably why there are so many methods
empirical property-prediction experiments have been done to evaluate different methods predicted property value is average of other members of same cluster (Sheffield University work) o calculate correlation coefficient between observed and predicted properties
active and inactive molecules should be in separate clusters (Abbott Laboratories work)
Sheffield University work (mid-1980s) showed Wards (hierarchical agglomerative) and Jarvis-Patrick method gave best predictions o at that time Jarvis-Patrick was significantly faster
Joint CAS/Sheffield/BCI study in early 1990s showed Wards and minimum diameter (hierarchical divisive) significantly better than Jarvis-Patrick similar conclusions in Abbott study (mid 1990s) more recent work at Eli Lilly recommended K-means o certainly better for very large datasets, because of speed
How many clusters to choose? Hierarchical methods allow user to choose any slice across the hierarchy but what level is the best one to choose? there are methods that give a score to each level get the fewest and tightest clusters
How many clusters to choose? Non-hierarchical methods Jarvis-Patrick method decides for itself on basis of user-selected k and kmin with other methods (e.g. k-means) it is more difficult o what is the natural number of clusters?
The natural number of clusters What is clustering used for? compound acquisition o purchase compounds from clusters that contain no compounds from existing collections
high-throughput screening o o choose one compound per cluster in first round test other compounds from clusters where hits are found
homogeneous subsets for QSAR diverse subset selection from combinatorial libraries o maximise different clusters represented; penalise over-representation of individual clusters
classification of new compounds o which existing cluster is a new compound closest to?
A clustering of clustering methods Descriptor calculation various numerical descriptors can be calculated for chemical structures molecular weight counts of features o o o o hydrogen bond donors/acceptors aromatic rings rotatable bonds etc
Property Prediction it is often useful to be able to calculate a physico-chemical property for a compound from its structure regression equations have been used to do this from topological indexes, but usually only for limited sets of molecules it would be better to have a more general method
logP
octanol-water partition coefficient has been found very useful in predicting the bioavailability of a drug o o it needs to be soluble enough in lipid to be able to cross cell membranes but soluble enough in water not to get stuck there
many methods have been proposed for calculating a good estimate from the structure
Leo, A. J. Chemical Reviews, 1993, 93, 1281-1306 logP calculation fragment-based methods (ClogP) pioneered by Corwin Hansch and Al Leo (Pomona College) identify large fragments, whose contribution to logP value is known from their occurrence in other compounds with measured logP large training set of compounds with accurately-measured logP (the Starlist) works very well if test compound has the right fragments o problems arise if test compound contains fragments that are missing from the training set
logP calculation atom-based methods (AlogP, XlogP, SlogP) pioneered by Gordon Crippen (Univ. Michigan) based on identifying a series of atom types in the molecule o o essentially, small atom-centred fragments usually 60-200 such fragments are involved
each atom-type is assigned a numerical value logP is obtained by adding values for the atom types present in the test molecule atom-type values are obtained by regression analysis, based on a set of compounds with measured logP sometimes some extra correction factors are used too
atom-based principle has also been used for other properties molar refractivity charged partial surface area intestinal absorption etc.
The Drug Discovery Process pharmaceutical companies are in the business of identifying compounds that may be useful new drugs tens or hundreds of thousands of compounds are made and tested every year (screening) o tests are usually simple binding assays (does the molecule bind to a target protein?)
testing is done in two stages o o Lead Generation (find a compound that binds) Lead Optimisation (find a compound that binds better)
Drug development Patents will be applied for as soon as a good compound (or class of compounds) is identified need to get in before the competition patent life (20 years) starts counting down from here
Much development work has still to be done animal tests clinical trials (several phases) regulatory requirements many drugs may fail during the process
Patent may have only 10 years left to run by the time a new drug is marketed
Only a tiny proportion of compounds make it all the way through this process If a potential new drug is going to fail it is better that it fail early before too much money has been spent on it
If you can identify the failures before you even synthesise them, so much the better virtual screening
Three stages of screening in silico (in silicon) virtual screening entirely in the computer
in vitro (in glass) uses test tube models of biological systems enzyme assays etc requires real compounds
Virtual Screening Often based on concept of drug-likeness do these compounds actually look like drugs? need to calculate appropriate properties o Is compound likely to have suitable properties for Absorption Distribution Metabolism Excretion Toxicity
ADMET or ADME/Tox
Lipinski Rule of Five Widely used set of properties used for virtual screening Developed at Pfizer, 1997 molecular weight < 500 logP < 5.0 < 5 hydrogen bond donors o number of OH and NH groups
Lead generation when testing a large number of compounds to identify a new lead, it is obviously desirable to have them as different from each other as possible pharmaceutical companies purchase large numbers of compounds from 3rd party suppliers (often Eastern European) to test they also synthesise combinatorial libraries of compounds
chemical diversity is important feature of such compound collections and libraries the idea is to cover as much of chemical space as possible
Lead optimisation when a lead compound has been identified, the next stage is to find compounds that are similar to it, which might bind even better this can involve similarity searching to find compounds previously made, or available commercially for purchase
in later stages, as activity of compound becomes better understood, medicinal chemists will make specific changes to the molecule which they hope will improve its binding affinity
Conclusions
Clustering is a useful technique for identifying classes of molecules in a dataset there are many different methods and algorithms some are faster or more effective than others
Topological indices are numbers that can be calculated from structures represented as connection tables there are many different indices available, some of which are designed to represent gross features like shape and branching
Topological indices can be used in regression equations to predict properties of a structure other methods are available for property prediction, based on summing scores for different fragments or atom types
Conclusions Many computer techniques are available to manipulate chemical structure representations some have inherent limitations but are none-the-less useful
Structure and substructure search algorithms are among the most important and useful There are useful techniques for calculating estimates of physico-chemical and other properties Identifying structurally similar molecules can lead to identifying molecules with similar biological activities Chemoinformatics is now a vital part of the drug discovery process in the pharmaceutical industry