General Modularity Example Module Projects & Files Commands & Scripting
Windows Menus Charts Tables Buttons & tools
Trees and Taxa Characters & Models Documentation General Utilities
Character Data Character Models Also: class CharacterState class CharacterData class CharacterDistribution class CharacterHistory class CharacterModel class ModelSet

Character data in Mesquite

(updated August 2005)

Abstract classes and interfaces are available for character data from the zero-dimensional (the representation of one state of a single character) to the one-dimensional (a data vector with states for a character in each of numTaxa taxa or at numNodes nodes), to the two-dimensional (a data matrix with states for each of numChars characters in each of numTaxa taxa). As well, standard subclasses of these are available for discrete and continuous characters.

The reason these abstract classes exist is so that many of the modules can pass references to objects in this datatype-neutral way. Some modules, e.g. those to shade the branches of the tree, don't even need to know what character type they are dealing with, since their needs are met by the methods provided (at least in abstract form) in the generic superclass.

The class and interface hierarchy seems more complex than it needs to be, but we struggled considerably to simplify it and yet retain appropriate abstractness. (This is one place where we wish Java had had multiple inheritance!)


Classes and interfaces for matrices of characters X taxa

Class CharacterData

This abstract class represents an entire data set, including extra information like names of characters and perhaps states. It is more or less equivalent to a Characters block in a NEXUS file plus some associated parts of an Assumptions block, for a CharacterData object also includes the Model sets, and character inclusion sets and so on.

Currently the CharacterData class is used only for stored matrices (i.e., matrices stored as FileElements within a file). For temporary matrices produced by simulations and other calculations, the simpler MCharactersStates and its subclasses are used. Most modules requiring data matrices on which to do calculations receive them as an Interface descended from MCharactersStatesHolder such as MCharactersDistribution. Thus, most modules don't deal with CharacterData. A notable exception are the modules editing data matrices, which deal with CharacterData objects intimately.

Subclasses of CharacterData must define various methods, including those to add and move characters, return the number of characters, and return a String describing the states in a particular character for a particular taxon. The subclasses (and not CharacterData itself) are responsible for many obviously necessary methods, because they depend on the particular type of data (setState(int ic, int it, double state) would not apply to a discrete matrix, for example).

These subclasses might eventually have to maintain color tables for use in tracing characters and coloring cells in matrices, but for now the only color-related method is

drawColoredStates(Graphics g, int x, int y, int width, int height, int ic, int it)

which draws within the given rectangle the states in character ic and taxon it, in color. The reason this is done as the responsibility of the CharacterData class (as opposed to a module, for example) is that the CharacterData class is responsible for figuring out String representation of states and might be responsible for color tables.

In the future it may be important to make a method within CharacterData to read Characters blocks in NEXUS files, so that new Characters types can be invented without the need for new reading modules. However, the current system in which various Managers participate in file reading and writing works well (and the Managers are needed anyway to keep the menu items and list windows current when new data sets are added or read).

Interface MCharacterStatesHolder

The Interface MCharacterStatesHolder is the base interface for those used to pass most character matrices among calculating modules. (The first letter "M" can be thought of as "Multiple" or "Matrix".) It and its descendent interfaces are:

The classes implementing these interfaces are descendants of MCharactersStates. Each type of data (e.g., categorical vs. continuous) has its own descendant hierarchy whose members implement the above interfaces.

Class MCharactersStates

This class contains information for a set of character over a set of taxa or nodes, and thus is two dimensional like CharacterData (characters X nodes). At first it may seem a duplicate of CharacterData, but MCharactersStates contains none of the extra information about characters, and the other dimension is best thought of as nodes. Generally, it is used for passing stripped-down data matrices to modules for calculation, or for storage with calculations on trees, much like the old downstate, upstate and finalstate arrays in MacClade. Because it is over multiple characters, it is usually used for calculations involving all characters at once. Thus, in various simulations, reconstructions and other calculations of characters on trees, the class of choice for passing information around are subclasses of MCharactersStates.

There are two main subclasses, one (MCharactersDistribution) for the states at each of the terminal taxa, and one (MCharactersHistory) for each of the nodes of a tree. The reason a single 2 dimensional array can be used for the states at all the nodes (instead of requiring special storage attached to each node) is that nodes are simply numbered in Tree's, and thus the node number is used for the index of one dimension of the array.

Many calculations in Mesquite pass character data matrices around via the data type neutral interface MCharactersDistribution.

Subclasses exist for different character types (e.g. categorical, continuous). Some current ones are:

MCharactersStates (implements interface MCharacterStatesHolder)


Classes and interfaces for vectors of one character X taxa

Interface CharacterStatesHolder

The Interface CharacterStatesHolder is the base interface for those used to pass most character vectors among calculating modules. It and its descendent interfaces are:

The classes implementing these interfaces are descendants of CharacterStates. Each type of data (e.g., categorical vs. continuous) has its own descendant hierarchy whose members implement the above interfaces.

CharacterStates

Subclasses of CharacterStates represent a vector of character states in a series of taxa or nodes. The subclasses are:

CharacterStates (implements interface CharacterStatesHolder)


Classes and interfaces for character state of one character X one taxon (or node)

CharacterStatee

A CharacterState object contains the state of a single character in a single taxon. There are methods to set it, write it as a string, set its values according to a String, query whether it is unassigned, and so on. The details as to what exactly is stored depends on the data type. CategoricalState objects thus store a long representing a state set. ContinuousState objects store an array of doubles for the items representing its state (e.g., mean, variance).

CharacterState objects are a datatype independent way that information can be extracted from matrices and passed around. Two of the standard subclasses are:

CharacterState


Colors of states

General systems for defining color for states are not yet settled. Colors/patterns are needed for character tracing, for filling cells of the matrix, and possibly for charts or other purposes. Currently, methods in CharacterHistory are used for tracing. Thus, for the CharacterHistory object of reconstructed states, the method prepareColors is first called, which surveys to accumulate information about what states are present across nodes. Then getColorsAtNode finds the colors at a node according to its states, and getLegendStates to find the colors and names to put in the legend. Methods in CharacterData are used for coloring cells in the matrix. The reason CharacterHistory is used for tracing is that it knows all the states it has at its nodes (its CharacterData object might not). The reason CharacterData is used for matrix cells is that the Data Windows use CharacterData objects. In the future it would be good to have color control centralized, perhaps in a CharacterData object, and when it needs to help with a tracing, its methods are passed the CharacterHistory object.

One complication with coloring states is that several options might be chosen for the span of colors that is chosen for a single character:

  1. Use a fixed scale of 0 to maximum state conceivable (e.g., for DNA data), and use same colors regardless of the states present in the particular character
  2. Use a scale of min-max observed in the character in the data matrix (not necessarily in the reconstruction, which might be on a tree with taxa trimmed).
  3. Use a scale of min-max in the character reconstruction
  4. Use any of the above scales, but avoid assigning white and black as colors

Choice 1 could be used to trace DNA data on trees. Choices 2 and 3 might be used for non-nucleotide data in tracing characters. Choice 4 might be used to color cells in a matrix, so that the strings describing the states could still be observed despite the coloring.


© W. Maddison & D. Maddison 1999-2005