Mesquite Character data

General Modularity Example Module Projects & Files Commands & Scripting

Windows Menus Charts Tables Buttons & tools

Trees and Taxa Characters & Models Documentation General Utilities

Character Data Character Models Also: class CharacterState class CharacterData class CharacterDistribution class CharacterHistory class CharacterModel class ModelSet

Character data in Mesquite

(updated August 2005)

Abstract classes and interfaces are available for character data from the zero-dimensional (the representation of one state of a single character) to the one-dimensional (a data vector with states for a character in each of numTaxa taxa or at numNodes nodes), to the two-dimensional (a data matrix with states for each of numChars characters in each of numTaxa taxa). As well, standard subclasses of these are available for discrete and continuous characters.

The reason these abstract classes exist is so that many of the modules can pass references to objects in this datatype-neutral way. Some modules, e.g. those to shade the branches of the tree, don't even need to know what character type they are dealing with, since their needs are met by the methods provided (at least in abstract form) in the generic superclass.

The class and interface hierarchy seems more complex than it needs to be, but we struggled considerably to simplify it and yet retain appropriate abstractness. (This is one place where we wish Java had had multiple inheritance!)

Classes and interfaces for matrices of characters X taxa

Class CharacterData

This abstract class represents an entire data set, including extra information like names of characters and perhaps states. It is more or less equivalent to a Characters block in a NEXUS file plus some associated parts of an Assumptions block, for a CharacterData object also includes the Model sets, and character inclusion sets and so on.

Currently the CharacterData class is used only for stored matrices (i.e., matrices stored as FileElements within a file). For temporary matrices produced by simulations and other calculations, the simpler MCharactersStates and its subclasses are used. Most modules requiring data matrices on which to do calculations receive them as an Interface descended from MCharactersStatesHolder such as MCharactersDistribution. Thus, most modules don't deal with CharacterData. A notable exception are the modules editing data matrices, which deal with CharacterData objects intimately.

Subclasses of CharacterData must define various methods, including those to add and move characters, return the number of characters, and return a String describing the states in a particular character for a particular taxon. The subclasses (and not CharacterData itself) are responsible for many obviously necessary methods, because they depend on the particular type of data (setState(int ic, int it, double state) would not apply to a discrete matrix, for example).

These subclasses might eventually have to maintain color tables for use in tracing characters and coloring cells in matrices, but for now the only color-related method is

drawColoredStates(Graphics g, int x, int y, int width, int height, int ic, int it)

which draws within the given rectangle the states in character ic and taxon it, in color. The reason this is done as the responsibility of the CharacterData class (as opposed to a module, for example) is that the CharacterData class is responsible for figuring out String representation of states and might be responsible for color tables.

In the future it may be important to make a method within CharacterData to read Characters blocks in NEXUS files, so that new Characters types can be invented without the need for new reading modules. However, the current system in which various Managers participate in file reading and writing works well (and the Managers are needed anyway to keep the menu items and list windows current when new data sets are added or read).

Interface MCharacterStatesHolder

The Interface MCharacterStatesHolder is the base interface for those used to pass most character matrices among calculating modules. (The first letter "M" can be thought of as "Multiple" or "Matrix".) It and its descendent interfaces are:

Interface MCharacterStatesHolder — Key methods are getNumTaxa, getNumChars, and getCharacterState. Because classes satisfying this interface may be temporary representations of the information in a full-blown CharacterData, they can store the CharacterData which they represent as their "parent data", obtainable by getParentData.
- Interface MCharactersDistribution — Intended to portray the states in terminal taxa. Can supply CharacterDistributions for single characters, or can supply an "upgraded" version of itself to a full CharacterData via makeCharacterData.
  - Interface MAdjustableDistribution — Allows the matrix to be changed, e.g. the character states to be set via the setCharacterState, tradeStatesBetweenTaxa or transferFrom methods.
    - Interface MCharactersHistory — Intended to portray the states at all nodes of a tree. Can supply CharacterHistory's for single characters.

The classes implementing these interfaces are descendants of MCharactersStates. Each type of data (e.g., categorical vs. continuous) has its own descendant hierarchy whose members implement the above interfaces.

Class MCharactersStates

This class contains information for a set of character over a set of taxa or nodes, and thus is two dimensional like CharacterData (characters X nodes). At first it may seem a duplicate of CharacterData, but MCharactersStates contains none of the extra information about characters, and the other dimension is best thought of as nodes. Generally, it is used for passing stripped-down data matrices to modules for calculation, or for storage with calculations on trees, much like the old downstate, upstate and finalstate arrays in MacClade. Because it is over multiple characters, it is usually used for calculations involving all characters at once. Thus, in various simulations, reconstructions and other calculations of characters on trees, the class of choice for passing information around are subclasses of MCharactersStates.

There are two main subclasses, one (MCharactersDistribution) for the states at each of the terminal taxa, and one (MCharactersHistory) for each of the nodes of a tree. The reason a single 2 dimensional array can be used for the states at all the nodes (instead of requiring special storage attached to each node) is that nodes are simply numbered in Tree's, and thus the node number is used for the index of one dimension of the array.

Many calculations in Mesquite pass character data matrices around via the data type neutral interface MCharactersDistribution.

Subclasses exist for different character types (e.g. categorical, continuous). Some current ones are:

MCharactersStates (implements interface MCharacterStatesHolder)

MCategoricalStates — MCharactersStates for categorical characters
- MCategoricalDistribution (implements interface MCharactersDistribution)
  - MCategoricalEmbedded — used as a reference to a matrix in an existing CharacterData object.
    - MDNAEmbedded — subclass for nucleotide data
    - MProteinEmbedded — subclass for amino acid data
  - MCategoricalAdjustable (implements interface MAdjustableDistribution)— usually used for temporarily created characters, such as those coming from a simulation or a reconstruction. Size can be adjusted & states altered.
    - MDNAAdjustable — subclass for nucleotide data
    - MProteinAdjustable — subclass for amino acid data
    - MCategoricalHistory (implements interface MCharactersHistory) — used for categorical states at nodes.

MContinuousStates — MCharactersStates for continuous valued characters
- MContinuousDistribution (implements interface MCharactersDistribution)
  - MContinuousEmbedded — used as a reference to a matrix in an existing CharacterData object.
  - MContinuousAdjustable (implements interface MAdjustableDistribution)— usually used for temporarily created characters, such as those coming from a simulation or a reconstruction. Size can be adjusted & states altered.
    - MContinuousHistory (implements interface MCharactersHistory) — used for continuous valued states at nodes.

Classes and interfaces for vectors of one character X taxa

Interface CharacterStatesHolder

The Interface CharacterStatesHolder is the base interface for those used to pass most character vectors among calculating modules. It and its descendent interfaces are:

Interface CharacterStatesHolder — Key methods are getNumTaxa (and getNumNodes) and getCharacterState. Because classes satisfying this interface may be temporary representations of the information in a full-blown CharacterData, they can store the CharacterData which they represent as their "parent data", obtainable by getParentData, and the character within that as their "parent character", obtainable by getParentCharacter.
- Interface CharacterDistribution — Intended to portray the states in terminal taxa. Via this interface many modules pass character distribution information to one another for calculations.
  - Interface AdjustableDistribution — Allows the character states to be changed, e.g. the character states to be set via the setCharacterState or tradeStatesBetweenTaxa methods.
    - Interface CharacterHistory — Intended to portray the states at all nodes of a tree. Has special methods for returning Colors for use in character tracing displays.

The classes implementing these interfaces are descendants of CharacterStates. Each type of data (e.g., categorical vs. continuous) has its own descendant hierarchy whose members implement the above interfaces.

CharacterStates

Subclasses of CharacterStates represent a vector of character states in a series of taxa or nodes. The subclasses are:

CharacterStates (implements interface CharacterStatesHolder)

CategoricalStates — CharacterStates for categorical characters
- CategoricalDistribution (implements interface CharacterDistribution)
  - CategoricalEmbedded — used as a reference to a character within an existing CharacterData object.
    - DNAEmbedded — subclass for nucleotide data
    - ProteinEmbedded — subclass for amino acid data
  - CategoricalAdjustable (implements interface AdjustableDistribution)— usually used for temporarily created characters, such as those coming from a simulation or a reconstruction. Size can be adjusted & states altered.
    - DNACharacterAdjustable — subclass for nucleotide data
      - RNACharacterAdjustable — subclass for nucleotide data
    - ProteinAdjustable — subclass for amino acid data
    - CategoricalHistory (implements interface CharacterHistory) — used for categorical states at nodes.
      - DNACharacterHistory — subclass for nucleotide data
        
        RNACharacterHistory — subclass for nucleotide data
      - ProteinCharacterHistory — subclass for amino acid data
ContinuousStates — CharacterStates for continuous valued characters
- ContinuousDistribution (implements interface CharacterDistribution)
  - ContinuousEmbedded — used as a reference to a character within an existing CharacterData object.
  - ContinuousAdjustable (implements interface AdjustableDistribution)— usually used for temporarily created characters, such as those coming from a simulation or a reconstruction. Size can be adjusted & states altered.
    - ContinuousHistory (implements interface CharacterHistory) — used for continuous valued states at nodes.

Classes and interfaces for character state of one character X one taxon (or node)

CharacterState e

A CharacterState object contains the state of a single character in a single taxon. There are methods to set it, write it as a string, set its values according to a String, query whether it is unassigned, and so on. The details as to what exactly is stored depends on the data type. CategoricalState objects thus store a long representing a state set. ContinuousState objects store an array of doubles for the items representing its state (e.g., mean, variance).

CharacterState objects are a datatype independent way that information can be extracted from matrices and passed around. Two of the standard subclasses are:

CharacterState

CategoricalState — the state of a categorical character. Categorical characters have a series of alternative states. Sets of states were stored in MacClade 3 as Pascal sets or C bit fields, and state set operations were done with bitwise ANDs and ORs, etc. In Mesquite, they are stored as long primitives, and thus can handle over 60 character states. The maximum value for a state is currently 56. This allows bits 57-63 to be used as special flags. Bit 63 indicates the state is unassigned. Bit 62 indicates the state is invalid ("impossible"). Bit 61 indicates the state is uncertain (as opposed to polymorphic). Bit 60 indicates the state is to be represented as a string with lower case symbols. An entirely empty set (0L) is interpreted as inapplicable (i.e., a gap). CategoricalState can be instantiated to hold a categorical state set, but it also provides static methods for manipulating these state sets in a bitwise way.
ContinuousState — the state of a continuous-valued character. Allows for multiple items (e.g., mean, variance, sample size), corresponding to ITEMS in the NEXUS file format for continuous characters. ContinuousState objects hold an array of up to 32 items.

Colors of states

General systems for defining color for states are not yet settled. Colors/patterns are needed for character tracing, for filling cells of the matrix, and possibly for charts or other purposes. Currently, methods in CharacterHistory are used for tracing. Thus, for the CharacterHistory object of reconstructed states, the method prepareColors is first called, which surveys to accumulate information about what states are present across nodes. Then getColorsAtNode finds the colors at a node according to its states, and getLegendStates to find the colors and names to put in the legend. Methods in CharacterData are used for coloring cells in the matrix. The reason CharacterHistory is used for tracing is that it knows all the states it has at its nodes (its CharacterData object might not). The reason CharacterData is used for matrix cells is that the Data Windows use CharacterData objects. In the future it would be good to have color control centralized, perhaps in a CharacterData object, and when it needs to help with a tracing, its methods are passed the CharacterHistory object.

One complication with coloring states is that several options might be chosen for the span of colors that is chosen for a single character:

Use a fixed scale of 0 to maximum state conceivable (e.g., for DNA data), and use same colors regardless of the states present in the particular character
Use a scale of min-max observed in the character in the data matrix (not necessarily in the reconstruction, which might be on a tree with taxa trimmed).
Use a scale of min-max in the character reconstruction
Use any of the above scales, but avoid assigning white and black as colors

Choice 1 could be used to trace DNA data on trees. Choices 2 and 3 might be used for non-nucleotide data in tracing characters. Choice 4 might be used to color cells in a matrix, so that the strings describing the states could still be observed despite the coloring.